nips nips2011 nips2011-59 knowledge-graph by maker-knowledge-mining

59 nips-2011-Composite Multiclass Losses


Source: pdf

Author: Elodie Vernet, Mark D. Reid, Robert C. Williamson

Abstract: We consider loss functions for multiclass prediction problems. We show when a multiclass loss can be expressed as a “proper composite loss”, which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We determine the stationarity condition, Bregman representation, order-sensitivity, existence and uniqueness of the composite representation for multiclass losses. We subsume existing results on “classification calibration” by relating it to properness and show that the simple integral representation for binary proper losses can not be extended to multiclass losses. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 au Abstract We consider loss functions for multiclass prediction problems. [sent-10, score-0.471]

2 We show when a multiclass loss can be expressed as a “proper composite loss”, which is the composition of a proper loss and a link function. [sent-11, score-1.117]

3 We extend existing results for binary losses to multiclass losses. [sent-12, score-0.763]

4 We determine the stationarity condition, Bregman representation, order-sensitivity, existence and uniqueness of the composite representation for multiclass losses. [sent-13, score-0.587]

5 We subsume existing results on “classification calibration” by relating it to properness and show that the simple integral representation for binary proper losses can not be extended to multiclass losses. [sent-14, score-1.568]

6 1 Introduction The motivation of this paper is to understand the intrinsic structure and properties of suitable loss functions for the problem of multiclass prediction, which includes multiclass probability estimation. [sent-15, score-0.785]

7 Suppose we are given a data sample S := (xi , yi )i∈[m] where xi ∈ X is an observation and yi ∈ {1, . [sent-16, score-0.076]

8 We assume the sample S is drawn iid according to some distribution P = PX ,Y on X × [n]. [sent-19, score-0.022]

9 Given a new observation x we want to predict the probability pi := P(Y = i|X = x) of x belonging to class i, for i ∈ [n]. [sent-20, score-0.295]

10 Multiclass classification requires the learner to predict the most likely class of x; that is to find y = arg maxi∈[n] pi . [sent-21, score-0.292]

11 , pn ) : ∑i∈[n] pi = 1, and 0 ≤ pi ≤ 1, ∀i ∈ [n]} denote the n-simplex. [sent-26, score-0.609]

12 For multiclass probability estimation, : ∆n → Rn . [sent-27, score-0.339]

13 The partial losses i are the components of (q) = ( 1 (q), . [sent-29, score-0.421]

14 + Proper losses are particularly suitable for probability estimation. [sent-33, score-0.467]

15 They have been studied in detail when n = 2 (the “binary case”) where there is a nice integral representation [1, 2, 3], and characterization [4] when differentiable. [sent-34, score-0.132]

16 Classification calibrated losses are an analog of proper losses for the problem of classification [5]. [sent-35, score-1.415]

17 The relationship between classification calibration and properness was determined in [4] for n = 2. [sent-36, score-0.631]

18 Most of these results have had no multiclass analogue until now. [sent-37, score-0.342]

19 The design of losses for multiclass prediction has received recent attention [6, 7, 8, 9, 10, 11, 12] although none of these papers developed the connection to proper losses, and most restrict consideration to margin losses (which imply certain symmetry conditions). [sent-38, score-1.602]

20 Glasmachers [13] has shown that certain learning algorithms can still behave well when the losses do not satisfy the conditions in these earlier papers because the requirements are actually stronger than needed. [sent-39, score-0.497]

21 We suppose we are given data (xi , yi )i∈[m] such that Yi ∈ Y is the label corresponding to xi ∈ X . [sent-47, score-0.119]

22 We denote by EX ,Y and EY |X respectively, the expectation and the conditional expectation with respect to PX ,Y . [sent-49, score-0.036]

23 The conditional risk L associated with a loss is the function L : ∆n × ∆n (p, q) → L(p, q) = EY∼p Y (q) = p · (q) = ∑ pi i (q) ∈ R+ , i∈[n] where Y ∼ p means Y is drawn according to a multinomial distribution with parameter p. [sent-50, score-0.422]

24 Minimizing L(q) over q : X → ∆n is equivalent to minimizing L(p(x), q(x)) over q(x) ∈ ∆n for all x ∈ X where p(x) = (p1 (x), . [sent-53, score-0.026]

25 , pn (x)) , p is the transpose of p, and pi (x) = P(Y = i|X = x). [sent-56, score-0.411]

26 Thus it suffices to only consider the conditional risk; confer [3]. [sent-57, score-0.036]

27 A loss : ∆n → Rn is proper if L(p, p) ≤ L(p, q), ∀p, q ∈ ∆n . [sent-58, score-0.453]

28 It is strictly proper if the inequality is + strict when p = q. [sent-59, score-0.448]

29 The conditional Bayes risk L : ∆n p → infq∈∆n L(p, q). [sent-60, score-0.09]

30 Strictly proper losses induce Fisher consistent estimators of probabilities: if is strictly proper, p = arg minq L(p, q). [sent-63, score-0.893]

31 In order to differentiate the losses we project the n-simplex into a subset of Rn−1 . [sent-64, score-0.445]

32 , pn−1 ) : pi ≥ 0, ∀i ∈ ˜ n−1 n , and Π−1 : ∆n ˜ p = ( p1 , . [sent-74, score-0.223]

33 , pn−1 ) → p = ˜ ˜ ˜ [n], ∑i=1 pi ≤ 1}, the projection of the n-simplex ∆ ∆ n−1 n its inverse. [sent-77, score-0.223]

34 , pn−1 , 1 − ∑i=1 pi ) ∈ ∆ ˜ ˜ The losses above are defined on the simplex ∆n since the argument (an estimator) represents a probability vector. [sent-81, score-0.709]

35 Suppose there exists an invertible function ψ : ∆n → V . [sent-84, score-0.079]

36 + Then can be written as a composition of a loss λ defined on the simplex with ψ −1 . [sent-85, score-0.217]

37 If λ is proper, we say is a proper composite loss, with associated proper loss λ and link ψ. [sent-88, score-0.995]

38 The kth unit vector ek is the n vector with all components zero except the kth which is 1. [sent-90, score-0.126]

39 , pn ) : ∑i∈[n] pi = 1, and 0 < pi < 1, ∀i ∈ [n]} and ∂ ∆n := ∆n \ ∆n . [sent-99, score-0.609]

40 3 Relating Properness to Classification Calibration Properness is an attractive property of a loss for the task of class probability estimation. [sent-100, score-0.153]

41 However if one is merely interested in classifying (predicting y ∈ [n] given x ∈ X ) then one requires less. [sent-101, score-0.021]

42 We ˆ relate classification calibration (the analog of properness for classification problems) to properness. [sent-102, score-0.723]

43 We cover ∆n with n subsets each representing one class: Ti (c) := {p ∈ ∆n : ∀ j = i pi c j ≥ p j ci }. [sent-104, score-0.285]

44 Observe that for i = j, the sets {p ∈ R : pi c j = p j c j } are subsets of dimension n − 2 through c and all ek such that k = i and k = j. [sent-105, score-0.305]

45 These subsets partition ∆n into two parts, the subspace Ti is the intersection of the subspaces delimited by the precedent (n − 2)-subspace and in the same side as ei . [sent-106, score-0.107]

46 Then the following hold: n , there exists i such that p ∈ T (c). [sent-109, score-0.029]

47 Ti (c) ∩ T j (c) ⊆ {p ∈ ∆n : pi c j = p j ci }, a subspace of dimension n − 2. [sent-113, score-0.282]

48 For all p, q ∈ ∆n , p = q, there exists c ∈ ∆n , and i ∈ [n] such that p ∈ Ti (c) and q ∈ Ti (c). [sent-118, score-0.029]

49 / 2 Classification calibrated losses have been developed and studied under some different definitions and names [6, 5]. [sent-119, score-0.641]

50 Below we generalise the notion of c-calibration which was proposed for n = 2 in [4] as a generalisation of the notion of classification calibration in [5]. [sent-120, score-0.493]

51 ˚ Definition 2 Suppose : ∆n → Rn is a loss and c ∈ ∆n . [sent-121, score-0.109]

52 We say is c-calibrated at p ∈ ∆n if for all + i ∈ [n] such that p ∈ Ti (c) then ∀q ∈ Ti (c), L(p) < L(p, q). [sent-122, score-0.025]

53 We say that is c-calibrated if ∀p ∈ ∆n , / is c-calibrated at p. [sent-123, score-0.025]

54 Definition 2 means that if the probability vector q one predicts doesn’t belong to the same subset (i. [sent-124, score-0.024]

55 doesn’t predict the same class) as the real probability vector p, then the loss might be larger. [sent-126, score-0.161]

56 Classification calibration in the sense used in [5] corresponds to 1 -calibrated losses when n = 2. [sent-127, score-0.74]

57 , 1 ) , cmid -calibration induces Fisher-consistent estimates in the case of classification. [sent-131, score-0.177]

58 n n Furthermore “ is cmid -calibrated and for all i ∈ [n], and i is continuous and bounded below” is equivalent to “ is infinite sample consistent as defined by [6]”. [sent-132, score-0.243]

59 This is because if is continuous and Ti (c) is closed, then ∀q ∈ Ti (c), L(p) < L(p, q) if and only if L(p) < infq∈Ti (c) L(p, q). [sent-133, score-0.043]

60 The following result generalises the correspondence between binary classification calibration and properness [4, Theorem 16] to multiclass losses (n > 2). [sent-134, score-1.451]

61 Proposition 3 A continuous loss : ∆n → Rn is strictly proper if and only if it is c-calibrated for + ˚ all c ∈ ∆n . [sent-135, score-0.58]

62 In particular, a continuous strictly proper loss is cmid -calibrated. [sent-136, score-0.757]

63 ˆ In the binary case, is classification calibrated if and only if the following implication holds [5]: L( fn ) → min L(g) ⇒ PX ,Y (Y = fn (X)) → min PX ,Y (Y = g(X)) . [sent-138, score-0.362]

64 g g (1) Tewari and Bartlett [8] have characterised when (1) holds in the multiclass case. [sent-139, score-0.37]

65 Since there is no reason to assume the equivalence between classification calibration and (1) still holds for n > 2, we give different names for these two notions. [sent-140, score-0.386]

66 We keep the name of classification calibration for the notion linked to Fisher consistency (as defined before) and call prediction calibrated the notion of Tewari and Bartlett (equivalent to (1)). [sent-141, score-0.656]

67 Let C = co({ (v) : v ∈ V }), the convex hull of the + image of V . [sent-143, score-0.025]

68 is said to be prediction calibrated if there exists a prediction function pred : Rn → [n] such that ∀p ∈ ∆n : inf p · z > inf p · z = L(p). [sent-144, score-0.406]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('losses', 0.421), ('proper', 0.344), ('calibration', 0.319), ('multiclass', 0.315), ('properness', 0.312), ('pi', 0.223), ('ti', 0.204), ('cmid', 0.177), ('calibrated', 0.176), ('pn', 0.163), ('px', 0.125), ('composite', 0.124), ('loss', 0.109), ('classi', 0.096), ('rn', 0.09), ('anu', 0.089), ('infq', 0.089), ('strictly', 0.084), ('suppose', 0.081), ('ey', 0.079), ('fisher', 0.069), ('nicta', 0.067), ('composition', 0.067), ('tewari', 0.061), ('qn', 0.059), ('notion', 0.057), ('uniqueness', 0.056), ('fn', 0.055), ('risk', 0.054), ('analog', 0.053), ('invertible', 0.05), ('ek', 0.05), ('integral', 0.049), ('link', 0.049), ('cation', 0.049), ('doesn', 0.049), ('prediction', 0.047), ('relating', 0.044), ('names', 0.044), ('continuous', 0.043), ('ex', 0.041), ('simplex', 0.041), ('pred', 0.039), ('reid', 0.039), ('relate', 0.039), ('kth', 0.038), ('yi', 0.038), ('bartlett', 0.036), ('conditional', 0.036), ('generalises', 0.036), ('estimator', 0.035), ('inf', 0.034), ('ens', 0.034), ('generalise', 0.034), ('characterization', 0.034), ('existence', 0.033), ('subsets', 0.032), ('characterised', 0.032), ('papers', 0.032), ('williamson', 0.03), ('stationarity', 0.03), ('ci', 0.03), ('exists', 0.029), ('representation', 0.029), ('subspace', 0.029), ('predict', 0.028), ('analogue', 0.027), ('aid', 0.027), ('subsume', 0.027), ('binary', 0.027), ('generalisation', 0.026), ('implication', 0.026), ('co', 0.026), ('minimizing', 0.026), ('say', 0.025), ('transpose', 0.025), ('hull', 0.025), ('differentiate', 0.024), ('subspaces', 0.024), ('probability', 0.024), ('holds', 0.023), ('presenting', 0.023), ('consistent', 0.023), ('constructs', 0.022), ('requirements', 0.022), ('suitable', 0.022), ('iid', 0.022), ('behave', 0.022), ('symmetry', 0.022), ('intersection', 0.022), ('arg', 0.021), ('bregman', 0.021), ('correspondence', 0.021), ('classifying', 0.021), ('maxi', 0.021), ('mark', 0.02), ('strict', 0.02), ('nice', 0.02), ('concave', 0.02), ('class', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 59 nips-2011-Composite Multiclass Losses

Author: Elodie Vernet, Mark D. Reid, Robert C. Williamson

Abstract: We consider loss functions for multiclass prediction problems. We show when a multiclass loss can be expressed as a “proper composite loss”, which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We determine the stationarity condition, Bregman representation, order-sensitivity, existence and uniqueness of the composite representation for multiclass losses. We subsume existing results on “classification calibration” by relating it to properness and show that the simple integral representation for binary proper losses can not be extended to multiclass losses. 1

2 0.1565105 178 nips-2011-Multiclass Boosting: Theory and Algorithms

Author: Mohammad J. Saberian, Nuno Vasconcelos

Abstract: The problem of multi-class boosting is considered. A new framework, based on multi-dimensional codewords and predictors is introduced. The optimal set of codewords is derived, and a margin enforcing loss proposed. The resulting risk is minimized by gradient descent on a multidimensional functional space. Two algorithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one predictor component at a time, 2) GD-MCBoost, based on gradient descent, updates all components jointly. The algorithms differ in the weak learners that they support but are both shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the risk. They also reduce to AdaBoost when there are only two classes. Experiments show that both methods outperform previous multiclass boosting approaches on a number of datasets. 1

3 0.11214818 106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample

Author: Gilles Blanchard, Gyemin Lee, Clayton Scott

Abstract: We consider the problem of assigning class labels to an unlabeled test data set, given several labeled training data sets drawn from similar distributions. This problem arises in several applications where data distributions fluctuate because of biological, technical, or other sources of variation. We develop a distributionfree, kernel-based approach to the problem. This approach involves identifying an appropriate reproducing kernel Hilbert space and optimizing a regularized empirical risk over the space. We present generalization error analysis, describe universal kernels, and establish universal consistency of the proposed methodology. Experimental results on flow cytometry data are presented. 1

4 0.10637868 25 nips-2011-Adaptive Hedge

Author: Tim V. Erven, Wouter M. Koolen, Steven D. Rooij, Peter Grünwald

Abstract: Most methods for decision-theoretic online learning are based on the Hedge algorithm, which takes a parameter called the learning rate. In most previous analyses the learning rate was carefully tuned to obtain optimal worst-case performance, leading to suboptimal performance on easy instances, for example when there exists an action that is significantly better than all others. We propose a new way of setting the learning rate, which adapts to the difficulty of the learning problem: in the worst case our procedure still guarantees optimal performance, but on easy instances it achieves much smaller regret. In particular, our adaptive method achieves constant regret in a probabilistic setting, when there exists an action that on average obtains strictly smaller loss than all other actions. We also provide a simulation study comparing our approach to existing methods. 1

5 0.096873373 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels

Author: Mona Eberts, Ingo Steinwart

Abstract: We prove a new oracle inequality for support vector machines with Gaussian RBF kernels solving the regularized least squares regression problem. To this end, we apply the modulus of smoothness. With the help of the new oracle inequality we then derive learning rates that can also be achieved by a simple data-dependent parameter selection method. Finally, it turns out that our learning rates are asymptotically optimal for regression functions satisfying certain standard smoothness conditions. 1

6 0.081665635 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization

7 0.070841663 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

8 0.068604968 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing

9 0.06736514 282 nips-2011-The Fast Convergence of Boosting

10 0.061336458 222 nips-2011-Prismatic Algorithm for Discrete D.C. Programming Problem

11 0.061191533 33 nips-2011-An Exact Algorithm for F-Measure Maximization

12 0.058622744 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

13 0.057837542 306 nips-2011-t-divergence Based Approximate Inference

14 0.055151228 145 nips-2011-Learning Eigenvectors for Free

15 0.054384228 28 nips-2011-Agnostic Selective Classification

16 0.051131267 198 nips-2011-On U-processes and clustering performance

17 0.050705884 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning

18 0.048883956 19 nips-2011-Active Classification based on Value of Classifier

19 0.047353785 8 nips-2011-A Model for Temporal Dependencies in Event Streams

20 0.044420838 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.122), (1, -0.044), (2, -0.055), (3, -0.059), (4, 0.017), (5, 0.053), (6, -0.006), (7, -0.072), (8, -0.119), (9, 0.014), (10, -0.072), (11, 0.001), (12, 0.045), (13, 0.011), (14, -0.01), (15, -0.015), (16, -0.022), (17, -0.045), (18, -0.06), (19, 0.021), (20, 0.031), (21, -0.059), (22, 0.096), (23, 0.054), (24, -0.018), (25, 0.067), (26, -0.046), (27, -0.091), (28, -0.138), (29, 0.009), (30, 0.125), (31, 0.19), (32, 0.013), (33, -0.06), (34, 0.006), (35, 0.027), (36, 0.091), (37, -0.16), (38, -0.07), (39, -0.005), (40, 0.05), (41, -0.027), (42, -0.065), (43, 0.026), (44, 0.028), (45, 0.105), (46, 0.047), (47, 0.024), (48, -0.028), (49, 0.066)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9581092 59 nips-2011-Composite Multiclass Losses

Author: Elodie Vernet, Mark D. Reid, Robert C. Williamson

Abstract: We consider loss functions for multiclass prediction problems. We show when a multiclass loss can be expressed as a “proper composite loss”, which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We determine the stationarity condition, Bregman representation, order-sensitivity, existence and uniqueness of the composite representation for multiclass losses. We subsume existing results on “classification calibration” by relating it to properness and show that the simple integral representation for binary proper losses can not be extended to multiclass losses. 1

2 0.6350463 106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample

Author: Gilles Blanchard, Gyemin Lee, Clayton Scott

Abstract: We consider the problem of assigning class labels to an unlabeled test data set, given several labeled training data sets drawn from similar distributions. This problem arises in several applications where data distributions fluctuate because of biological, technical, or other sources of variation. We develop a distributionfree, kernel-based approach to the problem. This approach involves identifying an appropriate reproducing kernel Hilbert space and optimizing a regularized empirical risk over the space. We present generalization error analysis, describe universal kernels, and establish universal consistency of the proposed methodology. Experimental results on flow cytometry data are presented. 1

3 0.61414921 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels

Author: Mona Eberts, Ingo Steinwart

Abstract: We prove a new oracle inequality for support vector machines with Gaussian RBF kernels solving the regularized least squares regression problem. To this end, we apply the modulus of smoothness. With the help of the new oracle inequality we then derive learning rates that can also be achieved by a simple data-dependent parameter selection method. Finally, it turns out that our learning rates are asymptotically optimal for regression functions satisfying certain standard smoothness conditions. 1

4 0.57464504 178 nips-2011-Multiclass Boosting: Theory and Algorithms

Author: Mohammad J. Saberian, Nuno Vasconcelos

Abstract: The problem of multi-class boosting is considered. A new framework, based on multi-dimensional codewords and predictors is introduced. The optimal set of codewords is derived, and a margin enforcing loss proposed. The resulting risk is minimized by gradient descent on a multidimensional functional space. Two algorithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one predictor component at a time, 2) GD-MCBoost, based on gradient descent, updates all components jointly. The algorithms differ in the weak learners that they support but are both shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the risk. They also reduce to AdaBoost when there are only two classes. Experiments show that both methods outperform previous multiclass boosting approaches on a number of datasets. 1

5 0.49904379 152 nips-2011-Learning in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint

Author: Kenji Fukumizu, Gert R. Lanckriet, Bharath K. Sriperumbudur

Abstract: The goal of this paper is to investigate the advantages and disadvantages of learning in Banach spaces over Hilbert spaces. While many works have been carried out in generalizing Hilbert methods to Banach spaces, in this paper, we consider the simple problem of learning a Parzen window classifier in a reproducing kernel Banach space (RKBS)—which is closely related to the notion of embedding probability measures into an RKBS—in order to carefully understand its pros and cons over the Hilbert space classifier. We show that while this generalization yields richer distance measures on probabilities compared to its Hilbert space counterpart, it however suffers from serious computational drawback limiting its practical applicability, which therefore demonstrates the need for developing efficient learning algorithms in Banach spaces.

6 0.46895471 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing

7 0.46594495 33 nips-2011-An Exact Algorithm for F-Measure Maximization

8 0.44725657 256 nips-2011-Solving Decision Problems with Limited Information

9 0.41512519 19 nips-2011-Active Classification based on Value of Classifier

10 0.40012008 28 nips-2011-Agnostic Selective Classification

11 0.39535093 282 nips-2011-The Fast Convergence of Boosting

12 0.38607374 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss

13 0.37259987 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension

14 0.35750458 143 nips-2011-Learning Anchor Planes for Classification

15 0.34671468 162 nips-2011-Lower Bounds for Passive and Active Learning

16 0.34443581 295 nips-2011-Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction

17 0.34166002 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization

18 0.33958045 25 nips-2011-Adaptive Hedge

19 0.32525572 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

20 0.32062158 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.052), (4, 0.024), (20, 0.068), (26, 0.029), (31, 0.031), (33, 0.023), (43, 0.1), (45, 0.102), (57, 0.025), (74, 0.03), (82, 0.287), (83, 0.042), (89, 0.014), (99, 0.08)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79119349 59 nips-2011-Composite Multiclass Losses

Author: Elodie Vernet, Mark D. Reid, Robert C. Williamson

Abstract: We consider loss functions for multiclass prediction problems. We show when a multiclass loss can be expressed as a “proper composite loss”, which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We determine the stationarity condition, Bregman representation, order-sensitivity, existence and uniqueness of the composite representation for multiclass losses. We subsume existing results on “classification calibration” by relating it to properness and show that the simple integral representation for binary proper losses can not be extended to multiclass losses. 1

2 0.60552973 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning

Author: Eric Moulines, Francis R. Bach

Abstract: We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and least-squares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. We provide a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent (a.k.a. Robbins-Monro algorithm) as well as a simple modification where iterates are averaged (a.k.a. Polyak-Ruppert averaging). Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant. This situation is remedied when using slower decays together with averaging, robustly leading to the optimal rate of convergence. We illustrate our theoretical results with simulations on synthetic and standard datasets.

3 0.60424173 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices

Author: Mladen Kolar, Sivaraman Balakrishnan, Alessandro Rinaldo, Aarti Singh

Abstract: We consider the problem of identifying a sparse set of relevant columns and rows in a large data matrix with highly corrupted entries. This problem of identifying groups from a collection of bipartite variables such as proteins and drugs, biological species and gene sequences, malware and signatures, etc is commonly referred to as biclustering or co-clustering. Despite its great practical relevance, and although several ad-hoc methods are available for biclustering, theoretical analysis of the problem is largely non-existent. The problem we consider is also closely related to structured multiple hypothesis testing, an area of statistics that has recently witnessed a flurry of activity. We make the following contributions 1. We prove lower bounds on the minimum signal strength needed for successful recovery of a bicluster as a function of the noise variance, size of the matrix and bicluster of interest. 2. We show that a combinatorial procedure based on the scan statistic achieves this optimal limit. 3. We characterize the SNR required by several computationally tractable procedures for biclustering including element-wise thresholding, column/row average thresholding and a convex relaxation approach to sparse singular vector decomposition. 1

4 0.52600199 186 nips-2011-Noise Thresholds for Spectral Clustering

Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh

Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1

5 0.52290195 29 nips-2011-Algorithms and hardness results for parallel large margin learning

Author: Phil Long, Rocco Servedio

Abstract: We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1

6 0.51846647 198 nips-2011-On U-processes and clustering performance

7 0.5163582 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning

8 0.51450777 202 nips-2011-On the Universality of Online Mirror Descent

9 0.51149225 203 nips-2011-On the accuracy of l1-filtering of signals with block-sparse structure

10 0.51112103 263 nips-2011-Sparse Manifold Clustering and Embedding

11 0.51071066 109 nips-2011-Greedy Model Averaging

12 0.51019782 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

13 0.50994706 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

14 0.5095315 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods

15 0.50848603 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators

16 0.50796002 80 nips-2011-Efficient Online Learning via Randomized Rounding

17 0.50734931 296 nips-2011-Uniqueness of Belief Propagation on Signed Graphs

18 0.50603288 265 nips-2011-Sparse recovery by thresholded non-negative least squares

19 0.50462866 22 nips-2011-Active Ranking using Pairwise Comparisons

20 0.50402701 25 nips-2011-Adaptive Hedge