jmlr jmlr2013 jmlr2013-26 knowledge-graph by maker-knowledge-mining

26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

Source: pdf

Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki

Abstract: There are two main approaches to binary classiÄ?Ĺš cation problems: the loss function approach and the uncertainty set approach. The loss function approach is widely used in real-world data analysis. Statistical decision theory has been used to elucidate its properties such as statistical consistency. Conditional probabilities can also be estimated by using the minimum solution of the loss function. In the uncertainty set approach, an uncertainty set is deÄ?Ĺš ned for each binary label from training samples. The best separating hyperplane between the two uncertainty sets is used as the decision function. Although the uncertainty set approach provides an intuitive understanding of learning algorithms, its statistical properties have not been sufÄ?Ĺš ciently studied. In this paper, we show that the uncertainty set is deeply connected with the convex conjugate of a loss function. On the basis of the conjugate relation, we propose a way of revising the uncertainty set approach so that it will have good statistical properties such as statistical consistency. We also introduce statistical models corresponding to uncertainty sets in order to estimate conditional probabilities. Finally, we present numerical experiments, verifying that the learning with revised uncertainty sets improves the prediction accuracy. Keywords: loss function, uncertainty set, convex conjugate, consistency

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Ĺš cation problems: the loss function approach and the uncertainty set approach. [sent-14, score-0.48]

2 In the uncertainty set approach, an uncertainty set is deÄ? [sent-18, score-0.632]

3 The best separating hyperplane between the two uncertainty sets is used as the decision function. [sent-20, score-0.402]

4 Although the uncertainty set approach provides an intuitive understanding of learning algorithms, its statistical properties have not been sufÄ? [sent-21, score-0.351]

5 In this paper, we show that the uncertainty set is deeply connected with the convex conjugate of a loss function. [sent-23, score-0.626]

6 On the basis of the conjugate relation, we propose a way of revising the uncertainty set approach so that it will have good statistical properties such as statistical consistency. [sent-24, score-0.52]

7 We also introduce statistical models corresponding to uncertainty sets in order to estimate conditional probabilities. [sent-25, score-0.401]

8 Keywords: loss function, uncertainty set, convex conjugate, consistency 1. [sent-27, score-0.548]

9 A decision function minimizing the empirical mean of the loss function over the training samples is employed as an estimator (Cortes and Vapnik, 1995; SchĂ‚Â¨ lkopf et al. [sent-37, score-0.38]

10 o For example, the hinge loss, exponential loss and logistic loss are used for support vector machine (SVM), Adaboost and logistic regression, respectively. [sent-40, score-0.362]

11 The loss function approach provides not only an estimator of the decision function, but also an estimator of the conditional probability of binary labels for a given input. [sent-46, score-0.386]

12 The sign of the estimated decision function is used for the label prediction, and the magnitude of the decision function is connected to the conditional probability via the loss function. [sent-47, score-0.527]

13 For example, the logistic loss and exponential loss produce logistic models, whereas the hinge loss cannot be used to estimate the conditional probability except the probability 0. [sent-50, score-0.636]

14 There have been studies on the statistical properties of learning with uncertainty sets. [sent-68, score-0.351]

15 (2003) proposed minimax probability machine (MPM) using ellipsoidal uncertainty sets and studied its statistical properties in the worst-case setting. [sent-70, score-0.431]

16 In statistical learning using uncertainty sets, the main concern is to develop optimization algorithms under the maximum margin criterion (Mavroforakis and Theodoridis, 2006). [sent-71, score-0.409]

17 So far, however, the statistical properties of learning with uncertainty sets have not been studied as much as those of learning with loss functions. [sent-72, score-0.515]

18 The main purpose of this paper is to study the relation between the loss function approach and uncertainty set approach, and to use the relation to transform learning with uncertainty sets into loss-based learning in order to clarify the statistical properties of learning algorithms. [sent-73, score-0.913]

19 As a result, we can establish a correspondence between uncertainty sets and statistical models of conditional probabilities. [sent-75, score-0.401]

20 Note that some of the existing learning methods using uncertainty sets do not necessarily have good statistical properties, such as the statistical consistency. [sent-76, score-0.386]

21 We propose a way of revising uncertainty sets to establish statistical consistency. [sent-77, score-0.407]

22 Figure 1 shows how uncertainty sets, loss functions and statistical models are related. [sent-78, score-0.515]

23 Starting from a learning algorithm with uncertainty sets, we obtain the corresponding loss function and statistical model via the convex conjugate. [sent-79, score-0.583]

24 By revising uncertainty sets, we can obtain the corresponding loss functions and statistical models. [sent-81, score-0.571]

25 We think that our method of revising uncertainty sets can bridge the gap between intuitive statistical modeling and the nice statistical properties of learning algorithms. [sent-84, score-0.442]

26 Section 2 reviews the existing learning methods using loss functions and uncertainty sets. [sent-86, score-0.48]

27 We describe the relation between the loss function and uncertainty set in ĂŽË? [sent-87, score-0.521]

28 Section 3 is an investigation of the general relation between loss functions and uncertainty sets. [sent-89, score-0.521]

29 2 Figure 1: Relations among uncertainty sets, loss functions and statistical models. [sent-97, score-0.515]

30 1, we derive uncertainty sets from loss functions by using the convex conjugate of loss functions. [sent-99, score-0.79]

31 2 shows how to revise uncertainty sets in order to obtain loss functions from them. [sent-103, score-0.48]

32 how to revising the uncertainty set so that it will have good statistical properties. [sent-106, score-0.407]

33 Both the hinge loss and the exponential loss are convex in f , and they provide an upper bound of the 0-1 loss. [sent-161, score-0.43]

34 The quantitative relation between the 0-1 loss and the surrogate loss was studied by Bartlett et al. [sent-163, score-0.42]

35 (1998) and Bartlett and Tewari (2007), surrogate loss functions provide statistical models for the conditional probability of a label y for a given x, that is, P(y|x). [sent-176, score-0.373]

36 Ĺš cation problems, that is, statistical learning based on the uncertainty set. [sent-193, score-0.351]

37 Each uncertainty set is assumed to include the mean vector of the distribution of input point x conditioned on each label (Takeda et al. [sent-205, score-0.359]

38 An ellipsoidal uncertainty set is also used for the robust classiÄ? [sent-214, score-0.366]

39 1465 K ANAMORI , TAKEDA AND S UZUKI Un xĂ˘ˆ— p xĂ˘ˆ— n w Up Figure 2: Decision boundary estimated by solving the minimum distance problem with the uncertainty sets U p and Un . [sent-217, score-0.356]

40 We use the uncertainty set to estimate the linear decision function f (x) = wT x + b. [sent-218, score-0.402]

41 (2003), but the ellipsoidal uncertainty set also plays an important role in MPM. [sent-232, score-0.366]

42 When the bias term b in the linear decision function is estimated such that the decision boundary bisects the line segment connecting xĂ˘ˆ— and xĂ˘ˆ— , the estimated dep n cision boundary will have the maximum margin between the uncertainty sets, U p and Un . [sent-234, score-0.626]

43 -SVM Here, we will describe how the loss function approach and uncertainty set approach are related to each other in ĂŽË? [sent-239, score-0.48]

44 -SVM is minimized over training samples such that yi (wT xi + b) < ÄŽ , and training samples having a large margin, that is, yi (wT xi + b) Ă˘‰Ä˝ ÄŽ , do not contribute to the loss function max{ÄŽ Ă˘ˆ’ yi (wT xi + b), 0}. [sent-265, score-0.669]

45 Ĺš‚y show that the dual problem of (3) yields a minimum distance problem in which the reduced convex-hulls of training samples are used as uncertainty sets (See Bennett and 1467 K ANAMORI , TAKEDA AND S UZUKI Bredensteiner 2000 for details). [sent-285, score-0.39]

46 Hence, solving problem (7) is identical to solving the minimum distance problem with the uncertainty set of reduced convex hulls, inf x p ,xn x p Ă˘ˆ’ xn subject to x p Ă˘ˆˆ U p , xn Ă˘ˆˆ Un . [sent-321, score-0.509]

47 -SVM with which we can investigate the relation between loss functions and uncertainty sets. [sent-334, score-0.521]

48 We can derive the uncertainty set associated with the loss function Ă˘„“ in (9) in a similar way to what was done with ĂŽË? [sent-365, score-0.48]

49 m iĂ˘ˆˆMo (11) Accordingly, the optimization problem in (10) can be represented as inf c p + cn + ĂŽĹĽ z p Ă˘ˆ’ zn c p ,cn ,z p ,zn subject to z p Ă˘ˆˆ U p [c p ], zn Ă˘ˆˆ Un [cn ], c p , cn Ă˘ˆˆ R. [sent-388, score-0.455]

50 Some calculation yields that w = ĂŽĹĽ(z p Ă˘ˆ’ zn )/ z p Ă˘ˆ’ zn holds for z p = zn , and for z p = zn any vector such that w 2 Ă˘‰Â¤ ĂŽĹĽ2 satisÄ? [sent-392, score-0.356]

51 The shape of uncertainty sets and the max-margin criterion respectively correspond to the loss function and the regularization principle. [sent-394, score-0.509]

52 Now let us show some examples of uncertainty sets (11) associated with popular loss functions. [sent-396, score-0.48]

53 For c Ă˘‰Ä˝ 0, the uncertainty set consists of the reduced convex hull of the training samples, and it does not depend on the parameter c. [sent-416, score-0.424]

54 mo iĂ˘ˆˆMo Suppose that ĂŽĹ o is invertible. [sent-426, score-0.467]

55 For ĂŽÄ…o = (ĂŽÄ…i )iĂ˘ˆˆMo satisfying the constraints, we get z= Ă‚Ĺť Ă‚Ĺť Ă˘ˆ‘ ĂŽÄ…i xi = (X Ă˘ˆ’ xo 1T )ĂŽÄ…o + xo , iĂ˘ˆˆMo Ă‚Ĺť where 1 = (1, . [sent-433, score-0.453]

56 2 Statistical Models Associated with Uncertainty Sets The extended minimum distance problem (12) with the parametrized uncertainty set (11) corresponds to the loss function in (9). [sent-446, score-0.514]

57 1 derived parametrized uncertainty sets associated with convex loss functions. [sent-464, score-0.582]

58 Conversely, if an uncertainty set is represented as the form of (11), a corresponding loss function exists. [sent-465, score-0.48]

59 However, if the uncertainty set does not have the form of (11), the corresponding loss function does not exist. [sent-467, score-0.48]

60 One way to deal with the drawback is to revise the uncertainty set so that it possesses a corresponding loss function. [sent-468, score-0.48]

61 In Example 2, the second expression of the uncertainty set involves the convex function hĂ˘ˆ— (z) = (z Ă˘ˆ’ xo )T ĂŽĹ Ă˘ˆ’1 (z Ă˘ˆ’ xo ). [sent-483, score-0.776]

62 2 Revised Uncertainty Sets and Corresponding Loss Functions The uncertainty sets can be revised such that the primal form (17) is represented as minimization of the empirical mean of a loss function. [sent-510, score-0.51]

63 In the large sample limit, hĂ˘ˆ— (Ă˘ˆ‘iĂ˘ˆˆMo m xi ) o is approximated by hĂ˘ˆ— (ĂŽÄ… mo Ă‚Äžo ). [sent-574, score-0.528]

64 i=1 Ă‚Ĺť Ă‚Ĺť The expanded minimum distance problem using the revised uncertainty sets U p [c] and Un [c] is min c p ,cn ,z p ,zn c p + cn + ĂŽĹĽ z p Ă˘ˆ’ zn Ă‚Ĺť Ă‚Ĺť subject to z p Ă˘ˆˆ U p [c p ], zn Ă˘ˆˆ Un [cn ]. [sent-577, score-0.605]

65 Ă‚Ĺť The revision of uncertainty sets leads to the empirical mean of the revised loss function Ă˘„“. [sent-582, score-0.6]

66 Now let us show some examples to illustrate how revision of uncertainty sets works. [sent-584, score-0.436]

67 k For o Ă˘ˆˆ {p, n}, let xo and ĂŽĹ o be the empirical mean and the empirical covariance matrix, Ă‚Ĺť xo = Ă‚Ĺť 1 Ă˘ˆ‘ xi , mo iĂ˘ˆˆMo ĂŽĹ o = 1 Ă‚Ĺť Ă‚Ĺť Ă˘ˆ‘ (xi Ă˘ˆ’ xo )(xi Ă˘ˆ’ xo )T . [sent-593, score-1.312]

68 mo iĂ˘ˆˆMo If ĂŽĹ o is invertible, we have Ă‚Ĺť Uo [c] = z Ă˘ˆˆ conv{xi : i Ă˘ˆˆ Mo } : (z Ă˘ˆ’ xo )T ĂŽĹ Ă˘ˆ’1 (z Ă˘ˆ’ xo ) Ă˘‰Â¤ Ă‚Ĺť Ă‚Ĺť o cmmo . [sent-594, score-0.859]

69 Figure 3 illustrates an example of the revision of the uncertainty set. [sent-612, score-0.436]

70 In the left panel, the uncertainty set does not match the distribution of the training samples. [sent-613, score-0.356]

71 On the other hand, the revised uncertainty set in the right panel well approximates the dispersal of the training samples. [sent-614, score-0.356]

72 The above uncertainty set is useful when the probability in the training phase is slightly different from that in the test phase. [sent-623, score-0.386]

73 Ĺš cation reduces from quadratic to linear around the decision boundary, though the original uncertainty set Uo [c] does not correspond to minimization of an empirical loss function. [sent-649, score-0.566]

74 Note that the revision of uncertainty sets presented in Section 4 can be used, if necessary. [sent-661, score-0.436]

75 Example 7 (ellipsoidal uncertainty sets in RKHS) Let us consider an uncertainty set U [c] in RKHS H deÄ? [sent-669, score-0.632]

76 This is the kernel variant of the ellipsoidal uncertainty set in Example 2. [sent-686, score-0.366]

77 In addition, we can verify the statistical consistency of the learning algorithm with the (revised) uncertainty sets by taking the corresponding loss function into account. [sent-689, score-0.515]

78 Appendix B presents a rigorous proof of the duality between (27) and (25) with the uncertainty set (26). [sent-727, score-0.354]

79 3 show that some popular uncertainty sets and their revisions yield loss functions satisfying the above sufÄ? [sent-748, score-0.48]

80 The hinge loss and logistic loss do not satisfy this assumption, whereas the quadratic loss and exponential loss satisfy it. [sent-771, score-0.69]

81 As a result, it is shown that the existence of ÄŽˆ is guaranteed for the truncated quadratic loss, exponential loss and the loss function derived from the uncertainty set with the estimation error in Example 6; see the examples provide in Appendix C. [sent-810, score-0.644]

82 Experiments We conducted some numerical experiments to examine the prediction performance of our revision of uncertainty sets methods. [sent-813, score-0.436]

83 This loss function corresponds to the revised uncertainty set of the ellipsoidal uncertainty set with the estimation error. [sent-827, score-0.846]

84 Figure 5 presents the squared loss and absolute loss of the estimated conditional probability versus the size of the training samples for C-SVM and the proposed method with h = 0 and h = 1. [sent-899, score-0.522]

85 32 Table 2: Squared loss and absolute loss of the estimated conditional probability P(y|x). [sent-944, score-0.448]

86 The uncertainty set of MPM and unbiased MPM is an ellipsoid deÄ? [sent-953, score-0.36]

87 The revision of the ellipsoidal uncertainty set leads to the uncertainty set of our algorithm. [sent-958, score-0.802]

88 Hence, the revision of uncertainty sets can improve the prediction accuracy of uncertainty-set-based learning. [sent-961, score-0.436]

89 Conclusion We studied the relation between the loss function approach and the uncertainty set approach in binary classiÄ? [sent-1001, score-0.521]

90 Given a loss function, there exists a corresponding parametrized uncertainty set. [sent-1004, score-0.514]

91 In general, however, the uncertainty set does not correspond to the empirical loss function. [sent-1005, score-0.48]

92 We presented a way of revising the uncertainty set so that it will correspond to an empirical loss function. [sent-1006, score-0.536]

93 We compared C-SVM, MPM, unbiased MPM, and the proposed method with loss functions (24) having h = 0 and h = 1: boldface letters indicate that the average squared loss is the smallest. [sent-1149, score-0.372]

94 48 Table 5: Squared loss Ä‚—100 of estimated conditional probability for C-SVM and the proposed method with loss functions (24) having h = 0 and h = 1: boldface letters indicate that the average test error is the smallest. [sent-1229, score-0.448]

95 We are currently investigating of the possibility of relaxing the assumptions so as to include the hinge loss and other popular loss functions such as the logistic loss. [sent-1231, score-0.362]

96 The relation between the loss function approach and the uncertainty set approach is a useful tool for statistical modeling. [sent-1237, score-0.556]

97 We believe that learning algorithms with revised uncertainty sets can bridge the gap between intuitive statistical modeling and nice statistical properties. [sent-1239, score-0.386]

98 , (xm , ym )}, the empirical loss RT ( f , ÄŽ ) and the regularized empirical loss RT,ĂŽĹĽ ( f , ÄŽ ) are deÄ? [sent-1328, score-0.356]

99 Ă‚Ĺť Ă˘„“Ă˘€Ë› (ÄŽ ) EFSJWBUJWFPGSFWJTFEMPTT K ANAMORI , TAKEDA AND S UZUKI [ Figure 6: The derivative of the loss function corresponding to the revised uncertainty set with the estimation error. [sent-1642, score-0.509]

100 Therefore, the loss function corresponding to the 32w2 revised uncertainty set in Example 6 satisÄ? [sent-1650, score-0.48]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mo', 0.467), ('uncertainty', 0.316), ('mpm', 0.261), ('takeda', 0.239), ('anamori', 0.205), ('uzuki', 0.205), ('elation', 0.196), ('onjugate', 0.196), ('xo', 0.196), ('loss', 0.164), ('uo', 0.144), ('roblems', 0.13), ('revision', 0.12), ('lassification', 0.116), ('mn', 0.109), ('wt', 0.108), ('rt', 0.091), ('inf', 0.09), ('zn', 0.089), ('un', 0.086), ('conv', 0.086), ('decision', 0.086), ('conjugate', 0.078), ('cn', 0.076), ('mp', 0.07), ('convex', 0.068), ('lo', 0.066), ('gm', 0.066), ('bartlett', 0.063), ('xi', 0.061), ('margin', 0.058), ('yi', 0.058), ('bredensteiner', 0.056), ('mptt', 0.056), ('revising', 0.056), ('surrogate', 0.051), ('steinwart', 0.051), ('ellipsoidal', 0.05), ('conditional', 0.05), ('subdifferential', 0.046), ('unbiased', 0.044), ('bennett', 0.043), ('fm', 0.043), ('label', 0.043), ('suppose', 0.042), ('lanckriet', 0.042), ('relation', 0.041), ('sup', 0.04), ('rkhs', 0.04), ('training', 0.04), ('banana', 0.04), ('nath', 0.04), ('ringnorm', 0.04), ('twonorm', 0.04), ('estimated', 0.04), ('duality', 0.038), ('diabetis', 0.037), ('nfuipe', 0.037), ('qspqptfe', 0.037), ('rmo', 0.037), ('bhattacharyya', 0.036), ('statistical', 0.035), ('subject', 0.035), ('classi', 0.035), ('samples', 0.034), ('parametrized', 0.034), ('hinge', 0.034), ('ln', 0.033), ('assumption', 0.032), ('crisp', 0.032), ('rockafellar', 0.032), ('cm', 0.032), ('primal', 0.03), ('probability', 0.03), ('lm', 0.029), ('covering', 0.029), ('derivative', 0.029), ('thyroid', 0.029), ('titanic', 0.029), ('regularization', 0.029), ('lkopf', 0.028), ('sign', 0.028), ('ym', 0.028), ('karatzoglou', 0.028), ('limz', 0.028), ('mavroforakis', 0.028), ('nagoya', 0.028), ('taiji', 0.028), ('zup', 0.028), ('converges', 0.028), ('estimator', 0.028), ('inequality', 0.027), ('lemma', 0.027), ('bayes', 0.027), ('margins', 0.027), ('splice', 0.027), ('assures', 0.027), ('bm', 0.026), ('condition', 0.026), ('risk', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki

2 0.12310771 114 jmlr-2013-The Rate of Convergence of AdaBoost

Author: Indraneel Mukherjee, Cynthia Rudin, Robert E. Schapire

Abstract: The AdaBoost algorithm was designed to combine many “weak” hypotheses that perform slightly better than random guessing into a “strong” hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the “exponential loss.” Unlike previous work, our proofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are ﬁnite. Our ﬁrst result shows that the exponential loss of AdaBoost’s computed parameter vector will be at most ε more than that of any parameter vector of ℓ1 -norm bounded by B in a number of rounds that is at most a polynomial in B and 1/ε. We also provide lower bounds showing that a polynomial dependence is necessary. Our second result is that within C/ε iterations, AdaBoost achieves a value of the exponential loss that is at most ε more than the best possible value, where C depends on the data set. We show that this dependence of the rate on ε is optimal up to constant factors, that is, at least Ω(1/ε) rounds are necessary to achieve within ε of the optimal exponential loss. Keywords: AdaBoost, optimization, coordinate descent, convergence rate

3 0.077027813 68 jmlr-2013-Machine Learning with Operational Costs

Author: Theja Tulabandhula, Cynthia Rudin

Abstract: This work proposes a way to align statistical modeling with decision making. We provide a method that propagates the uncertainty in predictive modeling to the uncertainty in operational cost, where operational cost is the amount spent by the practitioner in solving the problem. The method allows us to explore the range of operational costs associated with the set of reasonable statistical models, so as to provide a useful way for practitioners to understand uncertainty. To do this, the operational cost is cast as a regularization term in a learning algorithm’s objective function, allowing either an optimistic or pessimistic view of possible costs, depending on the regularization parameter. From another perspective, if we have prior knowledge about the operational cost, for instance that it should be low, this knowledge can help to restrict the hypothesis space, and can help with generalization. We provide a theoretical generalization bound for this scenario. We also show that learning with operational costs is related to robust optimization. Keywords: statistical learning theory, optimization, covering numbers, decision theory

4 0.073945895 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression

Author: Arnaud Guyader, Nick Hengartner

Abstract: Motivated by promising experimental results, this paper investigates the theoretical properties of a recently proposed nonparametric estimator, called the Mutual Nearest Neighbors rule, which estimates the regression function m(x) = E[Y |X = x] as follows: ﬁrst identify the k nearest neighbors of x in the sample Dn , then keep only those for which x is itself one of the k nearest neighbors, and ﬁnally take the average over the corresponding response variables. We prove that this estimator is consistent and that its rate of convergence is optimal. Since the estimate with the optimal rate of convergence depends on the unknown distribution of the observations, we also present adaptation results by data-splitting. Keywords: nonparametric estimation, nearest neighbor methods, mathematical statistics

5 0.061164327 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning

Author: Sivan Sabato, Nathan Srebro, Naftali Tishby

Abstract: We obtain a tight distribution-speciﬁc characterization of the sample complexity of large-margin classiﬁcation with L2 regularization: We introduce the margin-adapted dimension, which is a simple function of the second order statistics of the data distribution, and show distribution-speciﬁc upper and lower bounds on the sample complexity, both governed by the margin-adapted dimension of the data distribution. The upper bounds are universal, and the lower bounds hold for the rich family of sub-Gaussian distributions with independent features. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classiﬁcation. To prove the lower bound, we develop several new tools of independent interest. These include new connections between shattering and hardness of learning, new properties of shattering with linear classiﬁers, and a new lower bound on the smallest eigenvalue of a random Gram matrix generated by sub-Gaussian variables. Our results can be used to quantitatively compare large margin learning to other learning rules, and to improve the effectiveness of methods that use sample complexity bounds, such as active learning. Keywords: supervised learning, sample complexity, linear classiﬁers, distribution-dependence

6 0.06079771 120 jmlr-2013-Variational Algorithms for Marginal MAP

7 0.054254603 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding

8 0.054241735 8 jmlr-2013-A Theory of Multiclass Boosting

9 0.053278219 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification

10 0.05095382 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

11 0.050939444 104 jmlr-2013-Sparse Single-Index Model

12 0.050635051 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres

13 0.050310966 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

14 0.046561062 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods

15 0.046427511 76 jmlr-2013-Nonparametric Sparsity and Regularization

16 0.045685597 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

17 0.044714611 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

18 0.044116296 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach

19 0.042932693 73 jmlr-2013-Multicategory Large-Margin Unified Machines

20 0.041719265 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.22), (1, 0.066), (2, 0.072), (3, 0.098), (4, -0.015), (5, -0.108), (6, 0.062), (7, -0.006), (8, 0.024), (9, -0.155), (10, 0.0), (11, 0.066), (12, 0.147), (13, -0.056), (14, -0.13), (15, -0.011), (16, -0.056), (17, 0.027), (18, -0.041), (19, 0.112), (20, -0.009), (21, 0.009), (22, 0.025), (23, 0.065), (24, 0.03), (25, 0.1), (26, -0.05), (27, -0.041), (28, -0.1), (29, -0.137), (30, 0.006), (31, -0.162), (32, -0.015), (33, -0.007), (34, -0.038), (35, 0.042), (36, -0.01), (37, -0.053), (38, -0.025), (39, 0.141), (40, -0.132), (41, 0.124), (42, 0.012), (43, 0.098), (44, 0.043), (45, 0.09), (46, 0.224), (47, -0.196), (48, -0.133), (49, -0.111)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93894356 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki

2 0.58152759 68 jmlr-2013-Machine Learning with Operational Costs

Author: Theja Tulabandhula, Cynthia Rudin

3 0.53890723 114 jmlr-2013-The Rate of Convergence of AdaBoost

Author: Indraneel Mukherjee, Cynthia Rudin, Robert E. Schapire

4 0.52022588 73 jmlr-2013-Multicategory Large-Margin Unified Machines

Author: Chong Zhang, Yufeng Liu

Abstract: Hard and soft classiﬁers are two important groups of techniques for classiﬁcation problems. Logistic regression and Support Vector Machines are typical examples of soft and hard classiﬁers respectively. The essential difference between these two groups is whether one needs to estimate the class conditional probability for the classiﬁcation task or not. In particular, soft classiﬁers predict the label based on the obtained class conditional probabilities, while hard classiﬁers bypass the estimation of probabilities and focus on the decision boundary. In practice, for the goal of accurate classiﬁcation, it is unclear which one to use in a given situation. To tackle this problem, the Largemargin Uniﬁed Machine (LUM) was recently proposed as a uniﬁed family to embrace both groups. The LUM family enables one to study the behavior change from soft to hard binary classiﬁers. For multicategory cases, however, the concept of soft and hard classiﬁcation becomes less clear. In that case, class probability estimation becomes more involved as it requires estimation of a probability vector. In this paper, we propose a new Multicategory LUM (MLUM) framework to investigate the behavior of soft versus hard classiﬁcation under multicategory settings. Our theoretical and numerical results help to shed some light on the nature of multicategory classiﬁcation and its transition behavior from soft to hard classiﬁers. The numerical results suggest that the proposed tuned MLUM yields very competitive performance. Keywords: hard classiﬁcation, large-margin, soft classiﬁcation, support vector machine

5 0.45277441 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding

Author: Ery Arias-Castro, Bruno Pelletier

Abstract: Maximum Variance Unfolding is one of the main methods for (nonlinear) dimensionality reduction. We study its large sample limit, providing speciﬁc rates of convergence under standard assumptions. We ﬁnd that it is consistent when the underlying submanifold is isometric to a convex subset, and we provide some simple examples where it fails to be consistent. Keywords: maximum variance unfolding, isometric embedding, U-processes, empirical processes, proximity graphs.

6 0.44256449 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression

7 0.40481493 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification

8 0.38762352 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

9 0.37673199 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning

10 0.36416367 104 jmlr-2013-Sparse Single-Index Model

11 0.36398432 8 jmlr-2013-A Theory of Multiclass Boosting

12 0.36103454 120 jmlr-2013-Variational Algorithms for Marginal MAP

13 0.35184383 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications

14 0.34642726 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

15 0.34175804 107 jmlr-2013-Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

16 0.33615947 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods

17 0.31149396 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres

18 0.3083905 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach

19 0.30551463 76 jmlr-2013-Nonparametric Sparsity and Regularization

20 0.28438151 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.014), (5, 0.163), (6, 0.024), (10, 0.079), (20, 0.011), (23, 0.041), (41, 0.014), (68, 0.037), (70, 0.041), (75, 0.043), (85, 0.07), (87, 0.018), (93, 0.019), (97, 0.317)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73014295 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki

2 0.53702712 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright

Abstract: We analyze two communication-efﬁcient algorithms for distributed optimization in statistical settings involving large-scale data sets. The ﬁrst algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efﬁciently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling

3 0.53286928 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

Author: Sébastien Gerchinovitz

Abstract: We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T . We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same ﬂavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with ﬁxed design. Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds

4 0.53044176 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

Author: Tony Cai, Wen-Xin Zhou

Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence

5 0.52883512 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach

Author: Alon Gonen, Sivan Sabato, Shai Shalev-Shwartz

Abstract: We study pool-based active learning of half-spaces. We revisit the aggressive approach for active learning in the realizable case, and show that it can be made efﬁcient and practical, while also having theoretical guarantees under reasonable assumptions. We further show, both theoretically and experimentally, that it can be preferable to mellow approaches. Our efﬁcient aggressive active learner of half-spaces has formal approximation guarantees that hold when the pool is separable with a margin. While our analysis is focused on the realizable setting, we show that a simple heuristic allows using the same algorithm successfully for pools with low error as well. We further compare the aggressive approach to the mellow approach, and prove that there are cases in which the aggressive approach results in signiﬁcantly better label complexity compared to the mellow approach. We demonstrate experimentally that substantial improvements in label complexity can be achieved using the aggressive approach, for both realizable and low-error settings. Keywords: active learning, linear classiﬁers, margin, adaptive sub-modularity

6 0.52545184 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

7 0.52400804 73 jmlr-2013-Multicategory Large-Margin Unified Machines

8 0.52211738 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

9 0.52145463 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

10 0.5162254 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators

11 0.51576102 114 jmlr-2013-The Rate of Convergence of AdaBoost

12 0.51546043 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion

13 0.51496923 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

14 0.5147965 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications

15 0.51424694 76 jmlr-2013-Nonparametric Sparsity and Regularization

16 0.51387197 68 jmlr-2013-Machine Learning with Operational Costs

17 0.51362354 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

18 0.51179796 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

19 0.51147735 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

20 0.51134092 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis