jmlr jmlr2013 jmlr2013-26 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki
Abstract: There are two main approaches to binary classiÄ?Ĺš cation problems: the loss function approach and the uncertainty set approach. The loss function approach is widely used in real-world data analysis. Statistical decision theory has been used to elucidate its properties such as statistical consistency. Conditional probabilities can also be estimated by using the minimum solution of the loss function. In the uncertainty set approach, an uncertainty set is deÄ?Ĺš ned for each binary label from training samples. The best separating hyperplane between the two uncertainty sets is used as the decision function. Although the uncertainty set approach provides an intuitive understanding of learning algorithms, its statistical properties have not been sufÄ?Ĺš ciently studied. In this paper, we show that the uncertainty set is deeply connected with the convex conjugate of a loss function. On the basis of the conjugate relation, we propose a way of revising the uncertainty set approach so that it will have good statistical properties such as statistical consistency. We also introduce statistical models corresponding to uncertainty sets in order to estimate conditional probabilities. Finally, we present numerical experiments, verifying that the learning with revised uncertainty sets improves the prediction accuracy. Keywords: loss function, uncertainty set, convex conjugate, consistency
Reference: text
sentIndex sentText sentNum sentScore
1 Ĺš cation problems: the loss function approach and the uncertainty set approach. [sent-14, score-0.48]
2 In the uncertainty set approach, an uncertainty set is deÄ? [sent-18, score-0.632]
3 The best separating hyperplane between the two uncertainty sets is used as the decision function. [sent-20, score-0.402]
4 Although the uncertainty set approach provides an intuitive understanding of learning algorithms, its statistical properties have not been sufÄ? [sent-21, score-0.351]
5 In this paper, we show that the uncertainty set is deeply connected with the convex conjugate of a loss function. [sent-23, score-0.626]
6 On the basis of the conjugate relation, we propose a way of revising the uncertainty set approach so that it will have good statistical properties such as statistical consistency. [sent-24, score-0.52]
7 We also introduce statistical models corresponding to uncertainty sets in order to estimate conditional probabilities. [sent-25, score-0.401]
8 Keywords: loss function, uncertainty set, convex conjugate, consistency 1. [sent-27, score-0.548]
9 A decision function minimizing the empirical mean of the loss function over the training samples is employed as an estimator (Cortes and Vapnik, 1995; Sch¨ lkopf et al. [sent-37, score-0.38]
10 o For example, the hinge loss, exponential loss and logistic loss are used for support vector machine (SVM), Adaboost and logistic regression, respectively. [sent-40, score-0.362]
11 The loss function approach provides not only an estimator of the decision function, but also an estimator of the conditional probability of binary labels for a given input. [sent-46, score-0.386]
12 The sign of the estimated decision function is used for the label prediction, and the magnitude of the decision function is connected to the conditional probability via the loss function. [sent-47, score-0.527]
13 For example, the logistic loss and exponential loss produce logistic models, whereas the hinge loss cannot be used to estimate the conditional probability except the probability 0. [sent-50, score-0.636]
14 There have been studies on the statistical properties of learning with uncertainty sets. [sent-68, score-0.351]
15 (2003) proposed minimax probability machine (MPM) using ellipsoidal uncertainty sets and studied its statistical properties in the worst-case setting. [sent-70, score-0.431]
16 In statistical learning using uncertainty sets, the main concern is to develop optimization algorithms under the maximum margin criterion (Mavroforakis and Theodoridis, 2006). [sent-71, score-0.409]
17 So far, however, the statistical properties of learning with uncertainty sets have not been studied as much as those of learning with loss functions. [sent-72, score-0.515]
18 The main purpose of this paper is to study the relation between the loss function approach and uncertainty set approach, and to use the relation to transform learning with uncertainty sets into loss-based learning in order to clarify the statistical properties of learning algorithms. [sent-73, score-0.913]
19 As a result, we can establish a correspondence between uncertainty sets and statistical models of conditional probabilities. [sent-75, score-0.401]
20 Note that some of the existing learning methods using uncertainty sets do not necessarily have good statistical properties, such as the statistical consistency. [sent-76, score-0.386]
21 We propose a way of revising uncertainty sets to establish statistical consistency. [sent-77, score-0.407]
22 Figure 1 shows how uncertainty sets, loss functions and statistical models are related. [sent-78, score-0.515]
23 Starting from a learning algorithm with uncertainty sets, we obtain the corresponding loss function and statistical model via the convex conjugate. [sent-79, score-0.583]
24 By revising uncertainty sets, we can obtain the corresponding loss functions and statistical models. [sent-81, score-0.571]
25 We think that our method of revising uncertainty sets can bridge the gap between intuitive statistical modeling and the nice statistical properties of learning algorithms. [sent-84, score-0.442]
26 Section 2 reviews the existing learning methods using loss functions and uncertainty sets. [sent-86, score-0.48]
27 We describe the relation between the loss function and uncertainty set in ĂŽË? [sent-87, score-0.521]
28 Section 3 is an investigation of the general relation between loss functions and uncertainty sets. [sent-89, score-0.521]
29 2 Figure 1: Relations among uncertainty sets, loss functions and statistical models. [sent-97, score-0.515]
30 1, we derive uncertainty sets from loss functions by using the convex conjugate of loss functions. [sent-99, score-0.79]
31 2 shows how to revise uncertainty sets in order to obtain loss functions from them. [sent-103, score-0.48]
32 how to revising the uncertainty set so that it will have good statistical properties. [sent-106, score-0.407]
33 Both the hinge loss and the exponential loss are convex in f , and they provide an upper bound of the 0-1 loss. [sent-161, score-0.43]
34 The quantitative relation between the 0-1 loss and the surrogate loss was studied by Bartlett et al. [sent-163, score-0.42]
35 (1998) and Bartlett and Tewari (2007), surrogate loss functions provide statistical models for the conditional probability of a label y for a given x, that is, P(y|x). [sent-176, score-0.373]
36 Ĺš cation problems, that is, statistical learning based on the uncertainty set. [sent-193, score-0.351]
37 Each uncertainty set is assumed to include the mean vector of the distribution of input point x conditioned on each label (Takeda et al. [sent-205, score-0.359]
38 An ellipsoidal uncertainty set is also used for the robust classiÄ? [sent-214, score-0.366]
39 1465 K ANAMORI , TAKEDA AND S UZUKI Un x∗ p x∗ n w Up Figure 2: Decision boundary estimated by solving the minimum distance problem with the uncertainty sets U p and Un . [sent-217, score-0.356]
40 We use the uncertainty set to estimate the linear decision function f (x) = wT x + b. [sent-218, score-0.402]
41 (2003), but the ellipsoidal uncertainty set also plays an important role in MPM. [sent-232, score-0.366]
42 When the bias term b in the linear decision function is estimated such that the decision boundary bisects the line segment connecting x∗ and x∗ , the estimated dep n cision boundary will have the maximum margin between the uncertainty sets, U p and Un . [sent-234, score-0.626]
43 -SVM Here, we will describe how the loss function approach and uncertainty set approach are related to each other in ĂŽË? [sent-239, score-0.48]
44 -SVM is minimized over training samples such that yi (wT xi + b) < ÄŽ , and training samples having a large margin, that is, yi (wT xi + b) â‰Ä˝ ÄŽ , do not contribute to the loss function max{ÄŽ − yi (wT xi + b), 0}. [sent-265, score-0.669]
45 Ĺš‚y show that the dual problem of (3) yields a minimum distance problem in which the reduced convex-hulls of training samples are used as uncertainty sets (See Bennett and 1467 K ANAMORI , TAKEDA AND S UZUKI Bredensteiner 2000 for details). [sent-285, score-0.39]
46 Hence, solving problem (7) is identical to solving the minimum distance problem with the uncertainty set of reduced convex hulls, inf x p ,xn x p − xn subject to x p ∈ U p , xn ∈ Un . [sent-321, score-0.509]
47 -SVM with which we can investigate the relation between loss functions and uncertainty sets. [sent-334, score-0.521]
48 We can derive the uncertainty set associated with the loss function â„“ in (9) in a similar way to what was done with ĂŽË? [sent-365, score-0.48]
49 m i∈Mo (11) Accordingly, the optimization problem in (10) can be represented as inf c p + cn + ĂŽĹĽ z p − zn c p ,cn ,z p ,zn subject to z p ∈ U p [c p ], zn ∈ Un [cn ], c p , cn ∈ R. [sent-388, score-0.455]
50 Some calculation yields that w = ĂŽĹĽ(z p − zn )/ z p − zn holds for z p = zn , and for z p = zn any vector such that w 2 â‰Â¤ ĂŽĹĽ2 satisÄ? [sent-392, score-0.356]
51 The shape of uncertainty sets and the max-margin criterion respectively correspond to the loss function and the regularization principle. [sent-394, score-0.509]
52 Now let us show some examples of uncertainty sets (11) associated with popular loss functions. [sent-396, score-0.48]
53 For c â‰Ä˝ 0, the uncertainty set consists of the reduced convex hull of the training samples, and it does not depend on the parameter c. [sent-416, score-0.424]
54 mo i∈Mo Suppose that ĂŽĹ o is invertible. [sent-426, score-0.467]
55 For ĂŽÄ…o = (ĂŽÄ…i )i∈Mo satisfying the constraints, we get z= Ă‚Ĺť Ă‚Ĺť ∑ ĂŽÄ…i xi = (X − xo 1T )ĂŽÄ…o + xo , i∈Mo Ă‚Ĺť where 1 = (1, . [sent-433, score-0.453]
56 2 Statistical Models Associated with Uncertainty Sets The extended minimum distance problem (12) with the parametrized uncertainty set (11) corresponds to the loss function in (9). [sent-446, score-0.514]
57 1 derived parametrized uncertainty sets associated with convex loss functions. [sent-464, score-0.582]
58 Conversely, if an uncertainty set is represented as the form of (11), a corresponding loss function exists. [sent-465, score-0.48]
59 However, if the uncertainty set does not have the form of (11), the corresponding loss function does not exist. [sent-467, score-0.48]
60 One way to deal with the drawback is to revise the uncertainty set so that it possesses a corresponding loss function. [sent-468, score-0.48]
61 In Example 2, the second expression of the uncertainty set involves the convex function h∗ (z) = (z − xo )T Ίâˆ’1 (z − xo ). [sent-483, score-0.776]
62 2 Revised Uncertainty Sets and Corresponding Loss Functions The uncertainty sets can be revised such that the primal form (17) is represented as minimization of the empirical mean of a loss function. [sent-510, score-0.51]
63 In the large sample limit, h∗ (∑i∈Mo m xi ) o is approximated by h∗ (ĂŽÄ… mo Ă‚Äžo ). [sent-574, score-0.528]
64 i=1 Ă‚Ĺť Ă‚Ĺť The expanded minimum distance problem using the revised uncertainty sets U p [c] and Un [c] is min c p ,cn ,z p ,zn c p + cn + ĂŽĹĽ z p − zn Ă‚Ĺť Ă‚Ĺť subject to z p ∈ U p [c p ], zn ∈ Un [cn ]. [sent-577, score-0.605]
65 Ă‚Ĺť The revision of uncertainty sets leads to the empirical mean of the revised loss function â„“. [sent-582, score-0.6]
66 Now let us show some examples to illustrate how revision of uncertainty sets works. [sent-584, score-0.436]
67 k For o ∈ {p, n}, let xo and ĂŽĹ o be the empirical mean and the empirical covariance matrix, Ă‚Ĺť xo = Ă‚Ĺť 1 ∑ xi , mo i∈Mo ĂŽĹ o = 1 Ă‚Ĺť Ă‚Ĺť ∑ (xi − xo )(xi − xo )T . [sent-593, score-1.312]
68 mo i∈Mo If ĂŽĹ o is invertible, we have Ă‚Ĺť Uo [c] = z ∈ conv{xi : i ∈ Mo } : (z − xo )T Ίâˆ’1 (z − xo ) â‰Â¤ Ă‚Ĺť Ă‚Ĺť o cmmo . [sent-594, score-0.859]
69 Figure 3 illustrates an example of the revision of the uncertainty set. [sent-612, score-0.436]
70 In the left panel, the uncertainty set does not match the distribution of the training samples. [sent-613, score-0.356]
71 On the other hand, the revised uncertainty set in the right panel well approximates the dispersal of the training samples. [sent-614, score-0.356]
72 The above uncertainty set is useful when the probability in the training phase is slightly different from that in the test phase. [sent-623, score-0.386]
73 Ĺš cation reduces from quadratic to linear around the decision boundary, though the original uncertainty set Uo [c] does not correspond to minimization of an empirical loss function. [sent-649, score-0.566]
74 Note that the revision of uncertainty sets presented in Section 4 can be used, if necessary. [sent-661, score-0.436]
75 Example 7 (ellipsoidal uncertainty sets in RKHS) Let us consider an uncertainty set U [c] in RKHS H deÄ? [sent-669, score-0.632]
76 This is the kernel variant of the ellipsoidal uncertainty set in Example 2. [sent-686, score-0.366]
77 In addition, we can verify the statistical consistency of the learning algorithm with the (revised) uncertainty sets by taking the corresponding loss function into account. [sent-689, score-0.515]
78 Appendix B presents a rigorous proof of the duality between (27) and (25) with the uncertainty set (26). [sent-727, score-0.354]
79 3 show that some popular uncertainty sets and their revisions yield loss functions satisfying the above sufÄ? [sent-748, score-0.48]
80 The hinge loss and logistic loss do not satisfy this assumption, whereas the quadratic loss and exponential loss satisfy it. [sent-771, score-0.69]
81 As a result, it is shown that the existence of ÄŽˆ is guaranteed for the truncated quadratic loss, exponential loss and the loss function derived from the uncertainty set with the estimation error in Example 6; see the examples provide in Appendix C. [sent-810, score-0.644]
82 Experiments We conducted some numerical experiments to examine the prediction performance of our revision of uncertainty sets methods. [sent-813, score-0.436]
83 This loss function corresponds to the revised uncertainty set of the ellipsoidal uncertainty set with the estimation error. [sent-827, score-0.846]
84 Figure 5 presents the squared loss and absolute loss of the estimated conditional probability versus the size of the training samples for C-SVM and the proposed method with h = 0 and h = 1. [sent-899, score-0.522]
85 32 Table 2: Squared loss and absolute loss of the estimated conditional probability P(y|x). [sent-944, score-0.448]
86 The uncertainty set of MPM and unbiased MPM is an ellipsoid deÄ? [sent-953, score-0.36]
87 The revision of the ellipsoidal uncertainty set leads to the uncertainty set of our algorithm. [sent-958, score-0.802]
88 Hence, the revision of uncertainty sets can improve the prediction accuracy of uncertainty-set-based learning. [sent-961, score-0.436]
89 Conclusion We studied the relation between the loss function approach and the uncertainty set approach in binary classiÄ? [sent-1001, score-0.521]
90 Given a loss function, there exists a corresponding parametrized uncertainty set. [sent-1004, score-0.514]
91 In general, however, the uncertainty set does not correspond to the empirical loss function. [sent-1005, score-0.48]
92 We presented a way of revising the uncertainty set so that it will correspond to an empirical loss function. [sent-1006, score-0.536]
93 We compared C-SVM, MPM, unbiased MPM, and the proposed method with loss functions (24) having h = 0 and h = 1: boldface letters indicate that the average squared loss is the smallest. [sent-1149, score-0.372]
94 48 Table 5: Squared loss Ä‚—100 of estimated conditional probability for C-SVM and the proposed method with loss functions (24) having h = 0 and h = 1: boldface letters indicate that the average test error is the smallest. [sent-1229, score-0.448]
95 We are currently investigating of the possibility of relaxing the assumptions so as to include the hinge loss and other popular loss functions such as the logistic loss. [sent-1231, score-0.362]
96 The relation between the loss function approach and the uncertainty set approach is a useful tool for statistical modeling. [sent-1237, score-0.556]
97 We believe that learning algorithms with revised uncertainty sets can bridge the gap between intuitive statistical modeling and nice statistical properties. [sent-1239, score-0.386]
98 , (xm , ym )}, the empirical loss RT ( f , ÄŽ ) and the regularized empirical loss RT,ĂŽĹĽ ( f , ÄŽ ) are deÄ? [sent-1328, score-0.356]
99 Ă‚Ĺť â„“â€Ë› (ÄŽ ) EFSJWBUJWFPGSFWJTFEMPTT K ANAMORI , TAKEDA AND S UZUKI [ Figure 6: The derivative of the loss function corresponding to the revised uncertainty set with the estimation error. [sent-1642, score-0.509]
100 Therefore, the loss function corresponding to the 32w2 revised uncertainty set in Example 6 satisÄ? [sent-1650, score-0.48]
wordName wordTfidf (topN-words)
[('mo', 0.467), ('uncertainty', 0.316), ('mpm', 0.261), ('takeda', 0.239), ('anamori', 0.205), ('uzuki', 0.205), ('elation', 0.196), ('onjugate', 0.196), ('xo', 0.196), ('loss', 0.164), ('uo', 0.144), ('roblems', 0.13), ('revision', 0.12), ('lassification', 0.116), ('mn', 0.109), ('wt', 0.108), ('rt', 0.091), ('inf', 0.09), ('zn', 0.089), ('un', 0.086), ('conv', 0.086), ('decision', 0.086), ('conjugate', 0.078), ('cn', 0.076), ('mp', 0.07), ('convex', 0.068), ('lo', 0.066), ('gm', 0.066), ('bartlett', 0.063), ('xi', 0.061), ('margin', 0.058), ('yi', 0.058), ('bredensteiner', 0.056), ('mptt', 0.056), ('revising', 0.056), ('surrogate', 0.051), ('steinwart', 0.051), ('ellipsoidal', 0.05), ('conditional', 0.05), ('subdifferential', 0.046), ('unbiased', 0.044), ('bennett', 0.043), ('fm', 0.043), ('label', 0.043), ('suppose', 0.042), ('lanckriet', 0.042), ('relation', 0.041), ('sup', 0.04), ('rkhs', 0.04), ('training', 0.04), ('banana', 0.04), ('nath', 0.04), ('ringnorm', 0.04), ('twonorm', 0.04), ('estimated', 0.04), ('duality', 0.038), ('diabetis', 0.037), ('nfuipe', 0.037), ('qspqptfe', 0.037), ('rmo', 0.037), ('bhattacharyya', 0.036), ('statistical', 0.035), ('subject', 0.035), ('classi', 0.035), ('samples', 0.034), ('parametrized', 0.034), ('hinge', 0.034), ('ln', 0.033), ('assumption', 0.032), ('crisp', 0.032), ('rockafellar', 0.032), ('cm', 0.032), ('primal', 0.03), ('probability', 0.03), ('lm', 0.029), ('covering', 0.029), ('derivative', 0.029), ('thyroid', 0.029), ('titanic', 0.029), ('regularization', 0.029), ('lkopf', 0.028), ('sign', 0.028), ('ym', 0.028), ('karatzoglou', 0.028), ('limz', 0.028), ('mavroforakis', 0.028), ('nagoya', 0.028), ('taiji', 0.028), ('zup', 0.028), ('converges', 0.028), ('estimator', 0.028), ('inequality', 0.027), ('lemma', 0.027), ('bayes', 0.027), ('margins', 0.027), ('splice', 0.027), ('assures', 0.027), ('bm', 0.026), ('condition', 0.026), ('risk', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki
Abstract: There are two main approaches to binary classiÄ?Ĺš cation problems: the loss function approach and the uncertainty set approach. The loss function approach is widely used in real-world data analysis. Statistical decision theory has been used to elucidate its properties such as statistical consistency. Conditional probabilities can also be estimated by using the minimum solution of the loss function. In the uncertainty set approach, an uncertainty set is deÄ?Ĺš ned for each binary label from training samples. The best separating hyperplane between the two uncertainty sets is used as the decision function. Although the uncertainty set approach provides an intuitive understanding of learning algorithms, its statistical properties have not been sufÄ?Ĺš ciently studied. In this paper, we show that the uncertainty set is deeply connected with the convex conjugate of a loss function. On the basis of the conjugate relation, we propose a way of revising the uncertainty set approach so that it will have good statistical properties such as statistical consistency. We also introduce statistical models corresponding to uncertainty sets in order to estimate conditional probabilities. Finally, we present numerical experiments, verifying that the learning with revised uncertainty sets improves the prediction accuracy. Keywords: loss function, uncertainty set, convex conjugate, consistency
2 0.12310771 114 jmlr-2013-The Rate of Convergence of AdaBoost
Author: Indraneel Mukherjee, Cynthia Rudin, Robert E. Schapire
Abstract: The AdaBoost algorithm was designed to combine many “weak” hypotheses that perform slightly better than random guessing into a “strong” hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the “exponential loss.” Unlike previous work, our proofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are finite. Our first result shows that the exponential loss of AdaBoost’s computed parameter vector will be at most ε more than that of any parameter vector of ℓ1 -norm bounded by B in a number of rounds that is at most a polynomial in B and 1/ε. We also provide lower bounds showing that a polynomial dependence is necessary. Our second result is that within C/ε iterations, AdaBoost achieves a value of the exponential loss that is at most ε more than the best possible value, where C depends on the data set. We show that this dependence of the rate on ε is optimal up to constant factors, that is, at least Ω(1/ε) rounds are necessary to achieve within ε of the optimal exponential loss. Keywords: AdaBoost, optimization, coordinate descent, convergence rate
3 0.077027813 68 jmlr-2013-Machine Learning with Operational Costs
Author: Theja Tulabandhula, Cynthia Rudin
Abstract: This work proposes a way to align statistical modeling with decision making. We provide a method that propagates the uncertainty in predictive modeling to the uncertainty in operational cost, where operational cost is the amount spent by the practitioner in solving the problem. The method allows us to explore the range of operational costs associated with the set of reasonable statistical models, so as to provide a useful way for practitioners to understand uncertainty. To do this, the operational cost is cast as a regularization term in a learning algorithm’s objective function, allowing either an optimistic or pessimistic view of possible costs, depending on the regularization parameter. From another perspective, if we have prior knowledge about the operational cost, for instance that it should be low, this knowledge can help to restrict the hypothesis space, and can help with generalization. We provide a theoretical generalization bound for this scenario. We also show that learning with operational costs is related to robust optimization. Keywords: statistical learning theory, optimization, covering numbers, decision theory
4 0.073945895 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression
Author: Arnaud Guyader, Nick Hengartner
Abstract: Motivated by promising experimental results, this paper investigates the theoretical properties of a recently proposed nonparametric estimator, called the Mutual Nearest Neighbors rule, which estimates the regression function m(x) = E[Y |X = x] as follows: first identify the k nearest neighbors of x in the sample Dn , then keep only those for which x is itself one of the k nearest neighbors, and finally take the average over the corresponding response variables. We prove that this estimator is consistent and that its rate of convergence is optimal. Since the estimate with the optimal rate of convergence depends on the unknown distribution of the observations, we also present adaptation results by data-splitting. Keywords: nonparametric estimation, nearest neighbor methods, mathematical statistics
5 0.061164327 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the margin-adapted dimension, which is a simple function of the second order statistics of the data distribution, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the margin-adapted dimension of the data distribution. The upper bounds are universal, and the lower bounds hold for the rich family of sub-Gaussian distributions with independent features. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. To prove the lower bound, we develop several new tools of independent interest. These include new connections between shattering and hardness of learning, new properties of shattering with linear classifiers, and a new lower bound on the smallest eigenvalue of a random Gram matrix generated by sub-Gaussian variables. Our results can be used to quantitatively compare large margin learning to other learning rules, and to improve the effectiveness of methods that use sample complexity bounds, such as active learning. Keywords: supervised learning, sample complexity, linear classifiers, distribution-dependence
6 0.06079771 120 jmlr-2013-Variational Algorithms for Marginal MAP
7 0.054254603 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding
8 0.054241735 8 jmlr-2013-A Theory of Multiclass Boosting
9 0.053278219 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
10 0.05095382 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
11 0.050939444 104 jmlr-2013-Sparse Single-Index Model
12 0.050635051 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
13 0.050310966 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting
14 0.046561062 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods
15 0.046427511 76 jmlr-2013-Nonparametric Sparsity and Regularization
16 0.045685597 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
17 0.044714611 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
18 0.044116296 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach
19 0.042932693 73 jmlr-2013-Multicategory Large-Margin Unified Machines
20 0.041719265 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications
topicId topicWeight
[(0, -0.22), (1, 0.066), (2, 0.072), (3, 0.098), (4, -0.015), (5, -0.108), (6, 0.062), (7, -0.006), (8, 0.024), (9, -0.155), (10, 0.0), (11, 0.066), (12, 0.147), (13, -0.056), (14, -0.13), (15, -0.011), (16, -0.056), (17, 0.027), (18, -0.041), (19, 0.112), (20, -0.009), (21, 0.009), (22, 0.025), (23, 0.065), (24, 0.03), (25, 0.1), (26, -0.05), (27, -0.041), (28, -0.1), (29, -0.137), (30, 0.006), (31, -0.162), (32, -0.015), (33, -0.007), (34, -0.038), (35, 0.042), (36, -0.01), (37, -0.053), (38, -0.025), (39, 0.141), (40, -0.132), (41, 0.124), (42, 0.012), (43, 0.098), (44, 0.043), (45, 0.09), (46, 0.224), (47, -0.196), (48, -0.133), (49, -0.111)]
simIndex simValue paperId paperTitle
same-paper 1 0.93894356 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki
Abstract: There are two main approaches to binary classiÄ?Ĺš cation problems: the loss function approach and the uncertainty set approach. The loss function approach is widely used in real-world data analysis. Statistical decision theory has been used to elucidate its properties such as statistical consistency. Conditional probabilities can also be estimated by using the minimum solution of the loss function. In the uncertainty set approach, an uncertainty set is deÄ?Ĺš ned for each binary label from training samples. The best separating hyperplane between the two uncertainty sets is used as the decision function. Although the uncertainty set approach provides an intuitive understanding of learning algorithms, its statistical properties have not been sufÄ?Ĺš ciently studied. In this paper, we show that the uncertainty set is deeply connected with the convex conjugate of a loss function. On the basis of the conjugate relation, we propose a way of revising the uncertainty set approach so that it will have good statistical properties such as statistical consistency. We also introduce statistical models corresponding to uncertainty sets in order to estimate conditional probabilities. Finally, we present numerical experiments, verifying that the learning with revised uncertainty sets improves the prediction accuracy. Keywords: loss function, uncertainty set, convex conjugate, consistency
2 0.58152759 68 jmlr-2013-Machine Learning with Operational Costs
Author: Theja Tulabandhula, Cynthia Rudin
Abstract: This work proposes a way to align statistical modeling with decision making. We provide a method that propagates the uncertainty in predictive modeling to the uncertainty in operational cost, where operational cost is the amount spent by the practitioner in solving the problem. The method allows us to explore the range of operational costs associated with the set of reasonable statistical models, so as to provide a useful way for practitioners to understand uncertainty. To do this, the operational cost is cast as a regularization term in a learning algorithm’s objective function, allowing either an optimistic or pessimistic view of possible costs, depending on the regularization parameter. From another perspective, if we have prior knowledge about the operational cost, for instance that it should be low, this knowledge can help to restrict the hypothesis space, and can help with generalization. We provide a theoretical generalization bound for this scenario. We also show that learning with operational costs is related to robust optimization. Keywords: statistical learning theory, optimization, covering numbers, decision theory
3 0.53890723 114 jmlr-2013-The Rate of Convergence of AdaBoost
Author: Indraneel Mukherjee, Cynthia Rudin, Robert E. Schapire
Abstract: The AdaBoost algorithm was designed to combine many “weak” hypotheses that perform slightly better than random guessing into a “strong” hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the “exponential loss.” Unlike previous work, our proofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are finite. Our first result shows that the exponential loss of AdaBoost’s computed parameter vector will be at most ε more than that of any parameter vector of ℓ1 -norm bounded by B in a number of rounds that is at most a polynomial in B and 1/ε. We also provide lower bounds showing that a polynomial dependence is necessary. Our second result is that within C/ε iterations, AdaBoost achieves a value of the exponential loss that is at most ε more than the best possible value, where C depends on the data set. We show that this dependence of the rate on ε is optimal up to constant factors, that is, at least Ω(1/ε) rounds are necessary to achieve within ε of the optimal exponential loss. Keywords: AdaBoost, optimization, coordinate descent, convergence rate
4 0.52022588 73 jmlr-2013-Multicategory Large-Margin Unified Machines
Author: Chong Zhang, Yufeng Liu
Abstract: Hard and soft classifiers are two important groups of techniques for classification problems. Logistic regression and Support Vector Machines are typical examples of soft and hard classifiers respectively. The essential difference between these two groups is whether one needs to estimate the class conditional probability for the classification task or not. In particular, soft classifiers predict the label based on the obtained class conditional probabilities, while hard classifiers bypass the estimation of probabilities and focus on the decision boundary. In practice, for the goal of accurate classification, it is unclear which one to use in a given situation. To tackle this problem, the Largemargin Unified Machine (LUM) was recently proposed as a unified family to embrace both groups. The LUM family enables one to study the behavior change from soft to hard binary classifiers. For multicategory cases, however, the concept of soft and hard classification becomes less clear. In that case, class probability estimation becomes more involved as it requires estimation of a probability vector. In this paper, we propose a new Multicategory LUM (MLUM) framework to investigate the behavior of soft versus hard classification under multicategory settings. Our theoretical and numerical results help to shed some light on the nature of multicategory classification and its transition behavior from soft to hard classifiers. The numerical results suggest that the proposed tuned MLUM yields very competitive performance. Keywords: hard classification, large-margin, soft classification, support vector machine
5 0.45277441 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding
Author: Ery Arias-Castro, Bruno Pelletier
Abstract: Maximum Variance Unfolding is one of the main methods for (nonlinear) dimensionality reduction. We study its large sample limit, providing specific rates of convergence under standard assumptions. We find that it is consistent when the underlying submanifold is isometric to a convex subset, and we provide some simple examples where it fails to be consistent. Keywords: maximum variance unfolding, isometric embedding, U-processes, empirical processes, proximity graphs.
6 0.44256449 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression
7 0.40481493 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
8 0.38762352 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting
9 0.37673199 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
10 0.36416367 104 jmlr-2013-Sparse Single-Index Model
11 0.36398432 8 jmlr-2013-A Theory of Multiclass Boosting
12 0.36103454 120 jmlr-2013-Variational Algorithms for Marginal MAP
13 0.35184383 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications
14 0.34642726 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
15 0.34175804 107 jmlr-2013-Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
16 0.33615947 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods
17 0.31149396 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
18 0.3083905 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach
19 0.30551463 76 jmlr-2013-Nonparametric Sparsity and Regularization
20 0.28438151 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
topicId topicWeight
[(0, 0.014), (5, 0.163), (6, 0.024), (10, 0.079), (20, 0.011), (23, 0.041), (41, 0.014), (68, 0.037), (70, 0.041), (75, 0.043), (85, 0.07), (87, 0.018), (93, 0.019), (97, 0.317)]
simIndex simValue paperId paperTitle
same-paper 1 0.73014295 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki
Abstract: There are two main approaches to binary classiÄ?Ĺš cation problems: the loss function approach and the uncertainty set approach. The loss function approach is widely used in real-world data analysis. Statistical decision theory has been used to elucidate its properties such as statistical consistency. Conditional probabilities can also be estimated by using the minimum solution of the loss function. In the uncertainty set approach, an uncertainty set is deÄ?Ĺš ned for each binary label from training samples. The best separating hyperplane between the two uncertainty sets is used as the decision function. Although the uncertainty set approach provides an intuitive understanding of learning algorithms, its statistical properties have not been sufÄ?Ĺš ciently studied. In this paper, we show that the uncertainty set is deeply connected with the convex conjugate of a loss function. On the basis of the conjugate relation, we propose a way of revising the uncertainty set approach so that it will have good statistical properties such as statistical consistency. We also introduce statistical models corresponding to uncertainty sets in order to estimate conditional probabilities. Finally, we present numerical experiments, verifying that the learning with revised uncertainty sets improves the prediction accuracy. Keywords: loss function, uncertainty set, convex conjugate, consistency
2 0.53702712 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright
Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling
3 0.53286928 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
Author: Sébastien Gerchinovitz
Abstract: We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T . We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same flavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with fixed design. Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds
4 0.53044176 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
Author: Tony Cai, Wen-Xin Zhou
Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence
5 0.52883512 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
Author: Alon Gonen, Sivan Sabato, Shai Shalev-Shwartz
Abstract: We study pool-based active learning of half-spaces. We revisit the aggressive approach for active learning in the realizable case, and show that it can be made efficient and practical, while also having theoretical guarantees under reasonable assumptions. We further show, both theoretically and experimentally, that it can be preferable to mellow approaches. Our efficient aggressive active learner of half-spaces has formal approximation guarantees that hold when the pool is separable with a margin. While our analysis is focused on the realizable setting, we show that a simple heuristic allows using the same algorithm successfully for pools with low error as well. We further compare the aggressive approach to the mellow approach, and prove that there are cases in which the aggressive approach results in significantly better label complexity compared to the mellow approach. We demonstrate experimentally that substantial improvements in label complexity can be achieved using the aggressive approach, for both realizable and low-error settings. Keywords: active learning, linear classifiers, margin, adaptive sub-modularity
6 0.52545184 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
7 0.52400804 73 jmlr-2013-Multicategory Large-Margin Unified Machines
9 0.52145463 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
10 0.5162254 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators
11 0.51576102 114 jmlr-2013-The Rate of Convergence of AdaBoost
12 0.51546043 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
13 0.51496923 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
15 0.51424694 76 jmlr-2013-Nonparametric Sparsity and Regularization
16 0.51387197 68 jmlr-2013-Machine Learning with Operational Costs
17 0.51362354 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
18 0.51179796 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
19 0.51147735 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation