jmlr jmlr2013 jmlr2013-73 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Chong Zhang, Yufeng Liu
Abstract: Hard and soft classifiers are two important groups of techniques for classification problems. Logistic regression and Support Vector Machines are typical examples of soft and hard classifiers respectively. The essential difference between these two groups is whether one needs to estimate the class conditional probability for the classification task or not. In particular, soft classifiers predict the label based on the obtained class conditional probabilities, while hard classifiers bypass the estimation of probabilities and focus on the decision boundary. In practice, for the goal of accurate classification, it is unclear which one to use in a given situation. To tackle this problem, the Largemargin Unified Machine (LUM) was recently proposed as a unified family to embrace both groups. The LUM family enables one to study the behavior change from soft to hard binary classifiers. For multicategory cases, however, the concept of soft and hard classification becomes less clear. In that case, class probability estimation becomes more involved as it requires estimation of a probability vector. In this paper, we propose a new Multicategory LUM (MLUM) framework to investigate the behavior of soft versus hard classification under multicategory settings. Our theoretical and numerical results help to shed some light on the nature of multicategory classification and its transition behavior from soft to hard classifiers. The numerical results suggest that the proposed tuned MLUM yields very competitive performance. Keywords: hard classification, large-margin, soft classification, support vector machine
Reference: text
sentIndex sentText sentNum sentScore
1 Logistic regression and Support Vector Machines are typical examples of soft and hard classifiers respectively. [sent-6, score-0.29]
2 In particular, soft classifiers predict the label based on the obtained class conditional probabilities, while hard classifiers bypass the estimation of probabilities and focus on the decision boundary. [sent-8, score-0.333]
3 The LUM family enables one to study the behavior change from soft to hard binary classifiers. [sent-11, score-0.344]
4 For multicategory cases, however, the concept of soft and hard classification becomes less clear. [sent-12, score-0.588]
5 In this paper, we propose a new Multicategory LUM (MLUM) framework to investigate the behavior of soft versus hard classification under multicategory settings. [sent-14, score-0.588]
6 Our theoretical and numerical results help to shed some light on the nature of multicategory classification and its transition behavior from soft to hard classifiers. [sent-15, score-0.606]
7 Keywords: hard classification, large-margin, soft classification, support vector machine 1. [sent-17, score-0.276]
8 Z HANG AND L IU Wahba (2002) discussed the concept of soft versus hard classification. [sent-25, score-0.276]
9 For a given problem, the choice between hard and soft classifiers can be difficult. [sent-34, score-0.276]
10 The LUM family is a rich group of classifiers in the sense that it connects hard and soft classifiers in one spectrum. [sent-37, score-0.313]
11 It provides a natural platform for comparisons between soft and hard classifiers. [sent-38, score-0.276]
12 More importantly, it enables us to observe the performance transition from soft to hard classification. [sent-39, score-0.276]
13 Furthermore, multicategory consistency is much more involved, especially for hard classifiers. [sent-43, score-0.417]
14 For instance, there are a lot of developments on multicategory SVMs in the literature (Vapnik, 1998; Weston and Watkins, 1999; Crammer et al. [sent-44, score-0.312]
15 Recently, Liu and Yuan (2011) proposed a group of consistent multicategory piecewise linear hinge loss functions, namely a family of reinforced hinge loss functions, which covers the loss by Lee et al. [sent-49, score-0.618]
16 For probability estimation, there are several existing multicategory soft classifiers, such as Adaboost in Boosting (Freund and Schapire, 1997; Zou et al. [sent-51, score-0.525]
17 , 2000), proximal SVMs (Tang and Zhang, 2006), and multicategory composite least squares classifiers (Park et al. [sent-54, score-0.312]
18 It helps to shed some light on the choice between multicategory soft and hard classifiers, and provide some insights on the behavior change from soft to hard classification methods. [sent-58, score-0.882]
19 (2006) to the multicategory case and study its convergence rate. [sent-61, score-0.312]
20 A MLUM member, in-between hard and soft classifiers, tends to work the best. [sent-68, score-0.276]
21 Methodology In this section, we first introduce the background of binary classification, then discuss different ways of generalization to multicategory problems. [sent-82, score-0.343]
22 The notion of soft and hard classification is first reviewed in the binary classification context. [sent-83, score-0.307]
23 Then we propose a MLUM framework which helps us to understand soft versus hard classification in the multicategory setting. [sent-84, score-0.588]
24 Different methods can be roughly grouped into two categories, namely, soft and hard classifiers. [sent-111, score-0.276]
25 The parameter c is an index of soft versus hard classifiers. [sent-121, score-0.276]
26 In particular, c = 0 corresponds to a typical soft classifier, and c → ∞ corresponds to the SVM, a typical hard classifier. [sent-122, score-0.276]
27 Consequently, the LUM connects soft and hard classifiers as a family, and enables one to thoroughly investigate the transition behavior in this spectrum. [sent-123, score-0.276]
28 In the next section, we briefly introduce some existing methods for multicategory classification problems. [sent-147, score-0.312]
29 2 Existing Multicategory Classification Methods To solve a multicategory problem, a natural and direct way is to implement multiple binary classifiers. [sent-149, score-0.343]
30 Hence, it is desirable to have a simultaneous multicategory classifier that considers k classes altogether. [sent-173, score-0.329]
31 The idea of simultaneous multicategory classifiers is as follows. [sent-174, score-0.312]
32 Similar as in (3), we are interested in solving the following optimization problem min f ∈H 1 n ∑ V f(xi ), yi + λJ(f), n i=1 (4) with V being a loss function for a multicategory problem, and J(f) being a regularization term defined for the multicategory problems. [sent-187, score-0.679]
33 For soft classification in multiclass problems, Zhu and Hastie (2005) used the generalized logistic loss V = − fy (x) + log(e f1 (x) + · · · + e fk (x) ). [sent-194, score-0.423]
34 (2009) extended the Adaboost to a multicategory learning method with the exponential loss V = exp(− 1 zT f). [sent-200, score-0.367]
35 In the literature of hard classifiers, there are k several ways to extend the binary hinge loss of the SVM to the simultaneous multicategory case. [sent-201, score-0.525]
36 Here we list several commonly used versions with the sum-to-zero constraint: Loss 1 (Naive hinge loss) [1 − fy (x)]+ ; Loss 2 (Vapnik, 1998) ∑ j=y [1 − fy (x) − f j (x) ]+ ; Loss 3 (Crammer et al. [sent-202, score-0.249]
37 Next we examine the sum-to-zero constraint ∑k f j (x) = 0 for different multicategory losses. [sent-216, score-0.312]
38 This constraint was also used in many other simultaneous multicategory classification papers, for example, Tang and Zhang (2006), Wang and Shen (2007), and Zhu et al. [sent-242, score-0.312]
39 3 MLUM Family Soft and hard classifiers have been both studied in the literature of simultaneous multicategory classification. [sent-245, score-0.4]
40 The LUM family is a broad family which embraces both soft and hard classifiers in binary cases, yet no such a convenient platform is available in the multicategory framework. [sent-247, score-0.693]
41 In this paper, we propose a new family of MLUMs to study multicategory problems. [sent-248, score-0.349]
42 The main motivation to use the MLUM loss function (6) is based on the argmax rule for multicategory classification. [sent-251, score-0.395]
43 (2006) and Zhang (2004a), from the binary case to the multicategory one. [sent-277, score-0.343]
44 In the multicategory classification literature, to tackle the Fisher consistency problem, we need the following definitions. [sent-294, score-0.329]
45 In multicategory problems, the conditional probability becomes a vector, and this makes the transition behavior of probability estimation from soft to hard classification more complex. [sent-384, score-0.678]
46 In particular, we need to generalize the flat region in Figure 3 1358 M ULTICATEGORY L ARGE - MARGIN U NIFIED M ACHINES to the multicategory setting. [sent-385, score-0.312]
47 Zhang (2004b) further studied the relationship between the two excess risks in multicategory problems. [sent-418, score-0.379]
48 In particular, the convergence rates of the two excess risks are well studied in Wang and Shen (2007), in the setting of L1 penalized multicategory SVM with linear learning. [sent-419, score-0.396]
49 In this section, we employ the generalization of the excess V -risk from binary to multicategory cases as in Zhang (2004b), and explore the explicit form for MLUM. [sent-420, score-0.424]
50 The plots show the transition of the MLUM from soft classification (top left panel) to hard classification (bottom right panel). [sent-426, score-0.276]
51 A similar result for multicategory problems is obtained in Zhang (2004b). [sent-434, score-0.312]
52 Here we extend the results in Zhang (2004a) to the multicategory version. [sent-436, score-0.312]
53 3 includes two examples where the MLUM with c = 1, corresponding to a new multicategory DWD, can outperform both hard and soft classifiers, which was not observed in the binary LUM case in Liu et al. [sent-583, score-0.633]
54 Note that this multicategory DWD is different from the version proposed in Huang et al. [sent-585, score-0.312]
55 4 provides some summary on the effect of hard versus soft classifiers and gives some insight on the choice of classification methods. [sent-588, score-0.276]
56 Interestingly, we find that the behavior of MLUM with c = 1 can outperform the hard (c → ∞) and soft (c = 0) classifiers in certain cases. [sent-593, score-0.276]
57 Here for the MPLR, we replace the loss function ℓ by the logistic loss V ( f , y) = log(1 + e−y f ), while keeping the convex combination so that the classifier is tuned with different γ values. [sent-619, score-0.351]
58 Note that in most examples, the tuned MLUM performs better than the other methods, and is recommended. [sent-621, score-0.237]
59 In this case, estimation of the conditional probability function is challenging, and we expect hard classifiers to perform better than the soft ones, because the former bypasses probability estimation. [sent-667, score-0.366]
60 24 tuned γ, fixed c tuned MLUM Bayes tuned c, fixed γ tuned MLUM Bayes 0. [sent-690, score-1.068]
61 This is a multicategory generalization of Example 2 in Liu et al. [sent-707, score-0.312]
62 Thus class probabilities are difficult to estimate in this case, and the classification accuracy of soft classifiers may be sacrificed by probability estimation. [sent-710, score-0.23]
63 Interestingly, the soft classifier gives worse probability estimation than hard classifiers in this example. [sent-786, score-0.323]
64 In such cases the accurate probability estimation may help the soft classifier to build more accurate classification boundaries. [sent-806, score-0.235]
65 , 8, Error tuned γ, fixed c tuned MLUM Bayes 0 1 10 100 1000 0. [sent-846, score-0.534]
66 The left panel shows that this is an example in which soft classifier works the best. [sent-875, score-0.271]
67 15 Error tuned γ, fixed c tuned MLUM Bayes Error 0. [sent-935, score-0.534]
68 25 tuned c, fixed γ tuned MLUM Bayes 0 1 10 100 1000 0 c 0. [sent-937, score-0.534]
69 The left panel shows the soft classifier works reasonably well in this example. [sent-949, score-0.271]
70 (2011) showed that among many examples, the classifier that works the best in the LUM family appears to be either soft or hard classifiers. [sent-989, score-0.329]
71 In the MLUM family, however, we observe that the MLUM with c = 1 (a new multicategory DWD) can sometimes yield the best performance in terms of classification accuracy. [sent-990, score-0.312]
72 In these two examples, we add a small percentage of noisy data into the originally clean data sets, and observe that soft classifiers can be very sensitive to outliers in terms of classification accuracy, while MLUMs with c ≥ 1 appear to be quite robust. [sent-992, score-0.236]
73 36 tuned γ, fixed c tuned MLUM 0 1 10 100 1000 0 0. [sent-1053, score-0.534]
74 4 Summary of Simulation Results Our simulated examples provide some insights on hard versus soft classification methods, and suggest that no single classifier works universally the best in all cases. [sent-1103, score-0.321]
75 Our simulation studies showed different cases in which hard (c = ∞), soft (c = 0) and in-between (c = 1) classifiers work the best, respectively. [sent-1105, score-0.276]
76 When the underlying conditional probability function is a step function, probability estimation can be a difficult problem for soft classifiers, and the prediction accuracy may be compromised. [sent-1106, score-0.278]
77 Interestingly, the soft classifier is quite vulnerable to potential outliers in the data. [sent-1142, score-0.236]
78 In Example 5, we add 5% outliers to the clean samples, and the soft classifier performs the worst among the others. [sent-1143, score-0.251]
79 However in Example 6 when we add only 1% outliers to the data, the classification accuracy of the soft classifier is reduced by over 40% while those of the others are just around 15 − 20%. [sent-1145, score-0.236]
80 Therefore the tuned MLUM procedure is sufficient enough to avoid tuning on the entire interval γ ∈ [0, 1]. [sent-1151, score-0.244]
81 As we observed in the simulation studies that potential outliers may decrease the accuracy of soft classifiers, we perform Principal Component Analysis (PCA) on each data set and examine if there are any obvious potential outliers on the PCA projected plots. [sent-1214, score-0.284]
82 The Iris data set doesn’t appear to have any obvious outliers and the soft classifier works very well there. [sent-1217, score-0.252]
83 The tuned MLUM performs slightly worse than the RMSVM, but the difference is not large. [sent-1221, score-0.237]
84 48 tuned γ, fixed c tuned MLUM 0 1 10 100 1000 0 c 0. [sent-1255, score-0.534]
85 The right panel suggests that the classification error is the best when c is tuned with γ = 1. [sent-1268, score-0.317]
86 Interestingly, hard and soft classifiers perform similarly in terms of classification accuracy. [sent-1270, score-0.276]
87 18 Z HANG AND L IU tuned c, fixed γ tuned MLUM Error 0. [sent-1292, score-0.534]
88 14 tuned γ, fixed c tuned MLUM 0 1 10 100 1000 0 c 0. [sent-1298, score-0.534]
89 Overall, the tuned MLUM performs the best, as in the simulated examples. [sent-1324, score-0.252]
90 In this paper, we generalize the binary LUM family to the simultaneous multicategory classifiers, namely, the MLUM family. [sent-1327, score-0.38]
91 The MLUM is very general and it includes many popular multicategory classification methods as special cases. [sent-1328, score-0.312]
92 In particular, the MLUM family includes both soft and hard classifiers, and provides a platform to explore the transition behavior from soft to hard classification problems. [sent-1329, score-0.603]
93 205 tuned γ, fixed c tuned MLUM 0 1 10 100 1000 0 0. [sent-1337, score-0.534]
94 The numerical f, examples show that hard and soft classifiers behave quite differently in various settings, and they help to shed some light on the choice between the two. [sent-1412, score-0.308]
95 We numerically demonstrate that the tuned MLUM outperforms several other multicategory techniques and should be a competitive addition to the existing classification toolbox. [sent-1415, score-0.534]
96 The left panel shows the test error of soft and hard classifiers are roughly comparable, while c = 1 (DWD) is the worst. [sent-1436, score-0.396]
97 After some calculation we have S(f, x) − S(f∗ , x) = (1 − γ)[δ1 − δ2 ] + P1 γ(−δ4 ) + (1 − γ)(−δ1 ) +P2 γ(δ3 ) + (1 − γ)(δ2 ) 1379 (13) tuned c, fixed γ tuned MLUM 0. [sent-1509, score-0.534]
98 09 tuned γ, fixed c tuned MLUM 0 1 10 100 1000 0 c 0. [sent-1549, score-0.534]
99 The left panel suggests that soft classification method performs better in terms of classification accuracy than the others. [sent-1561, score-0.27]
100 New multicategory boosting algorithms based on multicategory fisher-consistent losses. [sent-1912, score-0.624]
wordName wordTfidf (topN-words)
[('mlum', 0.749), ('multicategory', 0.312), ('tuned', 0.222), ('soft', 0.188), ('lum', 0.155), ('classi', 0.129), ('achines', 0.111), ('nified', 0.111), ('ulticategory', 0.111), ('fy', 0.105), ('pj', 0.101), ('ers', 0.091), ('fixed', 0.09), ('hard', 0.088), ('maes', 0.087), ('rmsvm', 0.08), ('arge', 0.07), ('panel', 0.067), ('excess', 0.067), ('vehicle', 0.066), ('iu', 0.064), ('liu', 0.063), ('msvm', 0.062), ('vertebral', 0.062), ('fisher', 0.058), ('hang', 0.057), ('fih', 0.056), ('mplr', 0.056), ('loss', 0.055), ('iris', 0.054), ('margin', 0.054), ('er', 0.051), ('outliers', 0.048), ('dermatology', 0.048), ('dwd', 0.043), ('fk', 0.042), ('hinge', 0.039), ('fi', 0.039), ('gbm', 0.037), ('family', 0.037), ('pk', 0.035), ('breast', 0.034), ('mae', 0.033), ('wine', 0.033), ('cation', 0.032), ('binary', 0.031), ('error', 0.028), ('argmax', 0.028), ('shen', 0.027), ('ex', 0.027), ('reinforced', 0.026), ('test', 0.025), ('chapel', 0.025), ('mlums', 0.025), ('proneural', 0.025), ('probability', 0.025), ('bayes', 0.024), ('pc', 0.023), ('risk', 0.023), ('estimation', 0.022), ('losses', 0.022), ('tuning', 0.022), ('yuan', 0.022), ('zhang', 0.021), ('cancer', 0.021), ('fh', 0.021), ('bartlett', 0.02), ('logistic', 0.019), ('zhu', 0.019), ('arginf', 0.019), ('glioblastoma', 0.019), ('conditional', 0.018), ('shed', 0.018), ('consistency', 0.017), ('class', 0.017), ('svm', 0.017), ('classes', 0.017), ('rates', 0.017), ('bregman', 0.016), ('tang', 0.016), ('works', 0.016), ('yes', 0.016), ('hill', 0.016), ('covariates', 0.015), ('simulated', 0.015), ('uni', 0.015), ('performs', 0.015), ('minimizer', 0.015), ('pca', 0.015), ('multiclass', 0.014), ('rosset', 0.014), ('nq', 0.014), ('boser', 0.014), ('kz', 0.014), ('examples', 0.014), ('lee', 0.014), ('explore', 0.014), ('dd', 0.014), ('wahba', 0.014), ('reproducing', 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 73 jmlr-2013-Multicategory Large-Margin Unified Machines
Author: Chong Zhang, Yufeng Liu
Abstract: Hard and soft classifiers are two important groups of techniques for classification problems. Logistic regression and Support Vector Machines are typical examples of soft and hard classifiers respectively. The essential difference between these two groups is whether one needs to estimate the class conditional probability for the classification task or not. In particular, soft classifiers predict the label based on the obtained class conditional probabilities, while hard classifiers bypass the estimation of probabilities and focus on the decision boundary. In practice, for the goal of accurate classification, it is unclear which one to use in a given situation. To tackle this problem, the Largemargin Unified Machine (LUM) was recently proposed as a unified family to embrace both groups. The LUM family enables one to study the behavior change from soft to hard binary classifiers. For multicategory cases, however, the concept of soft and hard classification becomes less clear. In that case, class probability estimation becomes more involved as it requires estimation of a probability vector. In this paper, we propose a new Multicategory LUM (MLUM) framework to investigate the behavior of soft versus hard classification under multicategory settings. Our theoretical and numerical results help to shed some light on the nature of multicategory classification and its transition behavior from soft to hard classifiers. The numerical results suggest that the proposed tuned MLUM yields very competitive performance. Keywords: hard classification, large-margin, soft classification, support vector machine
2 0.05305396 8 jmlr-2013-A Theory of Multiclass Boosting
Author: Indraneel Mukherjee, Robert E. Schapire
Abstract: Boosting combines weak classifiers to form highly accurate predictors. Although the case of binary classification is well understood, in the multiclass setting, the “correct” requirements on the weak classifier, or the notion of the most efficient boosting algorithms are missing. In this paper, we create a broad and general framework, within which we make precise and identify the optimal requirements on the weak-classifier, as well as design the most effective, in a certain sense, boosting algorithms that assume such requirements. Keywords: multiclass, boosting, weak learning condition, drifting games
3 0.048588317 22 jmlr-2013-Classifying With Confidence From Incomplete Information
Author: Nathan Parrish, Hyrum S. Anderson, Maya R. Gupta, Dun Yu Hsiao
Abstract: We consider the problem of classifying a test sample given incomplete information. This problem arises naturally when data about a test sample is collected over time, or when costs must be incurred to compute the classification features. For example, in a distributed sensor network only a fraction of the sensors may have reported measurements at a certain time, and additional time, power, and bandwidth is needed to collect the complete data to classify. A practical goal is to assign a class label as soon as enough data is available to make a good decision. We formalize this goal through the notion of reliability—the probability that a label assigned given incomplete data would be the same as the label assigned given the complete data, and we propose a method to classify incomplete data only if some reliability threshold is met. Our approach models the complete data as a random variable whose distribution is dependent on the current incomplete data and the (complete) training data. The method differs from standard imputation strategies in that our focus is on determining the reliability of the classification decision, rather than just the class label. We show that the method provides useful reliability estimates of the correctness of the imputed class labels on a set of experiments on time-series data sets, where the goal is to classify the time-series as early as possible while still guaranteeing that the reliability threshold is met. Keywords: classification, sensor networks, signals, reliability
4 0.047334012 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
Author: Xin Tong
Abstract: The Neyman-Pearson (NP) paradigm in binary classification treats type I and type II errors with different priorities. It seeks classifiers that minimize type II error, subject to a type I error constraint under a user specified level α. In this paper, plug-in classifiers are developed under the NP paradigm. Based on the fundamental Neyman-Pearson Lemma, we propose two related plug-in classifiers which amount to thresholding respectively the class conditional density ratio and the regression function. These two classifiers handle different sampling schemes. This work focuses on theoretical properties of the proposed classifiers; in particular, we derive oracle inequalities that can be viewed as finite sample versions of risk bounds. NP classification can be used to address anomaly detection problems, where asymmetry in errors is an intrinsic property. As opposed to a common practice in anomaly detection that consists of thresholding normal class density, our approach does not assume a specific form for anomaly distributions. Such consideration is particularly necessary when the anomaly class density is far from uniformly distributed. Keywords: plug-in approach, Neyman-Pearson paradigm, nonparametric statistics, oracle inequality, anomaly detection
5 0.043172956 114 jmlr-2013-The Rate of Convergence of AdaBoost
Author: Indraneel Mukherjee, Cynthia Rudin, Robert E. Schapire
Abstract: The AdaBoost algorithm was designed to combine many “weak” hypotheses that perform slightly better than random guessing into a “strong” hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the “exponential loss.” Unlike previous work, our proofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are finite. Our first result shows that the exponential loss of AdaBoost’s computed parameter vector will be at most ε more than that of any parameter vector of ℓ1 -norm bounded by B in a number of rounds that is at most a polynomial in B and 1/ε. We also provide lower bounds showing that a polynomial dependence is necessary. Our second result is that within C/ε iterations, AdaBoost achieves a value of the exponential loss that is at most ε more than the best possible value, where C depends on the data set. We show that this dependence of the rate on ε is optimal up to constant factors, that is, at least Ω(1/ε) rounds are necessary to achieve within ε of the optimal exponential loss. Keywords: AdaBoost, optimization, coordinate descent, convergence rate
6 0.042932693 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
7 0.040256269 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach
8 0.036101419 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
9 0.035933349 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis
10 0.026715608 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
11 0.026209088 68 jmlr-2013-Machine Learning with Operational Costs
12 0.024756551 19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations
13 0.023983888 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut
14 0.023275435 7 jmlr-2013-A Risk Comparison of Ordinary Least Squares vs Ridge Regression
15 0.023234136 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
16 0.023192635 103 jmlr-2013-Sparse Robust Estimation and Kalman Smoothing with Nonsmooth Log-Concave Densities: Modeling, Computation, and Theory
17 0.023029281 23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty
18 0.023015741 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems
19 0.022812454 76 jmlr-2013-Nonparametric Sparsity and Regularization
20 0.022164073 87 jmlr-2013-Performance Bounds for λ Policy Iteration and Application to the Game of Tetris
topicId topicWeight
[(0, -0.121), (1, 0.049), (2, 0.007), (3, 0.051), (4, 0.013), (5, -0.047), (6, 0.036), (7, -0.014), (8, -0.006), (9, -0.106), (10, 0.081), (11, -0.018), (12, 0.036), (13, -0.104), (14, -0.073), (15, -0.056), (16, 0.049), (17, 0.028), (18, -0.144), (19, 0.054), (20, -0.107), (21, 0.03), (22, 0.071), (23, -0.026), (24, 0.036), (25, 0.033), (26, -0.048), (27, 0.134), (28, -0.132), (29, -0.05), (30, 0.112), (31, -0.167), (32, -0.288), (33, 0.028), (34, -0.031), (35, 0.018), (36, -0.027), (37, 0.011), (38, -0.012), (39, 0.121), (40, -0.093), (41, 0.076), (42, -0.093), (43, -0.059), (44, 0.126), (45, 0.061), (46, 0.061), (47, 0.075), (48, 0.024), (49, 0.191)]
simIndex simValue paperId paperTitle
same-paper 1 0.93564993 73 jmlr-2013-Multicategory Large-Margin Unified Machines
Author: Chong Zhang, Yufeng Liu
Abstract: Hard and soft classifiers are two important groups of techniques for classification problems. Logistic regression and Support Vector Machines are typical examples of soft and hard classifiers respectively. The essential difference between these two groups is whether one needs to estimate the class conditional probability for the classification task or not. In particular, soft classifiers predict the label based on the obtained class conditional probabilities, while hard classifiers bypass the estimation of probabilities and focus on the decision boundary. In practice, for the goal of accurate classification, it is unclear which one to use in a given situation. To tackle this problem, the Largemargin Unified Machine (LUM) was recently proposed as a unified family to embrace both groups. The LUM family enables one to study the behavior change from soft to hard binary classifiers. For multicategory cases, however, the concept of soft and hard classification becomes less clear. In that case, class probability estimation becomes more involved as it requires estimation of a probability vector. In this paper, we propose a new Multicategory LUM (MLUM) framework to investigate the behavior of soft versus hard classification under multicategory settings. Our theoretical and numerical results help to shed some light on the nature of multicategory classification and its transition behavior from soft to hard classifiers. The numerical results suggest that the proposed tuned MLUM yields very competitive performance. Keywords: hard classification, large-margin, soft classification, support vector machine
2 0.77522713 22 jmlr-2013-Classifying With Confidence From Incomplete Information
Author: Nathan Parrish, Hyrum S. Anderson, Maya R. Gupta, Dun Yu Hsiao
Abstract: We consider the problem of classifying a test sample given incomplete information. This problem arises naturally when data about a test sample is collected over time, or when costs must be incurred to compute the classification features. For example, in a distributed sensor network only a fraction of the sensors may have reported measurements at a certain time, and additional time, power, and bandwidth is needed to collect the complete data to classify. A practical goal is to assign a class label as soon as enough data is available to make a good decision. We formalize this goal through the notion of reliability—the probability that a label assigned given incomplete data would be the same as the label assigned given the complete data, and we propose a method to classify incomplete data only if some reliability threshold is met. Our approach models the complete data as a random variable whose distribution is dependent on the current incomplete data and the (complete) training data. The method differs from standard imputation strategies in that our focus is on determining the reliability of the classification decision, rather than just the class label. We show that the method provides useful reliability estimates of the correctness of the imputed class labels on a set of experiments on time-series data sets, where the goal is to classify the time-series as early as possible while still guaranteeing that the reliability threshold is met. Keywords: classification, sensor networks, signals, reliability
3 0.44029611 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
Author: Xin Tong
Abstract: The Neyman-Pearson (NP) paradigm in binary classification treats type I and type II errors with different priorities. It seeks classifiers that minimize type II error, subject to a type I error constraint under a user specified level α. In this paper, plug-in classifiers are developed under the NP paradigm. Based on the fundamental Neyman-Pearson Lemma, we propose two related plug-in classifiers which amount to thresholding respectively the class conditional density ratio and the regression function. These two classifiers handle different sampling schemes. This work focuses on theoretical properties of the proposed classifiers; in particular, we derive oracle inequalities that can be viewed as finite sample versions of risk bounds. NP classification can be used to address anomaly detection problems, where asymmetry in errors is an intrinsic property. As opposed to a common practice in anomaly detection that consists of thresholding normal class density, our approach does not assume a specific form for anomaly distributions. Such consideration is particularly necessary when the anomaly class density is far from uniformly distributed. Keywords: plug-in approach, Neyman-Pearson paradigm, nonparametric statistics, oracle inequality, anomaly detection
4 0.43544838 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki
Abstract: There are two main approaches to binary classiÄ?Ĺš cation problems: the loss function approach and the uncertainty set approach. The loss function approach is widely used in real-world data analysis. Statistical decision theory has been used to elucidate its properties such as statistical consistency. Conditional probabilities can also be estimated by using the minimum solution of the loss function. In the uncertainty set approach, an uncertainty set is deÄ?Ĺš ned for each binary label from training samples. The best separating hyperplane between the two uncertainty sets is used as the decision function. Although the uncertainty set approach provides an intuitive understanding of learning algorithms, its statistical properties have not been sufÄ?Ĺš ciently studied. In this paper, we show that the uncertainty set is deeply connected with the convex conjugate of a loss function. On the basis of the conjugate relation, we propose a way of revising the uncertainty set approach so that it will have good statistical properties such as statistical consistency. We also introduce statistical models corresponding to uncertainty sets in order to estimate conditional probabilities. Finally, we present numerical experiments, verifying that the learning with revised uncertainty sets improves the prediction accuracy. Keywords: loss function, uncertainty set, convex conjugate, consistency
5 0.33729947 11 jmlr-2013-Algorithms for Discovery of Multiple Markov Boundaries
Author: Alexander Statnikov, Nikita I. Lytkin, Jan Lemeire, Constantin F. Aliferis
Abstract: Algorithms for Markov boundary discovery from data constitute an important recent development in machine learning, primarily because they offer a principled solution to the variable/feature selection problem and give insight on local causal structure. Over the last decade many sound algorithms have been proposed to identify a single Markov boundary of the response variable. Even though faithful distributions and, more broadly, distributions that satisfy the intersection property always have a single Markov boundary, other distributions/data sets may have multiple Markov boundaries of the response variable. The latter distributions/data sets are common in practical data-analytic applications, and there are several reasons why it is important to induce multiple Markov boundaries from such data. However, there are currently no sound and efficient algorithms that can accomplish this task. This paper describes a family of algorithms TIE* that can discover all Markov boundaries in a distribution. The broad applicability as well as efficiency of the new algorithmic family is demonstrated in an extensive benchmarking study that involved comparison with 26 state-of-the-art algorithms/variants in 15 data sets from a diversity of application domains. Keywords: Markov boundary discovery, variable/feature selection, information equivalence, violations of faithfulness
6 0.32624745 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis
7 0.31521362 8 jmlr-2013-A Theory of Multiclass Boosting
9 0.25656477 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos
10 0.24236466 68 jmlr-2013-Machine Learning with Operational Costs
11 0.23448452 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
13 0.23170806 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach
14 0.22275296 12 jmlr-2013-Alleviating Naive Bayes Attribute Independence Assumption by Attribute Weighting
15 0.21457112 118 jmlr-2013-Using Symmetry and Evolutionary Search to Minimize Sorting Networks
16 0.21350376 19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations
17 0.20393164 114 jmlr-2013-The Rate of Convergence of AdaBoost
19 0.19618714 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization
20 0.19035929 23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty
topicId topicWeight
[(0, 0.014), (5, 0.18), (6, 0.026), (10, 0.07), (20, 0.016), (23, 0.026), (34, 0.342), (41, 0.018), (68, 0.02), (70, 0.022), (75, 0.067), (85, 0.044), (87, 0.022), (93, 0.018)]
simIndex simValue paperId paperTitle
1 0.75940418 53 jmlr-2013-Improving CUR Matrix Decomposition and the Nystrom Approximation via Adaptive Sampling
Author: Shusen Wang, Zhihua Zhang
Abstract: The CUR matrix decomposition and the Nystr¨ m approximation are two important low-rank matrix o approximation techniques. The Nystr¨ m method approximates a symmetric positive semidefinite o matrix in terms of a small number of its columns, while CUR approximates an arbitrary data matrix by a small number of its columns and rows. Thus, CUR decomposition can be regarded as an extension of the Nystr¨ m approximation. o In this paper we establish a more general error bound for the adaptive column/row sampling algorithm, based on which we propose more accurate CUR and Nystr¨ m algorithms with expected o relative-error bounds. The proposed CUR and Nystr¨ m algorithms also have low time complexity o and can avoid maintaining the whole data matrix in RAM. In addition, we give theoretical analysis for the lower error bounds of the standard Nystr¨ m method and the ensemble Nystr¨ m method. o o The main theoretical results established in this paper are novel, and our analysis makes no special assumption on the data matrices. Keywords: large-scale matrix computation, CUR matrix decomposition, the Nystr¨ m method, o randomized algorithms, adaptive sampling
same-paper 2 0.72227889 73 jmlr-2013-Multicategory Large-Margin Unified Machines
Author: Chong Zhang, Yufeng Liu
Abstract: Hard and soft classifiers are two important groups of techniques for classification problems. Logistic regression and Support Vector Machines are typical examples of soft and hard classifiers respectively. The essential difference between these two groups is whether one needs to estimate the class conditional probability for the classification task or not. In particular, soft classifiers predict the label based on the obtained class conditional probabilities, while hard classifiers bypass the estimation of probabilities and focus on the decision boundary. In practice, for the goal of accurate classification, it is unclear which one to use in a given situation. To tackle this problem, the Largemargin Unified Machine (LUM) was recently proposed as a unified family to embrace both groups. The LUM family enables one to study the behavior change from soft to hard binary classifiers. For multicategory cases, however, the concept of soft and hard classification becomes less clear. In that case, class probability estimation becomes more involved as it requires estimation of a probability vector. In this paper, we propose a new Multicategory LUM (MLUM) framework to investigate the behavior of soft versus hard classification under multicategory settings. Our theoretical and numerical results help to shed some light on the nature of multicategory classification and its transition behavior from soft to hard classifiers. The numerical results suggest that the proposed tuned MLUM yields very competitive performance. Keywords: hard classification, large-margin, soft classification, support vector machine
3 0.51709867 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright
Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling
4 0.5138815 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
Author: Tony Cai, Wen-Xin Zhou
Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence
5 0.5123474 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki
Abstract: There are two main approaches to binary classiÄ?Ĺš cation problems: the loss function approach and the uncertainty set approach. The loss function approach is widely used in real-world data analysis. Statistical decision theory has been used to elucidate its properties such as statistical consistency. Conditional probabilities can also be estimated by using the minimum solution of the loss function. In the uncertainty set approach, an uncertainty set is deÄ?Ĺš ned for each binary label from training samples. The best separating hyperplane between the two uncertainty sets is used as the decision function. Although the uncertainty set approach provides an intuitive understanding of learning algorithms, its statistical properties have not been sufÄ?Ĺš ciently studied. In this paper, we show that the uncertainty set is deeply connected with the convex conjugate of a loss function. On the basis of the conjugate relation, we propose a way of revising the uncertainty set approach so that it will have good statistical properties such as statistical consistency. We also introduce statistical models corresponding to uncertainty sets in order to estimate conditional probabilities. Finally, we present numerical experiments, verifying that the learning with revised uncertainty sets improves the prediction accuracy. Keywords: loss function, uncertainty set, convex conjugate, consistency
6 0.5091784 15 jmlr-2013-Bayesian Canonical Correlation Analysis
7 0.5082987 114 jmlr-2013-The Rate of Convergence of AdaBoost
8 0.5049693 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
10 0.50354147 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
11 0.50306082 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
12 0.50300759 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis
14 0.5005877 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
15 0.50036728 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
16 0.50011033 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
17 0.50008953 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
18 0.49989226 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood
19 0.49954286 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
20 0.49861756 68 jmlr-2013-Machine Learning with Operational Costs