jmlr jmlr2013 jmlr2013-73 knowledge-graph by maker-knowledge-mining

73 jmlr-2013-Multicategory Large-Margin Unified Machines

Source: pdf

Author: Chong Zhang, Yufeng Liu

Abstract: Hard and soft classiﬁers are two important groups of techniques for classiﬁcation problems. Logistic regression and Support Vector Machines are typical examples of soft and hard classiﬁers respectively. The essential difference between these two groups is whether one needs to estimate the class conditional probability for the classiﬁcation task or not. In particular, soft classiﬁers predict the label based on the obtained class conditional probabilities, while hard classiﬁers bypass the estimation of probabilities and focus on the decision boundary. In practice, for the goal of accurate classiﬁcation, it is unclear which one to use in a given situation. To tackle this problem, the Largemargin Uniﬁed Machine (LUM) was recently proposed as a uniﬁed family to embrace both groups. The LUM family enables one to study the behavior change from soft to hard binary classiﬁers. For multicategory cases, however, the concept of soft and hard classiﬁcation becomes less clear. In that case, class probability estimation becomes more involved as it requires estimation of a probability vector. In this paper, we propose a new Multicategory LUM (MLUM) framework to investigate the behavior of soft versus hard classiﬁcation under multicategory settings. Our theoretical and numerical results help to shed some light on the nature of multicategory classiﬁcation and its transition behavior from soft to hard classiﬁers. The numerical results suggest that the proposed tuned MLUM yields very competitive performance. Keywords: hard classiﬁcation, large-margin, soft classiﬁcation, support vector machine

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Logistic regression and Support Vector Machines are typical examples of soft and hard classiﬁers respectively. [sent-6, score-0.29]

2 In particular, soft classiﬁers predict the label based on the obtained class conditional probabilities, while hard classiﬁers bypass the estimation of probabilities and focus on the decision boundary. [sent-8, score-0.333]

3 The LUM family enables one to study the behavior change from soft to hard binary classiﬁers. [sent-11, score-0.344]

4 For multicategory cases, however, the concept of soft and hard classiﬁcation becomes less clear. [sent-12, score-0.588]

5 In this paper, we propose a new Multicategory LUM (MLUM) framework to investigate the behavior of soft versus hard classiﬁcation under multicategory settings. [sent-14, score-0.588]

6 Our theoretical and numerical results help to shed some light on the nature of multicategory classiﬁcation and its transition behavior from soft to hard classiﬁers. [sent-15, score-0.606]

7 Keywords: hard classiﬁcation, large-margin, soft classiﬁcation, support vector machine 1. [sent-17, score-0.276]

8 Z HANG AND L IU Wahba (2002) discussed the concept of soft versus hard classiﬁcation. [sent-25, score-0.276]

9 For a given problem, the choice between hard and soft classiﬁers can be difﬁcult. [sent-34, score-0.276]

10 The LUM family is a rich group of classiﬁers in the sense that it connects hard and soft classiﬁers in one spectrum. [sent-37, score-0.313]

11 It provides a natural platform for comparisons between soft and hard classiﬁers. [sent-38, score-0.276]

12 More importantly, it enables us to observe the performance transition from soft to hard classiﬁcation. [sent-39, score-0.276]

13 Furthermore, multicategory consistency is much more involved, especially for hard classiﬁers. [sent-43, score-0.417]

14 For instance, there are a lot of developments on multicategory SVMs in the literature (Vapnik, 1998; Weston and Watkins, 1999; Crammer et al. [sent-44, score-0.312]

15 Recently, Liu and Yuan (2011) proposed a group of consistent multicategory piecewise linear hinge loss functions, namely a family of reinforced hinge loss functions, which covers the loss by Lee et al. [sent-49, score-0.618]

16 For probability estimation, there are several existing multicategory soft classiﬁers, such as Adaboost in Boosting (Freund and Schapire, 1997; Zou et al. [sent-51, score-0.525]

17 , 2000), proximal SVMs (Tang and Zhang, 2006), and multicategory composite least squares classiﬁers (Park et al. [sent-54, score-0.312]

18 It helps to shed some light on the choice between multicategory soft and hard classiﬁers, and provide some insights on the behavior change from soft to hard classiﬁcation methods. [sent-58, score-0.882]

19 (2006) to the multicategory case and study its convergence rate. [sent-61, score-0.312]

20 A MLUM member, in-between hard and soft classiﬁers, tends to work the best. [sent-68, score-0.276]

21 Methodology In this section, we ﬁrst introduce the background of binary classiﬁcation, then discuss different ways of generalization to multicategory problems. [sent-82, score-0.343]

22 The notion of soft and hard classiﬁcation is ﬁrst reviewed in the binary classiﬁcation context. [sent-83, score-0.307]

23 Then we propose a MLUM framework which helps us to understand soft versus hard classiﬁcation in the multicategory setting. [sent-84, score-0.588]

24 Different methods can be roughly grouped into two categories, namely, soft and hard classiﬁers. [sent-111, score-0.276]

25 The parameter c is an index of soft versus hard classiﬁers. [sent-121, score-0.276]

26 In particular, c = 0 corresponds to a typical soft classiﬁer, and c → ∞ corresponds to the SVM, a typical hard classiﬁer. [sent-122, score-0.276]

27 Consequently, the LUM connects soft and hard classiﬁers as a family, and enables one to thoroughly investigate the transition behavior in this spectrum. [sent-123, score-0.276]

28 In the next section, we brieﬂy introduce some existing methods for multicategory classiﬁcation problems. [sent-147, score-0.312]

29 2 Existing Multicategory Classiﬁcation Methods To solve a multicategory problem, a natural and direct way is to implement multiple binary classiﬁers. [sent-149, score-0.343]

30 Hence, it is desirable to have a simultaneous multicategory classiﬁer that considers k classes altogether. [sent-173, score-0.329]

31 The idea of simultaneous multicategory classiﬁers is as follows. [sent-174, score-0.312]

32 Similar as in (3), we are interested in solving the following optimization problem min f ∈H 1 n ∑ V f(xi ), yi + λJ(f), n i=1 (4) with V being a loss function for a multicategory problem, and J(f) being a regularization term deﬁned for the multicategory problems. [sent-187, score-0.679]

33 For soft classiﬁcation in multiclass problems, Zhu and Hastie (2005) used the generalized logistic loss V = − fy (x) + log(e f1 (x) + · · · + e fk (x) ). [sent-194, score-0.423]

34 (2009) extended the Adaboost to a multicategory learning method with the exponential loss V = exp(− 1 zT f). [sent-200, score-0.367]

35 In the literature of hard classiﬁers, there are k several ways to extend the binary hinge loss of the SVM to the simultaneous multicategory case. [sent-201, score-0.525]

36 Here we list several commonly used versions with the sum-to-zero constraint: Loss 1 (Naive hinge loss) [1 − fy (x)]+ ; Loss 2 (Vapnik, 1998) ∑ j=y [1 − fy (x) − f j (x) ]+ ; Loss 3 (Crammer et al. [sent-202, score-0.249]

37 Next we examine the sum-to-zero constraint ∑k f j (x) = 0 for different multicategory losses. [sent-216, score-0.312]

38 This constraint was also used in many other simultaneous multicategory classiﬁcation papers, for example, Tang and Zhang (2006), Wang and Shen (2007), and Zhu et al. [sent-242, score-0.312]

39 3 MLUM Family Soft and hard classiﬁers have been both studied in the literature of simultaneous multicategory classiﬁcation. [sent-245, score-0.4]

40 The LUM family is a broad family which embraces both soft and hard classiﬁers in binary cases, yet no such a convenient platform is available in the multicategory framework. [sent-247, score-0.693]

41 In this paper, we propose a new family of MLUMs to study multicategory problems. [sent-248, score-0.349]

42 The main motivation to use the MLUM loss function (6) is based on the argmax rule for multicategory classiﬁcation. [sent-251, score-0.395]

43 (2006) and Zhang (2004a), from the binary case to the multicategory one. [sent-277, score-0.343]

44 In the multicategory classiﬁcation literature, to tackle the Fisher consistency problem, we need the following deﬁnitions. [sent-294, score-0.329]

45 In multicategory problems, the conditional probability becomes a vector, and this makes the transition behavior of probability estimation from soft to hard classiﬁcation more complex. [sent-384, score-0.678]

46 In particular, we need to generalize the ﬂat region in Figure 3 1358 M ULTICATEGORY L ARGE - MARGIN U NIFIED M ACHINES to the multicategory setting. [sent-385, score-0.312]

47 Zhang (2004b) further studied the relationship between the two excess risks in multicategory problems. [sent-418, score-0.379]

48 In particular, the convergence rates of the two excess risks are well studied in Wang and Shen (2007), in the setting of L1 penalized multicategory SVM with linear learning. [sent-419, score-0.396]

49 In this section, we employ the generalization of the excess V -risk from binary to multicategory cases as in Zhang (2004b), and explore the explicit form for MLUM. [sent-420, score-0.424]

50 The plots show the transition of the MLUM from soft classiﬁcation (top left panel) to hard classiﬁcation (bottom right panel). [sent-426, score-0.276]

51 A similar result for multicategory problems is obtained in Zhang (2004b). [sent-434, score-0.312]

52 Here we extend the results in Zhang (2004a) to the multicategory version. [sent-436, score-0.312]

53 3 includes two examples where the MLUM with c = 1, corresponding to a new multicategory DWD, can outperform both hard and soft classiﬁers, which was not observed in the binary LUM case in Liu et al. [sent-583, score-0.633]

54 Note that this multicategory DWD is different from the version proposed in Huang et al. [sent-585, score-0.312]

55 4 provides some summary on the effect of hard versus soft classiﬁers and gives some insight on the choice of classiﬁcation methods. [sent-588, score-0.276]

56 Interestingly, we ﬁnd that the behavior of MLUM with c = 1 can outperform the hard (c → ∞) and soft (c = 0) classiﬁers in certain cases. [sent-593, score-0.276]

57 Here for the MPLR, we replace the loss function ℓ by the logistic loss V ( f , y) = log(1 + e−y f ), while keeping the convex combination so that the classiﬁer is tuned with different γ values. [sent-619, score-0.351]

58 Note that in most examples, the tuned MLUM performs better than the other methods, and is recommended. [sent-621, score-0.237]

59 In this case, estimation of the conditional probability function is challenging, and we expect hard classiﬁers to perform better than the soft ones, because the former bypasses probability estimation. [sent-667, score-0.366]

60 24 tuned γ, fixed c tuned MLUM Bayes tuned c, fixed γ tuned MLUM Bayes 0. [sent-690, score-1.068]

61 This is a multicategory generalization of Example 2 in Liu et al. [sent-707, score-0.312]

62 Thus class probabilities are difﬁcult to estimate in this case, and the classiﬁcation accuracy of soft classiﬁers may be sacriﬁced by probability estimation. [sent-710, score-0.23]

63 Interestingly, the soft classiﬁer gives worse probability estimation than hard classiﬁers in this example. [sent-786, score-0.323]

64 In such cases the accurate probability estimation may help the soft classiﬁer to build more accurate classiﬁcation boundaries. [sent-806, score-0.235]

65 , 8, Error tuned γ, fixed c tuned MLUM Bayes 0 1 10 100 1000 0. [sent-846, score-0.534]

66 The left panel shows that this is an example in which soft classiﬁer works the best. [sent-875, score-0.271]

67 15 Error tuned γ, fixed c tuned MLUM Bayes Error 0. [sent-935, score-0.534]

68 25 tuned c, fixed γ tuned MLUM Bayes 0 1 10 100 1000 0 c 0. [sent-937, score-0.534]

69 The left panel shows the soft classiﬁer works reasonably well in this example. [sent-949, score-0.271]

70 (2011) showed that among many examples, the classiﬁer that works the best in the LUM family appears to be either soft or hard classiﬁers. [sent-989, score-0.329]

71 In the MLUM family, however, we observe that the MLUM with c = 1 (a new multicategory DWD) can sometimes yield the best performance in terms of classiﬁcation accuracy. [sent-990, score-0.312]

72 In these two examples, we add a small percentage of noisy data into the originally clean data sets, and observe that soft classiﬁers can be very sensitive to outliers in terms of classiﬁcation accuracy, while MLUMs with c ≥ 1 appear to be quite robust. [sent-992, score-0.236]

73 36 tuned γ, fixed c tuned MLUM 0 1 10 100 1000 0 0. [sent-1053, score-0.534]

74 4 Summary of Simulation Results Our simulated examples provide some insights on hard versus soft classiﬁcation methods, and suggest that no single classiﬁer works universally the best in all cases. [sent-1103, score-0.321]

75 Our simulation studies showed different cases in which hard (c = ∞), soft (c = 0) and in-between (c = 1) classiﬁers work the best, respectively. [sent-1105, score-0.276]

76 When the underlying conditional probability function is a step function, probability estimation can be a difﬁcult problem for soft classiﬁers, and the prediction accuracy may be compromised. [sent-1106, score-0.278]

77 Interestingly, the soft classiﬁer is quite vulnerable to potential outliers in the data. [sent-1142, score-0.236]

78 In Example 5, we add 5% outliers to the clean samples, and the soft classiﬁer performs the worst among the others. [sent-1143, score-0.251]

79 However in Example 6 when we add only 1% outliers to the data, the classiﬁcation accuracy of the soft classiﬁer is reduced by over 40% while those of the others are just around 15 − 20%. [sent-1145, score-0.236]

80 Therefore the tuned MLUM procedure is sufﬁcient enough to avoid tuning on the entire interval γ ∈ [0, 1]. [sent-1151, score-0.244]

81 As we observed in the simulation studies that potential outliers may decrease the accuracy of soft classiﬁers, we perform Principal Component Analysis (PCA) on each data set and examine if there are any obvious potential outliers on the PCA projected plots. [sent-1214, score-0.284]

82 The Iris data set doesn’t appear to have any obvious outliers and the soft classiﬁer works very well there. [sent-1217, score-0.252]

83 The tuned MLUM performs slightly worse than the RMSVM, but the difference is not large. [sent-1221, score-0.237]

84 48 tuned γ, fixed c tuned MLUM 0 1 10 100 1000 0 c 0. [sent-1255, score-0.534]

85 The right panel suggests that the classiﬁcation error is the best when c is tuned with γ = 1. [sent-1268, score-0.317]

86 Interestingly, hard and soft classiﬁers perform similarly in terms of classiﬁcation accuracy. [sent-1270, score-0.276]

87 18 Z HANG AND L IU tuned c, fixed γ tuned MLUM Error 0. [sent-1292, score-0.534]

88 14 tuned γ, fixed c tuned MLUM 0 1 10 100 1000 0 c 0. [sent-1298, score-0.534]

89 Overall, the tuned MLUM performs the best, as in the simulated examples. [sent-1324, score-0.252]

90 In this paper, we generalize the binary LUM family to the simultaneous multicategory classiﬁers, namely, the MLUM family. [sent-1327, score-0.38]

91 The MLUM is very general and it includes many popular multicategory classiﬁcation methods as special cases. [sent-1328, score-0.312]

92 In particular, the MLUM family includes both soft and hard classiﬁers, and provides a platform to explore the transition behavior from soft to hard classiﬁcation problems. [sent-1329, score-0.603]

93 205 tuned γ, fixed c tuned MLUM 0 1 10 100 1000 0 0. [sent-1337, score-0.534]

94 The numerical f, examples show that hard and soft classiﬁers behave quite differently in various settings, and they help to shed some light on the choice between the two. [sent-1412, score-0.308]

95 We numerically demonstrate that the tuned MLUM outperforms several other multicategory techniques and should be a competitive addition to the existing classiﬁcation toolbox. [sent-1415, score-0.534]

96 The left panel shows the test error of soft and hard classiﬁers are roughly comparable, while c = 1 (DWD) is the worst. [sent-1436, score-0.396]

97 After some calculation we have S(f, x) − S(f∗ , x) = (1 − γ)[δ1 − δ2 ] + P1 γ(−δ4 ) + (1 − γ)(−δ1 ) +P2 γ(δ3 ) + (1 − γ)(δ2 ) 1379 (13) tuned c, fixed γ tuned MLUM 0. [sent-1509, score-0.534]

98 09 tuned γ, fixed c tuned MLUM 0 1 10 100 1000 0 c 0. [sent-1549, score-0.534]

99 The left panel suggests that soft classiﬁcation method performs better in terms of classiﬁcation accuracy than the others. [sent-1561, score-0.27]

100 New multicategory boosting algorithms based on multicategory ﬁsher-consistent losses. [sent-1912, score-0.624]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mlum', 0.749), ('multicategory', 0.312), ('tuned', 0.222), ('soft', 0.188), ('lum', 0.155), ('classi', 0.129), ('achines', 0.111), ('nified', 0.111), ('ulticategory', 0.111), ('fy', 0.105), ('pj', 0.101), ('ers', 0.091), ('fixed', 0.09), ('hard', 0.088), ('maes', 0.087), ('rmsvm', 0.08), ('arge', 0.07), ('panel', 0.067), ('excess', 0.067), ('vehicle', 0.066), ('iu', 0.064), ('liu', 0.063), ('msvm', 0.062), ('vertebral', 0.062), ('fisher', 0.058), ('hang', 0.057), ('fih', 0.056), ('mplr', 0.056), ('loss', 0.055), ('iris', 0.054), ('margin', 0.054), ('er', 0.051), ('outliers', 0.048), ('dermatology', 0.048), ('dwd', 0.043), ('fk', 0.042), ('hinge', 0.039), ('fi', 0.039), ('gbm', 0.037), ('family', 0.037), ('pk', 0.035), ('breast', 0.034), ('mae', 0.033), ('wine', 0.033), ('cation', 0.032), ('binary', 0.031), ('error', 0.028), ('argmax', 0.028), ('shen', 0.027), ('ex', 0.027), ('reinforced', 0.026), ('test', 0.025), ('chapel', 0.025), ('mlums', 0.025), ('proneural', 0.025), ('probability', 0.025), ('bayes', 0.024), ('pc', 0.023), ('risk', 0.023), ('estimation', 0.022), ('losses', 0.022), ('tuning', 0.022), ('yuan', 0.022), ('zhang', 0.021), ('cancer', 0.021), ('fh', 0.021), ('bartlett', 0.02), ('logistic', 0.019), ('zhu', 0.019), ('arginf', 0.019), ('glioblastoma', 0.019), ('conditional', 0.018), ('shed', 0.018), ('consistency', 0.017), ('class', 0.017), ('svm', 0.017), ('classes', 0.017), ('rates', 0.017), ('bregman', 0.016), ('tang', 0.016), ('works', 0.016), ('yes', 0.016), ('hill', 0.016), ('covariates', 0.015), ('simulated', 0.015), ('uni', 0.015), ('performs', 0.015), ('minimizer', 0.015), ('pca', 0.015), ('multiclass', 0.014), ('rosset', 0.014), ('nq', 0.014), ('boser', 0.014), ('kz', 0.014), ('examples', 0.014), ('lee', 0.014), ('explore', 0.014), ('dd', 0.014), ('wahba', 0.014), ('reproducing', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 73 jmlr-2013-Multicategory Large-Margin Unified Machines

Author: Chong Zhang, Yufeng Liu

2 0.05305396 8 jmlr-2013-A Theory of Multiclass Boosting

Author: Indraneel Mukherjee, Robert E. Schapire

Abstract: Boosting combines weak classiﬁers to form highly accurate predictors. Although the case of binary classiﬁcation is well understood, in the multiclass setting, the “correct” requirements on the weak classiﬁer, or the notion of the most efﬁcient boosting algorithms are missing. In this paper, we create a broad and general framework, within which we make precise and identify the optimal requirements on the weak-classiﬁer, as well as design the most effective, in a certain sense, boosting algorithms that assume such requirements. Keywords: multiclass, boosting, weak learning condition, drifting games

3 0.048588317 22 jmlr-2013-Classifying With Confidence From Incomplete Information

Author: Nathan Parrish, Hyrum S. Anderson, Maya R. Gupta, Dun Yu Hsiao

Abstract: We consider the problem of classifying a test sample given incomplete information. This problem arises naturally when data about a test sample is collected over time, or when costs must be incurred to compute the classiﬁcation features. For example, in a distributed sensor network only a fraction of the sensors may have reported measurements at a certain time, and additional time, power, and bandwidth is needed to collect the complete data to classify. A practical goal is to assign a class label as soon as enough data is available to make a good decision. We formalize this goal through the notion of reliability—the probability that a label assigned given incomplete data would be the same as the label assigned given the complete data, and we propose a method to classify incomplete data only if some reliability threshold is met. Our approach models the complete data as a random variable whose distribution is dependent on the current incomplete data and the (complete) training data. The method differs from standard imputation strategies in that our focus is on determining the reliability of the classiﬁcation decision, rather than just the class label. We show that the method provides useful reliability estimates of the correctness of the imputed class labels on a set of experiments on time-series data sets, where the goal is to classify the time-series as early as possible while still guaranteeing that the reliability threshold is met. Keywords: classiﬁcation, sensor networks, signals, reliability

4 0.047334012 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification

Author: Xin Tong

Abstract: The Neyman-Pearson (NP) paradigm in binary classiﬁcation treats type I and type II errors with different priorities. It seeks classiﬁers that minimize type II error, subject to a type I error constraint under a user speciﬁed level α. In this paper, plug-in classiﬁers are developed under the NP paradigm. Based on the fundamental Neyman-Pearson Lemma, we propose two related plug-in classiﬁers which amount to thresholding respectively the class conditional density ratio and the regression function. These two classiﬁers handle different sampling schemes. This work focuses on theoretical properties of the proposed classiﬁers; in particular, we derive oracle inequalities that can be viewed as ﬁnite sample versions of risk bounds. NP classiﬁcation can be used to address anomaly detection problems, where asymmetry in errors is an intrinsic property. As opposed to a common practice in anomaly detection that consists of thresholding normal class density, our approach does not assume a speciﬁc form for anomaly distributions. Such consideration is particularly necessary when the anomaly class density is far from uniformly distributed. Keywords: plug-in approach, Neyman-Pearson paradigm, nonparametric statistics, oracle inequality, anomaly detection

5 0.043172956 114 jmlr-2013-The Rate of Convergence of AdaBoost

Author: Indraneel Mukherjee, Cynthia Rudin, Robert E. Schapire

Abstract: The AdaBoost algorithm was designed to combine many “weak” hypotheses that perform slightly better than random guessing into a “strong” hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the “exponential loss.” Unlike previous work, our proofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are ﬁnite. Our ﬁrst result shows that the exponential loss of AdaBoost’s computed parameter vector will be at most ε more than that of any parameter vector of ℓ1 -norm bounded by B in a number of rounds that is at most a polynomial in B and 1/ε. We also provide lower bounds showing that a polynomial dependence is necessary. Our second result is that within C/ε iterations, AdaBoost achieves a value of the exponential loss that is at most ε more than the best possible value, where C depends on the data set. We show that this dependence of the rate on ε is optimal up to constant factors, that is, at least Ω(1/ε) rounds are necessary to achieve within ε of the optimal exponential loss. Keywords: AdaBoost, optimization, coordinate descent, convergence rate

6 0.042932693 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

7 0.040256269 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach

8 0.036101419 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning

9 0.035933349 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis

10 0.026715608 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

11 0.026209088 68 jmlr-2013-Machine Learning with Operational Costs

12 0.024756551 19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations

13 0.023983888 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

14 0.023275435 7 jmlr-2013-A Risk Comparison of Ordinary Least Squares vs Ridge Regression

15 0.023234136 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

16 0.023192635 103 jmlr-2013-Sparse Robust Estimation and Kalman Smoothing with Nonsmooth Log-Concave Densities: Modeling, Computation, and Theory

17 0.023029281 23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty

18 0.023015741 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

19 0.022812454 76 jmlr-2013-Nonparametric Sparsity and Regularization

20 0.022164073 87 jmlr-2013-Performance Bounds for λ Policy Iteration and Application to the Game of Tetris

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.121), (1, 0.049), (2, 0.007), (3, 0.051), (4, 0.013), (5, -0.047), (6, 0.036), (7, -0.014), (8, -0.006), (9, -0.106), (10, 0.081), (11, -0.018), (12, 0.036), (13, -0.104), (14, -0.073), (15, -0.056), (16, 0.049), (17, 0.028), (18, -0.144), (19, 0.054), (20, -0.107), (21, 0.03), (22, 0.071), (23, -0.026), (24, 0.036), (25, 0.033), (26, -0.048), (27, 0.134), (28, -0.132), (29, -0.05), (30, 0.112), (31, -0.167), (32, -0.288), (33, 0.028), (34, -0.031), (35, 0.018), (36, -0.027), (37, 0.011), (38, -0.012), (39, 0.121), (40, -0.093), (41, 0.076), (42, -0.093), (43, -0.059), (44, 0.126), (45, 0.061), (46, 0.061), (47, 0.075), (48, 0.024), (49, 0.191)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93564993 73 jmlr-2013-Multicategory Large-Margin Unified Machines

Author: Chong Zhang, Yufeng Liu

2 0.77522713 22 jmlr-2013-Classifying With Confidence From Incomplete Information

Author: Nathan Parrish, Hyrum S. Anderson, Maya R. Gupta, Dun Yu Hsiao

3 0.44029611 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification

Author: Xin Tong

4 0.43544838 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki

Abstract: There are two main approaches to binary classiÄ?Ĺš cation problems: the loss function approach and the uncertainty set approach. The loss function approach is widely used in real-world data analysis. Statistical decision theory has been used to elucidate its properties such as statistical consistency. Conditional probabilities can also be estimated by using the minimum solution of the loss function. In the uncertainty set approach, an uncertainty set is deÄ?Ĺš ned for each binary label from training samples. The best separating hyperplane between the two uncertainty sets is used as the decision function. Although the uncertainty set approach provides an intuitive understanding of learning algorithms, its statistical properties have not been sufÄ?Ĺš ciently studied. In this paper, we show that the uncertainty set is deeply connected with the convex conjugate of a loss function. On the basis of the conjugate relation, we propose a way of revising the uncertainty set approach so that it will have good statistical properties such as statistical consistency. We also introduce statistical models corresponding to uncertainty sets in order to estimate conditional probabilities. Finally, we present numerical experiments, verifying that the learning with revised uncertainty sets improves the prediction accuracy. Keywords: loss function, uncertainty set, convex conjugate, consistency

5 0.33729947 11 jmlr-2013-Algorithms for Discovery of Multiple Markov Boundaries

Author: Alexander Statnikov, Nikita I. Lytkin, Jan Lemeire, Constantin F. Aliferis

Abstract: Algorithms for Markov boundary discovery from data constitute an important recent development in machine learning, primarily because they offer a principled solution to the variable/feature selection problem and give insight on local causal structure. Over the last decade many sound algorithms have been proposed to identify a single Markov boundary of the response variable. Even though faithful distributions and, more broadly, distributions that satisfy the intersection property always have a single Markov boundary, other distributions/data sets may have multiple Markov boundaries of the response variable. The latter distributions/data sets are common in practical data-analytic applications, and there are several reasons why it is important to induce multiple Markov boundaries from such data. However, there are currently no sound and efﬁcient algorithms that can accomplish this task. This paper describes a family of algorithms TIE* that can discover all Markov boundaries in a distribution. The broad applicability as well as efﬁciency of the new algorithmic family is demonstrated in an extensive benchmarking study that involved comparison with 26 state-of-the-art algorithms/variants in 15 data sets from a diversity of application domains. Keywords: Markov boundary discovery, variable/feature selection, information equivalence, violations of faithfulness

6 0.32624745 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis

7 0.31521362 8 jmlr-2013-A Theory of Multiclass Boosting

8 0.31267625 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications

9 0.25656477 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos

10 0.24236466 68 jmlr-2013-Machine Learning with Operational Costs

11 0.23448452 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

12 0.23404264 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

13 0.23170806 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach

14 0.22275296 12 jmlr-2013-Alleviating Naive Bayes Attribute Independence Assumption by Attribute Weighting

15 0.21457112 118 jmlr-2013-Using Symmetry and Evolutionary Search to Minimize Sorting Networks

16 0.21350376 19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations

17 0.20393164 114 jmlr-2013-The Rate of Convergence of AdaBoost

18 0.19976509 103 jmlr-2013-Sparse Robust Estimation and Kalman Smoothing with Nonsmooth Log-Concave Densities: Modeling, Computation, and Theory

19 0.19618714 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization

20 0.19035929 23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.014), (5, 0.18), (6, 0.026), (10, 0.07), (20, 0.016), (23, 0.026), (34, 0.342), (41, 0.018), (68, 0.02), (70, 0.022), (75, 0.067), (85, 0.044), (87, 0.022), (93, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.75940418 53 jmlr-2013-Improving CUR Matrix Decomposition and the Nystrom Approximation via Adaptive Sampling

Author: Shusen Wang, Zhihua Zhang

Abstract: The CUR matrix decomposition and the Nystr¨ m approximation are two important low-rank matrix o approximation techniques. The Nystr¨ m method approximates a symmetric positive semideﬁnite o matrix in terms of a small number of its columns, while CUR approximates an arbitrary data matrix by a small number of its columns and rows. Thus, CUR decomposition can be regarded as an extension of the Nystr¨ m approximation. o In this paper we establish a more general error bound for the adaptive column/row sampling algorithm, based on which we propose more accurate CUR and Nystr¨ m algorithms with expected o relative-error bounds. The proposed CUR and Nystr¨ m algorithms also have low time complexity o and can avoid maintaining the whole data matrix in RAM. In addition, we give theoretical analysis for the lower error bounds of the standard Nystr¨ m method and the ensemble Nystr¨ m method. o o The main theoretical results established in this paper are novel, and our analysis makes no special assumption on the data matrices. Keywords: large-scale matrix computation, CUR matrix decomposition, the Nystr¨ m method, o randomized algorithms, adaptive sampling

same-paper 2 0.72227889 73 jmlr-2013-Multicategory Large-Margin Unified Machines

Author: Chong Zhang, Yufeng Liu

3 0.51709867 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright

Abstract: We analyze two communication-efﬁcient algorithms for distributed optimization in statistical settings involving large-scale data sets. The ﬁrst algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efﬁciently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling

4 0.5138815 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

Author: Tony Cai, Wen-Xin Zhou

Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence

5 0.5123474 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki

6 0.5091784 15 jmlr-2013-Bayesian Canonical Correlation Analysis

7 0.5082987 114 jmlr-2013-The Rate of Convergence of AdaBoost

8 0.5049693 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach

9 0.50412953 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

10 0.50354147 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

11 0.50306082 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

12 0.50300759 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis

13 0.5024417 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

14 0.5005877 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

15 0.50036728 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion

16 0.50011033 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

17 0.50008953 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

18 0.49989226 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood

19 0.49954286 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

20 0.49861756 68 jmlr-2013-Machine Learning with Operational Costs