jmlr jmlr2009 jmlr2009-48 knowledge-graph by maker-knowledge-mining

48 jmlr-2009-Learning Nondeterministic Classifiers


Source: pdf

Author: Juan José del Coz, Jorge Díez, Antonio Bahamonde

Abstract: Nondeterministic classifiers are defined as those allowed to predict more than one class for some entries from an input space. Given that the true class should be included in predictions and the number of classes predicted should be as small as possible, these kind of classifiers can be considered as Information Retrieval (IR) procedures. In this paper, we propose a family of IR loss functions to measure the performance of nondeterministic learners. After discussing such measures, we derive an algorithm for learning optimal nondeterministic hypotheses. Given an entry from the input space, the algorithm requires the posterior probabilities to compute the subset of classes with the lowest expected loss. From a general point of view, nondeterministic classifiers provide an improvement in the proportion of predictions that include the true class compared to their deterministic counterparts; the price to be paid for this increase is usually a tiny proportion of predictions with more than one class. The paper includes an extensive experimental study using three deterministic learners to estimate posterior probabilities: a multiclass Support Vector Machine (SVM), a Logistic Regression, and a Na¨ve Bayes. The data sets considered comprise both UCI ı multi-class learning tasks and microarray expressions of different kinds of cancer. We successfully compare nondeterministic classifiers with other alternative approaches. Additionally, we shall see how the quality of posterior probabilities (measured by the Brier score) determines the goodness of nondeterministic predictions. Keywords: nondeterministic, multiclassification, reject option, multi-label classification, posterior probabilities

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper, we propose a family of IR loss functions to measure the performance of nondeterministic learners. [sent-9, score-0.779]

2 After discussing such measures, we derive an algorithm for learning optimal nondeterministic hypotheses. [sent-10, score-0.779]

3 Given an entry from the input space, the algorithm requires the posterior probabilities to compute the subset of classes with the lowest expected loss. [sent-11, score-0.298]

4 From a general point of view, nondeterministic classifiers provide an improvement in the proportion of predictions that include the true class compared to their deterministic counterparts; the price to be paid for this increase is usually a tiny proportion of predictions with more than one class. [sent-12, score-1.024]

5 The paper includes an extensive experimental study using three deterministic learners to estimate posterior probabilities: a multiclass Support Vector Machine (SVM), a Logistic Regression, and a Na¨ve Bayes. [sent-13, score-0.312]

6 We successfully compare nondeterministic classifiers with other alternative approaches. [sent-15, score-0.779]

7 Additionally, we shall see how the quality of posterior probabilities (measured by the Brier score) determines the goodness of nondeterministic predictions. [sent-16, score-0.991]

8 In this paper we explore how to learn classifiers with multiple outcomes, like nondeterministic automata; we shall call them nondeterministic classifiers. [sent-23, score-1.587]

9 To fix ideas, let us consider a screening for a set of medical diseases (or other diagnostic situations); for some inputs, a nondeterministic classifier c 2009 Juan Jos´ del Coz, Jorge D´ez and Antonio Bahamonde. [sent-25, score-0.861]

10 Thus nondeterministic predictions may discard some options and allow domain experts to make practical decisions. [sent-28, score-0.826]

11 Even when the nondeterministic classifier returns most of the available classes for the representation of an entry, we can read that the learned hypothesis is acknowledging its ignorance about how to deal with that entry. [sent-29, score-0.845]

12 It is evident that nondeterministic classifiers will include true classes in their predictions more frequently than deterministic hypotheses: they only have one possibility to be right. [sent-30, score-1.006]

13 In this sense, nondeterministic predictions are backed by greater reliability. [sent-31, score-0.826]

14 To be useful, however, nondeterministic classifiers should not only predict a set of classes containing the correct or true one, but their prediction sets should also be as small as possible. [sent-32, score-0.871]

15 Hence, the loss functions for nondeterministic classifiers can be built as combinations of IR measures, as Fβ functions are. [sent-35, score-0.779]

16 Starting from the distribution of posterior probabilities of classes, given one entry, we present an algorithm that computes the subset of classes with the lowest expected loss. [sent-36, score-0.249]

17 In the experiments reported at the end of the paper, we employed three deterministic learners that provide posterior probabilities: Support Vector Machines (SVM), Logistic Regression (LR), and Na¨ve Bayes (NB). [sent-37, score-0.274]

18 ı We successfully compared the achievements of our nondeterministic classifiers with those obtained by other alternative approaches. [sent-38, score-0.779]

19 The formal settings both for nondeterministic classifiers and their loss functions are presented in the third section. [sent-41, score-0.779]

20 After that, in Section 4, we derive an algorithm to learn nondeterministic hypotheses. [sent-42, score-0.779]

21 We see that the quality of posterior probabilities determines the goodness of nondeterministic predictions. [sent-45, score-0.962]

22 In this context, provided that posterior probabilities are exactly known, an optimal rejection rule can be devised (Chow, 1970; Bartlett and Wegkamp, 2008): an entry is rejected if the maximum posterior probability is less than a threshold. [sent-51, score-0.339]

23 Notice that classifiers with reject option are a relaxed version of nondeterministic classifiers. [sent-52, score-0.844]

24 Rejection is a nondeterministic classification that includes the complete set of classes. [sent-53, score-0.779]

25 On the other hand, instead of avoiding difficult classifications, for each entry, nondeterministic classifers adventure a set of possible classes, not necessarily the complete set. [sent-54, score-0.779]

26 The Na¨ve Credal Classifier models prior ignorance about the ı distribution of classes by means of a set of prior densities (also called the prior credal set), which is turned into a set of posterior probabilities by element-wise application of Bayes’ rule. [sent-59, score-0.365]

27 The classifier returns all the classes that are non-dominated by any other class according to the posterior credal set. [sent-60, score-0.289]

28 However, training instances in multi-label tasks can belong to more than one class, while nondeterministic training sets are the same as those of standard classification. [sent-62, score-0.779]

29 There, we dealt with an interesting application of nondeterministic classifiers, in which classes (or ranks, in that context) are linearly ordered. [sent-68, score-0.845]

30 In this application, nondeterministic classifiers return an interval of ranks. [sent-71, score-0.779]

31 First, we test whether nondeterministic classifiers outperform Na¨ve Credal Classifiers and other alternative approaches. [sent-80, score-0.779]

32 We ı then investigate the role played by the ingredients of nondeterministic classifiers. [sent-81, score-0.779]

33 Within this context, we define 2275 DEL C OZ , D´EZ AND BAHAMONDE I Definition 1 A nondeterministic hypothesis is a function h from the input space to the set of nonempty subsets of Y ; in symbols, if P(Y ) is the set of all subsets of Y , h : X −→ P(Y ) \ {∅}. [sent-91, score-0.779]

34 The aim of such a learning task is to find a nondeterministic hypothesis h from a space H that optimizes the expected prediction performance (or risk) on samples S′ independently and identically distributed (i. [sent-92, score-0.779]

35 In nondeterministic classification, we would like to favor those decisions of h that contain the true classes, and a smaller rather than a larger number of classes. [sent-96, score-0.779]

36 Thus, nondeterministic classification can be seen as a kind of Information Retrieval task for each entry. [sent-98, score-0.779]

37 According to the matrix, Equation (1), if h is a nondeterministic hypothesis and (x, y) ∈ X × Y , we thus have the following definitions. [sent-107, score-0.779]

38 β2 P + R (1 + β2 )a + b + β2 c (2) Thus, for a nondeterministic classifier h and a pair (x, y), Fβ (h(x), y) = 1+β2 β2 +|h(x)| 0 if y ∈ h(x) otherwise. [sent-120, score-0.779]

39 1 Nondeterministic Classification in a Binary Task To complete this section, let us show what nondeterministic classifiers look like in the simplest case, which will be further developed in the following sections. [sent-145, score-0.779]

40 In doubtful situations, however, the nondeterministic classifier should opt for predicting the 2 classes. [sent-150, score-0.779]

41 2278 L EARNING N ONDETERMINISTIC C LASSIFIERS Algorithm 1 The nondeterministic classifier nd • , an algorithm for computing the prediction with one or more classes for an entry x provided that the posterior probabilities of classes are given Input: C j : j = 1, . [sent-154, score-1.143]

42 Nondeterministic Classification Using Multiclass Posterior Probabilities In the general multiclass setting presented at the beginning of Section 3, let x be an entry of the input space X and let us now assume that we know the conditional probabilities of classes given the entry, Pr(C j |x). [sent-161, score-0.229]

43 ,Ck } that minimizes the risk defined in Equation (1) when we use the nondeterministic loss given by Fβ , (Equations 2, 3, and 4). [sent-166, score-0.779]

44 If the conditional probabilities Pr(C j |x) are known, Algorithm 1 returns the nondeterministic prediction for h(x) that minimizes the risk given by the loss 1 − Fβ . [sent-169, score-0.855]

45 It states that the optimum classification for an input x is the set of r classes with the highest posterior probabilities, where r is the lowest integer that fulfills r ∑ Pr(C j |x) ≥ (β2 + r)Pr(Cr+1|x), (8) j=1 or the set of all classes when this condition is not fulfilled by any r. [sent-203, score-0.239]

46 Additionally, we would like to underscore that Equation (8) hinders the use of na¨ve thresholds ı to compute nondeterministic predictions. [sent-205, score-0.779]

47 Thus, a nondeterministic classifier that always predicts the top r classes for a constant value r is not a correct option. [sent-206, score-0.845]

48 (9) j=1 Note that given a λ value in [0, 1], Equation (9) straightforwardly gives rise to a nondeterministic classifier as follows. [sent-209, score-0.779]

49 Therefore, in the experiments reported at the end of the paper, we shall consider the classifiers defined by Equation (10) as a possible alternative method to the nondeterministic classifier of Algorithm 1. [sent-219, score-0.808]

50 We then contrast our method with an implementation of Equation (10); once again our proposals outperform this alternative way to learn nondeterministic classifiers. [sent-229, score-0.779]

51 On the other hand, we analyze the influence of a number of factors related to nondeterministic learners. [sent-230, score-0.779]

52 We accordingly discuss how the scores of a nondeterministic learner are affected by the quality of posterior probabilities. [sent-231, score-1.012]

53 We see that the performance of a nondeterministic classifier is highly correlated with the accuracy of its deterministic counterpart. [sent-232, score-0.872]

54 1 Experimental Settings We used three different methods for learning posterior probabilities in order to build nondeterministic classifiers. [sent-235, score-0.962]

55 Additionally, we excluded those learning tasks with missing values or in which every deterministic learner considered (NB, SVM, LR) achieves a proportion of successful classifications of over 95%; otherwise nondeterministic learners would be too similar to their deterministic counterpart. [sent-262, score-1.095]

56 Every table of scores (Tables 4, 5, 6, 7, 8) is devoted to reporting the experimental results achieved in one of the kinds of data sets by one of the deterministic learners and by two nondeterministic algorithms that are to be compared. [sent-267, score-1.045]

57 First, they contain the scores of the deterministic learner d: the F1 (or accuracy or Recall), and the Brier score, a measure for the quality of posterior probabilities (Brier, 1950; Yeung et al. [sent-269, score-0.402]

58 2n i=1 j=1 Then we report, for each nondeterministic learner, the F1 , Precision, Recall, and the average number of classes predicted (|h(x)|). [sent-271, score-0.845]

59 Unless 2283 DEL C OZ , D´EZ AND BAHAMONDE I NB Data set zoo iris glass ecoli balance vehicle vowel contra yeast car image waveform landsat letter F1 95. [sent-275, score-0.419]

60 166 Table 4: Scores obtained by Na¨ve Bayes, the Na¨ve Credal Classifier and nondeterministic classiı ı fiers on UCI data sets using a 5-fold cross validation repeated 2 times. [sent-415, score-0.8]

61 The best nondeterministic F1 for each data set is boldfaced explicitly stated, we use the expression statistically significant differences to mean that p < 0. [sent-417, score-0.837]

62 Additionally, in order to provide a quick view of the order of magnitude of the scores, we have boldfaced the best nondeterministic F1 score for each data set. [sent-419, score-0.876]

63 Na¨ve Credal Classifiers ı In this subsection, we compare our nondeterministic learner with NCC (Corani and Zaffalon, 2008b), a state-of-the-art set-valued (nondeterministic) algorithm. [sent-427, score-0.806]

64 The nondeterministic nd nb is significantly (remember that we are using Wilcoxon tests) better than NCC both in Recall and F1 . [sent-430, score-0.86]

65 However, 2284 L EARNING N ONDETERMINISTIC C LASSIFIERS Data set zoo iris glass ecoli balance vehicle vowel contra yeast car image waveform landsat letter SVM F1 BS 94. [sent-441, score-0.419]

66 The best nondeterministic F1 for each data set is boldfaced LR Data set zoo iris glass ecoli balance vehicle vowel contra yeast car image waveform landsat letter F1 95. [sent-583, score-1.256]

67 The best nondeterministic F1 for each data set is boldfaced the scores of NCC on these data sets are inadmissible; their classifiers predict almost all classes for every example, their average values are: F1 = 25. [sent-725, score-1.028]

68 When the accuracy of the deterministic classifiers decreases, the average number of classes predicted would be expected 2285 DEL Data set brain nci lung 6 leukemia 3 lung 4 lung 11 tumors 11 tumors 14 lung 16 leukemia 7 SVM F1 BS 81. [sent-731, score-0.678]

69 The best nondeterministic F1 for each data set is boldfaced LR Data set brain nci lung 6 leukemia 3 lung 4 lung 11 tumors 11 tumors 14 lung 16 leukemia 7 F1 86. [sent-833, score-1.356]

70 The best nondeterministic F1 for each data set is boldfaced to increase. [sent-935, score-0.837]

71 1, we shall now compare the nondeterministic classifiers learned by Algorithm 1 with the alternative classifier defined in Equation (10) that uses a threshold λ for the sum of posterior probabilities. [sent-942, score-0.915]

72 The λ nondeterministic classifiers ı d will be denoted by ndλ , where d stands for the deterministic counterpart. [sent-944, score-0.872]

73 Moreover, λ classifiers always predict more classes than nd svm and nd lr . [sent-957, score-0.312]

74 In cancer microarray data, Tables 7 and 8, nd svm always wins in F1 , Precision, and average |h(x)|; while nd svm always loses in Recall. [sent-961, score-0.238]

75 However, when posterior probabilities are provided by LR, the differences are not significant in F1 , although nd lr has 5 wins, 1 tie and 4 losses; in Precision and average size of predictions the differences are significant in favor of nd lr . [sent-963, score-0.57]

76 4 The Importance of Posterior Probabilities The objective of this subsection is to experimentally investigate the degree of dependency between nondeterministic scores and the accuracy of posterior probabilities. [sent-971, score-0.985]

77 Comparing the results in Tables 5 and 6, it can be seen that the scores of nd lr are significantly worse than those of nd svm in F1 , Precision, Recall (p < 0. [sent-974, score-0.319]

78 The general message is that nd lr include unnecessary classes in their predictions. [sent-976, score-0.236]

79 Yet again, inferior posterior probabilities seem to be responsible for the inclusion of unnecessary classes in nondeterministic predictions. [sent-984, score-1.028]

80 In the preceding discussion of the scores achieved by nondeterministic learners, we found significant differences when the Brier scores of the deterministic counterparts presented significant differences. [sent-985, score-1.07]

81 In fact, the scores of a learner built with Algorithm 1 depend on the quality of the posterior probabilities supplied by the corresponding deterministic learner. [sent-986, score-0.402]

82 It seems plausible to draw the conclusion that the better the posterior probabilities, the better the nondeterministic scores. [sent-987, score-0.886]

83 In order to quantify this statement, we compared deterministic Brier scores with nondeterministic F1 , Recall, and Precision values; see Figure 2. [sent-988, score-0.971]

84 We separated the scores achieved by UCI and cancer data sets and included the scores of nd nb in UCI data sets. [sent-989, score-0.352]

85 Similar results would be achieved if we compared nondeterministic scores with deterministic accuracy. [sent-990, score-0.971]

86 Therefore, in order to choose a nondeterministic approach in a practical application, given a data set, it would be recommendable to first analyze the Brier score of different deterministic learners. [sent-1052, score-0.911]

87 0 Beta Figure 3: Evolution of F1 , F2 , Precision and Recall on two UCI data sets (yeast and vowel) for different β values and for the nondeterministic learners generated by SVM, LR, and NB 5. [sent-1203, score-0.853]

88 In Figure 3 we show the evolution of F1 , F2 , Precision and Recall on two UCI data sets (yeast and vowel) for different β values and for the nondeterministic learners generated by SVM, LR, and 2289 DEL C OZ , D´EZ AND BAHAMONDE I NB. [sent-1207, score-0.853]

89 Initially, β = 0 makes the nondeterministic classifiers deterministic. [sent-1209, score-0.779]

90 The kind of decisions that one would like to take from nondeterministic classifications must be considered. [sent-1214, score-0.779]

91 However, when some point near 1 is exceeded, the F1 score of the nondeterministic learner typically falls below the accuracy of the corresponding deterministic learner. [sent-1217, score-0.938]

92 Conclusions We have studied classifiers that are allowed to predict more than one class for entries from an input space: nondeterministic or set-valued classifiers. [sent-1221, score-0.805]

93 After discussing such measures, we derived an algorithm to learn optimal nondeterministic hypothesis. [sent-1223, score-0.779]

94 Given an entry from the input space, the algorithm requires the posterior probabilities to compute the subset of classes with the lowest expected loss. [sent-1224, score-0.298]

95 Additionally, an important advantage of our nondeterministic classifiers over NCC is that we can control the degree of nondeterministic behavior. [sent-1230, score-1.558]

96 However the nondeterministic behavior of NCC is quite difficult to predict. [sent-1233, score-0.779]

97 With the ı posterior probabilities provided by these deterministic learners, we built another alternative method to predict more than one class: the set of classes which the highest posterior probabilities summing more than a threshold λ. [sent-1235, score-0.551]

98 2290 L EARNING N ONDETERMINISTIC C LASSIFIERS On the other hand, in the experiments reported in this paper, we studied the role of the deterministic learners that explicitly provide posterior probabilities. [sent-1237, score-0.274]

99 We found that the better the posterior probabilities, the better the nondeterministic classifiers. [sent-1238, score-0.886]

100 In fact we obtained very high correlations between the Brier scores of deterministic probabilities and the F1 , Precision and Recall values of their nondeterministic counterparts. [sent-1239, score-1.047]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('nondeterministic', 0.779), ('lr', 0.17), ('brier', 0.169), ('fit', 0.145), ('ncc', 0.131), ('precision', 0.123), ('pr', 0.118), ('credal', 0.116), ('posterior', 0.107), ('bahamonde', 0.106), ('scores', 0.099), ('ondeterministic', 0.097), ('oz', 0.097), ('deterministic', 0.093), ('del', 0.082), ('nb', 0.081), ('zaffalon', 0.077), ('probabilities', 0.076), ('na', 0.076), ('lung', 0.075), ('uci', 0.074), ('learners', 0.074), ('vowel', 0.074), ('ers', 0.073), ('cancer', 0.073), ('ez', 0.073), ('classi', 0.068), ('corani', 0.068), ('lassifiers', 0.068), ('classes', 0.066), ('yeast', 0.061), ('boldfaced', 0.058), ('tamayo', 0.054), ('leukemia', 0.051), ('svm', 0.05), ('cr', 0.05), ('entry', 0.049), ('bs', 0.049), ('recall', 0.049), ('predictions', 0.047), ('tumors', 0.044), ('microarray', 0.043), ('ve', 0.041), ('correlation', 0.041), ('reject', 0.039), ('ecoli', 0.039), ('score', 0.039), ('multiclass', 0.038), ('gene', 0.035), ('er', 0.033), ('waveform', 0.033), ('yeung', 0.033), ('glass', 0.033), ('proportion', 0.029), ('zoo', 0.029), ('landsat', 0.029), ('angelo', 0.029), ('contra', 0.029), ('coz', 0.029), ('nci', 0.029), ('pnas', 0.029), ('staunton', 0.029), ('uniovi', 0.029), ('shall', 0.029), ('learner', 0.027), ('beta', 0.027), ('bayes', 0.026), ('option', 0.026), ('predict', 0.026), ('ease', 0.025), ('car', 0.025), ('antonio', 0.025), ('imprecise', 0.025), ('earning', 0.025), ('vehicle', 0.024), ('iris', 0.022), ('wins', 0.022), ('retrieval', 0.022), ('additionally', 0.022), ('tables', 0.022), ('zr', 0.022), ('vs', 0.022), ('letter', 0.021), ('repeated', 0.021), ('frequently', 0.021), ('armstrong', 0.019), ('bumgarner', 0.019), ('carcasses', 0.019), ('ciencia', 0.019), ('clare', 0.019), ('ministerio', 0.019), ('multiclassi', 0.019), ('ndlr', 0.019), ('ndnb', 0.019), ('ndsvm', 0.019), ('pomeroy', 0.019), ('scherf', 0.019), ('shafer', 0.019), ('tsoumakas', 0.019), ('wegkamp', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 48 jmlr-2009-Learning Nondeterministic Classifiers

Author: Juan José del Coz, Jorge Díez, Antonio Bahamonde

Abstract: Nondeterministic classifiers are defined as those allowed to predict more than one class for some entries from an input space. Given that the true class should be included in predictions and the number of classes predicted should be as small as possible, these kind of classifiers can be considered as Information Retrieval (IR) procedures. In this paper, we propose a family of IR loss functions to measure the performance of nondeterministic learners. After discussing such measures, we derive an algorithm for learning optimal nondeterministic hypotheses. Given an entry from the input space, the algorithm requires the posterior probabilities to compute the subset of classes with the lowest expected loss. From a general point of view, nondeterministic classifiers provide an improvement in the proportion of predictions that include the true class compared to their deterministic counterparts; the price to be paid for this increase is usually a tiny proportion of predictions with more than one class. The paper includes an extensive experimental study using three deterministic learners to estimate posterior probabilities: a multiclass Support Vector Machine (SVM), a Logistic Regression, and a Na¨ve Bayes. The data sets considered comprise both UCI ı multi-class learning tasks and microarray expressions of different kinds of cancer. We successfully compare nondeterministic classifiers with other alternative approaches. Additionally, we shall see how the quality of posterior probabilities (measured by the Brier score) determines the goodness of nondeterministic predictions. Keywords: nondeterministic, multiclassification, reject option, multi-label classification, posterior probabilities

2 0.052360337 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification

Author: Eitan Greenshtein, Junyong Park

Abstract: We consider the problem of classification using high dimensional features’ space. In a paper by Bickel and Levina (2004), it is recommended to use naive-Bayes classifiers, that is, to treat the features as if they are statistically independent. Consider now a sparse setup, where only a few of the features are informative for classification. Fan and Fan (2008), suggested a variable selection and classification method, called FAIR. The FAIR method improves the design of naive-Bayes classifiers in sparse setups. The improvement is due to reducing the noise in estimating the features’ means. This reduction is since that only the means of a few selected variables should be estimated. We also consider the design of naive Bayes classifiers. We show that a good alternative to variable selection is estimation of the means through a certain non parametric empirical Bayes procedure. In sparse setups the empirical Bayes implicitly performs an efficient variable selection. It also adapts very well to non sparse setups, and has the advantage of making use of the information from many “weakly informative” variables, which variable selection type of classification procedures give up on using. We compare our method with FAIR and other classification methods in simulation for sparse and non sparse setups, and in real data examples involving classification of normal versus malignant tissues based on microarray data. Keywords: non parametric empirical Bayes, high dimension, classification

3 0.050312657 50 jmlr-2009-Learning When Concepts Abound

Author: Omid Madani, Michael Connor, Wiley Greiner

Abstract: Many learning tasks, such as large-scale text categorization and word prediction, can benefit from efficient training and classification when the number of classes, in addition to instances and features, is large, that is, in the thousands and beyond. We investigate the learning of sparse class indices to address this challenge. An index is a mapping from features to classes. We compare the index-learning methods against other techniques, including one-versus-rest and top-down classification using perceptrons and support vector machines. We find that index learning is highly advantageous for space and time efficiency, at both training and classification times. Moreover, this approach yields similar and at times better accuracies. On problems with hundreds of thousands of instances and thousands of classes, the index is learned in minutes, while other methods can take hours or days. As we explain, the design of the learning update enables conveniently constraining each feature to connect to a small subset of the classes in the index. This constraint is crucial for scalability. Given an instance with l active (positive-valued) features, each feature on average connecting to d classes in the index (in the order of 10s in our experiments), update and classification take O(dl log(dl)). Keywords: index learning, many-class learning, multiclass learning, online learning, text categorization

4 0.045683157 56 jmlr-2009-Model Monitor (M2): Evaluating, Comparing, and Monitoring Models    (Machine Learning Open Source Software Paper)

Author: Troy Raeder, Nitesh V. Chawla

Abstract: This paper presents Model Monitor (M 2 ), a Java toolkit for robustly evaluating machine learning algorithms in the presence of changing data distributions. M 2 provides a simple and intuitive framework in which users can evaluate classifiers under hypothesized shifts in distribution and therefore determine the best model (or models) for their data under a number of potential scenarios. Additionally, M 2 is fully integrated with the WEKA machine learning environment, so that a variety of commodity classifiers can be used if desired. Keywords: machine learning, open-source software, distribution shift, scenario analysis

5 0.040930178 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems

Author: Luciana Ferrer, Kemal Sönmez, Elizabeth Shriberg

Abstract: We present a method for training support vector machine (SVM)-based classification systems for combination with other classification systems designed for the same task. Ideally, a new system should be designed such that, when combined with existing systems, the resulting performance is optimized. We present a simple model for this problem and use the understanding gained from this analysis to propose a method to achieve better combination performance when training SVM systems. We include a regularization term in the SVM objective function that aims to reduce the average class-conditional covariance between the resulting scores and the scores produced by the existing systems, introducing a trade-off between such covariance and the system’s individual performance. That is, the new system “takes one for the team”, falling somewhat short of its best possible performance in order to increase the diversity of the ensemble. We report results on the NIST 2005 and 2006 speaker recognition evaluations (SREs) for a variety of subsystems. We show a gain of 19% on the equal error rate (EER) of a combination of four systems when applying the proposed method with respect to the performance obtained when the four systems are trained independently of each other. Keywords: system combination, ensemble diversity, multiple classifier systems, support vector machines, speaker recognition, kernel methods ∗. This author performed part of the work presented in this paper while at the Information Systems Laboratory, Department of Electrical Engineering, Stanford University. c 2009 Luciana Ferrer, Kemal S¨ nmez and Elizabeth Shriberg. o ¨ F ERRER , S ONMEZ AND S HRIBERG

6 0.03753683 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers

7 0.036561687 73 jmlr-2009-Prediction With Expert Advice For The Brier Game

8 0.033631116 23 jmlr-2009-Discriminative Learning Under Covariate Shift

9 0.030401971 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

10 0.029704096 45 jmlr-2009-Learning Approximate Sequential Patterns for Classification

11 0.028217837 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM

12 0.028186871 35 jmlr-2009-Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination    (Special Topic on Model Selection)

13 0.027355907 4 jmlr-2009-A Survey of Accuracy Evaluation Metrics of Recommendation Tasks

14 0.027240397 47 jmlr-2009-Learning Linear Ranking Functions for Beam Search with Application to Planning

15 0.027194344 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures

16 0.024797732 15 jmlr-2009-Cautious Collective Classification

17 0.024466192 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications

18 0.024044259 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

19 0.022961251 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning

20 0.022598274 96 jmlr-2009-Transfer Learning for Reinforcement Learning Domains: A Survey


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.116), (1, -0.07), (2, 0.063), (3, -0.014), (4, 0.024), (5, -0.044), (6, 0.015), (7, 0.106), (8, 0.068), (9, 0.038), (10, 0.075), (11, -0.06), (12, -0.005), (13, -0.109), (14, 0.009), (15, -0.103), (16, 0.013), (17, -0.105), (18, -0.041), (19, -0.152), (20, -0.048), (21, 0.107), (22, 0.035), (23, 0.339), (24, -0.204), (25, 0.096), (26, -0.123), (27, -0.097), (28, -0.074), (29, 0.064), (30, -0.052), (31, 0.132), (32, 0.049), (33, 0.061), (34, 0.076), (35, -0.027), (36, 0.146), (37, -0.063), (38, -0.163), (39, -0.164), (40, 0.118), (41, -0.03), (42, 0.251), (43, -0.188), (44, -0.062), (45, -0.074), (46, -0.051), (47, 0.148), (48, 0.1), (49, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94881362 48 jmlr-2009-Learning Nondeterministic Classifiers

Author: Juan José del Coz, Jorge Díez, Antonio Bahamonde

Abstract: Nondeterministic classifiers are defined as those allowed to predict more than one class for some entries from an input space. Given that the true class should be included in predictions and the number of classes predicted should be as small as possible, these kind of classifiers can be considered as Information Retrieval (IR) procedures. In this paper, we propose a family of IR loss functions to measure the performance of nondeterministic learners. After discussing such measures, we derive an algorithm for learning optimal nondeterministic hypotheses. Given an entry from the input space, the algorithm requires the posterior probabilities to compute the subset of classes with the lowest expected loss. From a general point of view, nondeterministic classifiers provide an improvement in the proportion of predictions that include the true class compared to their deterministic counterparts; the price to be paid for this increase is usually a tiny proportion of predictions with more than one class. The paper includes an extensive experimental study using three deterministic learners to estimate posterior probabilities: a multiclass Support Vector Machine (SVM), a Logistic Regression, and a Na¨ve Bayes. The data sets considered comprise both UCI ı multi-class learning tasks and microarray expressions of different kinds of cancer. We successfully compare nondeterministic classifiers with other alternative approaches. Additionally, we shall see how the quality of posterior probabilities (measured by the Brier score) determines the goodness of nondeterministic predictions. Keywords: nondeterministic, multiclassification, reject option, multi-label classification, posterior probabilities

2 0.39949945 56 jmlr-2009-Model Monitor (M2): Evaluating, Comparing, and Monitoring Models    (Machine Learning Open Source Software Paper)

Author: Troy Raeder, Nitesh V. Chawla

Abstract: This paper presents Model Monitor (M 2 ), a Java toolkit for robustly evaluating machine learning algorithms in the presence of changing data distributions. M 2 provides a simple and intuitive framework in which users can evaluate classifiers under hypothesized shifts in distribution and therefore determine the best model (or models) for their data under a number of potential scenarios. Additionally, M 2 is fully integrated with the WEKA machine learning environment, so that a variety of commodity classifiers can be used if desired. Keywords: machine learning, open-source software, distribution shift, scenario analysis

3 0.36372596 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification

Author: Eitan Greenshtein, Junyong Park

Abstract: We consider the problem of classification using high dimensional features’ space. In a paper by Bickel and Levina (2004), it is recommended to use naive-Bayes classifiers, that is, to treat the features as if they are statistically independent. Consider now a sparse setup, where only a few of the features are informative for classification. Fan and Fan (2008), suggested a variable selection and classification method, called FAIR. The FAIR method improves the design of naive-Bayes classifiers in sparse setups. The improvement is due to reducing the noise in estimating the features’ means. This reduction is since that only the means of a few selected variables should be estimated. We also consider the design of naive Bayes classifiers. We show that a good alternative to variable selection is estimation of the means through a certain non parametric empirical Bayes procedure. In sparse setups the empirical Bayes implicitly performs an efficient variable selection. It also adapts very well to non sparse setups, and has the advantage of making use of the information from many “weakly informative” variables, which variable selection type of classification procedures give up on using. We compare our method with FAIR and other classification methods in simulation for sparse and non sparse setups, and in real data examples involving classification of normal versus malignant tissues based on microarray data. Keywords: non parametric empirical Bayes, high dimension, classification

4 0.35773349 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems

Author: Luciana Ferrer, Kemal Sönmez, Elizabeth Shriberg

Abstract: We present a method for training support vector machine (SVM)-based classification systems for combination with other classification systems designed for the same task. Ideally, a new system should be designed such that, when combined with existing systems, the resulting performance is optimized. We present a simple model for this problem and use the understanding gained from this analysis to propose a method to achieve better combination performance when training SVM systems. We include a regularization term in the SVM objective function that aims to reduce the average class-conditional covariance between the resulting scores and the scores produced by the existing systems, introducing a trade-off between such covariance and the system’s individual performance. That is, the new system “takes one for the team”, falling somewhat short of its best possible performance in order to increase the diversity of the ensemble. We report results on the NIST 2005 and 2006 speaker recognition evaluations (SREs) for a variety of subsystems. We show a gain of 19% on the equal error rate (EER) of a combination of four systems when applying the proposed method with respect to the performance obtained when the four systems are trained independently of each other. Keywords: system combination, ensemble diversity, multiple classifier systems, support vector machines, speaker recognition, kernel methods ∗. This author performed part of the work presented in this paper while at the Information Systems Laboratory, Department of Electrical Engineering, Stanford University. c 2009 Luciana Ferrer, Kemal S¨ nmez and Elizabeth Shriberg. o ¨ F ERRER , S ONMEZ AND S HRIBERG

5 0.34270567 50 jmlr-2009-Learning When Concepts Abound

Author: Omid Madani, Michael Connor, Wiley Greiner

Abstract: Many learning tasks, such as large-scale text categorization and word prediction, can benefit from efficient training and classification when the number of classes, in addition to instances and features, is large, that is, in the thousands and beyond. We investigate the learning of sparse class indices to address this challenge. An index is a mapping from features to classes. We compare the index-learning methods against other techniques, including one-versus-rest and top-down classification using perceptrons and support vector machines. We find that index learning is highly advantageous for space and time efficiency, at both training and classification times. Moreover, this approach yields similar and at times better accuracies. On problems with hundreds of thousands of instances and thousands of classes, the index is learned in minutes, while other methods can take hours or days. As we explain, the design of the learning update enables conveniently constraining each feature to connect to a small subset of the classes in the index. This constraint is crucial for scalability. Given an instance with l active (positive-valued) features, each feature on average connecting to d classes in the index (in the order of 10s in our experiments), update and classification take O(dl log(dl)). Keywords: index learning, many-class learning, multiclass learning, online learning, text categorization

6 0.28834745 15 jmlr-2009-Cautious Collective Classification

7 0.26067609 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers

8 0.22609819 73 jmlr-2009-Prediction With Expert Advice For The Brier Game

9 0.2205884 47 jmlr-2009-Learning Linear Ranking Functions for Beam Search with Application to Planning

10 0.19878326 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks

11 0.19529925 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM

12 0.18884964 45 jmlr-2009-Learning Approximate Sequential Patterns for Classification

13 0.18767913 46 jmlr-2009-Learning Halfspaces with Malicious Noise

14 0.16754164 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization

15 0.16738014 11 jmlr-2009-Bayesian Network Structure Learning by Recursive Autonomy Identification

16 0.16423672 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

17 0.15352827 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures

18 0.15238769 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning

19 0.15211566 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning

20 0.14374121 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.036), (11, 0.029), (19, 0.019), (26, 0.018), (38, 0.034), (43, 0.385), (47, 0.016), (52, 0.081), (55, 0.049), (58, 0.039), (66, 0.104), (90, 0.058), (96, 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.68938065 48 jmlr-2009-Learning Nondeterministic Classifiers

Author: Juan José del Coz, Jorge Díez, Antonio Bahamonde

Abstract: Nondeterministic classifiers are defined as those allowed to predict more than one class for some entries from an input space. Given that the true class should be included in predictions and the number of classes predicted should be as small as possible, these kind of classifiers can be considered as Information Retrieval (IR) procedures. In this paper, we propose a family of IR loss functions to measure the performance of nondeterministic learners. After discussing such measures, we derive an algorithm for learning optimal nondeterministic hypotheses. Given an entry from the input space, the algorithm requires the posterior probabilities to compute the subset of classes with the lowest expected loss. From a general point of view, nondeterministic classifiers provide an improvement in the proportion of predictions that include the true class compared to their deterministic counterparts; the price to be paid for this increase is usually a tiny proportion of predictions with more than one class. The paper includes an extensive experimental study using three deterministic learners to estimate posterior probabilities: a multiclass Support Vector Machine (SVM), a Logistic Regression, and a Na¨ve Bayes. The data sets considered comprise both UCI ı multi-class learning tasks and microarray expressions of different kinds of cancer. We successfully compare nondeterministic classifiers with other alternative approaches. Additionally, we shall see how the quality of posterior probabilities (measured by the Brier score) determines the goodness of nondeterministic predictions. Keywords: nondeterministic, multiclassification, reject option, multi-label classification, posterior probabilities

2 0.37611055 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

Author: Huan Xu, Constantine Caramanis, Shie Mannor

Abstract: We consider regularized support vector machines (SVMs) and show that they are precisely equivalent to a new robust optimization formulation. We show that this equivalence of robust optimization and regularization has implications for both algorithms, and analysis. In terms of algorithms, the equivalence suggests more general SVM-like algorithms for classification that explicitly build in protection to noise, and at the same time control overfitting. On the analysis front, the equivalence of robustness and regularization provides a robust optimization interpretation for the success of regularized SVMs. We use this new robustness interpretation of SVMs to give a new proof of consistency of (kernelized) SVMs, thus establishing robustness as the reason regularized SVMs generalize well. Keywords: robustness, regularization, generalization, kernel, support vector machine

3 0.37027857 87 jmlr-2009-Sparse Online Learning via Truncated Gradient

Author: John Langford, Lihong Li, Tong Zhang

Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and find for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso

4 0.37027821 38 jmlr-2009-Hash Kernels for Structured Data

Author: Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, S.V.N. Vishwanathan

Abstract: We propose hashing to facilitate efficient kernels. This generalizes previous work using sampling and we show a principled way to compute the kernel matrix for data streams and sparse feature spaces. Moreover, we give deviation bounds from the exact kernel matrix. This has applications to estimation on strings and graphs. Keywords: hashing, stream, string kernel, graphlet kernel, multiclass classification

5 0.36691198 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures

Author: Babak Shahbaba, Radford Neal

Abstract: We introduce a new nonlinear model for classification, in which we model the joint distribution of response variable, y, and covariates, x, non-parametrically using Dirichlet process mixtures. We keep the relationship between y and x linear within each component of the mixture. The overall relationship becomes nonlinear if the mixture contains more than one component, with different regression coefficients. We use simulated data to compare the performance of this new approach to alternative methods such as multinomial logit (MNL) models, decision trees, and support vector machines. We also evaluate our approach on two classification problems: identifying the folding class of protein sequences and detecting Parkinson’s disease. Our model can sometimes improve predictive accuracy. Moreover, by grouping observations into sub-populations (i.e., mixture components), our model can sometimes provide insight into hidden structure in the data. Keywords: mixture models, Dirichlet process, classification

6 0.36539051 19 jmlr-2009-Controlling the False Discovery Rate of the Association Causality Structure Learned with the PC Algorithm    (Special Topic on Mining and Learning with Graphs and Relations)

7 0.3617034 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model

8 0.36002126 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions

9 0.3584564 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models

10 0.35613167 9 jmlr-2009-Analysis of Perceptron-Based Active Learning

11 0.35441187 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models

12 0.35427913 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs

13 0.35380128 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification

14 0.35351557 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost

15 0.35347581 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM

16 0.35341641 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability

17 0.35321987 18 jmlr-2009-Consistency and Localizability

18 0.35292479 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications

19 0.35251537 70 jmlr-2009-Particle Swarm Model Selection    (Special Topic on Model Selection)

20 0.35223091 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers