jmlr jmlr2013 jmlr2013-12 knowledge-graph by maker-knowledge-mining

12 jmlr-2013-Alleviating Naive Bayes Attribute Independence Assumption by Attribute Weighting

Source: pdf

Author: Nayyar A. Zaidi, Jesús Cerquides, Mark J. Carman, Geoffrey I. Webb

Abstract: Despite the simplicity of the Naive Bayes classiﬁer, it has continued to perform well against more sophisticated newcomers and has remained, therefore, of great interest to the machine learning community. Of numerous approaches to reﬁning the naive Bayes classiﬁer, attribute weighting has received less attention than it warrants. Most approaches, perhaps inﬂuenced by attribute weighting in other machine learning algorithms, use weighting to place more emphasis on highly predictive attributes than those that are less predictive. In this paper, we argue that for naive Bayes attribute weighting should instead be used to alleviate the conditional independence assumption. Based on this premise, we propose a weighted naive Bayes algorithm, called WANBIA, that selects weights to minimize either the negative conditional log likelihood or the mean squared error objective functions. We perform extensive evaluations and ﬁnd that WANBIA is a competitive alternative to state of the art classiﬁers like Random Forest, Logistic Regression and A1DE. Keywords: classiﬁcation, naive Bayes, attribute independence assumption, weighted naive Bayes classiﬁcation

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Of numerous approaches to reﬁning the naive Bayes classiﬁer, attribute weighting has received less attention than it warrants. [sent-13, score-0.423]

2 Most approaches, perhaps inﬂuenced by attribute weighting in other machine learning algorithms, use weighting to place more emphasis on highly predictive attributes than those that are less predictive. [sent-14, score-0.44]

3 In this paper, we argue that for naive Bayes attribute weighting should instead be used to alleviate the conditional independence assumption. [sent-15, score-0.442]

4 Keywords: classiﬁcation, naive Bayes, attribute independence assumption, weighted naive Bayes classiﬁcation 1. [sent-18, score-0.515]

5 If there is sufﬁcient data present for every possible combination of attribute values, direct estimation of each relevant multi-variate probability will be reliable. [sent-23, score-0.261]

6 In practice, naive Bayes’ attribute independence assumption is often violated, and as a result its probability estimates are often suboptimal. [sent-33, score-0.39]

7 These methods are aimed at enhancing naive Bayes’ accuracy by relaxing the assumption of conditional independence between attributes given the class label (Langley and Sage, 1994; Friedman and Goldszmidt, 1996; Zheng et al. [sent-37, score-0.204]

8 The second category comprises attribute weighting methods and has received relatively little attention (Hilden and Bjerregaard, 1976; Ferreira et al. [sent-41, score-0.313]

9 There is some evidence that attribute weighting appears to have primarily been viewed as a means of increasing the inﬂuence of highly predictive attributes and discounting attributes that have little predictive value. [sent-43, score-0.463]

10 For example, weighting by mutual information between an attribute and the class is directly using a measure of how predictive is each individual attribute (Zhang and Sheng, 2004). [sent-45, score-0.574]

11 In contrast, we argue that the primary value of attribute weighting is its capacity to reduce the impact on prediction accuracy of violations of the assumption of conditional attribute independence. [sent-46, score-0.574]

12 • We present novel algorithms for learning attribute weights for naive Bayes. [sent-50, score-0.4]

13 It should be noted that the motivation of our work differs from most previous attribute weighting methods. [sent-51, score-0.313]

14 We view weighting as a way to reduce the effects of the violations of the attribute independence assumption on which naive Bayes is based. [sent-52, score-0.442]

15 Also, our work differs from semi-naive Bayes methods, as we weight the attributes rather than modifying the structure of naive Bayes. [sent-53, score-0.206]

16 Our novel techniques for learning naive Bayes weights are described in Section 4 where we also discuss their connection with naive Bayes and Logistic Regression in terms of parameter optimization. [sent-58, score-0.249]

17 In the most general case, this weight depends on the attribute value: a ˆ ˆ ˆ P(y, x) = P(y) ∏ P(xi |y)wi,xi . [sent-78, score-0.282]

18 Unless explicitly stated, in this paper we intend the intermediate form when we refer to attribute weighting, as we believe it provides an effective trade-off between computational complexity and inductive power. [sent-83, score-0.261]

19 Appropriate weights can reduce the error that results from violations of naive Bayes’ conditional attribute independence assumption. [sent-84, score-0.419]

20 Attribute weighting is strictly more powerful than attribute selection, as it is possible to obtain identical results to attribute selection by setting the weights of selected attributes to 1. [sent-93, score-0.678]

21 0, and assignment of other weights can create classiﬁers that cannot be expressed using attribute selection. [sent-95, score-0.29]

22 1 Dealing with Dependent Attributes by Weighting: A Simple Example This example shows the relative performance of naive Bayes and weighted naive Bayes as we vary the conditional dependence between attributes. [sent-97, score-0.235]

23 In particular it demonstrates how optimal assignment of weights will never result in higher error than attribute selection or standard naive Bayes, and that for certain violations of the attribute independence assumption it can result in lower error than either. [sent-98, score-0.68]

24 For each of the 4 possible attribute value combinations (x1 , x2 ) ∈ {(0, 0), (0, 1), (1, 0), (1, 1)}, we selected values for the class probability given the attribute value combination from the set: P(y|x1 , x2 ) ∈ {0. [sent-103, score-0.522]

25 We then set the values for the attribute value probabilities P(x1 , x2 ) by ﬁxing the marginal distributions to a half P(x1 ) = P(x2 ) = 1/2, and varying the correlation between the attributes using 1950 A LLEVIATING NB ATTRIBUTE I NDEPENDENCE A SSUMPTION BY ATTRIBUTE W EIGHTING 0. [sent-107, score-0.336]

26 7 Figure 1: Variation of Error of naive Bayes, selective naive Bayes, weighted naive Bayes and classiﬁer based only on prior probabilities of the class as a function of conditional dependence (conditional mutual information) between the two attributes. [sent-124, score-0.368]

27 P(X1 = 0, X2 = 0) = P(X1 = 1, X2 = 1) = Note that when ρ=−1 the attributes are perfectly anti-correlated (x1 =¬x2 ), when ρ=0 the attributes are independent (since the joint distribution P(x1 , x2 ) is uniform) and when ρ = 1 the attributes are perfectly correlated. [sent-126, score-0.225]

28 It can be seen that when conditional mutual information (CMI) is small, naive Bayes performs better than selective naive Bayes and the prior classiﬁer. [sent-152, score-0.243]

29 As CMI is increased, naive Bayes performance deteriorates compared to selective naive Bayes. [sent-155, score-0.243]

30 We have shown that weighted naive Bayes is capable of expressing more accurate classiﬁers than selective naive Bayes. [sent-159, score-0.258]

31 Langley and Sage (1994) proposed the Selective Bayes (SB) classiﬁer, using feature selection to accommodate redundant attributes in the prediction process and to augment naive Bayes with the ability to exclude attributes that introduce dependencies. [sent-170, score-0.26]

32 The technique is based on searching through the entire space of all attribute subsets. [sent-171, score-0.261]

33 On each iteration, the method considers adding each unused attribute to the subset on a trial basis and measures the performance of the resulting classiﬁer on the training data. [sent-174, score-0.261]

34 The attribute that most improves the accuracy is permanently added to the subset. [sent-175, score-0.261]

35 The algorithm terminates when addition of any attribute results in reduced accuracy, at which point it returns the list of current attributes along with their ranks. [sent-176, score-0.336]

36 The rank of the attribute is based on the order in which they are added to the subset. [sent-177, score-0.261]

37 Similar to Langley and Sage (1994), Correlation-based Feature Selection (CFS) used a correlation measure as a metric to determine the relevance of the attribute subset (Hall, 2000). [sent-178, score-0.261]

38 The Selective Bayesian Classiﬁer (SBC) of Ratanamahatana and Gunopulos (2003) also employs decision trees for attribute selection for naive Bayes. [sent-189, score-0.388]

39 Only those attributes appearing in the top three levels of a decision tree are selected for inclusion in naive Bayes. [sent-190, score-0.202]

40 Naive Bayes is trained on an attribute set which comprises the union of attributes appearing in all ﬁve decision trees. [sent-193, score-0.353]

41 This strategy uses a single weight and therefore is not strictly performing attribute weighting. [sent-195, score-0.282]

42 Their approach is motivated as a means of alleviating the effects of violations of the attribute independence assumption. [sent-196, score-0.28]

43 Zhang and Sheng (2004) used the gain ratio of an attribute with the class labels as its weight. [sent-206, score-0.261]

44 The gain ratio is a well-studied attribute weighting technique and is generally used for splitting nodes in decision trees (Duda et al. [sent-208, score-0.33]

45 The weight of each attribute is set to the gain ratio of the attribute relative to the average gain ratio across all attributes. [sent-210, score-0.543]

46 ∑a GR(i) i=1 (7) The gain ratio of an attribute is then simply the Mutual Information between that attribute and the class label divided by the entropy of that attribute: P(x1 ,y) I(Xi ,Y ) ∑y ∑x1 P(x1 , y) log P(x1 )P(y) . [sent-213, score-0.522]

47 An attribute weighting scheme based on differential evolution algorithms for naive Bayes classiﬁcation have been proposed in Wu and Cai (2011). [sent-217, score-0.423]

48 First, a population of attribute weight vectors 1953 Z AIDI , C ERQUIDES , C ARMAN AND W EBB is randomly generated, weights in the population are constrained to be between 0 and 1. [sent-218, score-0.311]

49 A scheme used in Hall (2007) is similar in spirit to SBC where the weight assigned to each attribute is inversely proportional to the minimum depth at which they were ﬁrst tested in an unpruned decision tree. [sent-223, score-0.299]

50 For example, one can assign weight to an attribute i as: wi = 1 T T 1 ∑ √dti . [sent-226, score-0.303]

51 (8) t where dti is the minimum depth at which the attribute i appears in decision tree t, and T is the total number of decision trees generated. [sent-227, score-0.295]

52 To understand whether the improvement in naive Bayes accuracy was due to attribute weighting or selection, a variant of the above approach was also proposed where all non-zero weights are set to one. [sent-228, score-0.452]

53 For example, the weight of an attribute i can be deﬁned as: 1 wi = √ . [sent-232, score-0.303]

54 (2001) is the only one to use Equation 4, weighting each attribute value rather than each attribute. [sent-235, score-0.313]

55 They used entropy-based discretization for numeric attributes and assigned a weight to each partition (value) of the attribute that is proportional to its predictive capability of the class. [sent-236, score-0.357]

56 In that case the weight of for value j of the attribute i can be written as: wi j ∝ 1 ∑ | P(y|Xi = j) − |Y | | 1/α α . [sent-242, score-0.303]

57 y where P(y|Xi = j) denotes the probability that the class is y given that the i-th attribute of a data point has value j. [sent-243, score-0.261]

58 Alternatively, the baseline class distribution can be set to the class probabilities across all values of the attribute (i. [sent-244, score-0.261]

59 A LLEVIATING NB ATTRIBUTE I NDEPENDENCE A SSUMPTION BY ATTRIBUTE W EIGHTING where P(y|Xi = miss) is the class prior probability across all data points for which the attribute i is not missing. [sent-248, score-0.261]

60 Many researchers have investigated techniques for extending the basic naive Bayes independence model with a small number of additional dependencies between attributes in order to improve classiﬁcation performance (Zheng and Webb, 2000). [sent-251, score-0.204]

61 In this paper we are interested in improving performance of the NB classiﬁer by reducing the effect of attribute independence violations through attribute weighting. [sent-263, score-0.541]

62 While it is conceivable that semi-naive Bayes methods and more general Bayesian Network classiﬁer learning could also beneﬁt from attribute weighting, we leave its investigation to future work. [sent-265, score-0.261]

63 1 WANBIA Many previous approaches to attribute weighting for naive Bayes have found weights using some form of mechanism that increases the weights of attributes that are highly predictive of the class and decreases the weights of attributes that are less predictive of the class. [sent-270, score-0.66]

64 Naive Bayes delivers Bayes optimal classiﬁcation if the attribute independence assumption holds. [sent-272, score-0.28]

65 Weighting should only be applied to remedy violations of the attribute independence assumption. [sent-273, score-0.28]

66 In contrast, a method that uses a measure such as mutual information with the class to weight the attribute will 1955 Z AIDI , C ERQUIDES , C ARMAN AND W EBB Name NB GRW SBC MH SB CFS SBC-FS MH-FS FNB-d1 FNB-d2 AnDE TAN RF LR WANBIACLL WANBIAMSE Description Naive Bayes. [sent-278, score-0.282]

67 Use gain ratio as attribute weights in naive Bayes, shown in Equation 7 (Zhang and Sheng, 2004). [sent-281, score-0.4]

68 Assign weight to attribute i as given in Equation 8 where L = 5 with a bootstrap size of 10%. [sent-282, score-0.282]

69 Assign weight to attribute i as given in Equation 8 where L = 10 with a bootstrap size of 50% (Hall, 2007). [sent-284, score-0.282]

70 Weights computed per attribute value using Equation 10 with α = 2. [sent-292, score-0.261]

71 Weights computed per attribute value using Equation 10 with α = 2. [sent-293, score-0.261]

72 Following from Equations 1, 2 and 5, let us re-deﬁne the weighted naive Bayes model as: ˆ P(y|x; π, Θ, w) = πy ∏i θwii=xi |y X w ∑y′ πy′ ∏i θXii=xi |y′ , (10) with constraints: ∑y πy = 1 and ∀y,i ∑ j θXi =xi |y = 1, where • {πy , θXi =xi |y } are naive Bayes parameters. [sent-309, score-0.235]

73 • w is a vector of class-independent weights, wi for each attribute i. [sent-312, score-0.282]

75 2 PARAMETERS OF W EIGHTED ATTRIBUTE NAIVE BAYES The WANBIA model is an extension of the NB model where we introduce a weight vector w ∈ [0, 1]a containing a class-independent weight wi for each attribute i. [sent-352, score-0.324]

76 Experiments In this section, we compare the performance of our proposed methods WANBIACLL and WANBIAMSE with state of the art classiﬁers, existing semi-naive Bayes methods and weighted naive Bayes methods based on both attribute selection and attribute weighting. [sent-365, score-0.647]

77 Note that unlike the NB model, the LR model does not require that the domain of attribute values be discrete. [sent-369, score-0.261]

78 Nondiscrete data can also be handled by Naive Bayes models, but a different parameterization for the distribution over attribute values must be used. [sent-370, score-0.261]

79 • Probability Estimates - The base probabilities of each algorithm are estimated using mestimation, since in our initial experiments it leads to more accurate probabilities than Laplace estimation for naive Bayes and weighted naive Bayes. [sent-455, score-0.235]

80 ) + m , (20) where Nxi ,y is the count of data points with attribute value xi and class label y, Ny is the count of data points with class label y, N? [sent-458, score-0.277]

81 Table 5 compares the WDL of naive Bayes in Equation 3 with weighted naive Bayes in Equation 6 as the weight w is varied from 0. [sent-561, score-0.256]

82 5 Single versus Multiple Naive Bayes Weights Learning To study the effect of single versus multiple weight learning for naive Bayes (naive Bayes in Equation 5 versus naive Bayes in Equation 6), we constrained WANBIA to learn only a single weight for all attributes. [sent-853, score-0.262]

83 A1DE relaxes NB’s attribute independence assumption by (only) making each attribute independent given the class and one at1970 A LLEVIATING NB ATTRIBUTE I NDEPENDENCE A SSUMPTION BY ATTRIBUTE W EIGHTING 0−1 Loss 8. [sent-962, score-0.541]

84 As A1DE is based on learning without search, every attribute takes a turn to be a super-parent. [sent-1020, score-0.261]

85 a i=1 Similarly, TAN augments the naive Bayes structure by allowing each attribute to depend on at most one non-class attribute. [sent-1022, score-0.371]

86 The estimate is: a ˆ ˆ ˆ P(y, x) = P(y) ∏ P(xi |y, π(xi )), i=1 where π(xi ) is the parent of attribute xi . [sent-1024, score-0.277]

87 This performance gain is attributed to the fact that WANBIA successfully alleviates the conditional attribute independence assumption. [sent-1155, score-0.28]

88 • WANBIA has signiﬁcantly better 0-1 loss, RMSE, bias and variance than most existing weighted naive Bayes schemes based on attribute selection and attribute weighting. [sent-1158, score-0.694]

89 As a result, WANBIA sets a new standard for attribute weighing for naive Bayes. [sent-1159, score-0.371]

90 However, it is credible that WANBIA’s strategy for alleviating NB’s attribute independence assumption is complementary to A1DE and TAN, allowing both to be applied to obtain even better classiﬁcation accuracy. [sent-1162, score-0.28]

91 Our work has been primarily motivated by the observation that naive Bayes conditional attribute independence assumption is often violated and, therefore, it is useful to alleviate it. [sent-1169, score-0.39]

92 We build an argument that in current research, weighting in naive Bayes has been viewed as a way of enhancing the impact of attributes that are highly correlated with the class. [sent-1170, score-0.237]

93 We argue that weighting provides a natural framework for alleviating the attribute independence assumption. [sent-1171, score-0.332]

94 Our two proposed naive Bayes weighting methods WANBIACLL and WANBIAMSE ﬁx naive Bayes’ parameters to be the MAP estimates and learn weights by maximizing conditional log-likelihood and minimizing Mean Square Error respectively. [sent-1172, score-0.301]

95 In interpreting the results presented in this work, it is important to keep in mind that attribute weighting and semi-naive Bayesian relaxation of the attribute independence assumption are mutually compatible. [sent-1178, score-0.593]

96 It remains a direction for future research to explore techniques for attribute weighting in the context of semi-naive Bayes classiﬁers. [sent-1179, score-0.313]

97 A next step will be to learn a weight for each attribute value per class and a weight for the prior probabilities. [sent-1182, score-0.303]

98 For example, for i-th attribute and y-th class, weight term constitutes βi,y for LR and wi log θxi |y for WANBIA. [sent-1184, score-0.303]

99 A decision tree-based attribute weighting ﬁlter for naive Bayes. [sent-4917, score-0.44]

100 Attribute weighting via differential evolution algorithm for attribute weighted naive Bayes (wnb). [sent-5010, score-0.438]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wanbia', 0.684), ('nb', 0.417), ('lr', 0.32), ('tan', 0.269), ('attribute', 0.261), ('naive', 0.11), ('bayes', 0.094), ('rmse', 0.087), ('attributes', 0.075), ('wanbiamse', 0.06), ('aidi', 0.055), ('arman', 0.055), ('ebb', 0.055), ('erquides', 0.055), ('wanbiacll', 0.055), ('weighting', 0.052), ('lleviating', 0.052), ('ssumption', 0.052), ('eighting', 0.045), ('ndependence', 0.045), ('cfs', 0.044), ('cll', 0.039), ('sbc', 0.038), ('rf', 0.036), ('medium', 0.034), ('bias', 0.033), ('grw', 0.031), ('weights', 0.029), ('webb', 0.029), ('mh', 0.025), ('magic', 0.024), ('bupa', 0.023), ('hepatitis', 0.023), ('labor', 0.023), ('langley', 0.023), ('pioneer', 0.023), ('shuttle', 0.023), ('spambase', 0.023), ('syncon', 0.023), ('volcanoes', 0.023), ('zoo', 0.023), ('selective', 0.023), ('classi', 0.023), ('pendigits', 0.022), ('yx', 0.022), ('sb', 0.022), ('weight', 0.021), ('anneal', 0.021), ('autos', 0.021), ('bcw', 0.021), ('cmc', 0.021), ('crx', 0.021), ('hypo', 0.021), ('lyn', 0.021), ('mush', 0.021), ('optdigits', 0.021), ('pid', 0.021), ('tae', 0.021), ('wi', 0.021), ('adult', 0.02), ('satellite', 0.02), ('thyroid', 0.02), ('vowel', 0.02), ('dermatology', 0.02), ('echocardiogram', 0.02), ('haberman', 0.02), ('sonar', 0.02), ('independence', 0.019), ('ferreira', 0.018), ('ttt', 0.018), ('membership', 0.018), ('abalone', 0.018), ('nursery', 0.018), ('cleveland', 0.018), ('hungarian', 0.018), ('ionosphere', 0.018), ('phoneme', 0.018), ('nemenyi', 0.018), ('promoters', 0.018), ('ptn', 0.018), ('logistic', 0.017), ('decision', 0.017), ('chess', 0.017), ('vehicle', 0.017), ('ers', 0.016), ('fs', 0.016), ('sick', 0.016), ('ande', 0.016), ('cerquides', 0.016), ('sizebottom', 0.016), ('zheng', 0.016), ('xi', 0.016), ('friedman', 0.015), ('iris', 0.015), ('yeast', 0.015), ('weighted', 0.015), ('variance', 0.014), ('car', 0.014), ('wine', 0.014), ('localization', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 12 jmlr-2013-Alleviating Naive Bayes Attribute Independence Assumption by Attribute Weighting

Author: Nayyar A. Zaidi, Jesús Cerquides, Mark J. Carman, Geoffrey I. Webb

2 0.089875937 85 jmlr-2013-Pairwise Likelihood Ratios for Estimation of Non-Gaussian Structural Equation Models

Author: Aapo Hyvärinen, Stephen M. Smith

Abstract: We present new measures of the causal direction, or direction of effect, between two non-Gaussian random variables. They are based on the likelihood ratio under the linear non-Gaussian acyclic model (LiNGAM). We also develop simple ﬁrst-order approximations of the likelihood ratio and analyze them based on related cumulant-based measures, which can be shown to ﬁnd the correct causal directions. We show how to apply these measures to estimate LiNGAM for more than two variables, and even in the case of more variables than observations. We further extend the method to cyclic and nonlinear models. The proposed framework is statistically at least as good as existing ones in the cases of few data points or noisy data, and it is computationally and conceptually very simple. Results on simulated fMRI data indicate that the method may be useful in neuroimaging where the number of time points is typically quite small. Keywords: structural equation model, Bayesian network, non-Gaussianity, causality, independent component analysis

3 0.086174041 89 jmlr-2013-QuantMiner for Mining Quantitative Association Rules

Author: Ansaf Salleb-Aouissi, Christel Vrain, Cyril Nortet, Xiangrong Kong, Vivek Rathod, Daniel Cassard

Abstract: In this paper, we propose Q UANT M INER, a mining quantitative association rules system. This system is based on a genetic algorithm that dynamically discovers “good” intervals in association rules by optimizing both the support and the conﬁdence. The experiments on real and artiﬁcial databases have shown the usefulness of Q UANT M INER as an interactive, exploratory data mining tool. Keywords: association rules, numerical and categorical attributes, unsupervised discretization, genetic algorithm, simulated annealing

4 0.053165179 42 jmlr-2013-Fast Generalized Subset Scan for Anomalous Pattern Detection

Author: Edward McFowland III, Skyler Speakman, Daniel B. Neill

Abstract: We propose Fast Generalized Subset Scan (FGSS), a new method for detecting anomalous patterns in general categorical data sets. We frame the pattern detection problem as a search over subsets of data records and attributes, maximizing a nonparametric scan statistic over all such subsets. We prove that the nonparametric scan statistics possess a novel property that allows for efﬁcient optimization over the exponentially many subsets of the data without an exhaustive search, enabling FGSS to scale to massive and high-dimensional data sets. We evaluate the performance of FGSS in three real-world application domains (customs monitoring, disease surveillance, and network intrusion detection), and demonstrate that FGSS can successfully detect and characterize relevant patterns in each domain. As compared to three other recently proposed detection algorithms, FGSS substantially decreased run time and improved detection power for massive multivariate data sets. Keywords: pattern detection, anomaly detection, knowledge discovery, Bayesian networks, scan statistics

5 0.023534654 22 jmlr-2013-Classifying With Confidence From Incomplete Information

Author: Nathan Parrish, Hyrum S. Anderson, Maya R. Gupta, Dun Yu Hsiao

Abstract: We consider the problem of classifying a test sample given incomplete information. This problem arises naturally when data about a test sample is collected over time, or when costs must be incurred to compute the classiﬁcation features. For example, in a distributed sensor network only a fraction of the sensors may have reported measurements at a certain time, and additional time, power, and bandwidth is needed to collect the complete data to classify. A practical goal is to assign a class label as soon as enough data is available to make a good decision. We formalize this goal through the notion of reliability—the probability that a label assigned given incomplete data would be the same as the label assigned given the complete data, and we propose a method to classify incomplete data only if some reliability threshold is met. Our approach models the complete data as a random variable whose distribution is dependent on the current incomplete data and the (complete) training data. The method differs from standard imputation strategies in that our focus is on determining the reliability of the classiﬁcation decision, rather than just the class label. We show that the method provides useful reliability estimates of the correctness of the imputed class labels on a set of experiments on time-series data sets, where the goal is to classify the time-series as early as possible while still guaranteeing that the reliability threshold is met. Keywords: classiﬁcation, sensor networks, signals, reliability

6 0.018724892 8 jmlr-2013-A Theory of Multiclass Boosting

7 0.01759454 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach

8 0.016747858 73 jmlr-2013-Multicategory Large-Margin Unified Machines

9 0.016688015 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

10 0.01480098 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

11 0.014198742 76 jmlr-2013-Nonparametric Sparsity and Regularization

12 0.013717206 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

13 0.013597163 33 jmlr-2013-Dimension Independent Similarity Computation

14 0.012742317 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

15 0.012416026 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding

16 0.012126769 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion

17 0.011772245 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models

18 0.011719499 98 jmlr-2013-Segregating Event Streams and Noise with a Markov Renewal Process Model

19 0.011643461 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs

20 0.011454578 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.064), (1, -0.0), (2, -0.013), (3, 0.018), (4, 0.028), (5, 0.023), (6, 0.004), (7, 0.016), (8, -0.042), (9, -0.021), (10, 0.009), (11, -0.13), (12, -0.02), (13, -0.265), (14, -0.039), (15, 0.016), (16, -0.278), (17, 0.104), (18, -0.014), (19, -0.139), (20, -0.072), (21, 0.099), (22, 0.115), (23, 0.055), (24, -0.266), (25, -0.031), (26, -0.18), (27, -0.013), (28, -0.105), (29, 0.254), (30, -0.07), (31, 0.049), (32, -0.138), (33, 0.099), (34, -0.138), (35, -0.106), (36, 0.066), (37, -0.059), (38, 0.108), (39, 0.074), (40, 0.016), (41, -0.057), (42, 0.122), (43, 0.066), (44, 0.016), (45, 0.138), (46, 0.012), (47, -0.065), (48, 0.04), (49, -0.064)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97212583 12 jmlr-2013-Alleviating Naive Bayes Attribute Independence Assumption by Attribute Weighting

Author: Nayyar A. Zaidi, Jesús Cerquides, Mark J. Carman, Geoffrey I. Webb

2 0.65831667 85 jmlr-2013-Pairwise Likelihood Ratios for Estimation of Non-Gaussian Structural Equation Models

Author: Aapo Hyvärinen, Stephen M. Smith

3 0.42641136 89 jmlr-2013-QuantMiner for Mining Quantitative Association Rules

Author: Ansaf Salleb-Aouissi, Christel Vrain, Cyril Nortet, Xiangrong Kong, Vivek Rathod, Daniel Cassard

4 0.37805915 42 jmlr-2013-Fast Generalized Subset Scan for Anomalous Pattern Detection

Author: Edward McFowland III, Skyler Speakman, Daniel B. Neill

5 0.1751367 67 jmlr-2013-MLPACK: A Scalable C++ Machine Learning Library

Author: Ryan R. Curtin, James R. Cline, N. P. Slagle, William B. March, Parikshit Ram, Nishant A. Mehta, Alexander G. Gray

Abstract: MLPACK is a state-of-the-art, scalable, multi-platform C++ machine learning library released in late 2011 offering both a simple, consistent API accessible to novice users and high performance and ﬂexibility to expert users by leveraging modern features of C++. MLPACK provides cutting-edge algorithms whose benchmarks exhibit far better performance than other leading machine learning libraries. MLPACK version 1.0.3, licensed under the LGPL, is available at http://www.mlpack.org. Keywords: C++, dual-tree algorithms, machine learning software, open source software, largescale learning 1. Introduction and Goals Though several machine learning libraries are freely available online, few, if any, offer efﬁcient algorithms to the average user. For instance, the popular Weka toolkit (Hall et al., 2009) emphasizes ease of use but scales poorly; the distributed Apache Mahout library offers scalability at a cost of higher overhead (such as clusters and powerful servers often unavailable to the average user). Also, few libraries offer breadth; for instance, libsvm (Chang and Lin, 2011) and the Tilburg MemoryBased Learner (TiMBL) are highly scalable and accessible yet each offer only a single method. MLPACK, intended to be the machine learning analog to the general-purpose LAPACK linear algebra library, aims to combine efﬁciency and accessibility. Written in C++, MLPACK uses the highly efﬁcient Armadillo matrix library (Sanderson, 2010) and is freely available under the GNU Lesser General Public License (LGPL). Through the use of C++ templates, MLPACK both eliminates unnecessary copying of data sets and performs expression optimizations unavailable in other languages. Also, MLPACK is, to our knowledge, unique among existing libraries in using generic programming features of C++ to allow customization of the available machine learning methods without incurring performance penalties. c 2013 Ryan R. Curtin, James R. Cline, N. P. Slagle, William B. March, Parikshit Ram, Nishant A. Mehta and Alexander G. Gray. C URTIN , C LINE , S LAGLE , M ARCH , R AM , M EHTA AND G RAY In addition, users ranging from students to experts should ﬁnd the consistent, intuitive interface of MLPACK to be highly accessible. Finally, the source code provides references and comprehensive documentation. Four major goals of the development team of MLPACK are • • • • to implement scalable, fast machine learning algorithms, to design an intuitive, consistent, and simple API for non-expert users, to implement a variety of machine learning methods, and to provide cutting-edge machine learning algorithms unavailable elsewhere. This paper offers both an introduction to the simple and extensible API and a glimpse of the superior performance of the library. 2. Package Overview Each algorithm available in MLPACK features both a set of C++ library functions and a standalone command-line executable. Version 1.0.3 includes the following methods: • • • • • • • • • • • • • • nearest/furthest neighbor search with cover trees or kd-trees (k-nearest-neighbors) range search with cover trees or kd-trees Gaussian mixture models (GMMs) hidden Markov models (HMMs) LARS / Lasso regression k-means clustering fast hierarchical clustering (Euclidean MST calculation)1 (March et al., 2010) kernel PCA (and regular PCA) local coordinate coding1 (Yu et al., 2009) sparse coding using dictionary learning RADICAL (Robust, Accurate, Direct ICA aLgorithm) (Learned-Miller and Fisher, 2003) maximum variance unfolding (MVU) via LRSDP1 (Burer and Monteiro, 2003) the naive Bayes classiﬁer density estimation trees1 (Ram and Gray, 2011) The development team manages MLPACK with Subversion and the Trac bug reporting system, allowing easy downloads and simple bug reporting. The entire development process is transparent, so any interested user can easily contribute to the library. MLPACK can compile from source on Linux, Mac OS, and Windows; currently, different Linux distributions are reviewing MLPACK for inclusion in their package managers, which will allow users to install MLPACK without needing to compile from source. 3. A Consistent, Simple API MLPACK features a highly accessible API, both in style (such as consistent naming schemes and coding conventions) and ease of use (such as templated defaults), as well as stringent documentation standards. Consequently, a new user can execute algorithms out-of-the-box often with little or no adjustment to parameters, while the seasoned expert can expect extreme ﬂexibility in algorithmic 1. This algorithm is not available in any other comparable software package. 802 MLPACK: A S CALABLE C++ M ACHINE L EARNING L IBRARY Data Set wine cloud wine-qual isolet miniboone yp-msd corel covtype mnist randu MLPACK 0.0003 0.0069 0.0290 13.0197 20.2045 5430.0478 4.9716 14.3449 2719.8087 1020.9142 Weka 0.0621 0.1174 0.8868 213.4735 216.1469 >9000.0000 14.4264 45.9912 >9000.0000 2665.0921 Shogun 0.0277 0.5000 4.3617 37.6190 2351.4637 >9000.0000 555.9600 >9000.0000 3536.4477 >9000.0000 MATLAB 0.0021 0.0210 0.6465 46.9518 1088.1127 >9000.0000 60.8496 >9000.0000 4838.6747 1679.2893 mlpy 0.0025 0.3520 4.0431 52.0437 3219.2696 >9000.0000 209.5056 >9000.0000 5192.3586 >9000.0000 sklearn 0.0008 0.0192 0.1668 46.8016 714.2385 >9000.0000 160.4597 651.6259 5363.9650 8780.0176 Table 1: k-NN benchmarks (in seconds). Data Set UCI Name Size wine Wine 178x13 cloud Cloud 2048x10 wine-qual Wine Quality 6497x11 isolet ISOLET 7797x617 miniboone MiniBooNE 130064x50 Data Set UCI Name Size yp-msd YearPredictionMSD 515345x90 corel Corel 37749x32 covtype Covertype 581082x54 mnist N/A 70000x784 randu N/A 1000000x10 Table 2: Benchmark data set sizes. tuning. For example, the following line initializes an object which will perform the standard kmeans clustering in Euclidean space: KMeans

6 0.16725385 22 jmlr-2013-Classifying With Confidence From Incomplete Information

7 0.13262017 73 jmlr-2013-Multicategory Large-Margin Unified Machines

8 0.11970364 15 jmlr-2013-Bayesian Canonical Correlation Analysis

9 0.11938871 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

10 0.11118052 33 jmlr-2013-Dimension Independent Similarity Computation

11 0.10706795 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

12 0.10424669 66 jmlr-2013-MAGIC Summoning: Towards Automatic Suggesting and Testing of Gestures With Low Probability of False Positives During Use

13 0.10410181 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion

14 0.10212845 92 jmlr-2013-Random Spanning Trees and the Prediction of Weighted Graphs

15 0.10034028 68 jmlr-2013-Machine Learning with Operational Costs

16 0.09861134 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation

17 0.097254284 57 jmlr-2013-Kernel Bayes' Rule: Bayesian Inference with Positive Definite Kernels

18 0.090916201 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

19 0.087875322 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

20 0.08140295 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.011), (5, 0.08), (6, 0.028), (10, 0.042), (20, 0.027), (23, 0.018), (68, 0.021), (70, 0.015), (75, 0.04), (85, 0.012), (87, 0.017), (89, 0.553)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80615193 12 jmlr-2013-Alleviating Naive Bayes Attribute Independence Assumption by Attribute Weighting

Author: Nayyar A. Zaidi, Jesús Cerquides, Mark J. Carman, Geoffrey I. Webb

2 0.74401498 63 jmlr-2013-Learning Trees from Strings: A Strong Learning Algorithm for some Context-Free Grammars

Author: Alexander Clark

Abstract: Standard models of language learning are concerned with weak learning: the learner, receiving as input only information about the strings in the language, must learn to generalise and to generate the correct, potentially inﬁnite, set of strings generated by some target grammar. Here we deﬁne the corresponding notion of strong learning: the learner, again only receiving strings as input, must learn a grammar that generates the correct set of structures or parse trees. We formalise this using a modiﬁcation of Gold’s identiﬁcation in the limit model, requiring convergence to a grammar that is isomorphic to the target grammar. We take as our starting point a simple learning algorithm for substitutable context-free languages, based on principles of distributional learning, and modify it so that it will converge to a canonical grammar for each language. We prove a corresponding strong learning result for a subclass of context-free grammars. Keywords: context-free grammars, grammatical inference, identiﬁcation in the limit, structure learning

3 0.6277225 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

Author: Markus Thom, Günther Palm

Abstract: Sparseness is a useful regularizer for learning in a wide range of applications, in particular in neural networks. This paper proposes a model targeted at classiﬁcation tasks, where sparse activity and sparse connectivity are used to enhance classiﬁcation capabilities. The tool for achieving this is a sparseness-enforcing projection operator which ﬁnds the closest vector with a pre-deﬁned sparseness for any given vector. In the theoretical part of this paper, a comprehensive theory for such a projection is developed. In conclusion, it is shown that the projection is differentiable almost everywhere and can thus be implemented as a smooth neuronal transfer function. The entire model can hence be tuned end-to-end using gradient-based methods. Experiments on the MNIST database of handwritten digits show that classiﬁcation performance can be boosted by sparse activity or sparse connectivity. With a combination of both, performance can be signiﬁcantly better compared to classical non-sparse approaches. Keywords: supervised learning, sparseness projection, sparse activity, sparse connectivity

4 0.23038052 51 jmlr-2013-Greedy Sparsity-Constrained Optimization

Author: Sohail Bahmani, Bhiksha Raj, Petros T. Boufounos

Abstract: Sparsity-constrained optimization has wide applicability in machine learning, statistics, and signal processing problems such as feature selection and Compressed Sensing. A vast body of work has studied the sparsity-constrained optimization from theoretical, algorithmic, and application aspects in the context of sparse estimation in linear models where the ﬁdelity of the estimate is measured by the squared error. In contrast, relatively less effort has been made in the study of sparsityconstrained optimization in cases where nonlinear models are involved or the cost function is not quadratic. In this paper we propose a greedy algorithm, Gradient Support Pursuit (GraSP), to approximate sparse minima of cost functions of arbitrary form. Should a cost function have a Stable Restricted Hessian (SRH) or a Stable Restricted Linearization (SRL), both of which are introduced in this paper, our algorithm is guaranteed to produce a sparse vector within a bounded distance from the true sparse optimum. Our approach generalizes known results for quadratic cost functions that arise in sparse linear regression and Compressed Sensing. We also evaluate the performance of GraSP through numerical simulations on synthetic and real data, where the algorithm is employed for sparse logistic regression with and without ℓ2 -regularization. Keywords: sparsity, optimization, compressed sensing, greedy algorithm

5 0.22303367 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

Author: Tony Cai, Wen-Xin Zhou

Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence

6 0.2230332 59 jmlr-2013-Large-scale SVD and Manifold Learning

7 0.21625027 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

8 0.21327713 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds

9 0.21295175 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

10 0.21201745 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

11 0.21123627 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

12 0.20875515 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators

13 0.20795947 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

14 0.20718321 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis

15 0.20667641 92 jmlr-2013-Random Spanning Trees and the Prediction of Weighted Graphs

16 0.20570341 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

17 0.20531496 82 jmlr-2013-Optimally Fuzzy Temporal Memory

18 0.20465405 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs

19 0.20445915 33 jmlr-2013-Dimension Independent Similarity Computation

20 0.20399271 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos