nips nips2012 nips2012-200 knowledge-graph by maker-knowledge-mining

200 nips-2012-Local Supervised Learning through Space Partitioning

Source: pdf

Author: Joseph Wang, Venkatesh Saligrama

Abstract: We develop a novel approach for supervised learning based on adaptively partitioning the feature space into different regions and learning local region-speciﬁc classiﬁers. We formulate an empirical risk minimization problem that incorporates both partitioning and classiﬁcation in to a single global objective. We show that space partitioning can be equivalently reformulated as a supervised learning problem and consequently any discriminative learning method can be utilized in conjunction with our approach. Nevertheless, we consider locally linear schemes by learning linear partitions and linear region classiﬁers. Locally linear schemes can not only approximate complex decision boundaries and ensure low training error but also provide tight control on over-ﬁtting and generalization error. We train locally linear classiﬁers by using LDA, logistic regression and perceptrons, and so our scheme is scalable to large data sizes and high-dimensions. We present experimental results demonstrating improved performance over state of the art classiﬁcation techniques on benchmark datasets. We also show improved robustness to label noise.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We develop a novel approach for supervised learning based on adaptively partitioning the feature space into different regions and learning local region-speciﬁc classiﬁers. [sent-5, score-0.645]

2 We formulate an empirical risk minimization problem that incorporates both partitioning and classiﬁcation in to a single global objective. [sent-6, score-0.408]

3 We show that space partitioning can be equivalently reformulated as a supervised learning problem and consequently any discriminative learning method can be utilized in conjunction with our approach. [sent-7, score-0.411]

4 Nevertheless, we consider locally linear schemes by learning linear partitions and linear region classiﬁers. [sent-8, score-0.57]

5 Locally linear schemes can not only approximate complex decision boundaries and ensure low training error but also provide tight control on over-ﬁtting and generalization error. [sent-9, score-0.474]

6 We train locally linear classiﬁers by using LDA, logistic regression and perceptrons, and so our scheme is scalable to large data sizes and high-dimensions. [sent-10, score-0.349]

7 1 Introduction We develop a novel approach for supervised learning based on adaptively partitioning the feature space into different regions and learning local region classiﬁers. [sent-13, score-0.79]

8 Each reject classiﬁer, gj , makes a binary decision and the observation is either classiﬁed by the associated region classiﬁer, fj , or passed to the next reject classiﬁer. [sent-17, score-0.938]

9 Each reject classiﬁer, gj , thus partitions the feature space into regions. [sent-18, score-0.466]

10 The region classiﬁer fj operates only on examples within the local region that is consistent with the reject classiﬁer partitions. [sent-19, score-0.666]

11 We incorporate both feature space partitioning (reject classiﬁers) and region-speciﬁc classiﬁers into a single global empirical risk/loss function. [sent-20, score-0.44]

12 In this context we show that each step of the coordinate descent can be reformulated as a supervised learning problem that seeks to optimize a 0/1 empirical loss function. [sent-22, score-0.272]

13 This result is somewhat surprising in the context of partitioning and has broader implications. [sent-23, score-0.28]

14 First, we can now solve feature space partitioning through empirical risk function minimization(ERM) and so powerful existing methods including boosting, decision trees and kernel methods can be used in conjunction for training ﬂexible partitioning classiﬁers. [sent-24, score-0.933]

15 Second, because data is usually locally “well-behaved,” simpler region-classiﬁers, such as linear classiﬁers, often sufﬁce for controlling local empirical error. [sent-25, score-0.291]

16 Furthermore, since complex boundaries for partitions can be approximated by piecewise linear functions, feature spaces can be partitioned to arbitrary degree of precision using linear boundaries (reject classiﬁers). [sent-26, score-0.533]

17 Thus the combination of piecewise linear partitions along with linear region classiﬁers has the ability to adapt to complex data sets leading to low training error. [sent-27, score-0.569]

18 Reject Classiﬁers, gj (x), partition space and region classiﬁers, fj (x), are applied locally within the partitioned region. [sent-29, score-0.455]

19 We use linear perceptrons and logistic regression for training partitioning classiﬁer and region classiﬁers. [sent-31, score-0.733]

20 Our scheme splits with 3 regions and does not overtrain unlike Adaboost. [sent-32, score-0.214]

21 number of linear partitions and linear region classiﬁers, since the VC dimension of such a structure is reasonably small. [sent-33, score-0.375]

22 1 (right) demonstrates the substantial beneﬁts of our approach on the banana dataset[1] over competing methods such as boosting and decision trees, both of which evidently overtrain. [sent-36, score-0.295]

23 Limiting reject and region classiﬁers to linear methods has computational advantages as well. [sent-37, score-0.407]

24 Since the datasets are locally well-behaved we can locally train with linear discriminant analysis (LDA), logistic regression and variants of perceptrons. [sent-38, score-0.501]

25 Indeed, we present some evidence that shows that the partitioning step can adaptively cluster the dataset into groups and letting region classiﬁers to operate on simpler problems. [sent-42, score-0.474]

26 Additionally linear methods such as LDA, Logistic regression, and perceptron naturally extend to multi-class problems leading to computationally efﬁcient and statistically meaningful results as evidenced on challenging datasets with performance improvements over state of the art techniques. [sent-43, score-0.309]

27 Boosting algorithms [2] learn complex decision boundaries characterized as a weighted linear combination of weak classiﬁers. [sent-46, score-0.315]

28 In contrast our method takes unions and intersections of simpler decision regions to learn more complex decision boundaries. [sent-47, score-0.47]

29 Decision trees are built by greedily partitioning the feature space [3]. [sent-49, score-0.439]

30 One main difference is that decision trees typically attempt to greedily minimize some loss or a heuristic, such as region purity or entropy, at each split/partition of the feature space. [sent-50, score-0.475]

31 Also decision trees typically split/partition a single feature/component resulting in unions of rectangularly shaped decision regions; in contrast we allow arbitrary partitions leading to complex decision regions. [sent-52, score-0.677]

32 In these methods a multiclass problem is decomposed into several binary problems using a code matrix and the predicted outcomes of these binary problems are fused to obtain multiclass labels. [sent-54, score-0.424]

33 3) that suggests that our space partitioning classiﬁer groups/clusters multiple classes into different regions; nevertheless our formulation is different in that we do not explicitly code classes into different regions and our method does not require fusion of intermediate outcomes. [sent-57, score-0.439]

34 This is because we show that space partitioning itself can be re-formulated as a supervised learning problem. [sent-59, score-0.378]

35 Consequently, any 2 existing method, including boosting and decision trees, could be used as a method of choice for learning space partitioning and region-speciﬁc decision functions. [sent-60, score-0.696]

36 We use simple linear classiﬁers for partitioning and region-classiﬁers in many of our experiments. [sent-61, score-0.337]

37 A recently proposed alternative approach is to build a global classiﬁer ignoring clusters of errors, and building separate classiﬁers in each error cluster region [11]. [sent-71, score-0.256]

38 This proposed approach greedily approximates a piecewise linear classiﬁer in this manner, however fails to take into account the performance of the classiﬁers in the error cluster regions. [sent-72, score-0.229]

39 1 that adaptively partitions data into regions and ﬁts simple classiﬁers within each region. [sent-85, score-0.292]

40 We predict the output for a test sample, x, based on the output of the trained simple classiﬁer associated with the region x belongs to. [sent-86, score-0.256]

41 In the sequel we formulate space partitioning and region-classiﬁcation into a single objective and show that space partitioning is equivalent to solving a binary classiﬁcation problem with 0/1 empirical loss. [sent-88, score-0.74]

42 1 Binary Space Partitioning as Supervised Learning In this section we consider learning binary space partitioning for ease of exposition. [sent-90, score-0.37]

43 The empirical risk/loss associated with the binary space partitioned classiﬁers is given by: n n 1 1 R(g, f0 , f1 ) = ½{g(xi )=0} ½{f0 (xi )=yi } + ½{g(xi )=1} ½{f1 (xi )=yi } (1) n i=1 n i=1 Our goal is to minimize the empirical error jointly over the family of functions g(·) ∈ G and fi (·) ∈ F . [sent-94, score-0.247]

44 From the above equation, when the partitioning function g(·) is ﬁxed, it is clear how one can view choice of classiﬁers f0 (·) and f1 (·) as ERM problems. [sent-95, score-0.28]

45 We observe several aspects of our proposed scheme: (1) Binary partitioning is a binary classiﬁcation problem on the training set, (xi , 0 ), i = i 1, 2, . [sent-105, score-0.403]

46 (3) The partitioning error is zero on a training example xi with weight wi = 1 if we choose g(xi ) = 0 on examples where f0 (xi ) = yi . [sent-112, score-0.683]

47 In contrast if f0 (xi ) = yi the partitioning error can be reduced by choosing g(xi ) = 1, and thus rejecting the example from consideration by f0 . [sent-113, score-0.42]

48 1 is that we can now use powerful learning techniques such as decision trees, boosting and SVMs for learning space partitioning classiﬁers. [sent-116, score-0.593]

49 These surrogate losses are upper bounds for indicator losses and usually attempt to obtain large margins. [sent-121, score-0.268]

50 This is because even though each surrogate upper bounds indicator loss functions at each step, when put together they do not upper bound the global objective of Eq. [sent-124, score-0.255]

51 The empirical error can be upper 4 bounded: ½ g(x)=1 = ½− h(x)≤0 ≤ φ(− h(x)) We then form a global surrogate for the empirical loss function. [sent-132, score-0.374]

52 1 with surrogate functions, the global surrogate is given by: 1 ˆ R(g, f0 , f1 ) = n n φ (h(xi )) φ (yi f0 (xi )) + i=1 1 n n φ (−h(xi )) φ (yi f1 (xi )) , (3) i=1 which is an upper bound on Eq. [sent-134, score-0.284]

53 Optimizing the partitioning function g(·) can be posed as a supervised learning problem, resulting in the following lemma (see Supplementary for a proof): Lemma 2. [sent-136, score-0.346]

54 , fr ) = 1 n n L1 (xi , yi ) (5) i=1 In the expression above we have made the dependence on reject classiﬁers and region-classiﬁers explicit. [sent-146, score-0.333]

55 5 over all gj , fj by means of coordinate descent, namely, to optimize gk we hold fj , ∀j and gj , j = k ﬁxed. [sent-148, score-0.614]

56 , r do Train region classiﬁer fj (x) to optimize 0/1 empirical loss of Eq. [sent-153, score-0.355]

57 , 2, 1 do Train reject classiﬁer gk (x) to optimize 0/1 empirical loss of Eq. [sent-158, score-0.49]

58 Consequently, we can also reduce this problem to a supervised n learning problem: 1 gk (·) = argmin wi ½{g(xi )= i } , (7) n g∈G i=1 where i = 0 if fk (xi ) = yi 1 if fk (xi ) = yi 1, 0, and wi = i = Lk+1 (xi , yi ), Ck (x) = 0 . [sent-163, score-0.838]

59 otherwise The composite classiﬁer F (x) based on the reject and region classiﬁers can be written compactly as follows: F (x) = fs (x), s = min{j | gj (x) = 0} ∪ {r} (8) Observe that if the kth region classiﬁer correctly classiﬁes the example xi , i. [sent-164, score-0.829]

60 , fk (xi ) = yi then this would encourage gk (xi ) = 0. [sent-166, score-0.358]

61 This is because gk (xi ) = 1 would induce an increased cost in terms of increasing Lk+1 (xi , yi ). [sent-167, score-0.254]

62 Similarly, if the kth region classiﬁer incorrectly classiﬁes, namely, fk (xi ) = yi , the optimization would prefer gk (xi ) = 1. [sent-168, score-0.541]

63 Also note that if the kth region classiﬁer loss as well as the subsequent stages are incorrect on an example are incorrect then the weight on that example is zero. [sent-169, score-0.289]

64 We can deal with minimizing indicator losses and resulting convergence issues by deriving a global surrogate as we did in Sec. [sent-171, score-0.261]

65 4 Local Linear Classiﬁcation Linear classiﬁcation is a natural method for learning local decision boundaries, with the global decision regions approximated by piecewise linear functions. [sent-176, score-0.699]

66 In local linear classiﬁcation, local classiﬁers, f1 , f2 , . [sent-177, score-0.239]

67 Bias error (empirical error) can be made arbitrarily small by approximating the decision boundary by many local linear classiﬁers. [sent-185, score-0.321]

68 Variance error (classiﬁer complexity) can be made small by Figure 2: Local LDA classiﬁcation regions restricting the number of local linear classiﬁers used to for XOR data, the black line is reject classiﬁer boundary. [sent-186, score-0.521]

69 8) with r − 1 linear classiﬁers gj and r linear classiﬁers fj in a d-dimensional space is bounded by 2(2r − 1) log(e(2r − 1))(d + 1). [sent-191, score-0.339]

70 The VC-dimension of local linear classiﬁers grows linearly with dimension and nearly linearly with respect to the number of regions. [sent-192, score-0.214]

71 In practice, few regions are necessary to achieve low training error as highly non-linear decision boundaries can be approximated well locally with linear boundaries. [sent-195, score-0.594]

72 Learning the local linear classiﬁer with 2 regions using LDA produces a classiﬁer with small empirical error. [sent-197, score-0.333]

73 Then with high probability (where probability is wrt initial sampling of reject region) a two region composite classiﬁer trained locally using LDA converges to zero training error. [sent-201, score-0.619]

74 Although margin based methods such as SVMs can be used, we primarily use relatively simple schemes such as LDA, logistic regression, and average voted perceptron in our experiments. [sent-204, score-0.418]

75 We use each of these schemes for learning both reject and region-classiﬁers. [sent-205, score-0.258]

76 Computational Costs of LDA, Logistic Regression and Perceptron: Each LDA classiﬁer is trained in O(nd2 ) computations, where n is the number of training observations and d is the dimension of the training data. [sent-207, score-0.208]

77 As a result, the total computation cost per iteration of the local linear classiﬁer with LDA scales linearly with respect to the number of training samples, requiring O(nd2 r) computations per iteration, where r is the number of classiﬁcation regions. [sent-208, score-0.291]

78 Similarly, the computational cost of training a single linear classiﬁer by logistic regression scales O(ncd2 ) for a ﬁxed number of iterations, with the local linear classiﬁer training time scaling O(rncd2 ) computations per iteration, where c is the number of classes. [sent-209, score-0.51]

79 A linear variant of the voted perceptron was implemented by taking the average of the weights generated by the unnormalized voted perceptron [15]. [sent-210, score-0.621]

80 Training each perceptron for a ﬁxed number of epochs is extremely efﬁcient, requiring only O(ndc) computations to train. [sent-211, score-0.22]

81 Therefore, training local linear perceptron scales linearly with data size and dimensions, with O(ndcr) computations, per iteration. [sent-212, score-0.421]

82 3 Experimental Results Multiclass Classiﬁcation:Experimental results on six datasets from the UCI repository [16] were performed using the benchmark training and test splits associated with each data set, as shown in Table 1. [sent-213, score-0.21]

83 Although conﬁdence intervals cannot be computed by multiple training/test splits, test set error bounds [17] show that with test data sets of these sizes, the difference between true error and empirical error is small with high probability. [sent-215, score-0.247]

84 Local linear classiﬁers were trained with LDA, logistic regression, and perceptron (mean of weights) used to learn local surrogates for the rejection and local classiﬁcation problems. [sent-217, score-0.693]

85 The classiﬁers were initialized with 5 classiﬁcation regions (r = 5), with the trained classiﬁers often reducing to fewer classiﬁcation regions due to empty rejection region. [sent-218, score-0.376]

86 Termination of the algorithm occurred when the rejection outputs, gk (x), and classiﬁcation labels, F (x), remained consistent on the training data for two iterations. [sent-219, score-0.264]

87 Results were compared with Mixture Discriminant Analysis (MDA) g1(x) g2(x) g3(x) g4(x) g5(x) Figure 3: Histogram of classes over test data for the Optdigit dataset in different partitions generated by our approach using the linear voted perceptron . [sent-221, score-0.488]

88 The improved performance of local linear learning of comparable complexity justiﬁes approximating these boundaries by piecewise linear functions. [sent-227, score-0.382]

89 Training each binary kernelized classiﬁer is computationally intensive, and on weakly learnable data, boosting also allows for modeling of complex boundaries with arbitrarily small empirical error. [sent-229, score-0.362]

90 One vs All AdaBoostis trained using decision stumps as weak learners. [sent-232, score-0.259]

91 AdaBoost-SAMME and GD-MCBoost are trained using depth-2 decision trees as weak learners. [sent-233, score-0.296]

92 32% In 4 of the 6 datasets, local linear classiﬁcation produced the lowest classiﬁcation error on test datasets, with optimal test errors within 0. [sent-282, score-0.255]

93 Also there is evidence that suggests that our scheme partitions multiclass problems into simpler subproblems. [sent-284, score-0.315]

94 We plotted histogram output of class labels for Optdigit dataset across different regions using local perceptrons (Fig. [sent-285, score-0.274]

95 This can contrasted with many state of the art boosting techniques, such as [18], which attempt to optimize both the codewords for each class as well as the binary classiﬁcation problems deﬁning the codewords. [sent-289, score-0.283]

96 Robustness to Label Noise: Local linear classiﬁcation trained using LDA, logistic regression, and averaged voted perceptron was tested in the presence of random label noise. [sent-292, score-0.555]

97 A randomly selected fraction of all training observations were given incorrect labels, and trained as described for the multiclass experiments. [sent-293, score-0.33]

98 For each label noise fraction, 100 randomly drawn training and test sets were used, and the average test error is shown in Fig. [sent-296, score-0.227]

99 For comparison, results are shown for classiﬁcation trees trained according to Gini’s diversity index (GDI) [3], AdaBoost trained with stumps [2], and support vector machines trained on Gaussian radial basis function kernels. [sent-298, score-0.369]

100 In comparison, boosting and classiﬁcation trees show sensitivity to label noise, with the test error increasing at a faster rate than LDA-trained local linear classiﬁcation on both the Wisconsin Breast Cancer data and Vertebrae data. [sent-300, score-0.483]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ers', 0.363), ('classi', 0.335), ('partitioning', 0.28), ('er', 0.211), ('reject', 0.205), ('perceptron', 0.175), ('gk', 0.155), ('multiclass', 0.154), ('lda', 0.154), ('region', 0.145), ('xi', 0.142), ('decision', 0.132), ('regions', 0.127), ('boosting', 0.12), ('partitions', 0.116), ('gj', 0.113), ('surrogate', 0.107), ('voted', 0.107), ('fk', 0.104), ('yi', 0.099), ('adaboost', 0.098), ('local', 0.091), ('piecewise', 0.09), ('boundaries', 0.087), ('mda', 0.086), ('optdigit', 0.086), ('trees', 0.086), ('locally', 0.085), ('logistic', 0.083), ('fj', 0.08), ('erm', 0.079), ('trained', 0.078), ('discriminant', 0.077), ('surrogates', 0.074), ('gdi', 0.073), ('vertebrae', 0.073), ('lk', 0.07), ('global', 0.07), ('supervised', 0.066), ('training', 0.065), ('boston', 0.059), ('binary', 0.058), ('empirical', 0.058), ('cation', 0.058), ('linear', 0.057), ('wi', 0.056), ('xor', 0.056), ('perceptrons', 0.056), ('label', 0.055), ('schemes', 0.053), ('adaptively', 0.049), ('saberian', 0.049), ('stumps', 0.049), ('breast', 0.047), ('regression', 0.047), ('losses', 0.046), ('computations', 0.045), ('scheme', 0.045), ('rejection', 0.044), ('gini', 0.043), ('banana', 0.043), ('landsat', 0.043), ('pendigit', 0.043), ('shuttle', 0.043), ('splits', 0.042), ('uci', 0.042), ('wisconsin', 0.042), ('art', 0.042), ('error', 0.041), ('cancer', 0.041), ('greedily', 0.041), ('composite', 0.041), ('coordinate', 0.041), ('namely', 0.041), ('loss', 0.04), ('unions', 0.04), ('complex', 0.039), ('indicator', 0.038), ('kth', 0.038), ('robert', 0.037), ('repository', 0.035), ('descent', 0.035), ('datasets', 0.035), ('test', 0.033), ('consequently', 0.033), ('linearly', 0.033), ('incorrect', 0.033), ('wine', 0.033), ('optimize', 0.032), ('space', 0.032), ('train', 0.032), ('ck', 0.032), ('yoav', 0.031), ('attempt', 0.031), ('optimizing', 0.03), ('optimizes', 0.03), ('techniques', 0.029), ('fr', 0.029), ('gr', 0.029), ('isolet', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 200 nips-2012-Local Supervised Learning through Space Partitioning

Author: Joseph Wang, Venkatesh Saligrama

2 0.22042973 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

Author: Chi Jin, Liwei Wang

Abstract: Margin is one of the most important concepts in machine learning. Previous margin bounds, both for SVM and for boosting, are dimensionality independent. A major advantage of this dimensionality independency is that it can explain the excellent performance of SVM whose feature spaces are often of high or inﬁnite dimension. In this paper we address the problem whether such dimensionality independency is intrinsic for the margin bounds. We prove a dimensionality dependent PAC-Bayes margin bound. The bound is monotone increasing with respect to the dimension when keeping all other factors ﬁxed. We show that our bound is strictly sharper than a previously well-known PAC-Bayes margin bound if the feature space is of ﬁnite dimension; and the two bounds tend to be equivalent as the dimension goes to inﬁnity. In addition, we show that the VC bound for linear classiﬁers can be recovered from our bound under mild conditions. We conduct extensive experiments on benchmark datasets and ﬁnd that the new bound is useful for model selection and is usually signiﬁcantly sharper than the dimensionality independent PAC-Bayes margin bound as well as the VC bound for linear classiﬁers.

3 0.17827564 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

Author: Harish G. Ramaswamy, Shivani Agarwal

Abstract: We study consistency properties of surrogate loss functions for general multiclass classiﬁcation problems, deﬁned by a general loss matrix. We extend the notion of classiﬁcation calibration, which has been studied for binary and multiclass 0-1 classiﬁcation problems (and for certain other speciﬁc learning problems), to the general multiclass setting, and derive necessary and sufﬁcient conditions for a surrogate loss to be classiﬁcation calibrated with respect to a loss matrix in this setting. We then introduce the notion of classiﬁcation calibration dimension of a multiclass loss matrix, which measures the smallest ‘size’ of a prediction space for which it is possible to design a convex surrogate that is classiﬁcation calibrated with respect to the loss matrix. We derive both upper and lower bounds on this quantity, and use these results to analyze various loss matrices. In particular, as one application, we provide a different route from the recent result of Duchi et al. (2010) for analyzing the difﬁculty of designing ‘low-dimensional’ convex surrogates that are consistent with respect to pairwise subset ranking losses. We anticipate the classiﬁcation calibration dimension may prove to be a useful tool in the study and design of surrogate losses for general multiclass learning problems. 1

4 0.16309291 227 nips-2012-Multiclass Learning with Simplex Coding

Author: Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, Jean-jeacques Slotine

Abstract: In this paper we discuss a novel framework for multiclass learning, deﬁned by a suitable coding/decoding strategy, namely the simplex coding, that allows us to generalize to multiple classes a relaxation approach commonly used in binary classiﬁcation. In this framework, we develop a relaxation error analysis that avoids constraints on the considered hypotheses class. Moreover, using this setting we derive the ﬁrst provably consistent regularized method with training/tuning complexity that is independent to the number of classes. We introduce tools from convex analysis that can be used beyond the scope of this paper. 1

5 0.14588545 197 nips-2012-Learning with Recursive Perceptual Representations

Author: Oriol Vinyals, Yangqing Jia, Li Deng, Trevor Darrell

Abstract: Linear Support Vector Machines (SVMs) have become very popular in vision as part of state-of-the-art object recognition and other classiﬁcation tasks but require high dimensional feature spaces for good performance. Deep learning methods can ﬁnd more compact representations but current methods employ multilayer perceptrons that require solving a difﬁcult, non-convex optimization problem. We propose a deep non-linear classiﬁer whose layers are SVMs and which incorporates random projection as its core stacking element. Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous –often more complicated– methods on several vision and speech benchmarks. 1

6 0.14033276 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications

7 0.13021903 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

8 0.12095276 279 nips-2012-Projection Retrieval for Classification

9 0.11190526 14 nips-2012-A P300 BCI for the Masses: Prior Information Enables Instant Unsupervised Spelling

10 0.11112892 271 nips-2012-Pointwise Tracking the Optimal Regression Function

11 0.11043017 361 nips-2012-Volume Regularization for Binary Classification

12 0.10860823 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

13 0.10329921 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation

14 0.10280878 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

15 0.098090857 97 nips-2012-Diffusion Decision Making for Adaptive k-Nearest Neighbor Classification

16 0.096709445 242 nips-2012-Non-linear Metric Learning

17 0.096157312 81 nips-2012-Context-Sensitive Decision Forests for Object Detection

18 0.09355513 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics

19 0.092519671 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction

20 0.091094092 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.254), (1, 0.027), (2, -0.04), (3, -0.079), (4, 0.113), (5, -0.057), (6, 0.025), (7, 0.19), (8, -0.103), (9, -0.063), (10, 0.068), (11, 0.099), (12, 0.13), (13, 0.051), (14, -0.01), (15, -0.119), (16, -0.159), (17, 0.056), (18, -0.014), (19, 0.154), (20, -0.061), (21, 0.041), (22, -0.081), (23, 0.141), (24, -0.004), (25, -0.152), (26, -0.019), (27, -0.042), (28, 0.003), (29, 0.014), (30, -0.027), (31, 0.095), (32, 0.082), (33, -0.022), (34, -0.032), (35, 0.002), (36, -0.077), (37, 0.129), (38, -0.056), (39, 0.007), (40, -0.023), (41, -0.045), (42, -0.066), (43, -0.022), (44, 0.018), (45, 0.001), (46, -0.087), (47, -0.033), (48, 0.067), (49, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98437518 200 nips-2012-Local Supervised Learning through Space Partitioning

Author: Joseph Wang, Venkatesh Saligrama

2 0.85796863 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

Author: Chi Jin, Liwei Wang

3 0.79071915 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications

Author: Amit Daniely, Sivan Sabato, Shai S. Shwartz

Abstract: We theoretically analyze and compare the following ﬁve popular multiclass classiﬁcation methods: One vs. All, All Pairs, Tree-based classiﬁers, Error Correcting Output Codes (ECOC) with randomly generated code matrices, and Multiclass SVM. In the ﬁrst four methods, the classiﬁcation is based on a reduction to binary classiﬁcation. We consider the case where the binary classiﬁer comes from a class of VC dimension d, and in particular from the class of halfspaces over Rd . We analyze both the estimation error and the approximation error of these methods. Our analysis reveals interesting conclusions of practical relevance, regarding the success of the different approaches under various conditions. Our proof technique employs tools from VC theory to analyze the approximation error of hypothesis classes. This is in contrast to most previous uses of VC theory, which only deal with estimation error. 1

4 0.74239707 279 nips-2012-Projection Retrieval for Classification

Author: Madalina Fiterau, Artur Dubrawski

Abstract: In many applications, classiﬁcation systems often require human intervention in the loop. In such cases the decision process must be transparent and comprehensible, simultaneously requiring minimal assumptions on the underlying data distributions. To tackle this problem, we formulate an axis-aligned subspace-ﬁnding task under the assumption that query speciﬁc information dictates the complementary use of the subspaces. We develop a regression-based approach called RECIP that efﬁciently solves this problem by ﬁnding projections that minimize a nonparametric conditional entropy estimator. Experiments show that the method is accurate in identifying the informative projections of the dataset, picking the correct views to classify query points, and facilitates visual evaluation by users. 1 Introduction and problem statement In the domain of predictive analytics, many applications which keep human users in the loop require the use of simple classiﬁcation models. Often, it is required that a test-point be ‘explained’ (classiﬁed) using a simple low-dimensional projection of the original feature space. This is a Projection Retrieval for Classiﬁcation problem (PRC). The interaction with the user proceeds as follows: the user provides the system a query point; the system searches for a projection in which the point can be accurately classiﬁed; the system displays the classiﬁcation result as well as an illustration of how the classiﬁcation decision was reached in the selected projection. Solving the PRC problem is relevant in many practical applications. For instance, consider a nuclear threat detection system installed at a border check point. Vehicles crossing the border are scanned with sensors so that a large array of measurements of radioactivity and secondary contextual information is being collected. These observations are fed into a classiﬁcation system that determines whether the scanned vehicle may carry a threat. Given the potentially devastating consequences of a false negative, a border control agent is requested to validate the prediction and decide whether to submit the vehicle for a costly further inspection. With the positive classiﬁcation rate of the system under strict bounds because of limitations in the control process, the risk of false negatives is increased. Despite its crucial role, human intervention should only be withheld for cases in which there are reasons to doubt the validity of classiﬁcation. In order for a user to attest the validity of a decision, the user must have a good understanding of the classiﬁcation process, which happens more readily when the classiﬁer only uses the original dataset features rather than combinations of them, and when the discrimination models are low-dimensional. In this context, we aim to learn a set of classiﬁers in low-dimensional subspaces and a decision function which selects the subspace under which a test point is to be classiﬁed. Assume we are given a dataset {(x1 , y1 ) . . . (xn , yn )} ∈ X n × {0, 1}n and a class of discriminators H. The model will contain a set Π of subspaces of X ; Π ⊆ Π, where Π is the set of all axis-aligned subspaces of the original feature space, the power set of the features. To each projection πi ∈ Π corresponds one discriminator from a given hypothesis space hi ∈ H. It will also contain a selection function g : X → Π × H, which yields, for a query point x, the projection/discriminator pair with which this point will be classiﬁed. The notation π(x) refers to the projection of the point x onto the subspace 1 π while h(π(x)) represents the predicted label for x. Formally, we describe the model class as Md = {Π = {π : π ∈ Π, dim(π) ≤ d}, H = {hi : hi ∈ H, h : πi → Y, ∀i = 1 . . . |Π|}, g ∈ {f : X → {1 . . . |Π|}} . where dim(π) presents the dimensionality of the subspace determined by the projection π. Note that only projections up to size d will be considered, where d is a parameter speciﬁc to the application. The set H contains one discriminator from the hypothesis class H for each projection. Intuitively, the aim is to minimize the expected classiﬁcation error over Md , however, a notable modiﬁcation is that the projection and, implicitly, the discriminator, are chosen according to the data point that needs to be classiﬁed. Given a query x in the space X , g(x) will yield the subspace πg(x) onto which the query is projected and the discriminator hg(x) for it. Distinct test points can be handled using different combinations of subspaces and discriminators. We consider models that minimize 0/1 loss. Hence, the PRC problem can be stated as follows: M ∗ = arg min EX ,Y y = hg(x) (πg(x) (x)) M ∈Md There are limitations to the type of selection function g that can be learned. A simple example for which g can be recovered is a set of signal readings x for which, if one of the readings xi exceeds a threshold ti , the label can be predicted just based on xi . A more complex one is a dataset containing regulatory variables, that is, for xi in the interval [ak , bk ] the label only depends on (x1 . . . xnk ) k k datasets that fall into the latter category fulﬁll what we call the Subspace-Separability Assumption. This paper proposes an algorithm called RECIP that solves the PRC problem for a class of nonparametric classiﬁers. We evaluate the method on artiﬁcial data to show that indeed it correctly identiﬁes the underlying structure for data satisfying the Subspace-Separability Assumption. We show some case studies to illustrate how RECIP offers insight into applications requiring human intervention. The use of dimensionality reduction techniques is a common preprocessing step in applications where the use of simpliﬁed classiﬁcation models is preferable. Methods that learn linear combinations of features, such as Linear Discriminant Analysis, are not quite appropriate for the task considered here, since we prefer to natively rely on the dimensions available in the original feature space. Feature selection methods, such as e.g. lasso, are suitable for identifying sets of relevant features, but do not consider interactions between them. Our work better ﬁts the areas of class dependent feature selection and context speciﬁc classiﬁcation, highly connected to the concept of Transductive Learning [6]. Other context-sensitive methods are Lazy and Data-Dependent Decision Trees, [5] and [10] respectively. In Ting et al [14], the Feating submodel selection relies on simple attribute splits followed by ﬁtting local predictors, though the algorithm itself is substantially different. Obozinski et al present a subspace selection method in the context of multitask learning [11]. Go et al propose a joint method for feature selection and subspace learning [7], however, their classiﬁcation model is not particularly query speciﬁc. Alternatively, algorithms that transform complex or unintelligible models with user-friendly equivalents have been proposed [3, 2, 1, 8]. Algorithms speciﬁcally designed to yield understandable models are a precious few. Here we note a rule learning method described in [12], even though the resulting rules can make visualization difﬁcult, while itemset mining [9] is not speciﬁcally designed for classiﬁcation. Unlike those approaches, our method is designed to retrieve subsets of the feature space designed for use in a way that is complementary to the basic task at hand (classiﬁcation) while providing query-speciﬁc information. 2 Recovering informative projections with RECIP To solve PRC, we need means by which to ascertain which projections are useful in terms of discriminating data from the two classes. Since our model allows the use of distinct projections depending on the query point, it is expected that each projection would potentially beneﬁt different areas of the feature space. A(π) refers to the area of the feature space where the projection π is selected. A(π) = {x ∈ X : πg(x) = π} The objective becomes min E(X ×Y) y = hg(x) (πg(x) (x)) M ∈Md = p(A(π))E y = hg(x) (πg(x) (x))|x ∈ A(π) min M ∈Md 2 π∈Π . The expected classiﬁcation error over A(π) is linked to the conditional entropy of Y |X. Fano’s inequality provides a lower bound on the error while Feder and Merhav [4] derive a tight upper bound on the minimal error probability in terms of the entropy. This means that conditional entropy characterizes the potential of a subset of the feature space to separate data, which is more generic than simply quantifying classiﬁcation accuracy for a speciﬁc discriminator. In view of this connection between classiﬁcation accuracy and entropy, we adapt the objective to: p(A(π))H(Y |π(X); X ∈ A(π)) min M ∈Md (1) π∈Π The method we propose optimizes an empirical analog of (1) which we develop below and for which we will need the following result. Proposition 2.1. Given a continuous variable X ∈ X and a binary variable Y , where X is sampled from the mixture model f (x) = p(y = 0)f0 (x) + p(y = 1)f1 (x) = p0 f0 (x) + p1 f1 (x) , then H(Y |X) = −p0 log p0 − p1 log p1 − DKL (f0 ||f ) − DKL (f1 ||f ) . Next, we will use the nonparametric estimator presented in [13] for Tsallis α-divergence. Given samples Ui ∼ U, with i = 1, n and Vj ∼ V with j = 1, m, the divergence is estimated as follows: ˆ Tα (U ||V ) = 1 1 1−α n n (n − 1)νk (Ui , U \ ui )d mνk (Ui , V )d i=1 1−α B(k, α) − 1 , (2) where d is the dimensionality of the variables U and V and νk (z, Z) represents the distance from z ˆ to its k th nearest neighbor of the set of points Z. For α ≈ 1 and n → ∞, Tα (u||v) ≈ DKL (u||v). 2.1 Local estimators of entropy We will now plug (2) in the formula obtained by Proposition 2.1 to estimate the quantity (1). We use the notation X0 to represent the n0 samples from X which have the labels Y equal to 0, and X1 to represent the n1 samples from X which have the labels set to 1. Also, Xy(x) represents the set of samples that have labels equal to the label of x and X¬y(x) the data that have labels opposite to the label of x. ˆ ˆ x ˆ x H(Y |X; X ∈ A) = −H(p0 ) − H(p1 ) − T (f0 ||f x ) − T (f1 ||f x ) + C α≈1 ˆ H(Y |X; X ∈ A) ∝ 1 n0 + 1 n1 ∝ 1 n0 + 1 n1 ∝ 1 n n0 (n0 − 1)νk (xi , X0 \ xi )d nνk (xi , X \ xi )d 1−α I[xi ∈ A] (n1 − 1)νk (xi , X1 \ xi )d nνk (xi , X \ xi )d 1−α I[xi ∈ A] (n0 − 1)νk (xi , X0 \ xi )d nνk (xi , X1 \ xi )d 1−α I[xi ∈ A] (n1 − 1)νk (xi , X1 \ xi )d nνk (xi , X0 \ xi )d 1−α I[xi ∈ A] i=1 n1 i=1 n0 i=1 n1 i=1 n I[xi ∈ A] i=1 (n − 1)νk (xi , Xy(xi ) \ xi )d nνk (xi , X¬y(xi ) \ xi )d 1−α The estimator for the entropy of the data that is classiﬁed with projection π is as follows: ˆ H(Y |π(X); X ∈ A(π)) ∝ 1 n n I[xi ∈ A(π)] i=1 (n − 1)νk (π(xi ), π(Xy(xi ) ) \ π(xi ))d nνk (π(xi ), π(X¬y(xi ) \ xi ))d 1−α (3) From 3 and using the fact that I[xi ∈ A(π)] = I[πg(xi ) = π] for which we use the notation I[g(xi ) → π], we estimate the objective as min M ∈Md π∈Π 1 n n I[g(xi ) → π] i=1 (n − 1)νk (π(xi ), π(Xy(xi ) ) \ π(xi ))d nνk (π(xi ), π(X¬y(xi ) \ xi ))d 3 1−α (4) Therefore, the contribution of each data point to the objective corresponds to a distance ratio on the projection π ∗ where the class of the point is obtained with the highest conﬁdence (data is separable in the neighborhood of the point). We start by computing the distance-based metric of each point on each projection of size up to d - there are d∗ such projections. This procedure yields an extended set of features Z, which we name local entropy estimates: Zij = νk (πj (xi ), πj (Xy(xi ) ) \ πj (xi )) νk (πj (xi ), πj (X¬y(xi ) ) \ πj (xi )) d(1−α) α≈1 j ∈ {1 . . . d∗ } (5) For each training data point, we compute the best distance ratio amid all the projections, which is simply Ti = minj∈[d∗ ] Zij . The objective can be then further rewritten as a function of the entropy estimates: n I[g(xi ) → πj ]Zij min M ∈Md (6) i=1 πj ∈Π From the deﬁnition of T, it is also clear that n n I[g(xi ) → πj ]Zij min M ∈Md 2.2 ≥ i=1 πj ∈Π Ti . (7) i=1 Projection selection as a combinatorial problem Considering form (6) of the objective, and given that the estimates Zij are constants, depending only on the training set, the projection retrieval problem is reduced to ﬁnding g for all training points, which will implicitly select the projection set of the model. Naturally, one might assume the bestperforming classiﬁcation model is the one containing all the axis-aligned subspaces. This model achieves the lower bound (7) for the training set. However, the larger the set of projections, the more values the function g takes, and thus the problem of selecting the correct projection becomes more difﬁcult. It becomes apparent that the number of projections should be somehow restricted to allow intepretability. Assuming a hard threshold of at most t projections, the optimization (6) becomes an entry selection problem over matrix Z where one value must be picked from each row under a limitation on the number of columns that can be used. This problem cannot be solved exactly in polynomial time. Instead, it can be formulated as an optimization problem under 1 constraints. 2.3 Projection retrieval through regularized regression To transform the projection retrieval to a regression problem we consider T, the minimum obtainable value of the entropy estimator for each point, as the output which the method needs to predict. Each row i of the parameter matrix B represents the degrees to which the entropy estimates on each projection contribute to the entropy estimator of point xi . Thus, the sum over each row of B is 1, and the regularization penalty applies to the number of non-zero columns in B. d∗ min ||T − (Z B B)J|Π|,1 ||2 2 +λ [Bi = 0] (8) i=1 subject to |Bk | 1 = 1 k = 1, n where (Z B)ij = Zij + Bij and J is a matrix of ones. The problem with this optimization is that it is not convex. A typical walk-around of this issue is to use the convex relaxation for Bi = 0, that is 1 norm. This would transform the penalized term d∗ d∗ n to i=1 |Bi | 1 . However, i=1 |Bi | 1 = k=1 |Bk | 1 = n , so this penalty really has no effect. An alternative mechanism to encourage the non-zero elements in B to populate a small number of columns is to add a penalty term in the form of Bδ, where δ is a d∗ -size column vector with each element representing the penalty for a column in B. With no prior information about which subspaces are more informative, δ starts as an all-1 vector. An initial value for B is obtained through the optimization (8). Since our goal is to handle data using a small number of projections, δ is then updated such that its value is lower for the denser columns in B. This update resembles the reweighing in adaptive lasso. The matrix B itself is updated, and this 2-step process continues until convergence of δ. Once δ converges, the projections corresponding to the non-zero columns of B are added to the model. The procedure is shown in Algorithm 1. 4 Algorithm 1: RECIP δ = [1 . . . 1] repeat |P I| b = arg minB ||T − i=1 < Z, B > ||2 + λ|Bδ| 1 2 subject to |Bk | 1 = 1 k = 1...n δk = |Bi | 1 i = . . . d∗ (update the differential penalty) δ δ = 1 − |δ| 1 until δ converges return Π = {πi ; |Bi | 1 > 0 ∀i = 1 . . . d∗ } 2.4 Lasso for projection selection We will compare our algorithm to lasso regularization that ranks the projections in terms of their potential for data separability. We write this as an 1 -penalized optimization on the extended feature set Z, with the objective T : minβ |T − Zβ|2 + λ|β| 1 . The lasso penalty to the coefﬁcient vector encourages sparsity. For a high enough λ, the sparsity pattern in β is indicative of the usefulness of the projections. The lasso on entropy contributions was not found to perform well as it is not query speciﬁc and will ﬁnd one projection for all data. We improved it by allowing it to iteratively ﬁnd projections - this robust version offers increased performance by reweighting the data thus focusing on different subsets of it. Although better than running lasso on entropy contributions, the robust lasso does not match RECIP’s performance as the projections are selected gradually rather than jointly. Running the standard lasso on the original design matrix yields a set of relevant variables and it is not immediately clear how the solution would translate to the desired class. 2.5 The selection function Once the projections are selected, the second stage of the algorithm deals with assigning the projection with which to classify a particular query point. An immediate way of selecting the correct projection starts by computing the local entropy estimator for each subspace with each class assignment. Then, we may select the label/subspace combination that minimizes the empirical entropy. (i∗ , θ∗ ) = arg min i,θ 3 νk (πi (x), πi (Xθ )) νk (πi (x), πi (X¬θ )) dim(πi )(1−α) i = 1 . . . d∗ , α≈1 (9) Experimental results In this section we illustrate the capability of RECIP to retrieve informative projections of data and their use in support of interpreting results of classiﬁcation. First, we analyze how well RECIP can identify subspaces in synthetic data whose distribution obeys the subspace separability assumption (3.1). As a point of reference, we also present classiﬁcation accuracy results (3.2) for both the synthetic data and a few real-world sets. This is to quantify the extent of the trade-off between ﬁdelity of attainable classiﬁers and desired informativeness of the projections chosen by RECIP. We expect RECIP’s classiﬁcation performance to be slightly, but not substantially worse when compared to relevant classiﬁcation algorithms trained to maximize classiﬁcation accuracy. Finally, we present a few examples (3.3) of informative projections recovered from real-world data and their utility in explaining to human users the decision processes applied to query points. A set of artiﬁcial data used in our experiments contains q batches of data points, each of them made classiﬁable with high accuracy using one of available 2-dimensional subspaces (x1 , x2 ) with k ∈ k k {1 . . . q}. The data in batch k also have the property that x1 > tk . This is done such that the group a k point belongs to can be detected from x1 , thus x1 is a regulatory variable. We control the amount of k k noise added to thusly created synthetic data by varying the proportion of noisy data points in each batch. The results below are for datasets with 7 features each, with number of batches q ranging between 1 and 7. We kept the number of features speciﬁcally low in order to prevent excessive variation between any two sets generated this way, and to enable computing meaningful estimates of the expectation and variance of performance, while enabling creation of complicated data in which synthetic patterns may substantially overlap (using 7 features and 7 2-dimensional patterns implies that dimensions of at least 4 of the patterns will overlap). We implemented our method 5 to be scalable to the size and dimensionality of data and although for brevity we do not include a discussion of this topic here, we have successfully run RECIP against data with 100 features. The parameter α is a value close to 1, because the Tsallis divergence converges to the KL divergence as alpha approaches 1. For the experiments on real-world data, d was set to n (all projections were considered). For the artiﬁcial data experiments, we reported results for d = 2 as they do not change signiﬁcantly for d >= 2 because this data was synthesized to contain bidimensional informative projections. In general, if d is too low, the correct full set of projections will not be found, but it may be recovered partially. If d is chosen too high, there is a risk that a given selected projection p will contain irrelevant features compared to the true projection p0 . However, this situation only occurs if the noise introduced by these features in the estimators makes the entropy contributions on p and p0 statistically indistinguishable for a large subset of data. The users will choose d according to the desired/acceptable complexity of the resulting model. If the results are to be visually interpreted by a human, values of 2 or 3 are reasonable for d. 3.1 Recovering informative projections Table 1 shows how well RECIP recovers the q subspaces corresponding to the synthesized batches of data. We measure precision (proportion of the recovered projections that are known to be informative), and recall (proportion of known informative projections that are recovered by the algorithm). in Table 1, rows correspond to the number of distinct synthetic batches injected in data, q, and subsequent columns correspond to the increasing amounts of noise in data. We note that the observed precision is nearly perfect: the algorithm makes only 2 mistakes over the entire set of experiments, and those occur for highly noisy setups. The recall is nearly perfect as long as there is little overlap among the dimensions, that is when the injections do not interfere with each other. As the number of projections increases, the chances for overlap among the affected features also increase, which makes the data more confusing resulting on a gradual drop of recall until only about 3 or 4 of the 7 known to be informative subspaces can be recovered. We have also used lasso as described in 2.4 in an attempt to recover projections. This setup only manages to recover one of the informative subspaces, regardless of how the regularization parameter is tuned. 3.2 Classiﬁcation accuracy Table 2 shows the classiﬁcation accuracy of RECIP, obtained using synthetic data. As expected, the observed performance is initially high when there are few known informative projections in data and it decreases as noise and ambiguity of the injected patterns increase. Most types of ensemble learners would use a voting scheme to arrive at the ﬁnal classiﬁcation of a testing sample, rather than use a model selection scheme. For this reason, we have also compared predictive accuracy revealed by RECIP against a method based on majority voting among multiple candidate subspaces. Table 4 shows that the accuracy of this technique is lower than the accuracy of RECIP, regardless of whether the informative projections are recovered by the algorithm or assumed to be known a priori. This conﬁrms the intuition that a selection-based approach can be more effective than voting for data which satisﬁes the subspace separability assumption. For reference, we have also classiﬁed the synthetic data using K-Nearest-Neighbors algorithm using all available features at once. The results of that experiment are shown in Table 5. Since RECIP uses neighbor information, K-NN is conceptually the closest among the popular alternatives. Compared to RECIP, K-NN performs worse when there are fewer synthetic patterns injected in data to form informative projections. It is because some features used then by K-NN are noisy. As more features become informative, the K-NN accuracy improves. This example shows the beneﬁt of a selective approach to feature space and using a subset of the most explanatory projections to support not only explanatory analyses but also classiﬁcation tasks in such circumstances. 3.3 RECIP case studies using real-world data Table 3 summarizes the RECIP and K-NN performance on UCI datasets. We also test the methods using Cell dataset containing a set of measurements such as the area and perimeter biological cells with separate labels marking cells subjected to treatment and control cells. In Vowel data, the nearest-neighbor approach works exceptionally well, even outperforming random forests (0.94 accuracy), which is an indication that all features are jointly relevant. For d lower than the number of features, RECIP picks projections of only one feature, but if there is no such limitation, RECIP picks the space of all the features as informative. 6 Table 1: Projection recovery for artiﬁcial datasets with 1 . . . 7 informative features and noise level 0 . . . 0.2 in terms of mean and variance of Precision and Recall. Mean/var obtained for each setting by repeating the experiment with datasets with different informative projections. PRECISION 1 2 3 4 5 6 7 0 1 1 1 1 1 1 1 0.02 1 1 1 1 1 1 1 Mean 0.05 1 1 1 1 1 1 1 1 2 3 4 5 6 7 0 1 1 1 0.9643 0.7714 0.6429 0.6327 0.02 1 1 1 0.9643 0.7429 0.6905 0.5918 Mean 0.05 1 1 0.9524 0.9643 0.8286 0.6905 0.5918 0 0 0 0 0 0 0 0 0.02 0 0 0 0 0 0 0 Variance 0.05 0 0 0 0 0 0 0 0.1 0.0306 0 0 0 0 0 0 0.2 0.0306 0 0 0 0 0 0 0 0 0 0 0.0077 0.0163 0.0113 0.0225 0.02 0 0 0 0.0077 0.0196 0.0113 0.02 Variance 0.05 0 0 0.0136 0.0077 0.0049 0.0272 0.0258 0.1 0 0 0.0136 0.0077 0.0082 0.0113 0.0233 0.2 0 0 0 0.0128 0.0278 0.0113 0.02 0.1 0.0008 0.0001 0.0028 0.0025 0.0036 0.0025 0.0042 0.2 0.0007 0.0001 0.0007 0.0032 0.0044 0.0027 0.0045 0.1 0.0001 0.0001 0.0007 0.0014 0.0019 0.0023 0.0021 0.2 0.0000 0.0001 0.0007 0.0020 0.0023 0.0021 0.0020 0.1 0.9286 1 1 1 1 1 1 0.2 0.9286 1 1 1 1 1 1 RECALL 0.1 1 1 0.9524 0.9643 0.8571 0.6905 0.5714 0.2 1 1 1 0.9286 0.7714 0.6905 0.551 Table 2: RECIP Classiﬁcation Accuracy on Artiﬁcial Data 1 2 3 4 5 6 7 0 0.9751 0.9333 0.9053 0.8725 0.8113 0.7655 0.7534 1 2 3 4 5 6 7 0 0.9751 0.9333 0.9053 0.8820 0.8714 0.8566 0.8429 CLASSIFICATION ACCURACY Mean Variance 0.02 0.05 0.1 0.2 0 0.02 0.05 0.9731 0.9686 0.9543 0.9420 0.0000 0.0000 0.0000 0.9297 0.9227 0.9067 0.8946 0.0001 0.0001 0.0001 0.8967 0.8764 0.8640 0.8618 0.0004 0.0005 0.0016 0.8685 0.8589 0.8454 0.8187 0.0020 0.0020 0.0019 0.8009 0.8105 0.8105 0.7782 0.0042 0.0044 0.0033 0.7739 0.7669 0.7632 0.7511 0.0025 0.0021 0.0026 0.7399 0.7347 0.7278 0.7205 0.0034 0.0040 0.0042 CLASSIFICATION ACCURACY - KNOWN PROJECTIONS Mean Variance 0.02 0.05 0.1 0.2 0 0.02 0.05 0.9731 0.9686 0.9637 0.9514 0.0000 0.0000 0.0000 0.9297 0.9227 0.9067 0.8946 0.0001 0.0001 0.0001 0.8967 0.8914 0.8777 0.8618 0.0004 0.0005 0.0005 0.8781 0.8657 0.8541 0.8331 0.0011 0.0011 0.0014 0.8641 0.8523 0.8429 0.8209 0.0015 0.0015 0.0018 0.8497 0.8377 0.8285 0.8074 0.0014 0.0015 0.0016 0.8371 0.8256 0.8122 0.7988 0.0015 0.0018 0.0018 Table 3: Accuracy of K-NN and RECIP Dataset KNN RECIP In Spam data, the two most informative projections are Breast Cancer Wis 0.8415 0.8275 ’Capital Length Total’ (CLT)/’Capital Length Longest’ Breast Tissue 1.0000 1.0000 (CLL) and CLT/’Frequency of word your’ (FWY). FigCell 0.7072 0.7640 ure 1 shows these two projections, with the dots repreMiniBOONE* 0.7896 0.7396 senting training points. The red dots represent points laSpam 0.7680 0.7680 Vowel 0.9839 0.9839 beled as spam while the blue ones are non-spam. The circles are query points that have been assigned to be classiﬁed with the projection in which they are plotted. The green circles are correctly classiﬁed points, while the magenta circles - far fewer - are the incorrectly classiﬁed ones. Not only does the importance of text in capital letters make sense for a spam ﬁltering dataset, but the points that select those projections are almost ﬂawlessly classiﬁed. Additionally, assuming the user would need to attest the validity of classiﬁcation for the ﬁrst plot, he/she would have no trouble seeing that the circled data points are located in a region predominantly populated with examples of spam, so any non-spam entry appears suspicious. Both of the magenta-colored cases fall into this category, and they can be therefore ﬂagged for further investigation. 7 Informative Projection for the Spam dataset 2000 1500 1000 500 0 Informative Projection for the Spam dataset 12 Frequency of word ‘your‘ Capital Run Length Longest 2500 10 8 6 4 2 0 500 1000 1500 2000 2500 Capital Run Length Total 3000 0 3500 0 2000 4000 6000 8000 10000 12000 14000 Capital Run Length Total 16000 Figure 1: Spam Dataset Selected Subspaces Table 4: Classiﬁcation accuracy using RECIP-learned projections - or known projections, in the lower section - within a voting model instead of a selection model 1 2 3 4 5 6 7 1 2 3 4 5 6 7 CLASSIFICATION ACCURACY - VOTING ENSEMBLE Mean Variance 0 0.02 0.05 0.1 0.2 0 0.02 0.05 0.1 0.9751 0.9731 0.9686 0.9317 0.9226 0.0000 0.0000 0.0000 0.0070 0.7360 0.7354 0.7331 0.7303 0.7257 0.0002 0.0002 0.0001 0.0002 0.7290 0.7266 0.7163 0.7166 0.7212 0.0002 0.0002 0.0008 0.0006 0.6934 0.6931 0.6932 0.6904 0.6867 0.0008 0.0008 0.0008 0.0008 0.6715 0.6602 0.6745 0.6688 0.6581 0.0013 0.0014 0.0013 0.0014 0.6410 0.6541 0.6460 0.6529 0.6512 0.0008 0.0007 0.0010 0.0006 0.6392 0.6342 0.6268 0.6251 0.6294 0.0009 0.0011 0.0012 0.0012 CLASSIFICATION ACCURACY - VOTING ENSEMBLE, KNOWN PROJECTIONS Mean Variance 0 0.02 0.05 0.1 0.2 0 0.02 0.05 0.1 0.9751 0.9731 0.9686 0.9637 0.9514 0.0000 0.0000 0.0000 0.0001 0.7360 0.7354 0.7331 0.7303 0.7257 0.0002 0.0002 0.0001 0.0002 0.7409 0.7385 0.7390 0.7353 0.7325 0.0010 0.0012 0.0010 0.0011 0.7110 0.7109 0.7083 0.7067 0.7035 0.0041 0.0041 0.0042 0.0042 0.7077 0.7070 0.7050 0.7034 0.7008 0.0015 0.0015 0.0015 0.0016 0.6816 0.6807 0.6801 0.6790 0.6747 0.0008 0.0008 0.0008 0.0008 0.6787 0.6783 0.6772 0.6767 0.6722 0.0008 0.0009 0.0009 0.0008 0.2 0.0053 0.0001 0.0002 0.0009 0.0013 0.0005 0.0012 0.2 0.0000 0.0001 0.0010 0.0043 0.0016 0.0009 0.0008 Table 5: Classiﬁcation accuracy for artiﬁcial data with the K-Nearest Neighbors method CLASSIFICATION ACCURACY - KNN 1 2 3 4 5 6 7 4 0 0.7909 0.7940 0.7964 0.7990 0.8038 0.8043 0.8054 0.02 0.7843 0.7911 0.7939 0.7972 0.8024 0.8032 0.8044 Mean 0.05 0.7747 0.7861 0.7901 0.7942 0.8002 0.8015 0.8028 0.1 0.7652 0.7790 0.7854 0.7904 0.7970 0.7987 0.8004 0.2 0.7412 0.7655 0.7756 0.7828 0.7905 0.7930 0.7955 0 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 0.02 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 Variance 0.05 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.1 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 0.2 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 Conclusion This paper considers the problem of Projection Recovery for Classiﬁcation. It is relevant in applications where the decision process must be easy to understand in order to enable human interpretation of the results. We have developed a principled, regression-based algorithm designed to recover small sets of low-dimensional subspaces that support interpretability. It optimizes the selection using individual data-point-speciﬁc entropy estimators. In this context, the proposed algorithm follows the idea of transductive learning, and the role of the resulting projections bears resemblance to high conﬁdence regions known in conformal prediction models. Empirical results obtained using simulated and real-world data show the effectiveness of our method in ﬁnding informative projections that enable accurate classiﬁcation while maintaining transparency of the underlying decision process. Acknowledgments This material is based upon work supported by the NSF, under Grant No. IIS-0911032. 8 References [1] Mark W. Craven and Jude W. Shavlik. Extracting Tree-Structured Representations of Trained Networks. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, Advances in Neural Information Processing Systems, volume 8, pages 24–30. The MIT Press, 1996. [2] Pedro Domingos. Knowledge discovery via multiple models. Intelligent Data Analysis, 2:187–202, 1998. [3] Eulanda M. Dos Santos, Robert Sabourin, and Patrick Maupin. A dynamic overproduce-and-choose strategy for the selection of classiﬁer ensembles. Pattern Recogn., 41:2993–3009, October 2008. [4] M. Feder and N. Merhav. Relations between entropy and error probability. Information Theory, IEEE Transactions on, 40(1):259–266, January 1994. [5] Jerome H. Friedman, Ron Kohavi, and Yeogirl Yun. Lazy decision trees, 1996. [6] A. Gammerman, V. Vovk, and V. Vapnik. Learning by transduction. In In Uncertainty in Artiﬁcial Intelligence, pages 148–155. Morgan Kaufmann, 1998. [7] Quanquan Gu, Zhenhui Li, and Jiawei Han. Joint feature selection and subspace learning, 2011. [8] Bing Liu, Minqing Hu, and Wynne Hsu. Intuitive representation of decision trees using general rules and exceptions. In Proceedings of Seventeeth National Conference on Artiﬁcial Intellgience (AAAI-2000), July 30 - Aug 3, 2000, pages 615–620, 2000. [9] Michael Mampaey, Nikolaj Tatti, and Jilles Vreeken. Tell me what i need to know: succinctly summarizing data with itemsets. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’11, pages 573–581, New York, NY, USA, 2011. ACM. [10] Mario Marchand and Marina Sokolova. Learning with decision lists of data-dependent features. JOURNAL OF MACHINE LEARNING REASEARCH, 6, 2005. [11] Guillaume Obozinski, Ben Taskar, and Michael I. Jordan. Joint covariate selection and joint subspace selection for multiple classiﬁcation problems. Statistics and Computing, 20(2):231–252, April 2010. [12] Michael J. Pazzani, Subramani Mani, and W. Rodman Shankle. Beyond concise and colorful: Learning intelligible rules, 1997. [13] B. Poczos and J. Schneider. On the estimation of alpha-divergences. AISTATS, 2011. [14] Kai Ting, Jonathan Wells, Swee Tan, Shyh Teng, and Geoffrey Webb. Feature-subspace aggregating: ensembles for stable andunstable learners. Machine Learning, 82:375–397, 2011. 10.1007/s10994-0105224-5. 9

5 0.73444456 361 nips-2012-Volume Regularization for Binary Classification

Author: Koby Crammer, Tal Wagner

Abstract: We introduce a large-volume box classiﬁcation for binary prediction, which maintains a subset of weight vectors, and speciﬁcally axis-aligned boxes. Our learning algorithm seeks for a box of large volume that contains “simple” weight vectors which most of are accurate on the training set. Two versions of the learning process are cast as convex optimization problems, and it is shown how to solve them efﬁciently. The formulation yields a natural PAC-Bayesian performance bound and it is shown to minimize a quantity directly aligned with it. The algorithm outperforms SVM and the recently proposed AROW algorithm on a majority of 30 NLP datasets and binarized USPS optical character recognition datasets. 1

6 0.72546178 227 nips-2012-Multiclass Learning with Simplex Coding

7 0.71908402 14 nips-2012-A P300 BCI for the Masses: Prior Information Enables Instant Unsupervised Spelling

8 0.70472515 28 nips-2012-A systematic approach to extracting semantic information from functional MRI data

9 0.69940764 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification

10 0.69636887 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

11 0.66984785 197 nips-2012-Learning with Recursive Perceptual Representations

12 0.66923183 50 nips-2012-Bandit Algorithms boost Brain Computer Interfaces for motor-task selection of a brain-controlled button

13 0.64970809 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics

14 0.63459051 271 nips-2012-Pointwise Tracking the Optimal Regression Function

15 0.62076628 266 nips-2012-Patient Risk Stratification for Hospital-Associated C. diff as a Time-Series Classification Task

16 0.61886817 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses

17 0.6150735 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

18 0.60393149 97 nips-2012-Diffusion Decision Making for Adaptive k-Nearest Neighbor Classification

19 0.60216588 207 nips-2012-Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification

20 0.59024316 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.075), (13, 0.149), (17, 0.013), (21, 0.02), (38, 0.161), (42, 0.07), (54, 0.016), (55, 0.029), (74, 0.049), (76, 0.163), (80, 0.137), (92, 0.041)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90192837 125 nips-2012-Factoring nonnegative matrices with linear programs

Author: Ben Recht, Christopher Re, Joel Tropp, Victor Bittorf

Abstract: This paper describes a new approach, based on linear programming, for computing nonnegative matrix factorizations (NMFs). The key idea is a data-driven model for the factorization where the most salient features in the data are used to express the remaining features. More precisely, given a data matrix X, the algorithm identiﬁes a matrix C that satisﬁes X ≈ CX and some linear constraints. The constraints are chosen to ensure that the matrix C selects features; these features can then be used to ﬁnd a low-rank NMF of X. A theoretical analysis demonstrates that this approach has guarantees similar to those of the recent NMF algorithm of Arora et al. (2012). In contrast with this earlier work, the proposed method extends to more general noise models and leads to efﬁcient, scalable algorithms. Experiments with synthetic and real datasets provide evidence that the new approach is also superior in practice. An optimized C++ implementation can factor a multigigabyte matrix in a matter of minutes. 1

same-paper 2 0.89852548 200 nips-2012-Local Supervised Learning through Space Partitioning

Author: Joseph Wang, Venkatesh Saligrama

3 0.89307213 361 nips-2012-Volume Regularization for Binary Classification

Author: Koby Crammer, Tal Wagner

4 0.87185085 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

Author: Yi Wu, David P. Wipf

Abstract: Sparse linear (or generalized linear) models combine a standard likelihood function with a sparse prior on the unknown coefﬁcients. These priors can conveniently be expressed as a maximization over zero-mean Gaussians with different variance hyperparameters. Standard MAP estimation (Type I) involves maximizing over both the hyperparameters and coefﬁcients, while an empirical Bayesian alternative (Type II) ﬁrst marginalizes the coefﬁcients and then maximizes over the hyperparameters, leading to a tractable posterior approximation. The underlying cost functions can be related via a dual-space framework from [22], which allows both the Type I or Type II objectives to be expressed in either coefﬁcient or hyperparmeter space. This perspective is useful because some analyses or extensions are more conducive to development in one space or the other. Herein we consider the estimation of a trade-off parameter balancing sparsity and data ﬁt. As this parameter is effectively a variance, natural estimators exist by assessing the problem in hyperparameter (variance) space, transitioning natural ideas from Type II to solve what is much less intuitive for Type I. In contrast, for analyses of update rules and sparsity properties of local and global solutions, as well as extensions to more general likelihood models, we can leverage coefﬁcient-space techniques developed for Type I and apply them to Type II. For example, this allows us to prove that Type II-inspired techniques can be successful recovering sparse coefﬁcients when unfavorable restricted isometry properties (RIP) lead to failure of popular ℓ1 reconstructions. It also facilitates the analysis of Type II when non-Gaussian likelihood models lead to intractable integrations. 1

5 0.87105429 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

Author: Chong Wang, David M. Blei

Abstract: We present a truncation-free stochastic variational inference algorithm for Bayesian nonparametric models. While traditional variational inference algorithms require truncations for the model or the variational distribution, our method adapts model complexity on the ﬂy. We studied our method with Dirichlet process mixture models and hierarchical Dirichlet process topic models on two large data sets. Our method performs better than previous stochastic variational inference algorithms. 1

6 0.87072784 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

7 0.86997998 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

8 0.86391848 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

9 0.86364734 292 nips-2012-Regularized Off-Policy TD-Learning

10 0.86099321 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

11 0.86003011 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation

12 0.85945106 65 nips-2012-Cardinality Restricted Boltzmann Machines

13 0.85936606 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

14 0.85830772 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

15 0.85809469 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

16 0.85770297 330 nips-2012-Supervised Learning with Similarity Functions

17 0.85721481 16 nips-2012-A Polynomial-time Form of Robust Regression

18 0.85714191 294 nips-2012-Repulsive Mixtures

19 0.8567971 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

20 0.85578191 199 nips-2012-Link Prediction in Graphs with Autoregressive Features