nips nips2012 nips2012-200 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Joseph Wang, Venkatesh Saligrama
Abstract: We develop a novel approach for supervised learning based on adaptively partitioning the feature space into different regions and learning local region-specific classifiers. We formulate an empirical risk minimization problem that incorporates both partitioning and classification in to a single global objective. We show that space partitioning can be equivalently reformulated as a supervised learning problem and consequently any discriminative learning method can be utilized in conjunction with our approach. Nevertheless, we consider locally linear schemes by learning linear partitions and linear region classifiers. Locally linear schemes can not only approximate complex decision boundaries and ensure low training error but also provide tight control on over-fitting and generalization error. We train locally linear classifiers by using LDA, logistic regression and perceptrons, and so our scheme is scalable to large data sizes and high-dimensions. We present experimental results demonstrating improved performance over state of the art classification techniques on benchmark datasets. We also show improved robustness to label noise.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We develop a novel approach for supervised learning based on adaptively partitioning the feature space into different regions and learning local region-specific classifiers. [sent-5, score-0.645]
2 We formulate an empirical risk minimization problem that incorporates both partitioning and classification in to a single global objective. [sent-6, score-0.408]
3 We show that space partitioning can be equivalently reformulated as a supervised learning problem and consequently any discriminative learning method can be utilized in conjunction with our approach. [sent-7, score-0.411]
4 Nevertheless, we consider locally linear schemes by learning linear partitions and linear region classifiers. [sent-8, score-0.57]
5 Locally linear schemes can not only approximate complex decision boundaries and ensure low training error but also provide tight control on over-fitting and generalization error. [sent-9, score-0.474]
6 We train locally linear classifiers by using LDA, logistic regression and perceptrons, and so our scheme is scalable to large data sizes and high-dimensions. [sent-10, score-0.349]
7 1 Introduction We develop a novel approach for supervised learning based on adaptively partitioning the feature space into different regions and learning local region classifiers. [sent-13, score-0.79]
8 Each reject classifier, gj , makes a binary decision and the observation is either classified by the associated region classifier, fj , or passed to the next reject classifier. [sent-17, score-0.938]
9 Each reject classifier, gj , thus partitions the feature space into regions. [sent-18, score-0.466]
10 The region classifier fj operates only on examples within the local region that is consistent with the reject classifier partitions. [sent-19, score-0.666]
11 We incorporate both feature space partitioning (reject classifiers) and region-specific classifiers into a single global empirical risk/loss function. [sent-20, score-0.44]
12 In this context we show that each step of the coordinate descent can be reformulated as a supervised learning problem that seeks to optimize a 0/1 empirical loss function. [sent-22, score-0.272]
13 This result is somewhat surprising in the context of partitioning and has broader implications. [sent-23, score-0.28]
14 First, we can now solve feature space partitioning through empirical risk function minimization(ERM) and so powerful existing methods including boosting, decision trees and kernel methods can be used in conjunction for training flexible partitioning classifiers. [sent-24, score-0.933]
15 Second, because data is usually locally “well-behaved,” simpler region-classifiers, such as linear classifiers, often suffice for controlling local empirical error. [sent-25, score-0.291]
16 Furthermore, since complex boundaries for partitions can be approximated by piecewise linear functions, feature spaces can be partitioned to arbitrary degree of precision using linear boundaries (reject classifiers). [sent-26, score-0.533]
17 Thus the combination of piecewise linear partitions along with linear region classifiers has the ability to adapt to complex data sets leading to low training error. [sent-27, score-0.569]
18 Reject Classifiers, gj (x), partition space and region classifiers, fj (x), are applied locally within the partitioned region. [sent-29, score-0.455]
19 We use linear perceptrons and logistic regression for training partitioning classifier and region classifiers. [sent-31, score-0.733]
20 Our scheme splits with 3 regions and does not overtrain unlike Adaboost. [sent-32, score-0.214]
21 number of linear partitions and linear region classifiers, since the VC dimension of such a structure is reasonably small. [sent-33, score-0.375]
22 1 (right) demonstrates the substantial benefits of our approach on the banana dataset[1] over competing methods such as boosting and decision trees, both of which evidently overtrain. [sent-36, score-0.295]
23 Limiting reject and region classifiers to linear methods has computational advantages as well. [sent-37, score-0.407]
24 Since the datasets are locally well-behaved we can locally train with linear discriminant analysis (LDA), logistic regression and variants of perceptrons. [sent-38, score-0.501]
25 Indeed, we present some evidence that shows that the partitioning step can adaptively cluster the dataset into groups and letting region classifiers to operate on simpler problems. [sent-42, score-0.474]
26 Additionally linear methods such as LDA, Logistic regression, and perceptron naturally extend to multi-class problems leading to computationally efficient and statistically meaningful results as evidenced on challenging datasets with performance improvements over state of the art techniques. [sent-43, score-0.309]
27 Boosting algorithms [2] learn complex decision boundaries characterized as a weighted linear combination of weak classifiers. [sent-46, score-0.315]
28 In contrast our method takes unions and intersections of simpler decision regions to learn more complex decision boundaries. [sent-47, score-0.47]
29 Decision trees are built by greedily partitioning the feature space [3]. [sent-49, score-0.439]
30 One main difference is that decision trees typically attempt to greedily minimize some loss or a heuristic, such as region purity or entropy, at each split/partition of the feature space. [sent-50, score-0.475]
31 Also decision trees typically split/partition a single feature/component resulting in unions of rectangularly shaped decision regions; in contrast we allow arbitrary partitions leading to complex decision regions. [sent-52, score-0.677]
32 In these methods a multiclass problem is decomposed into several binary problems using a code matrix and the predicted outcomes of these binary problems are fused to obtain multiclass labels. [sent-54, score-0.424]
33 3) that suggests that our space partitioning classifier groups/clusters multiple classes into different regions; nevertheless our formulation is different in that we do not explicitly code classes into different regions and our method does not require fusion of intermediate outcomes. [sent-57, score-0.439]
34 This is because we show that space partitioning itself can be re-formulated as a supervised learning problem. [sent-59, score-0.378]
35 Consequently, any 2 existing method, including boosting and decision trees, could be used as a method of choice for learning space partitioning and region-specific decision functions. [sent-60, score-0.696]
36 We use simple linear classifiers for partitioning and region-classifiers in many of our experiments. [sent-61, score-0.337]
37 A recently proposed alternative approach is to build a global classifier ignoring clusters of errors, and building separate classifiers in each error cluster region [11]. [sent-71, score-0.256]
38 This proposed approach greedily approximates a piecewise linear classifier in this manner, however fails to take into account the performance of the classifiers in the error cluster regions. [sent-72, score-0.229]
39 1 that adaptively partitions data into regions and fits simple classifiers within each region. [sent-85, score-0.292]
40 We predict the output for a test sample, x, based on the output of the trained simple classifier associated with the region x belongs to. [sent-86, score-0.256]
41 In the sequel we formulate space partitioning and region-classification into a single objective and show that space partitioning is equivalent to solving a binary classification problem with 0/1 empirical loss. [sent-88, score-0.74]
42 1 Binary Space Partitioning as Supervised Learning In this section we consider learning binary space partitioning for ease of exposition. [sent-90, score-0.37]
43 The empirical risk/loss associated with the binary space partitioned classifiers is given by: n n 1 1 R(g, f0 , f1 ) = ½{g(xi )=0} ½{f0 (xi )=yi } + ½{g(xi )=1} ½{f1 (xi )=yi } (1) n i=1 n i=1 Our goal is to minimize the empirical error jointly over the family of functions g(·) ∈ G and fi (·) ∈ F . [sent-94, score-0.247]
44 From the above equation, when the partitioning function g(·) is fixed, it is clear how one can view choice of classifiers f0 (·) and f1 (·) as ERM problems. [sent-95, score-0.28]
45 We observe several aspects of our proposed scheme: (1) Binary partitioning is a binary classification problem on the training set, (xi , 0 ), i = i 1, 2, . [sent-105, score-0.403]
46 (3) The partitioning error is zero on a training example xi with weight wi = 1 if we choose g(xi ) = 0 on examples where f0 (xi ) = yi . [sent-112, score-0.683]
47 In contrast if f0 (xi ) = yi the partitioning error can be reduced by choosing g(xi ) = 1, and thus rejecting the example from consideration by f0 . [sent-113, score-0.42]
48 1 is that we can now use powerful learning techniques such as decision trees, boosting and SVMs for learning space partitioning classifiers. [sent-116, score-0.593]
49 These surrogate losses are upper bounds for indicator losses and usually attempt to obtain large margins. [sent-121, score-0.268]
50 This is because even though each surrogate upper bounds indicator loss functions at each step, when put together they do not upper bound the global objective of Eq. [sent-124, score-0.255]
51 The empirical error can be upper 4 bounded: ½ g(x)=1 = ½− h(x)≤0 ≤ φ(− h(x)) We then form a global surrogate for the empirical loss function. [sent-132, score-0.374]
52 1 with surrogate functions, the global surrogate is given by: 1 ˆ R(g, f0 , f1 ) = n n φ (h(xi )) φ (yi f0 (xi )) + i=1 1 n n φ (−h(xi )) φ (yi f1 (xi )) , (3) i=1 which is an upper bound on Eq. [sent-134, score-0.284]
53 Optimizing the partitioning function g(·) can be posed as a supervised learning problem, resulting in the following lemma (see Supplementary for a proof): Lemma 2. [sent-136, score-0.346]
54 , fr ) = 1 n n L1 (xi , yi ) (5) i=1 In the expression above we have made the dependence on reject classifiers and region-classifiers explicit. [sent-146, score-0.333]
55 5 over all gj , fj by means of coordinate descent, namely, to optimize gk we hold fj , ∀j and gj , j = k fixed. [sent-148, score-0.614]
56 , r do Train region classifier fj (x) to optimize 0/1 empirical loss of Eq. [sent-153, score-0.355]
57 , 2, 1 do Train reject classifier gk (x) to optimize 0/1 empirical loss of Eq. [sent-158, score-0.49]
58 Consequently, we can also reduce this problem to a supervised n learning problem: 1 gk (·) = argmin wi ½{g(xi )= i } , (7) n g∈G i=1 where i = 0 if fk (xi ) = yi 1 if fk (xi ) = yi 1, 0, and wi = i = Lk+1 (xi , yi ), Ck (x) = 0 . [sent-163, score-0.838]
59 otherwise The composite classifier F (x) based on the reject and region classifiers can be written compactly as follows: F (x) = fs (x), s = min{j | gj (x) = 0} ∪ {r} (8) Observe that if the kth region classifier correctly classifies the example xi , i. [sent-164, score-0.829]
60 , fk (xi ) = yi then this would encourage gk (xi ) = 0. [sent-166, score-0.358]
61 This is because gk (xi ) = 1 would induce an increased cost in terms of increasing Lk+1 (xi , yi ). [sent-167, score-0.254]
62 Similarly, if the kth region classifier incorrectly classifies, namely, fk (xi ) = yi , the optimization would prefer gk (xi ) = 1. [sent-168, score-0.541]
63 Also note that if the kth region classifier loss as well as the subsequent stages are incorrect on an example are incorrect then the weight on that example is zero. [sent-169, score-0.289]
64 We can deal with minimizing indicator losses and resulting convergence issues by deriving a global surrogate as we did in Sec. [sent-171, score-0.261]
65 4 Local Linear Classification Linear classification is a natural method for learning local decision boundaries, with the global decision regions approximated by piecewise linear functions. [sent-176, score-0.699]
66 In local linear classification, local classifiers, f1 , f2 , . [sent-177, score-0.239]
67 Bias error (empirical error) can be made arbitrarily small by approximating the decision boundary by many local linear classifiers. [sent-185, score-0.321]
68 Variance error (classifier complexity) can be made small by Figure 2: Local LDA classification regions restricting the number of local linear classifiers used to for XOR data, the black line is reject classifier boundary. [sent-186, score-0.521]
69 8) with r − 1 linear classifiers gj and r linear classifiers fj in a d-dimensional space is bounded by 2(2r − 1) log(e(2r − 1))(d + 1). [sent-191, score-0.339]
70 The VC-dimension of local linear classifiers grows linearly with dimension and nearly linearly with respect to the number of regions. [sent-192, score-0.214]
71 In practice, few regions are necessary to achieve low training error as highly non-linear decision boundaries can be approximated well locally with linear boundaries. [sent-195, score-0.594]
72 Learning the local linear classifier with 2 regions using LDA produces a classifier with small empirical error. [sent-197, score-0.333]
73 Then with high probability (where probability is wrt initial sampling of reject region) a two region composite classifier trained locally using LDA converges to zero training error. [sent-201, score-0.619]
74 Although margin based methods such as SVMs can be used, we primarily use relatively simple schemes such as LDA, logistic regression, and average voted perceptron in our experiments. [sent-204, score-0.418]
75 We use each of these schemes for learning both reject and region-classifiers. [sent-205, score-0.258]
76 Computational Costs of LDA, Logistic Regression and Perceptron: Each LDA classifier is trained in O(nd2 ) computations, where n is the number of training observations and d is the dimension of the training data. [sent-207, score-0.208]
77 As a result, the total computation cost per iteration of the local linear classifier with LDA scales linearly with respect to the number of training samples, requiring O(nd2 r) computations per iteration, where r is the number of classification regions. [sent-208, score-0.291]
78 Similarly, the computational cost of training a single linear classifier by logistic regression scales O(ncd2 ) for a fixed number of iterations, with the local linear classifier training time scaling O(rncd2 ) computations per iteration, where c is the number of classes. [sent-209, score-0.51]
79 A linear variant of the voted perceptron was implemented by taking the average of the weights generated by the unnormalized voted perceptron [15]. [sent-210, score-0.621]
80 Training each perceptron for a fixed number of epochs is extremely efficient, requiring only O(ndc) computations to train. [sent-211, score-0.22]
81 Therefore, training local linear perceptron scales linearly with data size and dimensions, with O(ndcr) computations, per iteration. [sent-212, score-0.421]
82 3 Experimental Results Multiclass Classification:Experimental results on six datasets from the UCI repository [16] were performed using the benchmark training and test splits associated with each data set, as shown in Table 1. [sent-213, score-0.21]
83 Although confidence intervals cannot be computed by multiple training/test splits, test set error bounds [17] show that with test data sets of these sizes, the difference between true error and empirical error is small with high probability. [sent-215, score-0.247]
84 Local linear classifiers were trained with LDA, logistic regression, and perceptron (mean of weights) used to learn local surrogates for the rejection and local classification problems. [sent-217, score-0.693]
85 The classifiers were initialized with 5 classification regions (r = 5), with the trained classifiers often reducing to fewer classification regions due to empty rejection region. [sent-218, score-0.376]
86 Termination of the algorithm occurred when the rejection outputs, gk (x), and classification labels, F (x), remained consistent on the training data for two iterations. [sent-219, score-0.264]
87 Results were compared with Mixture Discriminant Analysis (MDA) g1(x) g2(x) g3(x) g4(x) g5(x) Figure 3: Histogram of classes over test data for the Optdigit dataset in different partitions generated by our approach using the linear voted perceptron . [sent-221, score-0.488]
88 The improved performance of local linear learning of comparable complexity justifies approximating these boundaries by piecewise linear functions. [sent-227, score-0.382]
89 Training each binary kernelized classifier is computationally intensive, and on weakly learnable data, boosting also allows for modeling of complex boundaries with arbitrarily small empirical error. [sent-229, score-0.362]
90 One vs All AdaBoostis trained using decision stumps as weak learners. [sent-232, score-0.259]
91 AdaBoost-SAMME and GD-MCBoost are trained using depth-2 decision trees as weak learners. [sent-233, score-0.296]
92 32% In 4 of the 6 datasets, local linear classification produced the lowest classification error on test datasets, with optimal test errors within 0. [sent-282, score-0.255]
93 Also there is evidence that suggests that our scheme partitions multiclass problems into simpler subproblems. [sent-284, score-0.315]
94 We plotted histogram output of class labels for Optdigit dataset across different regions using local perceptrons (Fig. [sent-285, score-0.274]
95 This can contrasted with many state of the art boosting techniques, such as [18], which attempt to optimize both the codewords for each class as well as the binary classification problems defining the codewords. [sent-289, score-0.283]
96 Robustness to Label Noise: Local linear classification trained using LDA, logistic regression, and averaged voted perceptron was tested in the presence of random label noise. [sent-292, score-0.555]
97 A randomly selected fraction of all training observations were given incorrect labels, and trained as described for the multiclass experiments. [sent-293, score-0.33]
98 For each label noise fraction, 100 randomly drawn training and test sets were used, and the average test error is shown in Fig. [sent-296, score-0.227]
99 For comparison, results are shown for classification trees trained according to Gini’s diversity index (GDI) [3], AdaBoost trained with stumps [2], and support vector machines trained on Gaussian radial basis function kernels. [sent-298, score-0.369]
100 In comparison, boosting and classification trees show sensitivity to label noise, with the test error increasing at a faster rate than LDA-trained local linear classification on both the Wisconsin Breast Cancer data and Vertebrae data. [sent-300, score-0.483]
wordName wordTfidf (topN-words)
[('ers', 0.363), ('classi', 0.335), ('partitioning', 0.28), ('er', 0.211), ('reject', 0.205), ('perceptron', 0.175), ('gk', 0.155), ('multiclass', 0.154), ('lda', 0.154), ('region', 0.145), ('xi', 0.142), ('decision', 0.132), ('regions', 0.127), ('boosting', 0.12), ('partitions', 0.116), ('gj', 0.113), ('surrogate', 0.107), ('voted', 0.107), ('fk', 0.104), ('yi', 0.099), ('adaboost', 0.098), ('local', 0.091), ('piecewise', 0.09), ('boundaries', 0.087), ('mda', 0.086), ('optdigit', 0.086), ('trees', 0.086), ('locally', 0.085), ('logistic', 0.083), ('fj', 0.08), ('erm', 0.079), ('trained', 0.078), ('discriminant', 0.077), ('surrogates', 0.074), ('gdi', 0.073), ('vertebrae', 0.073), ('lk', 0.07), ('global', 0.07), ('supervised', 0.066), ('training', 0.065), ('boston', 0.059), ('binary', 0.058), ('empirical', 0.058), ('cation', 0.058), ('linear', 0.057), ('wi', 0.056), ('xor', 0.056), ('perceptrons', 0.056), ('label', 0.055), ('schemes', 0.053), ('adaptively', 0.049), ('saberian', 0.049), ('stumps', 0.049), ('breast', 0.047), ('regression', 0.047), ('losses', 0.046), ('computations', 0.045), ('scheme', 0.045), ('rejection', 0.044), ('gini', 0.043), ('banana', 0.043), ('landsat', 0.043), ('pendigit', 0.043), ('shuttle', 0.043), ('splits', 0.042), ('uci', 0.042), ('wisconsin', 0.042), ('art', 0.042), ('error', 0.041), ('cancer', 0.041), ('greedily', 0.041), ('composite', 0.041), ('coordinate', 0.041), ('namely', 0.041), ('loss', 0.04), ('unions', 0.04), ('complex', 0.039), ('indicator', 0.038), ('kth', 0.038), ('robert', 0.037), ('repository', 0.035), ('descent', 0.035), ('datasets', 0.035), ('test', 0.033), ('consequently', 0.033), ('linearly', 0.033), ('incorrect', 0.033), ('wine', 0.033), ('optimize', 0.032), ('space', 0.032), ('train', 0.032), ('ck', 0.032), ('yoav', 0.031), ('attempt', 0.031), ('optimizing', 0.03), ('optimizes', 0.03), ('techniques', 0.029), ('fr', 0.029), ('gr', 0.029), ('isolet', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999994 200 nips-2012-Local Supervised Learning through Space Partitioning
Author: Joseph Wang, Venkatesh Saligrama
Abstract: We develop a novel approach for supervised learning based on adaptively partitioning the feature space into different regions and learning local region-specific classifiers. We formulate an empirical risk minimization problem that incorporates both partitioning and classification in to a single global objective. We show that space partitioning can be equivalently reformulated as a supervised learning problem and consequently any discriminative learning method can be utilized in conjunction with our approach. Nevertheless, we consider locally linear schemes by learning linear partitions and linear region classifiers. Locally linear schemes can not only approximate complex decision boundaries and ensure low training error but also provide tight control on over-fitting and generalization error. We train locally linear classifiers by using LDA, logistic regression and perceptrons, and so our scheme is scalable to large data sizes and high-dimensions. We present experimental results demonstrating improved performance over state of the art classification techniques on benchmark datasets. We also show improved robustness to label noise.
2 0.22042973 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound
Author: Chi Jin, Liwei Wang
Abstract: Margin is one of the most important concepts in machine learning. Previous margin bounds, both for SVM and for boosting, are dimensionality independent. A major advantage of this dimensionality independency is that it can explain the excellent performance of SVM whose feature spaces are often of high or infinite dimension. In this paper we address the problem whether such dimensionality independency is intrinsic for the margin bounds. We prove a dimensionality dependent PAC-Bayes margin bound. The bound is monotone increasing with respect to the dimension when keeping all other factors fixed. We show that our bound is strictly sharper than a previously well-known PAC-Bayes margin bound if the feature space is of finite dimension; and the two bounds tend to be equivalent as the dimension goes to infinity. In addition, we show that the VC bound for linear classifiers can be recovered from our bound under mild conditions. We conduct extensive experiments on benchmark datasets and find that the new bound is useful for model selection and is usually significantly sharper than the dimensionality independent PAC-Bayes margin bound as well as the VC bound for linear classifiers.
3 0.17827564 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses
Author: Harish G. Ramaswamy, Shivani Agarwal
Abstract: We study consistency properties of surrogate loss functions for general multiclass classification problems, defined by a general loss matrix. We extend the notion of classification calibration, which has been studied for binary and multiclass 0-1 classification problems (and for certain other specific learning problems), to the general multiclass setting, and derive necessary and sufficient conditions for a surrogate loss to be classification calibrated with respect to a loss matrix in this setting. We then introduce the notion of classification calibration dimension of a multiclass loss matrix, which measures the smallest ‘size’ of a prediction space for which it is possible to design a convex surrogate that is classification calibrated with respect to the loss matrix. We derive both upper and lower bounds on this quantity, and use these results to analyze various loss matrices. In particular, as one application, we provide a different route from the recent result of Duchi et al. (2010) for analyzing the difficulty of designing ‘low-dimensional’ convex surrogates that are consistent with respect to pairwise subset ranking losses. We anticipate the classification calibration dimension may prove to be a useful tool in the study and design of surrogate losses for general multiclass learning problems. 1
4 0.16309291 227 nips-2012-Multiclass Learning with Simplex Coding
Author: Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, Jean-jeacques Slotine
Abstract: In this paper we discuss a novel framework for multiclass learning, defined by a suitable coding/decoding strategy, namely the simplex coding, that allows us to generalize to multiple classes a relaxation approach commonly used in binary classification. In this framework, we develop a relaxation error analysis that avoids constraints on the considered hypotheses class. Moreover, using this setting we derive the first provably consistent regularized method with training/tuning complexity that is independent to the number of classes. We introduce tools from convex analysis that can be used beyond the scope of this paper. 1
5 0.14588545 197 nips-2012-Learning with Recursive Perceptual Representations
Author: Oriol Vinyals, Yangqing Jia, Li Deng, Trevor Darrell
Abstract: Linear Support Vector Machines (SVMs) have become very popular in vision as part of state-of-the-art object recognition and other classification tasks but require high dimensional feature spaces for good performance. Deep learning methods can find more compact representations but current methods employ multilayer perceptrons that require solving a difficult, non-convex optimization problem. We propose a deep non-linear classifier whose layers are SVMs and which incorporates random projection as its core stacking element. Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous –often more complicated– methods on several vision and speech benchmarks. 1
6 0.14033276 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications
7 0.13021903 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
8 0.12095276 279 nips-2012-Projection Retrieval for Classification
9 0.11190526 14 nips-2012-A P300 BCI for the Masses: Prior Information Enables Instant Unsupervised Spelling
10 0.11112892 271 nips-2012-Pointwise Tracking the Optimal Regression Function
11 0.11043017 361 nips-2012-Volume Regularization for Binary Classification
12 0.10860823 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing
13 0.10329921 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation
14 0.10280878 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs
15 0.098090857 97 nips-2012-Diffusion Decision Making for Adaptive k-Nearest Neighbor Classification
16 0.096709445 242 nips-2012-Non-linear Metric Learning
17 0.096157312 81 nips-2012-Context-Sensitive Decision Forests for Object Detection
18 0.09355513 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics
19 0.092519671 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction
20 0.091094092 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation
topicId topicWeight
[(0, 0.254), (1, 0.027), (2, -0.04), (3, -0.079), (4, 0.113), (5, -0.057), (6, 0.025), (7, 0.19), (8, -0.103), (9, -0.063), (10, 0.068), (11, 0.099), (12, 0.13), (13, 0.051), (14, -0.01), (15, -0.119), (16, -0.159), (17, 0.056), (18, -0.014), (19, 0.154), (20, -0.061), (21, 0.041), (22, -0.081), (23, 0.141), (24, -0.004), (25, -0.152), (26, -0.019), (27, -0.042), (28, 0.003), (29, 0.014), (30, -0.027), (31, 0.095), (32, 0.082), (33, -0.022), (34, -0.032), (35, 0.002), (36, -0.077), (37, 0.129), (38, -0.056), (39, 0.007), (40, -0.023), (41, -0.045), (42, -0.066), (43, -0.022), (44, 0.018), (45, 0.001), (46, -0.087), (47, -0.033), (48, 0.067), (49, 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.98437518 200 nips-2012-Local Supervised Learning through Space Partitioning
Author: Joseph Wang, Venkatesh Saligrama
Abstract: We develop a novel approach for supervised learning based on adaptively partitioning the feature space into different regions and learning local region-specific classifiers. We formulate an empirical risk minimization problem that incorporates both partitioning and classification in to a single global objective. We show that space partitioning can be equivalently reformulated as a supervised learning problem and consequently any discriminative learning method can be utilized in conjunction with our approach. Nevertheless, we consider locally linear schemes by learning linear partitions and linear region classifiers. Locally linear schemes can not only approximate complex decision boundaries and ensure low training error but also provide tight control on over-fitting and generalization error. We train locally linear classifiers by using LDA, logistic regression and perceptrons, and so our scheme is scalable to large data sizes and high-dimensions. We present experimental results demonstrating improved performance over state of the art classification techniques on benchmark datasets. We also show improved robustness to label noise.
2 0.85796863 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound
Author: Chi Jin, Liwei Wang
Abstract: Margin is one of the most important concepts in machine learning. Previous margin bounds, both for SVM and for boosting, are dimensionality independent. A major advantage of this dimensionality independency is that it can explain the excellent performance of SVM whose feature spaces are often of high or infinite dimension. In this paper we address the problem whether such dimensionality independency is intrinsic for the margin bounds. We prove a dimensionality dependent PAC-Bayes margin bound. The bound is monotone increasing with respect to the dimension when keeping all other factors fixed. We show that our bound is strictly sharper than a previously well-known PAC-Bayes margin bound if the feature space is of finite dimension; and the two bounds tend to be equivalent as the dimension goes to infinity. In addition, we show that the VC bound for linear classifiers can be recovered from our bound under mild conditions. We conduct extensive experiments on benchmark datasets and find that the new bound is useful for model selection and is usually significantly sharper than the dimensionality independent PAC-Bayes margin bound as well as the VC bound for linear classifiers.
3 0.79071915 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications
Author: Amit Daniely, Sivan Sabato, Shai S. Shwartz
Abstract: We theoretically analyze and compare the following five popular multiclass classification methods: One vs. All, All Pairs, Tree-based classifiers, Error Correcting Output Codes (ECOC) with randomly generated code matrices, and Multiclass SVM. In the first four methods, the classification is based on a reduction to binary classification. We consider the case where the binary classifier comes from a class of VC dimension d, and in particular from the class of halfspaces over Rd . We analyze both the estimation error and the approximation error of these methods. Our analysis reveals interesting conclusions of practical relevance, regarding the success of the different approaches under various conditions. Our proof technique employs tools from VC theory to analyze the approximation error of hypothesis classes. This is in contrast to most previous uses of VC theory, which only deal with estimation error. 1
4 0.74239707 279 nips-2012-Projection Retrieval for Classification
Author: Madalina Fiterau, Artur Dubrawski
Abstract: In many applications, classification systems often require human intervention in the loop. In such cases the decision process must be transparent and comprehensible, simultaneously requiring minimal assumptions on the underlying data distributions. To tackle this problem, we formulate an axis-aligned subspace-finding task under the assumption that query specific information dictates the complementary use of the subspaces. We develop a regression-based approach called RECIP that efficiently solves this problem by finding projections that minimize a nonparametric conditional entropy estimator. Experiments show that the method is accurate in identifying the informative projections of the dataset, picking the correct views to classify query points, and facilitates visual evaluation by users. 1 Introduction and problem statement In the domain of predictive analytics, many applications which keep human users in the loop require the use of simple classification models. Often, it is required that a test-point be ‘explained’ (classified) using a simple low-dimensional projection of the original feature space. This is a Projection Retrieval for Classification problem (PRC). The interaction with the user proceeds as follows: the user provides the system a query point; the system searches for a projection in which the point can be accurately classified; the system displays the classification result as well as an illustration of how the classification decision was reached in the selected projection. Solving the PRC problem is relevant in many practical applications. For instance, consider a nuclear threat detection system installed at a border check point. Vehicles crossing the border are scanned with sensors so that a large array of measurements of radioactivity and secondary contextual information is being collected. These observations are fed into a classification system that determines whether the scanned vehicle may carry a threat. Given the potentially devastating consequences of a false negative, a border control agent is requested to validate the prediction and decide whether to submit the vehicle for a costly further inspection. With the positive classification rate of the system under strict bounds because of limitations in the control process, the risk of false negatives is increased. Despite its crucial role, human intervention should only be withheld for cases in which there are reasons to doubt the validity of classification. In order for a user to attest the validity of a decision, the user must have a good understanding of the classification process, which happens more readily when the classifier only uses the original dataset features rather than combinations of them, and when the discrimination models are low-dimensional. In this context, we aim to learn a set of classifiers in low-dimensional subspaces and a decision function which selects the subspace under which a test point is to be classified. Assume we are given a dataset {(x1 , y1 ) . . . (xn , yn )} ∈ X n × {0, 1}n and a class of discriminators H. The model will contain a set Π of subspaces of X ; Π ⊆ Π, where Π is the set of all axis-aligned subspaces of the original feature space, the power set of the features. To each projection πi ∈ Π corresponds one discriminator from a given hypothesis space hi ∈ H. It will also contain a selection function g : X → Π × H, which yields, for a query point x, the projection/discriminator pair with which this point will be classified. The notation π(x) refers to the projection of the point x onto the subspace 1 π while h(π(x)) represents the predicted label for x. Formally, we describe the model class as Md = {Π = {π : π ∈ Π, dim(π) ≤ d}, H = {hi : hi ∈ H, h : πi → Y, ∀i = 1 . . . |Π|}, g ∈ {f : X → {1 . . . |Π|}} . where dim(π) presents the dimensionality of the subspace determined by the projection π. Note that only projections up to size d will be considered, where d is a parameter specific to the application. The set H contains one discriminator from the hypothesis class H for each projection. Intuitively, the aim is to minimize the expected classification error over Md , however, a notable modification is that the projection and, implicitly, the discriminator, are chosen according to the data point that needs to be classified. Given a query x in the space X , g(x) will yield the subspace πg(x) onto which the query is projected and the discriminator hg(x) for it. Distinct test points can be handled using different combinations of subspaces and discriminators. We consider models that minimize 0/1 loss. Hence, the PRC problem can be stated as follows: M ∗ = arg min EX ,Y y = hg(x) (πg(x) (x)) M ∈Md There are limitations to the type of selection function g that can be learned. A simple example for which g can be recovered is a set of signal readings x for which, if one of the readings xi exceeds a threshold ti , the label can be predicted just based on xi . A more complex one is a dataset containing regulatory variables, that is, for xi in the interval [ak , bk ] the label only depends on (x1 . . . xnk ) k k datasets that fall into the latter category fulfill what we call the Subspace-Separability Assumption. This paper proposes an algorithm called RECIP that solves the PRC problem for a class of nonparametric classifiers. We evaluate the method on artificial data to show that indeed it correctly identifies the underlying structure for data satisfying the Subspace-Separability Assumption. We show some case studies to illustrate how RECIP offers insight into applications requiring human intervention. The use of dimensionality reduction techniques is a common preprocessing step in applications where the use of simplified classification models is preferable. Methods that learn linear combinations of features, such as Linear Discriminant Analysis, are not quite appropriate for the task considered here, since we prefer to natively rely on the dimensions available in the original feature space. Feature selection methods, such as e.g. lasso, are suitable for identifying sets of relevant features, but do not consider interactions between them. Our work better fits the areas of class dependent feature selection and context specific classification, highly connected to the concept of Transductive Learning [6]. Other context-sensitive methods are Lazy and Data-Dependent Decision Trees, [5] and [10] respectively. In Ting et al [14], the Feating submodel selection relies on simple attribute splits followed by fitting local predictors, though the algorithm itself is substantially different. Obozinski et al present a subspace selection method in the context of multitask learning [11]. Go et al propose a joint method for feature selection and subspace learning [7], however, their classification model is not particularly query specific. Alternatively, algorithms that transform complex or unintelligible models with user-friendly equivalents have been proposed [3, 2, 1, 8]. Algorithms specifically designed to yield understandable models are a precious few. Here we note a rule learning method described in [12], even though the resulting rules can make visualization difficult, while itemset mining [9] is not specifically designed for classification. Unlike those approaches, our method is designed to retrieve subsets of the feature space designed for use in a way that is complementary to the basic task at hand (classification) while providing query-specific information. 2 Recovering informative projections with RECIP To solve PRC, we need means by which to ascertain which projections are useful in terms of discriminating data from the two classes. Since our model allows the use of distinct projections depending on the query point, it is expected that each projection would potentially benefit different areas of the feature space. A(π) refers to the area of the feature space where the projection π is selected. A(π) = {x ∈ X : πg(x) = π} The objective becomes min E(X ×Y) y = hg(x) (πg(x) (x)) M ∈Md = p(A(π))E y = hg(x) (πg(x) (x))|x ∈ A(π) min M ∈Md 2 π∈Π . The expected classification error over A(π) is linked to the conditional entropy of Y |X. Fano’s inequality provides a lower bound on the error while Feder and Merhav [4] derive a tight upper bound on the minimal error probability in terms of the entropy. This means that conditional entropy characterizes the potential of a subset of the feature space to separate data, which is more generic than simply quantifying classification accuracy for a specific discriminator. In view of this connection between classification accuracy and entropy, we adapt the objective to: p(A(π))H(Y |π(X); X ∈ A(π)) min M ∈Md (1) π∈Π The method we propose optimizes an empirical analog of (1) which we develop below and for which we will need the following result. Proposition 2.1. Given a continuous variable X ∈ X and a binary variable Y , where X is sampled from the mixture model f (x) = p(y = 0)f0 (x) + p(y = 1)f1 (x) = p0 f0 (x) + p1 f1 (x) , then H(Y |X) = −p0 log p0 − p1 log p1 − DKL (f0 ||f ) − DKL (f1 ||f ) . Next, we will use the nonparametric estimator presented in [13] for Tsallis α-divergence. Given samples Ui ∼ U, with i = 1, n and Vj ∼ V with j = 1, m, the divergence is estimated as follows: ˆ Tα (U ||V ) = 1 1 1−α n n (n − 1)νk (Ui , U \ ui )d mνk (Ui , V )d i=1 1−α B(k, α) − 1 , (2) where d is the dimensionality of the variables U and V and νk (z, Z) represents the distance from z ˆ to its k th nearest neighbor of the set of points Z. For α ≈ 1 and n → ∞, Tα (u||v) ≈ DKL (u||v). 2.1 Local estimators of entropy We will now plug (2) in the formula obtained by Proposition 2.1 to estimate the quantity (1). We use the notation X0 to represent the n0 samples from X which have the labels Y equal to 0, and X1 to represent the n1 samples from X which have the labels set to 1. Also, Xy(x) represents the set of samples that have labels equal to the label of x and X¬y(x) the data that have labels opposite to the label of x. ˆ ˆ x ˆ x H(Y |X; X ∈ A) = −H(p0 ) − H(p1 ) − T (f0 ||f x ) − T (f1 ||f x ) + C α≈1 ˆ H(Y |X; X ∈ A) ∝ 1 n0 + 1 n1 ∝ 1 n0 + 1 n1 ∝ 1 n n0 (n0 − 1)νk (xi , X0 \ xi )d nνk (xi , X \ xi )d 1−α I[xi ∈ A] (n1 − 1)νk (xi , X1 \ xi )d nνk (xi , X \ xi )d 1−α I[xi ∈ A] (n0 − 1)νk (xi , X0 \ xi )d nνk (xi , X1 \ xi )d 1−α I[xi ∈ A] (n1 − 1)νk (xi , X1 \ xi )d nνk (xi , X0 \ xi )d 1−α I[xi ∈ A] i=1 n1 i=1 n0 i=1 n1 i=1 n I[xi ∈ A] i=1 (n − 1)νk (xi , Xy(xi ) \ xi )d nνk (xi , X¬y(xi ) \ xi )d 1−α The estimator for the entropy of the data that is classified with projection π is as follows: ˆ H(Y |π(X); X ∈ A(π)) ∝ 1 n n I[xi ∈ A(π)] i=1 (n − 1)νk (π(xi ), π(Xy(xi ) ) \ π(xi ))d nνk (π(xi ), π(X¬y(xi ) \ xi ))d 1−α (3) From 3 and using the fact that I[xi ∈ A(π)] = I[πg(xi ) = π] for which we use the notation I[g(xi ) → π], we estimate the objective as min M ∈Md π∈Π 1 n n I[g(xi ) → π] i=1 (n − 1)νk (π(xi ), π(Xy(xi ) ) \ π(xi ))d nνk (π(xi ), π(X¬y(xi ) \ xi ))d 3 1−α (4) Therefore, the contribution of each data point to the objective corresponds to a distance ratio on the projection π ∗ where the class of the point is obtained with the highest confidence (data is separable in the neighborhood of the point). We start by computing the distance-based metric of each point on each projection of size up to d - there are d∗ such projections. This procedure yields an extended set of features Z, which we name local entropy estimates: Zij = νk (πj (xi ), πj (Xy(xi ) ) \ πj (xi )) νk (πj (xi ), πj (X¬y(xi ) ) \ πj (xi )) d(1−α) α≈1 j ∈ {1 . . . d∗ } (5) For each training data point, we compute the best distance ratio amid all the projections, which is simply Ti = minj∈[d∗ ] Zij . The objective can be then further rewritten as a function of the entropy estimates: n I[g(xi ) → πj ]Zij min M ∈Md (6) i=1 πj ∈Π From the definition of T, it is also clear that n n I[g(xi ) → πj ]Zij min M ∈Md 2.2 ≥ i=1 πj ∈Π Ti . (7) i=1 Projection selection as a combinatorial problem Considering form (6) of the objective, and given that the estimates Zij are constants, depending only on the training set, the projection retrieval problem is reduced to finding g for all training points, which will implicitly select the projection set of the model. Naturally, one might assume the bestperforming classification model is the one containing all the axis-aligned subspaces. This model achieves the lower bound (7) for the training set. However, the larger the set of projections, the more values the function g takes, and thus the problem of selecting the correct projection becomes more difficult. It becomes apparent that the number of projections should be somehow restricted to allow intepretability. Assuming a hard threshold of at most t projections, the optimization (6) becomes an entry selection problem over matrix Z where one value must be picked from each row under a limitation on the number of columns that can be used. This problem cannot be solved exactly in polynomial time. Instead, it can be formulated as an optimization problem under 1 constraints. 2.3 Projection retrieval through regularized regression To transform the projection retrieval to a regression problem we consider T, the minimum obtainable value of the entropy estimator for each point, as the output which the method needs to predict. Each row i of the parameter matrix B represents the degrees to which the entropy estimates on each projection contribute to the entropy estimator of point xi . Thus, the sum over each row of B is 1, and the regularization penalty applies to the number of non-zero columns in B. d∗ min ||T − (Z B B)J|Π|,1 ||2 2 +λ [Bi = 0] (8) i=1 subject to |Bk | 1 = 1 k = 1, n where (Z B)ij = Zij + Bij and J is a matrix of ones. The problem with this optimization is that it is not convex. A typical walk-around of this issue is to use the convex relaxation for Bi = 0, that is 1 norm. This would transform the penalized term d∗ d∗ n to i=1 |Bi | 1 . However, i=1 |Bi | 1 = k=1 |Bk | 1 = n , so this penalty really has no effect. An alternative mechanism to encourage the non-zero elements in B to populate a small number of columns is to add a penalty term in the form of Bδ, where δ is a d∗ -size column vector with each element representing the penalty for a column in B. With no prior information about which subspaces are more informative, δ starts as an all-1 vector. An initial value for B is obtained through the optimization (8). Since our goal is to handle data using a small number of projections, δ is then updated such that its value is lower for the denser columns in B. This update resembles the reweighing in adaptive lasso. The matrix B itself is updated, and this 2-step process continues until convergence of δ. Once δ converges, the projections corresponding to the non-zero columns of B are added to the model. The procedure is shown in Algorithm 1. 4 Algorithm 1: RECIP δ = [1 . . . 1] repeat |P I| b = arg minB ||T − i=1 < Z, B > ||2 + λ|Bδ| 1 2 subject to |Bk | 1 = 1 k = 1...n δk = |Bi | 1 i = . . . d∗ (update the differential penalty) δ δ = 1 − |δ| 1 until δ converges return Π = {πi ; |Bi | 1 > 0 ∀i = 1 . . . d∗ } 2.4 Lasso for projection selection We will compare our algorithm to lasso regularization that ranks the projections in terms of their potential for data separability. We write this as an 1 -penalized optimization on the extended feature set Z, with the objective T : minβ |T − Zβ|2 + λ|β| 1 . The lasso penalty to the coefficient vector encourages sparsity. For a high enough λ, the sparsity pattern in β is indicative of the usefulness of the projections. The lasso on entropy contributions was not found to perform well as it is not query specific and will find one projection for all data. We improved it by allowing it to iteratively find projections - this robust version offers increased performance by reweighting the data thus focusing on different subsets of it. Although better than running lasso on entropy contributions, the robust lasso does not match RECIP’s performance as the projections are selected gradually rather than jointly. Running the standard lasso on the original design matrix yields a set of relevant variables and it is not immediately clear how the solution would translate to the desired class. 2.5 The selection function Once the projections are selected, the second stage of the algorithm deals with assigning the projection with which to classify a particular query point. An immediate way of selecting the correct projection starts by computing the local entropy estimator for each subspace with each class assignment. Then, we may select the label/subspace combination that minimizes the empirical entropy. (i∗ , θ∗ ) = arg min i,θ 3 νk (πi (x), πi (Xθ )) νk (πi (x), πi (X¬θ )) dim(πi )(1−α) i = 1 . . . d∗ , α≈1 (9) Experimental results In this section we illustrate the capability of RECIP to retrieve informative projections of data and their use in support of interpreting results of classification. First, we analyze how well RECIP can identify subspaces in synthetic data whose distribution obeys the subspace separability assumption (3.1). As a point of reference, we also present classification accuracy results (3.2) for both the synthetic data and a few real-world sets. This is to quantify the extent of the trade-off between fidelity of attainable classifiers and desired informativeness of the projections chosen by RECIP. We expect RECIP’s classification performance to be slightly, but not substantially worse when compared to relevant classification algorithms trained to maximize classification accuracy. Finally, we present a few examples (3.3) of informative projections recovered from real-world data and their utility in explaining to human users the decision processes applied to query points. A set of artificial data used in our experiments contains q batches of data points, each of them made classifiable with high accuracy using one of available 2-dimensional subspaces (x1 , x2 ) with k ∈ k k {1 . . . q}. The data in batch k also have the property that x1 > tk . This is done such that the group a k point belongs to can be detected from x1 , thus x1 is a regulatory variable. We control the amount of k k noise added to thusly created synthetic data by varying the proportion of noisy data points in each batch. The results below are for datasets with 7 features each, with number of batches q ranging between 1 and 7. We kept the number of features specifically low in order to prevent excessive variation between any two sets generated this way, and to enable computing meaningful estimates of the expectation and variance of performance, while enabling creation of complicated data in which synthetic patterns may substantially overlap (using 7 features and 7 2-dimensional patterns implies that dimensions of at least 4 of the patterns will overlap). We implemented our method 5 to be scalable to the size and dimensionality of data and although for brevity we do not include a discussion of this topic here, we have successfully run RECIP against data with 100 features. The parameter α is a value close to 1, because the Tsallis divergence converges to the KL divergence as alpha approaches 1. For the experiments on real-world data, d was set to n (all projections were considered). For the artificial data experiments, we reported results for d = 2 as they do not change significantly for d >= 2 because this data was synthesized to contain bidimensional informative projections. In general, if d is too low, the correct full set of projections will not be found, but it may be recovered partially. If d is chosen too high, there is a risk that a given selected projection p will contain irrelevant features compared to the true projection p0 . However, this situation only occurs if the noise introduced by these features in the estimators makes the entropy contributions on p and p0 statistically indistinguishable for a large subset of data. The users will choose d according to the desired/acceptable complexity of the resulting model. If the results are to be visually interpreted by a human, values of 2 or 3 are reasonable for d. 3.1 Recovering informative projections Table 1 shows how well RECIP recovers the q subspaces corresponding to the synthesized batches of data. We measure precision (proportion of the recovered projections that are known to be informative), and recall (proportion of known informative projections that are recovered by the algorithm). in Table 1, rows correspond to the number of distinct synthetic batches injected in data, q, and subsequent columns correspond to the increasing amounts of noise in data. We note that the observed precision is nearly perfect: the algorithm makes only 2 mistakes over the entire set of experiments, and those occur for highly noisy setups. The recall is nearly perfect as long as there is little overlap among the dimensions, that is when the injections do not interfere with each other. As the number of projections increases, the chances for overlap among the affected features also increase, which makes the data more confusing resulting on a gradual drop of recall until only about 3 or 4 of the 7 known to be informative subspaces can be recovered. We have also used lasso as described in 2.4 in an attempt to recover projections. This setup only manages to recover one of the informative subspaces, regardless of how the regularization parameter is tuned. 3.2 Classification accuracy Table 2 shows the classification accuracy of RECIP, obtained using synthetic data. As expected, the observed performance is initially high when there are few known informative projections in data and it decreases as noise and ambiguity of the injected patterns increase. Most types of ensemble learners would use a voting scheme to arrive at the final classification of a testing sample, rather than use a model selection scheme. For this reason, we have also compared predictive accuracy revealed by RECIP against a method based on majority voting among multiple candidate subspaces. Table 4 shows that the accuracy of this technique is lower than the accuracy of RECIP, regardless of whether the informative projections are recovered by the algorithm or assumed to be known a priori. This confirms the intuition that a selection-based approach can be more effective than voting for data which satisfies the subspace separability assumption. For reference, we have also classified the synthetic data using K-Nearest-Neighbors algorithm using all available features at once. The results of that experiment are shown in Table 5. Since RECIP uses neighbor information, K-NN is conceptually the closest among the popular alternatives. Compared to RECIP, K-NN performs worse when there are fewer synthetic patterns injected in data to form informative projections. It is because some features used then by K-NN are noisy. As more features become informative, the K-NN accuracy improves. This example shows the benefit of a selective approach to feature space and using a subset of the most explanatory projections to support not only explanatory analyses but also classification tasks in such circumstances. 3.3 RECIP case studies using real-world data Table 3 summarizes the RECIP and K-NN performance on UCI datasets. We also test the methods using Cell dataset containing a set of measurements such as the area and perimeter biological cells with separate labels marking cells subjected to treatment and control cells. In Vowel data, the nearest-neighbor approach works exceptionally well, even outperforming random forests (0.94 accuracy), which is an indication that all features are jointly relevant. For d lower than the number of features, RECIP picks projections of only one feature, but if there is no such limitation, RECIP picks the space of all the features as informative. 6 Table 1: Projection recovery for artificial datasets with 1 . . . 7 informative features and noise level 0 . . . 0.2 in terms of mean and variance of Precision and Recall. Mean/var obtained for each setting by repeating the experiment with datasets with different informative projections. PRECISION 1 2 3 4 5 6 7 0 1 1 1 1 1 1 1 0.02 1 1 1 1 1 1 1 Mean 0.05 1 1 1 1 1 1 1 1 2 3 4 5 6 7 0 1 1 1 0.9643 0.7714 0.6429 0.6327 0.02 1 1 1 0.9643 0.7429 0.6905 0.5918 Mean 0.05 1 1 0.9524 0.9643 0.8286 0.6905 0.5918 0 0 0 0 0 0 0 0 0.02 0 0 0 0 0 0 0 Variance 0.05 0 0 0 0 0 0 0 0.1 0.0306 0 0 0 0 0 0 0.2 0.0306 0 0 0 0 0 0 0 0 0 0 0.0077 0.0163 0.0113 0.0225 0.02 0 0 0 0.0077 0.0196 0.0113 0.02 Variance 0.05 0 0 0.0136 0.0077 0.0049 0.0272 0.0258 0.1 0 0 0.0136 0.0077 0.0082 0.0113 0.0233 0.2 0 0 0 0.0128 0.0278 0.0113 0.02 0.1 0.0008 0.0001 0.0028 0.0025 0.0036 0.0025 0.0042 0.2 0.0007 0.0001 0.0007 0.0032 0.0044 0.0027 0.0045 0.1 0.0001 0.0001 0.0007 0.0014 0.0019 0.0023 0.0021 0.2 0.0000 0.0001 0.0007 0.0020 0.0023 0.0021 0.0020 0.1 0.9286 1 1 1 1 1 1 0.2 0.9286 1 1 1 1 1 1 RECALL 0.1 1 1 0.9524 0.9643 0.8571 0.6905 0.5714 0.2 1 1 1 0.9286 0.7714 0.6905 0.551 Table 2: RECIP Classification Accuracy on Artificial Data 1 2 3 4 5 6 7 0 0.9751 0.9333 0.9053 0.8725 0.8113 0.7655 0.7534 1 2 3 4 5 6 7 0 0.9751 0.9333 0.9053 0.8820 0.8714 0.8566 0.8429 CLASSIFICATION ACCURACY Mean Variance 0.02 0.05 0.1 0.2 0 0.02 0.05 0.9731 0.9686 0.9543 0.9420 0.0000 0.0000 0.0000 0.9297 0.9227 0.9067 0.8946 0.0001 0.0001 0.0001 0.8967 0.8764 0.8640 0.8618 0.0004 0.0005 0.0016 0.8685 0.8589 0.8454 0.8187 0.0020 0.0020 0.0019 0.8009 0.8105 0.8105 0.7782 0.0042 0.0044 0.0033 0.7739 0.7669 0.7632 0.7511 0.0025 0.0021 0.0026 0.7399 0.7347 0.7278 0.7205 0.0034 0.0040 0.0042 CLASSIFICATION ACCURACY - KNOWN PROJECTIONS Mean Variance 0.02 0.05 0.1 0.2 0 0.02 0.05 0.9731 0.9686 0.9637 0.9514 0.0000 0.0000 0.0000 0.9297 0.9227 0.9067 0.8946 0.0001 0.0001 0.0001 0.8967 0.8914 0.8777 0.8618 0.0004 0.0005 0.0005 0.8781 0.8657 0.8541 0.8331 0.0011 0.0011 0.0014 0.8641 0.8523 0.8429 0.8209 0.0015 0.0015 0.0018 0.8497 0.8377 0.8285 0.8074 0.0014 0.0015 0.0016 0.8371 0.8256 0.8122 0.7988 0.0015 0.0018 0.0018 Table 3: Accuracy of K-NN and RECIP Dataset KNN RECIP In Spam data, the two most informative projections are Breast Cancer Wis 0.8415 0.8275 ’Capital Length Total’ (CLT)/’Capital Length Longest’ Breast Tissue 1.0000 1.0000 (CLL) and CLT/’Frequency of word your’ (FWY). FigCell 0.7072 0.7640 ure 1 shows these two projections, with the dots repreMiniBOONE* 0.7896 0.7396 senting training points. The red dots represent points laSpam 0.7680 0.7680 Vowel 0.9839 0.9839 beled as spam while the blue ones are non-spam. The circles are query points that have been assigned to be classified with the projection in which they are plotted. The green circles are correctly classified points, while the magenta circles - far fewer - are the incorrectly classified ones. Not only does the importance of text in capital letters make sense for a spam filtering dataset, but the points that select those projections are almost flawlessly classified. Additionally, assuming the user would need to attest the validity of classification for the first plot, he/she would have no trouble seeing that the circled data points are located in a region predominantly populated with examples of spam, so any non-spam entry appears suspicious. Both of the magenta-colored cases fall into this category, and they can be therefore flagged for further investigation. 7 Informative Projection for the Spam dataset 2000 1500 1000 500 0 Informative Projection for the Spam dataset 12 Frequency of word ‘your‘ Capital Run Length Longest 2500 10 8 6 4 2 0 500 1000 1500 2000 2500 Capital Run Length Total 3000 0 3500 0 2000 4000 6000 8000 10000 12000 14000 Capital Run Length Total 16000 Figure 1: Spam Dataset Selected Subspaces Table 4: Classification accuracy using RECIP-learned projections - or known projections, in the lower section - within a voting model instead of a selection model 1 2 3 4 5 6 7 1 2 3 4 5 6 7 CLASSIFICATION ACCURACY - VOTING ENSEMBLE Mean Variance 0 0.02 0.05 0.1 0.2 0 0.02 0.05 0.1 0.9751 0.9731 0.9686 0.9317 0.9226 0.0000 0.0000 0.0000 0.0070 0.7360 0.7354 0.7331 0.7303 0.7257 0.0002 0.0002 0.0001 0.0002 0.7290 0.7266 0.7163 0.7166 0.7212 0.0002 0.0002 0.0008 0.0006 0.6934 0.6931 0.6932 0.6904 0.6867 0.0008 0.0008 0.0008 0.0008 0.6715 0.6602 0.6745 0.6688 0.6581 0.0013 0.0014 0.0013 0.0014 0.6410 0.6541 0.6460 0.6529 0.6512 0.0008 0.0007 0.0010 0.0006 0.6392 0.6342 0.6268 0.6251 0.6294 0.0009 0.0011 0.0012 0.0012 CLASSIFICATION ACCURACY - VOTING ENSEMBLE, KNOWN PROJECTIONS Mean Variance 0 0.02 0.05 0.1 0.2 0 0.02 0.05 0.1 0.9751 0.9731 0.9686 0.9637 0.9514 0.0000 0.0000 0.0000 0.0001 0.7360 0.7354 0.7331 0.7303 0.7257 0.0002 0.0002 0.0001 0.0002 0.7409 0.7385 0.7390 0.7353 0.7325 0.0010 0.0012 0.0010 0.0011 0.7110 0.7109 0.7083 0.7067 0.7035 0.0041 0.0041 0.0042 0.0042 0.7077 0.7070 0.7050 0.7034 0.7008 0.0015 0.0015 0.0015 0.0016 0.6816 0.6807 0.6801 0.6790 0.6747 0.0008 0.0008 0.0008 0.0008 0.6787 0.6783 0.6772 0.6767 0.6722 0.0008 0.0009 0.0009 0.0008 0.2 0.0053 0.0001 0.0002 0.0009 0.0013 0.0005 0.0012 0.2 0.0000 0.0001 0.0010 0.0043 0.0016 0.0009 0.0008 Table 5: Classification accuracy for artificial data with the K-Nearest Neighbors method CLASSIFICATION ACCURACY - KNN 1 2 3 4 5 6 7 4 0 0.7909 0.7940 0.7964 0.7990 0.8038 0.8043 0.8054 0.02 0.7843 0.7911 0.7939 0.7972 0.8024 0.8032 0.8044 Mean 0.05 0.7747 0.7861 0.7901 0.7942 0.8002 0.8015 0.8028 0.1 0.7652 0.7790 0.7854 0.7904 0.7970 0.7987 0.8004 0.2 0.7412 0.7655 0.7756 0.7828 0.7905 0.7930 0.7955 0 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 0.02 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 Variance 0.05 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.1 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 0.2 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 Conclusion This paper considers the problem of Projection Recovery for Classification. It is relevant in applications where the decision process must be easy to understand in order to enable human interpretation of the results. We have developed a principled, regression-based algorithm designed to recover small sets of low-dimensional subspaces that support interpretability. It optimizes the selection using individual data-point-specific entropy estimators. In this context, the proposed algorithm follows the idea of transductive learning, and the role of the resulting projections bears resemblance to high confidence regions known in conformal prediction models. Empirical results obtained using simulated and real-world data show the effectiveness of our method in finding informative projections that enable accurate classification while maintaining transparency of the underlying decision process. Acknowledgments This material is based upon work supported by the NSF, under Grant No. IIS-0911032. 8 References [1] Mark W. Craven and Jude W. Shavlik. Extracting Tree-Structured Representations of Trained Networks. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, Advances in Neural Information Processing Systems, volume 8, pages 24–30. The MIT Press, 1996. [2] Pedro Domingos. Knowledge discovery via multiple models. Intelligent Data Analysis, 2:187–202, 1998. [3] Eulanda M. Dos Santos, Robert Sabourin, and Patrick Maupin. A dynamic overproduce-and-choose strategy for the selection of classifier ensembles. Pattern Recogn., 41:2993–3009, October 2008. [4] M. Feder and N. Merhav. Relations between entropy and error probability. Information Theory, IEEE Transactions on, 40(1):259–266, January 1994. [5] Jerome H. Friedman, Ron Kohavi, and Yeogirl Yun. Lazy decision trees, 1996. [6] A. Gammerman, V. Vovk, and V. Vapnik. Learning by transduction. In In Uncertainty in Artificial Intelligence, pages 148–155. Morgan Kaufmann, 1998. [7] Quanquan Gu, Zhenhui Li, and Jiawei Han. Joint feature selection and subspace learning, 2011. [8] Bing Liu, Minqing Hu, and Wynne Hsu. Intuitive representation of decision trees using general rules and exceptions. In Proceedings of Seventeeth National Conference on Artificial Intellgience (AAAI-2000), July 30 - Aug 3, 2000, pages 615–620, 2000. [9] Michael Mampaey, Nikolaj Tatti, and Jilles Vreeken. Tell me what i need to know: succinctly summarizing data with itemsets. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’11, pages 573–581, New York, NY, USA, 2011. ACM. [10] Mario Marchand and Marina Sokolova. Learning with decision lists of data-dependent features. JOURNAL OF MACHINE LEARNING REASEARCH, 6, 2005. [11] Guillaume Obozinski, Ben Taskar, and Michael I. Jordan. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2):231–252, April 2010. [12] Michael J. Pazzani, Subramani Mani, and W. Rodman Shankle. Beyond concise and colorful: Learning intelligible rules, 1997. [13] B. Poczos and J. Schneider. On the estimation of alpha-divergences. AISTATS, 2011. [14] Kai Ting, Jonathan Wells, Swee Tan, Shyh Teng, and Geoffrey Webb. Feature-subspace aggregating: ensembles for stable andunstable learners. Machine Learning, 82:375–397, 2011. 10.1007/s10994-0105224-5. 9
5 0.73444456 361 nips-2012-Volume Regularization for Binary Classification
Author: Koby Crammer, Tal Wagner
Abstract: We introduce a large-volume box classification for binary prediction, which maintains a subset of weight vectors, and specifically axis-aligned boxes. Our learning algorithm seeks for a box of large volume that contains “simple” weight vectors which most of are accurate on the training set. Two versions of the learning process are cast as convex optimization problems, and it is shown how to solve them efficiently. The formulation yields a natural PAC-Bayesian performance bound and it is shown to minimize a quantity directly aligned with it. The algorithm outperforms SVM and the recently proposed AROW algorithm on a majority of 30 NLP datasets and binarized USPS optical character recognition datasets. 1
6 0.72546178 227 nips-2012-Multiclass Learning with Simplex Coding
7 0.71908402 14 nips-2012-A P300 BCI for the Masses: Prior Information Enables Instant Unsupervised Spelling
8 0.70472515 28 nips-2012-A systematic approach to extracting semantic information from functional MRI data
9 0.69940764 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification
10 0.69636887 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses
11 0.66984785 197 nips-2012-Learning with Recursive Perceptual Representations
12 0.66923183 50 nips-2012-Bandit Algorithms boost Brain Computer Interfaces for motor-task selection of a brain-controlled button
13 0.64970809 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics
14 0.63459051 271 nips-2012-Pointwise Tracking the Optimal Regression Function
15 0.62076628 266 nips-2012-Patient Risk Stratification for Hospital-Associated C. diff as a Time-Series Classification Task
16 0.61886817 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses
17 0.6150735 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
18 0.60393149 97 nips-2012-Diffusion Decision Making for Adaptive k-Nearest Neighbor Classification
19 0.60216588 207 nips-2012-Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification
20 0.59024316 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs
topicId topicWeight
[(0, 0.075), (13, 0.149), (17, 0.013), (21, 0.02), (38, 0.161), (42, 0.07), (54, 0.016), (55, 0.029), (74, 0.049), (76, 0.163), (80, 0.137), (92, 0.041)]
simIndex simValue paperId paperTitle
1 0.90192837 125 nips-2012-Factoring nonnegative matrices with linear programs
Author: Ben Recht, Christopher Re, Joel Tropp, Victor Bittorf
Abstract: This paper describes a new approach, based on linear programming, for computing nonnegative matrix factorizations (NMFs). The key idea is a data-driven model for the factorization where the most salient features in the data are used to express the remaining features. More precisely, given a data matrix X, the algorithm identifies a matrix C that satisfies X ≈ CX and some linear constraints. The constraints are chosen to ensure that the matrix C selects features; these features can then be used to find a low-rank NMF of X. A theoretical analysis demonstrates that this approach has guarantees similar to those of the recent NMF algorithm of Arora et al. (2012). In contrast with this earlier work, the proposed method extends to more general noise models and leads to efficient, scalable algorithms. Experiments with synthetic and real datasets provide evidence that the new approach is also superior in practice. An optimized C++ implementation can factor a multigigabyte matrix in a matter of minutes. 1
same-paper 2 0.89852548 200 nips-2012-Local Supervised Learning through Space Partitioning
Author: Joseph Wang, Venkatesh Saligrama
Abstract: We develop a novel approach for supervised learning based on adaptively partitioning the feature space into different regions and learning local region-specific classifiers. We formulate an empirical risk minimization problem that incorporates both partitioning and classification in to a single global objective. We show that space partitioning can be equivalently reformulated as a supervised learning problem and consequently any discriminative learning method can be utilized in conjunction with our approach. Nevertheless, we consider locally linear schemes by learning linear partitions and linear region classifiers. Locally linear schemes can not only approximate complex decision boundaries and ensure low training error but also provide tight control on over-fitting and generalization error. We train locally linear classifiers by using LDA, logistic regression and perceptrons, and so our scheme is scalable to large data sizes and high-dimensions. We present experimental results demonstrating improved performance over state of the art classification techniques on benchmark datasets. We also show improved robustness to label noise.
3 0.89307213 361 nips-2012-Volume Regularization for Binary Classification
Author: Koby Crammer, Tal Wagner
Abstract: We introduce a large-volume box classification for binary prediction, which maintains a subset of weight vectors, and specifically axis-aligned boxes. Our learning algorithm seeks for a box of large volume that contains “simple” weight vectors which most of are accurate on the training set. Two versions of the learning process are cast as convex optimization problems, and it is shown how to solve them efficiently. The formulation yields a natural PAC-Bayesian performance bound and it is shown to minimize a quantity directly aligned with it. The algorithm outperforms SVM and the recently proposed AROW algorithm on a majority of 30 NLP datasets and binarized USPS optical character recognition datasets. 1
4 0.87185085 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model
Author: Yi Wu, David P. Wipf
Abstract: Sparse linear (or generalized linear) models combine a standard likelihood function with a sparse prior on the unknown coefficients. These priors can conveniently be expressed as a maximization over zero-mean Gaussians with different variance hyperparameters. Standard MAP estimation (Type I) involves maximizing over both the hyperparameters and coefficients, while an empirical Bayesian alternative (Type II) first marginalizes the coefficients and then maximizes over the hyperparameters, leading to a tractable posterior approximation. The underlying cost functions can be related via a dual-space framework from [22], which allows both the Type I or Type II objectives to be expressed in either coefficient or hyperparmeter space. This perspective is useful because some analyses or extensions are more conducive to development in one space or the other. Herein we consider the estimation of a trade-off parameter balancing sparsity and data fit. As this parameter is effectively a variance, natural estimators exist by assessing the problem in hyperparameter (variance) space, transitioning natural ideas from Type II to solve what is much less intuitive for Type I. In contrast, for analyses of update rules and sparsity properties of local and global solutions, as well as extensions to more general likelihood models, we can leverage coefficient-space techniques developed for Type I and apply them to Type II. For example, this allows us to prove that Type II-inspired techniques can be successful recovering sparse coefficients when unfavorable restricted isometry properties (RIP) lead to failure of popular ℓ1 reconstructions. It also facilitates the analysis of Type II when non-Gaussian likelihood models lead to intractable integrations. 1
5 0.87105429 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models
Author: Chong Wang, David M. Blei
Abstract: We present a truncation-free stochastic variational inference algorithm for Bayesian nonparametric models. While traditional variational inference algorithms require truncations for the model or the variational distribution, our method adapts model complexity on the fly. We studied our method with Dirichlet process mixture models and hierarchical Dirichlet process topic models on two large data sets. Our method performs better than previous stochastic variational inference algorithms. 1
6 0.87072784 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs
7 0.86997998 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
8 0.86391848 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders
9 0.86364734 292 nips-2012-Regularized Off-Policy TD-Learning
10 0.86099321 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models
11 0.86003011 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation
12 0.85945106 65 nips-2012-Cardinality Restricted Boltzmann Machines
13 0.85936606 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
14 0.85830772 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes
15 0.85809469 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning
16 0.85770297 330 nips-2012-Supervised Learning with Similarity Functions
17 0.85721481 16 nips-2012-A Polynomial-time Form of Robust Regression
18 0.85714191 294 nips-2012-Repulsive Mixtures
19 0.8567971 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback
20 0.85578191 199 nips-2012-Link Prediction in Graphs with Autoregressive Features