nips nips2012 nips2012-279 knowledge-graph by maker-knowledge-mining

279 nips-2012-Projection Retrieval for Classification


Source: pdf

Author: Madalina Fiterau, Artur Dubrawski

Abstract: In many applications, classification systems often require human intervention in the loop. In such cases the decision process must be transparent and comprehensible, simultaneously requiring minimal assumptions on the underlying data distributions. To tackle this problem, we formulate an axis-aligned subspace-finding task under the assumption that query specific information dictates the complementary use of the subspaces. We develop a regression-based approach called RECIP that efficiently solves this problem by finding projections that minimize a nonparametric conditional entropy estimator. Experiments show that the method is accurate in identifying the informative projections of the dataset, picking the correct views to classify query points, and facilitates visual evaluation by users. 1 Introduction and problem statement In the domain of predictive analytics, many applications which keep human users in the loop require the use of simple classification models. Often, it is required that a test-point be ‘explained’ (classified) using a simple low-dimensional projection of the original feature space. This is a Projection Retrieval for Classification problem (PRC). The interaction with the user proceeds as follows: the user provides the system a query point; the system searches for a projection in which the point can be accurately classified; the system displays the classification result as well as an illustration of how the classification decision was reached in the selected projection. Solving the PRC problem is relevant in many practical applications. For instance, consider a nuclear threat detection system installed at a border check point. Vehicles crossing the border are scanned with sensors so that a large array of measurements of radioactivity and secondary contextual information is being collected. These observations are fed into a classification system that determines whether the scanned vehicle may carry a threat. Given the potentially devastating consequences of a false negative, a border control agent is requested to validate the prediction and decide whether to submit the vehicle for a costly further inspection. With the positive classification rate of the system under strict bounds because of limitations in the control process, the risk of false negatives is increased. Despite its crucial role, human intervention should only be withheld for cases in which there are reasons to doubt the validity of classification. In order for a user to attest the validity of a decision, the user must have a good understanding of the classification process, which happens more readily when the classifier only uses the original dataset features rather than combinations of them, and when the discrimination models are low-dimensional. In this context, we aim to learn a set of classifiers in low-dimensional subspaces and a decision function which selects the subspace under which a test point is to be classified. Assume we are given a dataset {(x1 , y1 ) . . . (xn , yn )} ∈ X n × {0, 1}n and a class of discriminators H. The model will contain a set Π of subspaces of X ; Π ⊆ Π, where Π is the set of all axis-aligned subspaces of the original feature space, the power set of the features. To each projection πi ∈ Π corresponds one discriminator from a given hypothesis space hi ∈ H. It will also contain a selection function g : X → Π × H, which yields, for a query point x, the projection/discriminator pair with which this point will be classified. The notation π(x) refers to the projection of the point x onto the subspace 1 π while h(π(x)) represents the predicted label for x. Formally, we describe the model class as Md = {Π = {π : π ∈ Π, dim(π) ≤ d}, H = {hi : hi ∈ H, h : πi → Y, ∀i = 1 . . . |Π|}, g ∈ {f : X → {1 . . . |Π|}} . where dim(π) presents the dimensionality of the subspace determined by the projection π. Note that only projections up to size d will be considered, where d is a parameter specific to the application. The set H contains one discriminator from the hypothesis class H for each projection. Intuitively, the aim is to minimize the expected classification error over Md , however, a notable modification is that the projection and, implicitly, the discriminator, are chosen according to the data point that needs to be classified. Given a query x in the space X , g(x) will yield the subspace πg(x) onto which the query is projected and the discriminator hg(x) for it. Distinct test points can be handled using different combinations of subspaces and discriminators. We consider models that minimize 0/1 loss. Hence, the PRC problem can be stated as follows: M ∗ = arg min EX ,Y y = hg(x) (πg(x) (x)) M ∈Md There are limitations to the type of selection function g that can be learned. A simple example for which g can be recovered is a set of signal readings x for which, if one of the readings xi exceeds a threshold ti , the label can be predicted just based on xi . A more complex one is a dataset containing regulatory variables, that is, for xi in the interval [ak , bk ] the label only depends on (x1 . . . xnk ) k k datasets that fall into the latter category fulfill what we call the Subspace-Separability Assumption. This paper proposes an algorithm called RECIP that solves the PRC problem for a class of nonparametric classifiers. We evaluate the method on artificial data to show that indeed it correctly identifies the underlying structure for data satisfying the Subspace-Separability Assumption. We show some case studies to illustrate how RECIP offers insight into applications requiring human intervention. The use of dimensionality reduction techniques is a common preprocessing step in applications where the use of simplified classification models is preferable. Methods that learn linear combinations of features, such as Linear Discriminant Analysis, are not quite appropriate for the task considered here, since we prefer to natively rely on the dimensions available in the original feature space. Feature selection methods, such as e.g. lasso, are suitable for identifying sets of relevant features, but do not consider interactions between them. Our work better fits the areas of class dependent feature selection and context specific classification, highly connected to the concept of Transductive Learning [6]. Other context-sensitive methods are Lazy and Data-Dependent Decision Trees, [5] and [10] respectively. In Ting et al [14], the Feating submodel selection relies on simple attribute splits followed by fitting local predictors, though the algorithm itself is substantially different. Obozinski et al present a subspace selection method in the context of multitask learning [11]. Go et al propose a joint method for feature selection and subspace learning [7], however, their classification model is not particularly query specific. Alternatively, algorithms that transform complex or unintelligible models with user-friendly equivalents have been proposed [3, 2, 1, 8]. Algorithms specifically designed to yield understandable models are a precious few. Here we note a rule learning method described in [12], even though the resulting rules can make visualization difficult, while itemset mining [9] is not specifically designed for classification. Unlike those approaches, our method is designed to retrieve subsets of the feature space designed for use in a way that is complementary to the basic task at hand (classification) while providing query-specific information. 2 Recovering informative projections with RECIP To solve PRC, we need means by which to ascertain which projections are useful in terms of discriminating data from the two classes. Since our model allows the use of distinct projections depending on the query point, it is expected that each projection would potentially benefit different areas of the feature space. A(π) refers to the area of the feature space where the projection π is selected. A(π) = {x ∈ X : πg(x) = π} The objective becomes min E(X ×Y) y = hg(x) (πg(x) (x)) M ∈Md = p(A(π))E y = hg(x) (πg(x) (x))|x ∈ A(π) min M ∈Md 2 π∈Π . The expected classification error over A(π) is linked to the conditional entropy of Y |X. Fano’s inequality provides a lower bound on the error while Feder and Merhav [4] derive a tight upper bound on the minimal error probability in terms of the entropy. This means that conditional entropy characterizes the potential of a subset of the feature space to separate data, which is more generic than simply quantifying classification accuracy for a specific discriminator. In view of this connection between classification accuracy and entropy, we adapt the objective to: p(A(π))H(Y |π(X); X ∈ A(π)) min M ∈Md (1) π∈Π The method we propose optimizes an empirical analog of (1) which we develop below and for which we will need the following result. Proposition 2.1. Given a continuous variable X ∈ X and a binary variable Y , where X is sampled from the mixture model f (x) = p(y = 0)f0 (x) + p(y = 1)f1 (x) = p0 f0 (x) + p1 f1 (x) , then H(Y |X) = −p0 log p0 − p1 log p1 − DKL (f0 ||f ) − DKL (f1 ||f ) . Next, we will use the nonparametric estimator presented in [13] for Tsallis α-divergence. Given samples Ui ∼ U, with i = 1, n and Vj ∼ V with j = 1, m, the divergence is estimated as follows: ˆ Tα (U ||V ) = 1 1 1−α n n (n − 1)νk (Ui , U \ ui )d mνk (Ui , V )d i=1 1−α B(k, α) − 1 , (2) where d is the dimensionality of the variables U and V and νk (z, Z) represents the distance from z ˆ to its k th nearest neighbor of the set of points Z. For α ≈ 1 and n → ∞, Tα (u||v) ≈ DKL (u||v). 2.1 Local estimators of entropy We will now plug (2) in the formula obtained by Proposition 2.1 to estimate the quantity (1). We use the notation X0 to represent the n0 samples from X which have the labels Y equal to 0, and X1 to represent the n1 samples from X which have the labels set to 1. Also, Xy(x) represents the set of samples that have labels equal to the label of x and X¬y(x) the data that have labels opposite to the label of x. ˆ ˆ x ˆ x H(Y |X; X ∈ A) = −H(p0 ) − H(p1 ) − T (f0 ||f x ) − T (f1 ||f x ) + C α≈1 ˆ H(Y |X; X ∈ A) ∝ 1 n0 + 1 n1 ∝ 1 n0 + 1 n1 ∝ 1 n n0 (n0 − 1)νk (xi , X0 \ xi )d nνk (xi , X \ xi )d 1−α I[xi ∈ A] (n1 − 1)νk (xi , X1 \ xi )d nνk (xi , X \ xi )d 1−α I[xi ∈ A] (n0 − 1)νk (xi , X0 \ xi )d nνk (xi , X1 \ xi )d 1−α I[xi ∈ A] (n1 − 1)νk (xi , X1 \ xi )d nνk (xi , X0 \ xi )d 1−α I[xi ∈ A] i=1 n1 i=1 n0 i=1 n1 i=1 n I[xi ∈ A] i=1 (n − 1)νk (xi , Xy(xi ) \ xi )d nνk (xi , X¬y(xi ) \ xi )d 1−α The estimator for the entropy of the data that is classified with projection π is as follows: ˆ H(Y |π(X); X ∈ A(π)) ∝ 1 n n I[xi ∈ A(π)] i=1 (n − 1)νk (π(xi ), π(Xy(xi ) ) \ π(xi ))d nνk (π(xi ), π(X¬y(xi ) \ xi ))d 1−α (3) From 3 and using the fact that I[xi ∈ A(π)] = I[πg(xi ) = π] for which we use the notation I[g(xi ) → π], we estimate the objective as min M ∈Md π∈Π 1 n n I[g(xi ) → π] i=1 (n − 1)νk (π(xi ), π(Xy(xi ) ) \ π(xi ))d nνk (π(xi ), π(X¬y(xi ) \ xi ))d 3 1−α (4) Therefore, the contribution of each data point to the objective corresponds to a distance ratio on the projection π ∗ where the class of the point is obtained with the highest confidence (data is separable in the neighborhood of the point). We start by computing the distance-based metric of each point on each projection of size up to d - there are d∗ such projections. This procedure yields an extended set of features Z, which we name local entropy estimates: Zij = νk (πj (xi ), πj (Xy(xi ) ) \ πj (xi )) νk (πj (xi ), πj (X¬y(xi ) ) \ πj (xi )) d(1−α) α≈1 j ∈ {1 . . . d∗ } (5) For each training data point, we compute the best distance ratio amid all the projections, which is simply Ti = minj∈[d∗ ] Zij . The objective can be then further rewritten as a function of the entropy estimates: n I[g(xi ) → πj ]Zij min M ∈Md (6) i=1 πj ∈Π From the definition of T, it is also clear that n n I[g(xi ) → πj ]Zij min M ∈Md 2.2 ≥ i=1 πj ∈Π Ti . (7) i=1 Projection selection as a combinatorial problem Considering form (6) of the objective, and given that the estimates Zij are constants, depending only on the training set, the projection retrieval problem is reduced to finding g for all training points, which will implicitly select the projection set of the model. Naturally, one might assume the bestperforming classification model is the one containing all the axis-aligned subspaces. This model achieves the lower bound (7) for the training set. However, the larger the set of projections, the more values the function g takes, and thus the problem of selecting the correct projection becomes more difficult. It becomes apparent that the number of projections should be somehow restricted to allow intepretability. Assuming a hard threshold of at most t projections, the optimization (6) becomes an entry selection problem over matrix Z where one value must be picked from each row under a limitation on the number of columns that can be used. This problem cannot be solved exactly in polynomial time. Instead, it can be formulated as an optimization problem under 1 constraints. 2.3 Projection retrieval through regularized regression To transform the projection retrieval to a regression problem we consider T, the minimum obtainable value of the entropy estimator for each point, as the output which the method needs to predict. Each row i of the parameter matrix B represents the degrees to which the entropy estimates on each projection contribute to the entropy estimator of point xi . Thus, the sum over each row of B is 1, and the regularization penalty applies to the number of non-zero columns in B. d∗ min ||T − (Z B B)J|Π|,1 ||2 2 +λ [Bi = 0] (8) i=1 subject to |Bk | 1 = 1 k = 1, n where (Z B)ij = Zij + Bij and J is a matrix of ones. The problem with this optimization is that it is not convex. A typical walk-around of this issue is to use the convex relaxation for Bi = 0, that is 1 norm. This would transform the penalized term d∗ d∗ n to i=1 |Bi | 1 . However, i=1 |Bi | 1 = k=1 |Bk | 1 = n , so this penalty really has no effect. An alternative mechanism to encourage the non-zero elements in B to populate a small number of columns is to add a penalty term in the form of Bδ, where δ is a d∗ -size column vector with each element representing the penalty for a column in B. With no prior information about which subspaces are more informative, δ starts as an all-1 vector. An initial value for B is obtained through the optimization (8). Since our goal is to handle data using a small number of projections, δ is then updated such that its value is lower for the denser columns in B. This update resembles the reweighing in adaptive lasso. The matrix B itself is updated, and this 2-step process continues until convergence of δ. Once δ converges, the projections corresponding to the non-zero columns of B are added to the model. The procedure is shown in Algorithm 1. 4 Algorithm 1: RECIP δ = [1 . . . 1] repeat |P I| b = arg minB ||T − i=1 < Z, B > ||2 + λ|Bδ| 1 2 subject to |Bk | 1 = 1 k = 1...n δk = |Bi | 1 i = . . . d∗ (update the differential penalty) δ δ = 1 − |δ| 1 until δ converges return Π = {πi ; |Bi | 1 > 0 ∀i = 1 . . . d∗ } 2.4 Lasso for projection selection We will compare our algorithm to lasso regularization that ranks the projections in terms of their potential for data separability. We write this as an 1 -penalized optimization on the extended feature set Z, with the objective T : minβ |T − Zβ|2 + λ|β| 1 . The lasso penalty to the coefficient vector encourages sparsity. For a high enough λ, the sparsity pattern in β is indicative of the usefulness of the projections. The lasso on entropy contributions was not found to perform well as it is not query specific and will find one projection for all data. We improved it by allowing it to iteratively find projections - this robust version offers increased performance by reweighting the data thus focusing on different subsets of it. Although better than running lasso on entropy contributions, the robust lasso does not match RECIP’s performance as the projections are selected gradually rather than jointly. Running the standard lasso on the original design matrix yields a set of relevant variables and it is not immediately clear how the solution would translate to the desired class. 2.5 The selection function Once the projections are selected, the second stage of the algorithm deals with assigning the projection with which to classify a particular query point. An immediate way of selecting the correct projection starts by computing the local entropy estimator for each subspace with each class assignment. Then, we may select the label/subspace combination that minimizes the empirical entropy. (i∗ , θ∗ ) = arg min i,θ 3 νk (πi (x), πi (Xθ )) νk (πi (x), πi (X¬θ )) dim(πi )(1−α) i = 1 . . . d∗ , α≈1 (9) Experimental results In this section we illustrate the capability of RECIP to retrieve informative projections of data and their use in support of interpreting results of classification. First, we analyze how well RECIP can identify subspaces in synthetic data whose distribution obeys the subspace separability assumption (3.1). As a point of reference, we also present classification accuracy results (3.2) for both the synthetic data and a few real-world sets. This is to quantify the extent of the trade-off between fidelity of attainable classifiers and desired informativeness of the projections chosen by RECIP. We expect RECIP’s classification performance to be slightly, but not substantially worse when compared to relevant classification algorithms trained to maximize classification accuracy. Finally, we present a few examples (3.3) of informative projections recovered from real-world data and their utility in explaining to human users the decision processes applied to query points. A set of artificial data used in our experiments contains q batches of data points, each of them made classifiable with high accuracy using one of available 2-dimensional subspaces (x1 , x2 ) with k ∈ k k {1 . . . q}. The data in batch k also have the property that x1 > tk . This is done such that the group a k point belongs to can be detected from x1 , thus x1 is a regulatory variable. We control the amount of k k noise added to thusly created synthetic data by varying the proportion of noisy data points in each batch. The results below are for datasets with 7 features each, with number of batches q ranging between 1 and 7. We kept the number of features specifically low in order to prevent excessive variation between any two sets generated this way, and to enable computing meaningful estimates of the expectation and variance of performance, while enabling creation of complicated data in which synthetic patterns may substantially overlap (using 7 features and 7 2-dimensional patterns implies that dimensions of at least 4 of the patterns will overlap). We implemented our method 5 to be scalable to the size and dimensionality of data and although for brevity we do not include a discussion of this topic here, we have successfully run RECIP against data with 100 features. The parameter α is a value close to 1, because the Tsallis divergence converges to the KL divergence as alpha approaches 1. For the experiments on real-world data, d was set to n (all projections were considered). For the artificial data experiments, we reported results for d = 2 as they do not change significantly for d >= 2 because this data was synthesized to contain bidimensional informative projections. In general, if d is too low, the correct full set of projections will not be found, but it may be recovered partially. If d is chosen too high, there is a risk that a given selected projection p will contain irrelevant features compared to the true projection p0 . However, this situation only occurs if the noise introduced by these features in the estimators makes the entropy contributions on p and p0 statistically indistinguishable for a large subset of data. The users will choose d according to the desired/acceptable complexity of the resulting model. If the results are to be visually interpreted by a human, values of 2 or 3 are reasonable for d. 3.1 Recovering informative projections Table 1 shows how well RECIP recovers the q subspaces corresponding to the synthesized batches of data. We measure precision (proportion of the recovered projections that are known to be informative), and recall (proportion of known informative projections that are recovered by the algorithm). in Table 1, rows correspond to the number of distinct synthetic batches injected in data, q, and subsequent columns correspond to the increasing amounts of noise in data. We note that the observed precision is nearly perfect: the algorithm makes only 2 mistakes over the entire set of experiments, and those occur for highly noisy setups. The recall is nearly perfect as long as there is little overlap among the dimensions, that is when the injections do not interfere with each other. As the number of projections increases, the chances for overlap among the affected features also increase, which makes the data more confusing resulting on a gradual drop of recall until only about 3 or 4 of the 7 known to be informative subspaces can be recovered. We have also used lasso as described in 2.4 in an attempt to recover projections. This setup only manages to recover one of the informative subspaces, regardless of how the regularization parameter is tuned. 3.2 Classification accuracy Table 2 shows the classification accuracy of RECIP, obtained using synthetic data. As expected, the observed performance is initially high when there are few known informative projections in data and it decreases as noise and ambiguity of the injected patterns increase. Most types of ensemble learners would use a voting scheme to arrive at the final classification of a testing sample, rather than use a model selection scheme. For this reason, we have also compared predictive accuracy revealed by RECIP against a method based on majority voting among multiple candidate subspaces. Table 4 shows that the accuracy of this technique is lower than the accuracy of RECIP, regardless of whether the informative projections are recovered by the algorithm or assumed to be known a priori. This confirms the intuition that a selection-based approach can be more effective than voting for data which satisfies the subspace separability assumption. For reference, we have also classified the synthetic data using K-Nearest-Neighbors algorithm using all available features at once. The results of that experiment are shown in Table 5. Since RECIP uses neighbor information, K-NN is conceptually the closest among the popular alternatives. Compared to RECIP, K-NN performs worse when there are fewer synthetic patterns injected in data to form informative projections. It is because some features used then by K-NN are noisy. As more features become informative, the K-NN accuracy improves. This example shows the benefit of a selective approach to feature space and using a subset of the most explanatory projections to support not only explanatory analyses but also classification tasks in such circumstances. 3.3 RECIP case studies using real-world data Table 3 summarizes the RECIP and K-NN performance on UCI datasets. We also test the methods using Cell dataset containing a set of measurements such as the area and perimeter biological cells with separate labels marking cells subjected to treatment and control cells. In Vowel data, the nearest-neighbor approach works exceptionally well, even outperforming random forests (0.94 accuracy), which is an indication that all features are jointly relevant. For d lower than the number of features, RECIP picks projections of only one feature, but if there is no such limitation, RECIP picks the space of all the features as informative. 6 Table 1: Projection recovery for artificial datasets with 1 . . . 7 informative features and noise level 0 . . . 0.2 in terms of mean and variance of Precision and Recall. Mean/var obtained for each setting by repeating the experiment with datasets with different informative projections. PRECISION 1 2 3 4 5 6 7 0 1 1 1 1 1 1 1 0.02 1 1 1 1 1 1 1 Mean 0.05 1 1 1 1 1 1 1 1 2 3 4 5 6 7 0 1 1 1 0.9643 0.7714 0.6429 0.6327 0.02 1 1 1 0.9643 0.7429 0.6905 0.5918 Mean 0.05 1 1 0.9524 0.9643 0.8286 0.6905 0.5918 0 0 0 0 0 0 0 0 0.02 0 0 0 0 0 0 0 Variance 0.05 0 0 0 0 0 0 0 0.1 0.0306 0 0 0 0 0 0 0.2 0.0306 0 0 0 0 0 0 0 0 0 0 0.0077 0.0163 0.0113 0.0225 0.02 0 0 0 0.0077 0.0196 0.0113 0.02 Variance 0.05 0 0 0.0136 0.0077 0.0049 0.0272 0.0258 0.1 0 0 0.0136 0.0077 0.0082 0.0113 0.0233 0.2 0 0 0 0.0128 0.0278 0.0113 0.02 0.1 0.0008 0.0001 0.0028 0.0025 0.0036 0.0025 0.0042 0.2 0.0007 0.0001 0.0007 0.0032 0.0044 0.0027 0.0045 0.1 0.0001 0.0001 0.0007 0.0014 0.0019 0.0023 0.0021 0.2 0.0000 0.0001 0.0007 0.0020 0.0023 0.0021 0.0020 0.1 0.9286 1 1 1 1 1 1 0.2 0.9286 1 1 1 1 1 1 RECALL 0.1 1 1 0.9524 0.9643 0.8571 0.6905 0.5714 0.2 1 1 1 0.9286 0.7714 0.6905 0.551 Table 2: RECIP Classification Accuracy on Artificial Data 1 2 3 4 5 6 7 0 0.9751 0.9333 0.9053 0.8725 0.8113 0.7655 0.7534 1 2 3 4 5 6 7 0 0.9751 0.9333 0.9053 0.8820 0.8714 0.8566 0.8429 CLASSIFICATION ACCURACY Mean Variance 0.02 0.05 0.1 0.2 0 0.02 0.05 0.9731 0.9686 0.9543 0.9420 0.0000 0.0000 0.0000 0.9297 0.9227 0.9067 0.8946 0.0001 0.0001 0.0001 0.8967 0.8764 0.8640 0.8618 0.0004 0.0005 0.0016 0.8685 0.8589 0.8454 0.8187 0.0020 0.0020 0.0019 0.8009 0.8105 0.8105 0.7782 0.0042 0.0044 0.0033 0.7739 0.7669 0.7632 0.7511 0.0025 0.0021 0.0026 0.7399 0.7347 0.7278 0.7205 0.0034 0.0040 0.0042 CLASSIFICATION ACCURACY - KNOWN PROJECTIONS Mean Variance 0.02 0.05 0.1 0.2 0 0.02 0.05 0.9731 0.9686 0.9637 0.9514 0.0000 0.0000 0.0000 0.9297 0.9227 0.9067 0.8946 0.0001 0.0001 0.0001 0.8967 0.8914 0.8777 0.8618 0.0004 0.0005 0.0005 0.8781 0.8657 0.8541 0.8331 0.0011 0.0011 0.0014 0.8641 0.8523 0.8429 0.8209 0.0015 0.0015 0.0018 0.8497 0.8377 0.8285 0.8074 0.0014 0.0015 0.0016 0.8371 0.8256 0.8122 0.7988 0.0015 0.0018 0.0018 Table 3: Accuracy of K-NN and RECIP Dataset KNN RECIP In Spam data, the two most informative projections are Breast Cancer Wis 0.8415 0.8275 ’Capital Length Total’ (CLT)/’Capital Length Longest’ Breast Tissue 1.0000 1.0000 (CLL) and CLT/’Frequency of word your’ (FWY). FigCell 0.7072 0.7640 ure 1 shows these two projections, with the dots repreMiniBOONE* 0.7896 0.7396 senting training points. The red dots represent points laSpam 0.7680 0.7680 Vowel 0.9839 0.9839 beled as spam while the blue ones are non-spam. The circles are query points that have been assigned to be classified with the projection in which they are plotted. The green circles are correctly classified points, while the magenta circles - far fewer - are the incorrectly classified ones. Not only does the importance of text in capital letters make sense for a spam filtering dataset, but the points that select those projections are almost flawlessly classified. Additionally, assuming the user would need to attest the validity of classification for the first plot, he/she would have no trouble seeing that the circled data points are located in a region predominantly populated with examples of spam, so any non-spam entry appears suspicious. Both of the magenta-colored cases fall into this category, and they can be therefore flagged for further investigation. 7 Informative Projection for the Spam dataset 2000 1500 1000 500 0 Informative Projection for the Spam dataset 12 Frequency of word ‘your‘ Capital Run Length Longest 2500 10 8 6 4 2 0 500 1000 1500 2000 2500 Capital Run Length Total 3000 0 3500 0 2000 4000 6000 8000 10000 12000 14000 Capital Run Length Total 16000 Figure 1: Spam Dataset Selected Subspaces Table 4: Classification accuracy using RECIP-learned projections - or known projections, in the lower section - within a voting model instead of a selection model 1 2 3 4 5 6 7 1 2 3 4 5 6 7 CLASSIFICATION ACCURACY - VOTING ENSEMBLE Mean Variance 0 0.02 0.05 0.1 0.2 0 0.02 0.05 0.1 0.9751 0.9731 0.9686 0.9317 0.9226 0.0000 0.0000 0.0000 0.0070 0.7360 0.7354 0.7331 0.7303 0.7257 0.0002 0.0002 0.0001 0.0002 0.7290 0.7266 0.7163 0.7166 0.7212 0.0002 0.0002 0.0008 0.0006 0.6934 0.6931 0.6932 0.6904 0.6867 0.0008 0.0008 0.0008 0.0008 0.6715 0.6602 0.6745 0.6688 0.6581 0.0013 0.0014 0.0013 0.0014 0.6410 0.6541 0.6460 0.6529 0.6512 0.0008 0.0007 0.0010 0.0006 0.6392 0.6342 0.6268 0.6251 0.6294 0.0009 0.0011 0.0012 0.0012 CLASSIFICATION ACCURACY - VOTING ENSEMBLE, KNOWN PROJECTIONS Mean Variance 0 0.02 0.05 0.1 0.2 0 0.02 0.05 0.1 0.9751 0.9731 0.9686 0.9637 0.9514 0.0000 0.0000 0.0000 0.0001 0.7360 0.7354 0.7331 0.7303 0.7257 0.0002 0.0002 0.0001 0.0002 0.7409 0.7385 0.7390 0.7353 0.7325 0.0010 0.0012 0.0010 0.0011 0.7110 0.7109 0.7083 0.7067 0.7035 0.0041 0.0041 0.0042 0.0042 0.7077 0.7070 0.7050 0.7034 0.7008 0.0015 0.0015 0.0015 0.0016 0.6816 0.6807 0.6801 0.6790 0.6747 0.0008 0.0008 0.0008 0.0008 0.6787 0.6783 0.6772 0.6767 0.6722 0.0008 0.0009 0.0009 0.0008 0.2 0.0053 0.0001 0.0002 0.0009 0.0013 0.0005 0.0012 0.2 0.0000 0.0001 0.0010 0.0043 0.0016 0.0009 0.0008 Table 5: Classification accuracy for artificial data with the K-Nearest Neighbors method CLASSIFICATION ACCURACY - KNN 1 2 3 4 5 6 7 4 0 0.7909 0.7940 0.7964 0.7990 0.8038 0.8043 0.8054 0.02 0.7843 0.7911 0.7939 0.7972 0.8024 0.8032 0.8044 Mean 0.05 0.7747 0.7861 0.7901 0.7942 0.8002 0.8015 0.8028 0.1 0.7652 0.7790 0.7854 0.7904 0.7970 0.7987 0.8004 0.2 0.7412 0.7655 0.7756 0.7828 0.7905 0.7930 0.7955 0 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 0.02 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 Variance 0.05 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.1 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 0.2 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 Conclusion This paper considers the problem of Projection Recovery for Classification. It is relevant in applications where the decision process must be easy to understand in order to enable human interpretation of the results. We have developed a principled, regression-based algorithm designed to recover small sets of low-dimensional subspaces that support interpretability. It optimizes the selection using individual data-point-specific entropy estimators. In this context, the proposed algorithm follows the idea of transductive learning, and the role of the resulting projections bears resemblance to high confidence regions known in conformal prediction models. Empirical results obtained using simulated and real-world data show the effectiveness of our method in finding informative projections that enable accurate classification while maintaining transparency of the underlying decision process. Acknowledgments This material is based upon work supported by the NSF, under Grant No. IIS-0911032. 8 References [1] Mark W. Craven and Jude W. Shavlik. Extracting Tree-Structured Representations of Trained Networks. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, Advances in Neural Information Processing Systems, volume 8, pages 24–30. The MIT Press, 1996. [2] Pedro Domingos. Knowledge discovery via multiple models. Intelligent Data Analysis, 2:187–202, 1998. [3] Eulanda M. Dos Santos, Robert Sabourin, and Patrick Maupin. A dynamic overproduce-and-choose strategy for the selection of classifier ensembles. Pattern Recogn., 41:2993–3009, October 2008. [4] M. Feder and N. Merhav. Relations between entropy and error probability. Information Theory, IEEE Transactions on, 40(1):259–266, January 1994. [5] Jerome H. Friedman, Ron Kohavi, and Yeogirl Yun. Lazy decision trees, 1996. [6] A. Gammerman, V. Vovk, and V. Vapnik. Learning by transduction. In In Uncertainty in Artificial Intelligence, pages 148–155. Morgan Kaufmann, 1998. [7] Quanquan Gu, Zhenhui Li, and Jiawei Han. Joint feature selection and subspace learning, 2011. [8] Bing Liu, Minqing Hu, and Wynne Hsu. Intuitive representation of decision trees using general rules and exceptions. In Proceedings of Seventeeth National Conference on Artificial Intellgience (AAAI-2000), July 30 - Aug 3, 2000, pages 615–620, 2000. [9] Michael Mampaey, Nikolaj Tatti, and Jilles Vreeken. Tell me what i need to know: succinctly summarizing data with itemsets. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’11, pages 573–581, New York, NY, USA, 2011. ACM. [10] Mario Marchand and Marina Sokolova. Learning with decision lists of data-dependent features. JOURNAL OF MACHINE LEARNING REASEARCH, 6, 2005. [11] Guillaume Obozinski, Ben Taskar, and Michael I. Jordan. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2):231–252, April 2010. [12] Michael J. Pazzani, Subramani Mani, and W. Rodman Shankle. Beyond concise and colorful: Learning intelligible rules, 1997. [13] B. Poczos and J. Schneider. On the estimation of alpha-divergences. AISTATS, 2011. [14] Kai Ting, Jonathan Wells, Swee Tan, Shyh Teng, and Geoffrey Webb. Feature-subspace aggregating: ensembles for stable andunstable learners. Machine Learning, 82:375–397, 2011. 10.1007/s10994-0105224-5. 9

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In many applications, classification systems often require human intervention in the loop. [sent-5, score-0.074]

2 To tackle this problem, we formulate an axis-aligned subspace-finding task under the assumption that query specific information dictates the complementary use of the subspaces. [sent-7, score-0.11]

3 We develop a regression-based approach called RECIP that efficiently solves this problem by finding projections that minimize a nonparametric conditional entropy estimator. [sent-8, score-0.46]

4 Experiments show that the method is accurate in identifying the informative projections of the dataset, picking the correct views to classify query points, and facilitates visual evaluation by users. [sent-9, score-0.632]

5 1 Introduction and problem statement In the domain of predictive analytics, many applications which keep human users in the loop require the use of simple classification models. [sent-10, score-0.072]

6 Often, it is required that a test-point be ‘explained’ (classified) using a simple low-dimensional projection of the original feature space. [sent-11, score-0.239]

7 For instance, consider a nuclear threat detection system installed at a border check point. [sent-15, score-0.079]

8 Vehicles crossing the border are scanned with sensors so that a large array of measurements of radioactivity and secondary contextual information is being collected. [sent-16, score-0.094]

9 These observations are fed into a classification system that determines whether the scanned vehicle may carry a threat. [sent-17, score-0.105]

10 Given the potentially devastating consequences of a false negative, a border control agent is requested to validate the prediction and decide whether to submit the vehicle for a costly further inspection. [sent-18, score-0.085]

11 Despite its crucial role, human intervention should only be withheld for cases in which there are reasons to doubt the validity of classification. [sent-20, score-0.121]

12 In this context, we aim to learn a set of classifiers in low-dimensional subspaces and a decision function which selects the subspace under which a test point is to be classified. [sent-22, score-0.342]

13 The model will contain a set Π of subspaces of X ; Π ⊆ Π, where Π is the set of all axis-aligned subspaces of the original feature space, the power set of the features. [sent-27, score-0.345]

14 To each projection πi ∈ Π corresponds one discriminator from a given hypothesis space hi ∈ H. [sent-28, score-0.342]

15 It will also contain a selection function g : X → Π × H, which yields, for a query point x, the projection/discriminator pair with which this point will be classified. [sent-29, score-0.236]

16 The notation π(x) refers to the projection of the point x onto the subspace 1 π while h(π(x)) represents the predicted label for x. [sent-30, score-0.35]

17 where dim(π) presents the dimensionality of the subspace determined by the projection π. [sent-38, score-0.295]

18 Note that only projections up to size d will be considered, where d is a parameter specific to the application. [sent-39, score-0.334]

19 The set H contains one discriminator from the hypothesis class H for each projection. [sent-40, score-0.105]

20 Intuitively, the aim is to minimize the expected classification error over Md , however, a notable modification is that the projection and, implicitly, the discriminator, are chosen according to the data point that needs to be classified. [sent-41, score-0.228]

21 Given a query x in the space X , g(x) will yield the subspace πg(x) onto which the query is projected and the discriminator hg(x) for it. [sent-42, score-0.42]

22 Distinct test points can be handled using different combinations of subspaces and discriminators. [sent-43, score-0.207]

23 Hence, the PRC problem can be stated as follows: M ∗ = arg min EX ,Y y = hg(x) (πg(x) (x)) M ∈Md There are limitations to the type of selection function g that can be learned. [sent-45, score-0.098]

24 A simple example for which g can be recovered is a set of signal readings x for which, if one of the readings xi exceeds a threshold ti , the label can be predicted just based on xi . [sent-46, score-0.496]

25 A more complex one is a dataset containing regulatory variables, that is, for xi in the interval [ak , bk ] the label only depends on (x1 . [sent-47, score-0.296]

26 Our work better fits the areas of class dependent feature selection and context specific classification, highly connected to the concept of Transductive Learning [6]. [sent-59, score-0.109]

27 In Ting et al [14], the Feating submodel selection relies on simple attribute splits followed by fitting local predictors, though the algorithm itself is substantially different. [sent-61, score-0.128]

28 Obozinski et al present a subspace selection method in the context of multitask learning [11]. [sent-62, score-0.196]

29 Go et al propose a joint method for feature selection and subspace learning [7], however, their classification model is not particularly query specific. [sent-63, score-0.345]

30 Unlike those approaches, our method is designed to retrieve subsets of the feature space designed for use in a way that is complementary to the basic task at hand (classification) while providing query-specific information. [sent-67, score-0.127]

31 2 Recovering informative projections with RECIP To solve PRC, we need means by which to ascertain which projections are useful in terms of discriminating data from the two classes. [sent-68, score-0.856]

32 Since our model allows the use of distinct projections depending on the query point, it is expected that each projection would potentially benefit different areas of the feature space. [sent-69, score-0.683]

33 A(π) refers to the area of the feature space where the projection π is selected. [sent-70, score-0.239]

34 The expected classification error over A(π) is linked to the conditional entropy of Y |X. [sent-72, score-0.126]

35 This means that conditional entropy characterizes the potential of a subset of the feature space to separate data, which is more generic than simply quantifying classification accuracy for a specific discriminator. [sent-74, score-0.233]

36 In view of this connection between classification accuracy and entropy, we adapt the objective to: p(A(π))H(Y |π(X); X ∈ A(π)) min M ∈Md (1) π∈Π The method we propose optimizes an empirical analog of (1) which we develop below and for which we will need the following result. [sent-75, score-0.096]

37 1 Local estimators of entropy We will now plug (2) in the formula obtained by Proposition 2. [sent-83, score-0.126]

38 Also, Xy(x) represents the set of samples that have labels equal to the label of x and X¬y(x) the data that have labels opposite to the label of x. [sent-86, score-0.11]

39 We start by computing the distance-based metric of each point on each projection of size up to d - there are d∗ such projections. [sent-88, score-0.228]

40 This procedure yields an extended set of features Z, which we name local entropy estimates: Zij = νk (πj (xi ), πj (Xy(xi ) ) \ πj (xi )) νk (πj (xi ), πj (X¬y(xi ) ) \ πj (xi )) d(1−α) α≈1 j ∈ {1 . [sent-89, score-0.178]

41 The objective can be then further rewritten as a function of the entropy estimates: n I[g(xi ) → πj ]Zij min M ∈Md (6) i=1 πj ∈Π From the definition of T, it is also clear that n n I[g(xi ) → πj ]Zij min M ∈Md 2. [sent-93, score-0.182]

42 However, the larger the set of projections, the more values the function g takes, and thus the problem of selecting the correct projection becomes more difficult. [sent-98, score-0.2]

43 It becomes apparent that the number of projections should be somehow restricted to allow intepretability. [sent-99, score-0.334]

44 Assuming a hard threshold of at most t projections, the optimization (6) becomes an entry selection problem over matrix Z where one value must be picked from each row under a limitation on the number of columns that can be used. [sent-100, score-0.104]

45 3 Projection retrieval through regularized regression To transform the projection retrieval to a regression problem we consider T, the minimum obtainable value of the entropy estimator for each point, as the output which the method needs to predict. [sent-104, score-0.484]

46 Each row i of the parameter matrix B represents the degrees to which the entropy estimates on each projection contribute to the entropy estimator of point xi . [sent-105, score-0.689]

47 Thus, the sum over each row of B is 1, and the regularization penalty applies to the number of non-zero columns in B. [sent-106, score-0.076]

48 An alternative mechanism to encourage the non-zero elements in B to populate a small number of columns is to add a penalty term in the form of Bδ, where δ is a d∗ -size column vector with each element representing the penalty for a column in B. [sent-112, score-0.118]

49 With no prior information about which subspaces are more informative, δ starts as an all-1 vector. [sent-113, score-0.153]

50 Once δ converges, the projections corresponding to the non-zero columns of B are added to the model. [sent-118, score-0.368]

51 4 Lasso for projection selection We will compare our algorithm to lasso regularization that ranks the projections in terms of their potential for data separability. [sent-133, score-0.686]

52 The lasso penalty to the coefficient vector encourages sparsity. [sent-135, score-0.124]

53 The lasso on entropy contributions was not found to perform well as it is not query specific and will find one projection for all data. [sent-137, score-0.518]

54 We improved it by allowing it to iteratively find projections - this robust version offers increased performance by reweighting the data thus focusing on different subsets of it. [sent-138, score-0.334]

55 Although better than running lasso on entropy contributions, the robust lasso does not match RECIP’s performance as the projections are selected gradually rather than jointly. [sent-139, score-0.624]

56 Running the standard lasso on the original design matrix yields a set of relevant variables and it is not immediately clear how the solution would translate to the desired class. [sent-140, score-0.11]

57 5 The selection function Once the projections are selected, the second stage of the algorithm deals with assigning the projection with which to classify a particular query point. [sent-142, score-0.714]

58 An immediate way of selecting the correct projection starts by computing the local entropy estimator for each subspace with each class assignment. [sent-143, score-0.455]

59 d∗ , α≈1 (9) Experimental results In this section we illustrate the capability of RECIP to retrieve informative projections of data and their use in support of interpreting results of classification. [sent-148, score-0.55]

60 First, we analyze how well RECIP can identify subspaces in synthetic data whose distribution obeys the subspace separability assumption (3. [sent-149, score-0.336]

61 As a point of reference, we also present classification accuracy results (3. [sent-151, score-0.096]

62 This is to quantify the extent of the trade-off between fidelity of attainable classifiers and desired informativeness of the projections chosen by RECIP. [sent-153, score-0.334]

63 3) of informative projections recovered from real-world data and their utility in explaining to human users the decision processes applied to query points. [sent-156, score-0.825]

64 A set of artificial data used in our experiments contains q batches of data points, each of them made classifiable with high accuracy using one of available 2-dimensional subspaces (x1 , x2 ) with k ∈ k k {1 . [sent-157, score-0.287]

65 We control the amount of k k noise added to thusly created synthetic data by varying the proportion of noisy data points in each batch. [sent-163, score-0.11]

66 The results below are for datasets with 7 features each, with number of batches q ranging between 1 and 7. [sent-164, score-0.118]

67 For the experiments on real-world data, d was set to n (all projections were considered). [sent-168, score-0.334]

68 For the artificial data experiments, we reported results for d = 2 as they do not change significantly for d >= 2 because this data was synthesized to contain bidimensional informative projections. [sent-169, score-0.22]

69 In general, if d is too low, the correct full set of projections will not be found, but it may be recovered partially. [sent-170, score-0.389]

70 If d is chosen too high, there is a risk that a given selected projection p will contain irrelevant features compared to the true projection p0 . [sent-171, score-0.452]

71 However, this situation only occurs if the noise introduced by these features in the estimators makes the entropy contributions on p and p0 statistically indistinguishable for a large subset of data. [sent-172, score-0.178]

72 1 Recovering informative projections Table 1 shows how well RECIP recovers the q subspaces corresponding to the synthesized batches of data. [sent-176, score-0.773]

73 We measure precision (proportion of the recovered projections that are known to be informative), and recall (proportion of known informative projections that are recovered by the algorithm). [sent-177, score-0.994]

74 in Table 1, rows correspond to the number of distinct synthetic batches injected in data, q, and subsequent columns correspond to the increasing amounts of noise in data. [sent-178, score-0.207]

75 As the number of projections increases, the chances for overlap among the affected features also increase, which makes the data more confusing resulting on a gradual drop of recall until only about 3 or 4 of the 7 known to be informative subspaces can be recovered. [sent-181, score-0.769]

76 This setup only manages to recover one of the informative subspaces, regardless of how the regularization parameter is tuned. [sent-184, score-0.188]

77 2 Classification accuracy Table 2 shows the classification accuracy of RECIP, obtained using synthetic data. [sent-186, score-0.186]

78 As expected, the observed performance is initially high when there are few known informative projections in data and it decreases as noise and ambiguity of the injected patterns increase. [sent-187, score-0.614]

79 Most types of ensemble learners would use a voting scheme to arrive at the final classification of a testing sample, rather than use a model selection scheme. [sent-188, score-0.18]

80 For this reason, we have also compared predictive accuracy revealed by RECIP against a method based on majority voting among multiple candidate subspaces. [sent-189, score-0.147]

81 Table 4 shows that the accuracy of this technique is lower than the accuracy of RECIP, regardless of whether the informative projections are recovered by the algorithm or assumed to be known a priori. [sent-190, score-0.713]

82 This confirms the intuition that a selection-based approach can be more effective than voting for data which satisfies the subspace separability assumption. [sent-191, score-0.212]

83 For reference, we have also classified the synthetic data using K-Nearest-Neighbors algorithm using all available features at once. [sent-192, score-0.102]

84 Compared to RECIP, K-NN performs worse when there are fewer synthetic patterns injected in data to form informative projections. [sent-195, score-0.33]

85 As more features become informative, the K-NN accuracy improves. [sent-197, score-0.12]

86 This example shows the benefit of a selective approach to feature space and using a subset of the most explanatory projections to support not only explanatory analyses but also classification tasks in such circumstances. [sent-198, score-0.441]

87 For d lower than the number of features, RECIP picks projections of only one feature, but if there is no such limitation, RECIP picks the space of all the features as informative. [sent-204, score-0.442]

88 Mean/var obtained for each setting by repeating the experiment with datasets with different informative projections. [sent-213, score-0.188]

89 0018 Table 3: Accuracy of K-NN and RECIP Dataset KNN RECIP In Spam data, the two most informative projections are Breast Cancer Wis 0. [sent-434, score-0.522]

90 9839 beled as spam while the blue ones are non-spam. [sent-448, score-0.141]

91 The circles are query points that have been assigned to be classified with the projection in which they are plotted. [sent-449, score-0.371]

92 Not only does the importance of text in capital letters make sense for a spam filtering dataset, but the points that select those projections are almost flawlessly classified. [sent-451, score-0.625]

93 Additionally, assuming the user would need to attest the validity of classification for the first plot, he/she would have no trouble seeing that the circled data points are located in a region predominantly populated with examples of spam, so any non-spam entry appears suspicious. [sent-452, score-0.173]

94 It is relevant in applications where the decision process must be easy to understand in order to enable human interpretation of the results. [sent-689, score-0.166]

95 We have developed a principled, regression-based algorithm designed to recover small sets of low-dimensional subspaces that support interpretability. [sent-690, score-0.183]

96 It optimizes the selection using individual data-point-specific entropy estimators. [sent-691, score-0.196]

97 In this context, the proposed algorithm follows the idea of transductive learning, and the role of the resulting projections bears resemblance to high confidence regions known in conformal prediction models. [sent-692, score-0.368]

98 Empirical results obtained using simulated and real-world data show the effectiveness of our method in finding informative projections that enable accurate classification while maintaining transparency of the underlying decision process. [sent-693, score-0.62]

99 Intuitive representation of decision trees using general rules and exceptions. [sent-731, score-0.118]

100 Joint covariate selection and joint subspace selection for multiple classification problems. [sent-742, score-0.235]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('recip', 0.605), ('projections', 0.334), ('projection', 0.2), ('informative', 0.188), ('subspaces', 0.153), ('xi', 0.151), ('spam', 0.141), ('classi', 0.136), ('prc', 0.132), ('entropy', 0.126), ('md', 0.123), ('capital', 0.121), ('query', 0.11), ('discriminator', 0.105), ('subspace', 0.095), ('zij', 0.091), ('classification', 0.088), ('lasso', 0.082), ('hg', 0.081), ('xy', 0.08), ('voting', 0.079), ('selection', 0.07), ('accuracy', 0.068), ('batches', 0.066), ('decision', 0.066), ('injected', 0.057), ('cation', 0.056), ('recovered', 0.055), ('dkl', 0.053), ('attest', 0.053), ('bi', 0.052), ('features', 0.052), ('border', 0.051), ('synthetic', 0.05), ('dim', 0.05), ('retrieval', 0.048), ('validity', 0.047), ('tsallis', 0.046), ('bk', 0.046), ('user', 0.044), ('vowel', 0.043), ('ting', 0.043), ('scanned', 0.043), ('overlap', 0.042), ('penalty', 0.042), ('readings', 0.04), ('variance', 0.04), ('human', 0.04), ('ui', 0.039), ('feature', 0.039), ('separability', 0.038), ('hi', 0.037), ('lazy', 0.037), ('regulatory', 0.037), ('michael', 0.036), ('feder', 0.035), ('dataset', 0.035), ('patterns', 0.035), ('estimator', 0.034), ('vehicle', 0.034), ('breast', 0.034), ('intervention', 0.034), ('transductive', 0.034), ('explanatory', 0.034), ('columns', 0.034), ('arti', 0.033), ('longest', 0.033), ('ti', 0.032), ('knn', 0.032), ('synthesized', 0.032), ('enable', 0.032), ('users', 0.032), ('circles', 0.032), ('ensemble', 0.031), ('pittsburgh', 0.031), ('proportion', 0.031), ('al', 0.031), ('cial', 0.03), ('designed', 0.03), ('obozinski', 0.029), ('table', 0.029), ('points', 0.029), ('transform', 0.028), ('system', 0.028), ('dots', 0.028), ('retrieve', 0.028), ('picks', 0.028), ('point', 0.028), ('relevant', 0.028), ('labels', 0.028), ('precision', 0.028), ('min', 0.028), ('label', 0.027), ('rules', 0.027), ('substantially', 0.027), ('length', 0.026), ('trees', 0.025), ('combinations', 0.025), ('estimates', 0.024), ('mellon', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 279 nips-2012-Projection Retrieval for Classification

Author: Madalina Fiterau, Artur Dubrawski

Abstract: In many applications, classification systems often require human intervention in the loop. In such cases the decision process must be transparent and comprehensible, simultaneously requiring minimal assumptions on the underlying data distributions. To tackle this problem, we formulate an axis-aligned subspace-finding task under the assumption that query specific information dictates the complementary use of the subspaces. We develop a regression-based approach called RECIP that efficiently solves this problem by finding projections that minimize a nonparametric conditional entropy estimator. Experiments show that the method is accurate in identifying the informative projections of the dataset, picking the correct views to classify query points, and facilitates visual evaluation by users. 1 Introduction and problem statement In the domain of predictive analytics, many applications which keep human users in the loop require the use of simple classification models. Often, it is required that a test-point be ‘explained’ (classified) using a simple low-dimensional projection of the original feature space. This is a Projection Retrieval for Classification problem (PRC). The interaction with the user proceeds as follows: the user provides the system a query point; the system searches for a projection in which the point can be accurately classified; the system displays the classification result as well as an illustration of how the classification decision was reached in the selected projection. Solving the PRC problem is relevant in many practical applications. For instance, consider a nuclear threat detection system installed at a border check point. Vehicles crossing the border are scanned with sensors so that a large array of measurements of radioactivity and secondary contextual information is being collected. These observations are fed into a classification system that determines whether the scanned vehicle may carry a threat. Given the potentially devastating consequences of a false negative, a border control agent is requested to validate the prediction and decide whether to submit the vehicle for a costly further inspection. With the positive classification rate of the system under strict bounds because of limitations in the control process, the risk of false negatives is increased. Despite its crucial role, human intervention should only be withheld for cases in which there are reasons to doubt the validity of classification. In order for a user to attest the validity of a decision, the user must have a good understanding of the classification process, which happens more readily when the classifier only uses the original dataset features rather than combinations of them, and when the discrimination models are low-dimensional. In this context, we aim to learn a set of classifiers in low-dimensional subspaces and a decision function which selects the subspace under which a test point is to be classified. Assume we are given a dataset {(x1 , y1 ) . . . (xn , yn )} ∈ X n × {0, 1}n and a class of discriminators H. The model will contain a set Π of subspaces of X ; Π ⊆ Π, where Π is the set of all axis-aligned subspaces of the original feature space, the power set of the features. To each projection πi ∈ Π corresponds one discriminator from a given hypothesis space hi ∈ H. It will also contain a selection function g : X → Π × H, which yields, for a query point x, the projection/discriminator pair with which this point will be classified. The notation π(x) refers to the projection of the point x onto the subspace 1 π while h(π(x)) represents the predicted label for x. Formally, we describe the model class as Md = {Π = {π : π ∈ Π, dim(π) ≤ d}, H = {hi : hi ∈ H, h : πi → Y, ∀i = 1 . . . |Π|}, g ∈ {f : X → {1 . . . |Π|}} . where dim(π) presents the dimensionality of the subspace determined by the projection π. Note that only projections up to size d will be considered, where d is a parameter specific to the application. The set H contains one discriminator from the hypothesis class H for each projection. Intuitively, the aim is to minimize the expected classification error over Md , however, a notable modification is that the projection and, implicitly, the discriminator, are chosen according to the data point that needs to be classified. Given a query x in the space X , g(x) will yield the subspace πg(x) onto which the query is projected and the discriminator hg(x) for it. Distinct test points can be handled using different combinations of subspaces and discriminators. We consider models that minimize 0/1 loss. Hence, the PRC problem can be stated as follows: M ∗ = arg min EX ,Y y = hg(x) (πg(x) (x)) M ∈Md There are limitations to the type of selection function g that can be learned. A simple example for which g can be recovered is a set of signal readings x for which, if one of the readings xi exceeds a threshold ti , the label can be predicted just based on xi . A more complex one is a dataset containing regulatory variables, that is, for xi in the interval [ak , bk ] the label only depends on (x1 . . . xnk ) k k datasets that fall into the latter category fulfill what we call the Subspace-Separability Assumption. This paper proposes an algorithm called RECIP that solves the PRC problem for a class of nonparametric classifiers. We evaluate the method on artificial data to show that indeed it correctly identifies the underlying structure for data satisfying the Subspace-Separability Assumption. We show some case studies to illustrate how RECIP offers insight into applications requiring human intervention. The use of dimensionality reduction techniques is a common preprocessing step in applications where the use of simplified classification models is preferable. Methods that learn linear combinations of features, such as Linear Discriminant Analysis, are not quite appropriate for the task considered here, since we prefer to natively rely on the dimensions available in the original feature space. Feature selection methods, such as e.g. lasso, are suitable for identifying sets of relevant features, but do not consider interactions between them. Our work better fits the areas of class dependent feature selection and context specific classification, highly connected to the concept of Transductive Learning [6]. Other context-sensitive methods are Lazy and Data-Dependent Decision Trees, [5] and [10] respectively. In Ting et al [14], the Feating submodel selection relies on simple attribute splits followed by fitting local predictors, though the algorithm itself is substantially different. Obozinski et al present a subspace selection method in the context of multitask learning [11]. Go et al propose a joint method for feature selection and subspace learning [7], however, their classification model is not particularly query specific. Alternatively, algorithms that transform complex or unintelligible models with user-friendly equivalents have been proposed [3, 2, 1, 8]. Algorithms specifically designed to yield understandable models are a precious few. Here we note a rule learning method described in [12], even though the resulting rules can make visualization difficult, while itemset mining [9] is not specifically designed for classification. Unlike those approaches, our method is designed to retrieve subsets of the feature space designed for use in a way that is complementary to the basic task at hand (classification) while providing query-specific information. 2 Recovering informative projections with RECIP To solve PRC, we need means by which to ascertain which projections are useful in terms of discriminating data from the two classes. Since our model allows the use of distinct projections depending on the query point, it is expected that each projection would potentially benefit different areas of the feature space. A(π) refers to the area of the feature space where the projection π is selected. A(π) = {x ∈ X : πg(x) = π} The objective becomes min E(X ×Y) y = hg(x) (πg(x) (x)) M ∈Md = p(A(π))E y = hg(x) (πg(x) (x))|x ∈ A(π) min M ∈Md 2 π∈Π . The expected classification error over A(π) is linked to the conditional entropy of Y |X. Fano’s inequality provides a lower bound on the error while Feder and Merhav [4] derive a tight upper bound on the minimal error probability in terms of the entropy. This means that conditional entropy characterizes the potential of a subset of the feature space to separate data, which is more generic than simply quantifying classification accuracy for a specific discriminator. In view of this connection between classification accuracy and entropy, we adapt the objective to: p(A(π))H(Y |π(X); X ∈ A(π)) min M ∈Md (1) π∈Π The method we propose optimizes an empirical analog of (1) which we develop below and for which we will need the following result. Proposition 2.1. Given a continuous variable X ∈ X and a binary variable Y , where X is sampled from the mixture model f (x) = p(y = 0)f0 (x) + p(y = 1)f1 (x) = p0 f0 (x) + p1 f1 (x) , then H(Y |X) = −p0 log p0 − p1 log p1 − DKL (f0 ||f ) − DKL (f1 ||f ) . Next, we will use the nonparametric estimator presented in [13] for Tsallis α-divergence. Given samples Ui ∼ U, with i = 1, n and Vj ∼ V with j = 1, m, the divergence is estimated as follows: ˆ Tα (U ||V ) = 1 1 1−α n n (n − 1)νk (Ui , U \ ui )d mνk (Ui , V )d i=1 1−α B(k, α) − 1 , (2) where d is the dimensionality of the variables U and V and νk (z, Z) represents the distance from z ˆ to its k th nearest neighbor of the set of points Z. For α ≈ 1 and n → ∞, Tα (u||v) ≈ DKL (u||v). 2.1 Local estimators of entropy We will now plug (2) in the formula obtained by Proposition 2.1 to estimate the quantity (1). We use the notation X0 to represent the n0 samples from X which have the labels Y equal to 0, and X1 to represent the n1 samples from X which have the labels set to 1. Also, Xy(x) represents the set of samples that have labels equal to the label of x and X¬y(x) the data that have labels opposite to the label of x. ˆ ˆ x ˆ x H(Y |X; X ∈ A) = −H(p0 ) − H(p1 ) − T (f0 ||f x ) − T (f1 ||f x ) + C α≈1 ˆ H(Y |X; X ∈ A) ∝ 1 n0 + 1 n1 ∝ 1 n0 + 1 n1 ∝ 1 n n0 (n0 − 1)νk (xi , X0 \ xi )d nνk (xi , X \ xi )d 1−α I[xi ∈ A] (n1 − 1)νk (xi , X1 \ xi )d nνk (xi , X \ xi )d 1−α I[xi ∈ A] (n0 − 1)νk (xi , X0 \ xi )d nνk (xi , X1 \ xi )d 1−α I[xi ∈ A] (n1 − 1)νk (xi , X1 \ xi )d nνk (xi , X0 \ xi )d 1−α I[xi ∈ A] i=1 n1 i=1 n0 i=1 n1 i=1 n I[xi ∈ A] i=1 (n − 1)νk (xi , Xy(xi ) \ xi )d nνk (xi , X¬y(xi ) \ xi )d 1−α The estimator for the entropy of the data that is classified with projection π is as follows: ˆ H(Y |π(X); X ∈ A(π)) ∝ 1 n n I[xi ∈ A(π)] i=1 (n − 1)νk (π(xi ), π(Xy(xi ) ) \ π(xi ))d nνk (π(xi ), π(X¬y(xi ) \ xi ))d 1−α (3) From 3 and using the fact that I[xi ∈ A(π)] = I[πg(xi ) = π] for which we use the notation I[g(xi ) → π], we estimate the objective as min M ∈Md π∈Π 1 n n I[g(xi ) → π] i=1 (n − 1)νk (π(xi ), π(Xy(xi ) ) \ π(xi ))d nνk (π(xi ), π(X¬y(xi ) \ xi ))d 3 1−α (4) Therefore, the contribution of each data point to the objective corresponds to a distance ratio on the projection π ∗ where the class of the point is obtained with the highest confidence (data is separable in the neighborhood of the point). We start by computing the distance-based metric of each point on each projection of size up to d - there are d∗ such projections. This procedure yields an extended set of features Z, which we name local entropy estimates: Zij = νk (πj (xi ), πj (Xy(xi ) ) \ πj (xi )) νk (πj (xi ), πj (X¬y(xi ) ) \ πj (xi )) d(1−α) α≈1 j ∈ {1 . . . d∗ } (5) For each training data point, we compute the best distance ratio amid all the projections, which is simply Ti = minj∈[d∗ ] Zij . The objective can be then further rewritten as a function of the entropy estimates: n I[g(xi ) → πj ]Zij min M ∈Md (6) i=1 πj ∈Π From the definition of T, it is also clear that n n I[g(xi ) → πj ]Zij min M ∈Md 2.2 ≥ i=1 πj ∈Π Ti . (7) i=1 Projection selection as a combinatorial problem Considering form (6) of the objective, and given that the estimates Zij are constants, depending only on the training set, the projection retrieval problem is reduced to finding g for all training points, which will implicitly select the projection set of the model. Naturally, one might assume the bestperforming classification model is the one containing all the axis-aligned subspaces. This model achieves the lower bound (7) for the training set. However, the larger the set of projections, the more values the function g takes, and thus the problem of selecting the correct projection becomes more difficult. It becomes apparent that the number of projections should be somehow restricted to allow intepretability. Assuming a hard threshold of at most t projections, the optimization (6) becomes an entry selection problem over matrix Z where one value must be picked from each row under a limitation on the number of columns that can be used. This problem cannot be solved exactly in polynomial time. Instead, it can be formulated as an optimization problem under 1 constraints. 2.3 Projection retrieval through regularized regression To transform the projection retrieval to a regression problem we consider T, the minimum obtainable value of the entropy estimator for each point, as the output which the method needs to predict. Each row i of the parameter matrix B represents the degrees to which the entropy estimates on each projection contribute to the entropy estimator of point xi . Thus, the sum over each row of B is 1, and the regularization penalty applies to the number of non-zero columns in B. d∗ min ||T − (Z B B)J|Π|,1 ||2 2 +λ [Bi = 0] (8) i=1 subject to |Bk | 1 = 1 k = 1, n where (Z B)ij = Zij + Bij and J is a matrix of ones. The problem with this optimization is that it is not convex. A typical walk-around of this issue is to use the convex relaxation for Bi = 0, that is 1 norm. This would transform the penalized term d∗ d∗ n to i=1 |Bi | 1 . However, i=1 |Bi | 1 = k=1 |Bk | 1 = n , so this penalty really has no effect. An alternative mechanism to encourage the non-zero elements in B to populate a small number of columns is to add a penalty term in the form of Bδ, where δ is a d∗ -size column vector with each element representing the penalty for a column in B. With no prior information about which subspaces are more informative, δ starts as an all-1 vector. An initial value for B is obtained through the optimization (8). Since our goal is to handle data using a small number of projections, δ is then updated such that its value is lower for the denser columns in B. This update resembles the reweighing in adaptive lasso. The matrix B itself is updated, and this 2-step process continues until convergence of δ. Once δ converges, the projections corresponding to the non-zero columns of B are added to the model. The procedure is shown in Algorithm 1. 4 Algorithm 1: RECIP δ = [1 . . . 1] repeat |P I| b = arg minB ||T − i=1 < Z, B > ||2 + λ|Bδ| 1 2 subject to |Bk | 1 = 1 k = 1...n δk = |Bi | 1 i = . . . d∗ (update the differential penalty) δ δ = 1 − |δ| 1 until δ converges return Π = {πi ; |Bi | 1 > 0 ∀i = 1 . . . d∗ } 2.4 Lasso for projection selection We will compare our algorithm to lasso regularization that ranks the projections in terms of their potential for data separability. We write this as an 1 -penalized optimization on the extended feature set Z, with the objective T : minβ |T − Zβ|2 + λ|β| 1 . The lasso penalty to the coefficient vector encourages sparsity. For a high enough λ, the sparsity pattern in β is indicative of the usefulness of the projections. The lasso on entropy contributions was not found to perform well as it is not query specific and will find one projection for all data. We improved it by allowing it to iteratively find projections - this robust version offers increased performance by reweighting the data thus focusing on different subsets of it. Although better than running lasso on entropy contributions, the robust lasso does not match RECIP’s performance as the projections are selected gradually rather than jointly. Running the standard lasso on the original design matrix yields a set of relevant variables and it is not immediately clear how the solution would translate to the desired class. 2.5 The selection function Once the projections are selected, the second stage of the algorithm deals with assigning the projection with which to classify a particular query point. An immediate way of selecting the correct projection starts by computing the local entropy estimator for each subspace with each class assignment. Then, we may select the label/subspace combination that minimizes the empirical entropy. (i∗ , θ∗ ) = arg min i,θ 3 νk (πi (x), πi (Xθ )) νk (πi (x), πi (X¬θ )) dim(πi )(1−α) i = 1 . . . d∗ , α≈1 (9) Experimental results In this section we illustrate the capability of RECIP to retrieve informative projections of data and their use in support of interpreting results of classification. First, we analyze how well RECIP can identify subspaces in synthetic data whose distribution obeys the subspace separability assumption (3.1). As a point of reference, we also present classification accuracy results (3.2) for both the synthetic data and a few real-world sets. This is to quantify the extent of the trade-off between fidelity of attainable classifiers and desired informativeness of the projections chosen by RECIP. We expect RECIP’s classification performance to be slightly, but not substantially worse when compared to relevant classification algorithms trained to maximize classification accuracy. Finally, we present a few examples (3.3) of informative projections recovered from real-world data and their utility in explaining to human users the decision processes applied to query points. A set of artificial data used in our experiments contains q batches of data points, each of them made classifiable with high accuracy using one of available 2-dimensional subspaces (x1 , x2 ) with k ∈ k k {1 . . . q}. The data in batch k also have the property that x1 > tk . This is done such that the group a k point belongs to can be detected from x1 , thus x1 is a regulatory variable. We control the amount of k k noise added to thusly created synthetic data by varying the proportion of noisy data points in each batch. The results below are for datasets with 7 features each, with number of batches q ranging between 1 and 7. We kept the number of features specifically low in order to prevent excessive variation between any two sets generated this way, and to enable computing meaningful estimates of the expectation and variance of performance, while enabling creation of complicated data in which synthetic patterns may substantially overlap (using 7 features and 7 2-dimensional patterns implies that dimensions of at least 4 of the patterns will overlap). We implemented our method 5 to be scalable to the size and dimensionality of data and although for brevity we do not include a discussion of this topic here, we have successfully run RECIP against data with 100 features. The parameter α is a value close to 1, because the Tsallis divergence converges to the KL divergence as alpha approaches 1. For the experiments on real-world data, d was set to n (all projections were considered). For the artificial data experiments, we reported results for d = 2 as they do not change significantly for d >= 2 because this data was synthesized to contain bidimensional informative projections. In general, if d is too low, the correct full set of projections will not be found, but it may be recovered partially. If d is chosen too high, there is a risk that a given selected projection p will contain irrelevant features compared to the true projection p0 . However, this situation only occurs if the noise introduced by these features in the estimators makes the entropy contributions on p and p0 statistically indistinguishable for a large subset of data. The users will choose d according to the desired/acceptable complexity of the resulting model. If the results are to be visually interpreted by a human, values of 2 or 3 are reasonable for d. 3.1 Recovering informative projections Table 1 shows how well RECIP recovers the q subspaces corresponding to the synthesized batches of data. We measure precision (proportion of the recovered projections that are known to be informative), and recall (proportion of known informative projections that are recovered by the algorithm). in Table 1, rows correspond to the number of distinct synthetic batches injected in data, q, and subsequent columns correspond to the increasing amounts of noise in data. We note that the observed precision is nearly perfect: the algorithm makes only 2 mistakes over the entire set of experiments, and those occur for highly noisy setups. The recall is nearly perfect as long as there is little overlap among the dimensions, that is when the injections do not interfere with each other. As the number of projections increases, the chances for overlap among the affected features also increase, which makes the data more confusing resulting on a gradual drop of recall until only about 3 or 4 of the 7 known to be informative subspaces can be recovered. We have also used lasso as described in 2.4 in an attempt to recover projections. This setup only manages to recover one of the informative subspaces, regardless of how the regularization parameter is tuned. 3.2 Classification accuracy Table 2 shows the classification accuracy of RECIP, obtained using synthetic data. As expected, the observed performance is initially high when there are few known informative projections in data and it decreases as noise and ambiguity of the injected patterns increase. Most types of ensemble learners would use a voting scheme to arrive at the final classification of a testing sample, rather than use a model selection scheme. For this reason, we have also compared predictive accuracy revealed by RECIP against a method based on majority voting among multiple candidate subspaces. Table 4 shows that the accuracy of this technique is lower than the accuracy of RECIP, regardless of whether the informative projections are recovered by the algorithm or assumed to be known a priori. This confirms the intuition that a selection-based approach can be more effective than voting for data which satisfies the subspace separability assumption. For reference, we have also classified the synthetic data using K-Nearest-Neighbors algorithm using all available features at once. The results of that experiment are shown in Table 5. Since RECIP uses neighbor information, K-NN is conceptually the closest among the popular alternatives. Compared to RECIP, K-NN performs worse when there are fewer synthetic patterns injected in data to form informative projections. It is because some features used then by K-NN are noisy. As more features become informative, the K-NN accuracy improves. This example shows the benefit of a selective approach to feature space and using a subset of the most explanatory projections to support not only explanatory analyses but also classification tasks in such circumstances. 3.3 RECIP case studies using real-world data Table 3 summarizes the RECIP and K-NN performance on UCI datasets. We also test the methods using Cell dataset containing a set of measurements such as the area and perimeter biological cells with separate labels marking cells subjected to treatment and control cells. In Vowel data, the nearest-neighbor approach works exceptionally well, even outperforming random forests (0.94 accuracy), which is an indication that all features are jointly relevant. For d lower than the number of features, RECIP picks projections of only one feature, but if there is no such limitation, RECIP picks the space of all the features as informative. 6 Table 1: Projection recovery for artificial datasets with 1 . . . 7 informative features and noise level 0 . . . 0.2 in terms of mean and variance of Precision and Recall. Mean/var obtained for each setting by repeating the experiment with datasets with different informative projections. PRECISION 1 2 3 4 5 6 7 0 1 1 1 1 1 1 1 0.02 1 1 1 1 1 1 1 Mean 0.05 1 1 1 1 1 1 1 1 2 3 4 5 6 7 0 1 1 1 0.9643 0.7714 0.6429 0.6327 0.02 1 1 1 0.9643 0.7429 0.6905 0.5918 Mean 0.05 1 1 0.9524 0.9643 0.8286 0.6905 0.5918 0 0 0 0 0 0 0 0 0.02 0 0 0 0 0 0 0 Variance 0.05 0 0 0 0 0 0 0 0.1 0.0306 0 0 0 0 0 0 0.2 0.0306 0 0 0 0 0 0 0 0 0 0 0.0077 0.0163 0.0113 0.0225 0.02 0 0 0 0.0077 0.0196 0.0113 0.02 Variance 0.05 0 0 0.0136 0.0077 0.0049 0.0272 0.0258 0.1 0 0 0.0136 0.0077 0.0082 0.0113 0.0233 0.2 0 0 0 0.0128 0.0278 0.0113 0.02 0.1 0.0008 0.0001 0.0028 0.0025 0.0036 0.0025 0.0042 0.2 0.0007 0.0001 0.0007 0.0032 0.0044 0.0027 0.0045 0.1 0.0001 0.0001 0.0007 0.0014 0.0019 0.0023 0.0021 0.2 0.0000 0.0001 0.0007 0.0020 0.0023 0.0021 0.0020 0.1 0.9286 1 1 1 1 1 1 0.2 0.9286 1 1 1 1 1 1 RECALL 0.1 1 1 0.9524 0.9643 0.8571 0.6905 0.5714 0.2 1 1 1 0.9286 0.7714 0.6905 0.551 Table 2: RECIP Classification Accuracy on Artificial Data 1 2 3 4 5 6 7 0 0.9751 0.9333 0.9053 0.8725 0.8113 0.7655 0.7534 1 2 3 4 5 6 7 0 0.9751 0.9333 0.9053 0.8820 0.8714 0.8566 0.8429 CLASSIFICATION ACCURACY Mean Variance 0.02 0.05 0.1 0.2 0 0.02 0.05 0.9731 0.9686 0.9543 0.9420 0.0000 0.0000 0.0000 0.9297 0.9227 0.9067 0.8946 0.0001 0.0001 0.0001 0.8967 0.8764 0.8640 0.8618 0.0004 0.0005 0.0016 0.8685 0.8589 0.8454 0.8187 0.0020 0.0020 0.0019 0.8009 0.8105 0.8105 0.7782 0.0042 0.0044 0.0033 0.7739 0.7669 0.7632 0.7511 0.0025 0.0021 0.0026 0.7399 0.7347 0.7278 0.7205 0.0034 0.0040 0.0042 CLASSIFICATION ACCURACY - KNOWN PROJECTIONS Mean Variance 0.02 0.05 0.1 0.2 0 0.02 0.05 0.9731 0.9686 0.9637 0.9514 0.0000 0.0000 0.0000 0.9297 0.9227 0.9067 0.8946 0.0001 0.0001 0.0001 0.8967 0.8914 0.8777 0.8618 0.0004 0.0005 0.0005 0.8781 0.8657 0.8541 0.8331 0.0011 0.0011 0.0014 0.8641 0.8523 0.8429 0.8209 0.0015 0.0015 0.0018 0.8497 0.8377 0.8285 0.8074 0.0014 0.0015 0.0016 0.8371 0.8256 0.8122 0.7988 0.0015 0.0018 0.0018 Table 3: Accuracy of K-NN and RECIP Dataset KNN RECIP In Spam data, the two most informative projections are Breast Cancer Wis 0.8415 0.8275 ’Capital Length Total’ (CLT)/’Capital Length Longest’ Breast Tissue 1.0000 1.0000 (CLL) and CLT/’Frequency of word your’ (FWY). FigCell 0.7072 0.7640 ure 1 shows these two projections, with the dots repreMiniBOONE* 0.7896 0.7396 senting training points. The red dots represent points laSpam 0.7680 0.7680 Vowel 0.9839 0.9839 beled as spam while the blue ones are non-spam. The circles are query points that have been assigned to be classified with the projection in which they are plotted. The green circles are correctly classified points, while the magenta circles - far fewer - are the incorrectly classified ones. Not only does the importance of text in capital letters make sense for a spam filtering dataset, but the points that select those projections are almost flawlessly classified. Additionally, assuming the user would need to attest the validity of classification for the first plot, he/she would have no trouble seeing that the circled data points are located in a region predominantly populated with examples of spam, so any non-spam entry appears suspicious. Both of the magenta-colored cases fall into this category, and they can be therefore flagged for further investigation. 7 Informative Projection for the Spam dataset 2000 1500 1000 500 0 Informative Projection for the Spam dataset 12 Frequency of word ‘your‘ Capital Run Length Longest 2500 10 8 6 4 2 0 500 1000 1500 2000 2500 Capital Run Length Total 3000 0 3500 0 2000 4000 6000 8000 10000 12000 14000 Capital Run Length Total 16000 Figure 1: Spam Dataset Selected Subspaces Table 4: Classification accuracy using RECIP-learned projections - or known projections, in the lower section - within a voting model instead of a selection model 1 2 3 4 5 6 7 1 2 3 4 5 6 7 CLASSIFICATION ACCURACY - VOTING ENSEMBLE Mean Variance 0 0.02 0.05 0.1 0.2 0 0.02 0.05 0.1 0.9751 0.9731 0.9686 0.9317 0.9226 0.0000 0.0000 0.0000 0.0070 0.7360 0.7354 0.7331 0.7303 0.7257 0.0002 0.0002 0.0001 0.0002 0.7290 0.7266 0.7163 0.7166 0.7212 0.0002 0.0002 0.0008 0.0006 0.6934 0.6931 0.6932 0.6904 0.6867 0.0008 0.0008 0.0008 0.0008 0.6715 0.6602 0.6745 0.6688 0.6581 0.0013 0.0014 0.0013 0.0014 0.6410 0.6541 0.6460 0.6529 0.6512 0.0008 0.0007 0.0010 0.0006 0.6392 0.6342 0.6268 0.6251 0.6294 0.0009 0.0011 0.0012 0.0012 CLASSIFICATION ACCURACY - VOTING ENSEMBLE, KNOWN PROJECTIONS Mean Variance 0 0.02 0.05 0.1 0.2 0 0.02 0.05 0.1 0.9751 0.9731 0.9686 0.9637 0.9514 0.0000 0.0000 0.0000 0.0001 0.7360 0.7354 0.7331 0.7303 0.7257 0.0002 0.0002 0.0001 0.0002 0.7409 0.7385 0.7390 0.7353 0.7325 0.0010 0.0012 0.0010 0.0011 0.7110 0.7109 0.7083 0.7067 0.7035 0.0041 0.0041 0.0042 0.0042 0.7077 0.7070 0.7050 0.7034 0.7008 0.0015 0.0015 0.0015 0.0016 0.6816 0.6807 0.6801 0.6790 0.6747 0.0008 0.0008 0.0008 0.0008 0.6787 0.6783 0.6772 0.6767 0.6722 0.0008 0.0009 0.0009 0.0008 0.2 0.0053 0.0001 0.0002 0.0009 0.0013 0.0005 0.0012 0.2 0.0000 0.0001 0.0010 0.0043 0.0016 0.0009 0.0008 Table 5: Classification accuracy for artificial data with the K-Nearest Neighbors method CLASSIFICATION ACCURACY - KNN 1 2 3 4 5 6 7 4 0 0.7909 0.7940 0.7964 0.7990 0.8038 0.8043 0.8054 0.02 0.7843 0.7911 0.7939 0.7972 0.8024 0.8032 0.8044 Mean 0.05 0.7747 0.7861 0.7901 0.7942 0.8002 0.8015 0.8028 0.1 0.7652 0.7790 0.7854 0.7904 0.7970 0.7987 0.8004 0.2 0.7412 0.7655 0.7756 0.7828 0.7905 0.7930 0.7955 0 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 0.02 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 Variance 0.05 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.1 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 0.2 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 Conclusion This paper considers the problem of Projection Recovery for Classification. It is relevant in applications where the decision process must be easy to understand in order to enable human interpretation of the results. We have developed a principled, regression-based algorithm designed to recover small sets of low-dimensional subspaces that support interpretability. It optimizes the selection using individual data-point-specific entropy estimators. In this context, the proposed algorithm follows the idea of transductive learning, and the role of the resulting projections bears resemblance to high confidence regions known in conformal prediction models. Empirical results obtained using simulated and real-world data show the effectiveness of our method in finding informative projections that enable accurate classification while maintaining transparency of the underlying decision process. Acknowledgments This material is based upon work supported by the NSF, under Grant No. IIS-0911032. 8 References [1] Mark W. Craven and Jude W. Shavlik. Extracting Tree-Structured Representations of Trained Networks. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, Advances in Neural Information Processing Systems, volume 8, pages 24–30. The MIT Press, 1996. [2] Pedro Domingos. Knowledge discovery via multiple models. Intelligent Data Analysis, 2:187–202, 1998. [3] Eulanda M. Dos Santos, Robert Sabourin, and Patrick Maupin. A dynamic overproduce-and-choose strategy for the selection of classifier ensembles. Pattern Recogn., 41:2993–3009, October 2008. [4] M. Feder and N. Merhav. Relations between entropy and error probability. Information Theory, IEEE Transactions on, 40(1):259–266, January 1994. [5] Jerome H. Friedman, Ron Kohavi, and Yeogirl Yun. Lazy decision trees, 1996. [6] A. Gammerman, V. Vovk, and V. Vapnik. Learning by transduction. In In Uncertainty in Artificial Intelligence, pages 148–155. Morgan Kaufmann, 1998. [7] Quanquan Gu, Zhenhui Li, and Jiawei Han. Joint feature selection and subspace learning, 2011. [8] Bing Liu, Minqing Hu, and Wynne Hsu. Intuitive representation of decision trees using general rules and exceptions. In Proceedings of Seventeeth National Conference on Artificial Intellgience (AAAI-2000), July 30 - Aug 3, 2000, pages 615–620, 2000. [9] Michael Mampaey, Nikolaj Tatti, and Jilles Vreeken. Tell me what i need to know: succinctly summarizing data with itemsets. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’11, pages 573–581, New York, NY, USA, 2011. ACM. [10] Mario Marchand and Marina Sokolova. Learning with decision lists of data-dependent features. JOURNAL OF MACHINE LEARNING REASEARCH, 6, 2005. [11] Guillaume Obozinski, Ben Taskar, and Michael I. Jordan. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2):231–252, April 2010. [12] Michael J. Pazzani, Subramani Mani, and W. Rodman Shankle. Beyond concise and colorful: Learning intelligible rules, 1997. [13] B. Poczos and J. Schneider. On the estimation of alpha-divergences. AISTATS, 2011. [14] Kai Ting, Jonathan Wells, Swee Tan, Shyh Teng, and Geoffrey Webb. Feature-subspace aggregating: ensembles for stable andunstable learners. Machine Learning, 82:375–397, 2011. 10.1007/s10994-0105224-5. 9

2 0.12095276 200 nips-2012-Local Supervised Learning through Space Partitioning

Author: Joseph Wang, Venkatesh Saligrama

Abstract: We develop a novel approach for supervised learning based on adaptively partitioning the feature space into different regions and learning local region-specific classifiers. We formulate an empirical risk minimization problem that incorporates both partitioning and classification in to a single global objective. We show that space partitioning can be equivalently reformulated as a supervised learning problem and consequently any discriminative learning method can be utilized in conjunction with our approach. Nevertheless, we consider locally linear schemes by learning linear partitions and linear region classifiers. Locally linear schemes can not only approximate complex decision boundaries and ensure low training error but also provide tight control on over-fitting and generalization error. We train locally linear classifiers by using LDA, logistic regression and perceptrons, and so our scheme is scalable to large data sizes and high-dimensions. We present experimental results demonstrating improved performance over state of the art classification techniques on benchmark datasets. We also show improved robustness to label noise.

3 0.097229078 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

Author: S. D. Babacan, Shinichi Nakajima, Minh Do

Abstract: In this paper, we consider the problem of clustering data points into lowdimensional subspaces in the presence of outliers. We pose the problem using a density estimation formulation with an associated generative model. Based on this probability model, we first develop an iterative expectation-maximization (EM) algorithm and then derive its global solution. In addition, we develop two Bayesian methods based on variational Bayesian (VB) approximation, which are capable of automatic dimensionality selection. While the first method is based on an alternating optimization scheme for all unknowns, the second method makes use of recent results in VB matrix factorization leading to fast and effective estimation. Both methods are extended to handle sparse outliers for robustness and can handle missing values. Experimental results suggest that proposed methods are very effective in subspace clustering and identifying outliers. 1

4 0.087050609 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

Author: Ashish Kapoor, Raajay Viswanathan, Prateek Jain

Abstract: In this paper, we present a Bayesian framework for multilabel classiďŹ cation using compressed sensing. The key idea in compressed sensing for multilabel classiďŹ cation is to ďŹ rst project the label vector to a lower dimensional space using a random transformation and then learn regression functions over these projections. Our approach considers both of these components in a single probabilistic model, thereby jointly optimizing over compression as well as learning tasks. We then derive an efďŹ cient variational inference scheme that provides joint posterior distribution over all the unobserved labels. The two key beneďŹ ts of the model are that a) it can naturally handle datasets that have missing labels and b) it can also measure uncertainty in prediction. The uncertainty estimate provided by the model allows for active learning paradigms where an oracle provides information about labels that promise to be maximally informative for the prediction task. Our experiments show signiďŹ cant boost over prior methods in terms of prediction performance over benchmark datasets, both in the fully labeled and the missing labels case. Finally, we also highlight various useful active learning scenarios that are enabled by the probabilistic model. 1

5 0.086720601 119 nips-2012-Entropy Estimations Using Correlated Symmetric Stable Random Projections

Author: Ping Li, Cun-hui Zhang

Abstract: Methods for efficiently estimating Shannon entropy of data streams have important applications in learning, data mining, and network anomaly detections (e.g., the DDoS attacks). For nonnegative data streams, the method of Compressed Counting (CC) [11, 13] based on maximally-skewed stable random projections can provide accurate estimates of the Shannon entropy using small storage. However, CC is no longer applicable when entries of data streams can be below zero, which is a common scenario when comparing two streams. In this paper, we propose an algorithm for entropy estimation in general data streams which allow negative entries. In our method, the Shannon entropy is approximated by the finite difference of two correlated frequency moments estimated from correlated samples of symmetric stable random variables. Interestingly, the estimator for the moment we recommend for entropy estimation barely has bounded variance itself, whereas the common geometric mean estimator (which has bounded higher-order moments) is not sufficient for entropy estimation. Our experiments confirm that this method is able to well approximate the Shannon entropy using small storage.

6 0.083947256 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

7 0.082005195 97 nips-2012-Diffusion Decision Making for Adaptive k-Nearest Neighbor Classification

8 0.080765679 204 nips-2012-MAP Inference in Chains using Column Generation

9 0.079479203 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

10 0.07580848 197 nips-2012-Learning with Recursive Perceptual Representations

11 0.074039489 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation

12 0.070607997 74 nips-2012-Collaborative Gaussian Processes for Preference Learning

13 0.063646108 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction

14 0.062356111 242 nips-2012-Non-linear Metric Learning

15 0.06161163 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection

16 0.060952332 57 nips-2012-Bayesian estimation of discrete entropy with mixtures of stick-breaking priors

17 0.059167687 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming

18 0.058608554 81 nips-2012-Context-Sensitive Decision Forests for Object Detection

19 0.057772789 304 nips-2012-Selecting Diverse Features via Spectral Regularization

20 0.057456002 32 nips-2012-Active Comparison of Prediction Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.203), (1, 0.023), (2, -0.019), (3, -0.055), (4, 0.046), (5, -0.007), (6, 0.006), (7, 0.104), (8, -0.019), (9, -0.032), (10, 0.001), (11, -0.002), (12, 0.03), (13, 0.014), (14, 0.004), (15, -0.007), (16, 0.013), (17, -0.049), (18, -0.029), (19, 0.065), (20, -0.016), (21, 0.065), (22, 0.015), (23, -0.006), (24, 0.004), (25, -0.087), (26, 0.064), (27, -0.062), (28, 0.115), (29, 0.008), (30, -0.07), (31, 0.049), (32, 0.081), (33, 0.04), (34, -0.123), (35, -0.018), (36, 0.064), (37, 0.113), (38, 0.028), (39, 0.058), (40, -0.064), (41, -0.101), (42, -0.015), (43, -0.04), (44, 0.126), (45, 0.028), (46, -0.006), (47, -0.005), (48, -0.011), (49, -0.046)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93425864 279 nips-2012-Projection Retrieval for Classification

Author: Madalina Fiterau, Artur Dubrawski

Abstract: In many applications, classification systems often require human intervention in the loop. In such cases the decision process must be transparent and comprehensible, simultaneously requiring minimal assumptions on the underlying data distributions. To tackle this problem, we formulate an axis-aligned subspace-finding task under the assumption that query specific information dictates the complementary use of the subspaces. We develop a regression-based approach called RECIP that efficiently solves this problem by finding projections that minimize a nonparametric conditional entropy estimator. Experiments show that the method is accurate in identifying the informative projections of the dataset, picking the correct views to classify query points, and facilitates visual evaluation by users. 1 Introduction and problem statement In the domain of predictive analytics, many applications which keep human users in the loop require the use of simple classification models. Often, it is required that a test-point be ‘explained’ (classified) using a simple low-dimensional projection of the original feature space. This is a Projection Retrieval for Classification problem (PRC). The interaction with the user proceeds as follows: the user provides the system a query point; the system searches for a projection in which the point can be accurately classified; the system displays the classification result as well as an illustration of how the classification decision was reached in the selected projection. Solving the PRC problem is relevant in many practical applications. For instance, consider a nuclear threat detection system installed at a border check point. Vehicles crossing the border are scanned with sensors so that a large array of measurements of radioactivity and secondary contextual information is being collected. These observations are fed into a classification system that determines whether the scanned vehicle may carry a threat. Given the potentially devastating consequences of a false negative, a border control agent is requested to validate the prediction and decide whether to submit the vehicle for a costly further inspection. With the positive classification rate of the system under strict bounds because of limitations in the control process, the risk of false negatives is increased. Despite its crucial role, human intervention should only be withheld for cases in which there are reasons to doubt the validity of classification. In order for a user to attest the validity of a decision, the user must have a good understanding of the classification process, which happens more readily when the classifier only uses the original dataset features rather than combinations of them, and when the discrimination models are low-dimensional. In this context, we aim to learn a set of classifiers in low-dimensional subspaces and a decision function which selects the subspace under which a test point is to be classified. Assume we are given a dataset {(x1 , y1 ) . . . (xn , yn )} ∈ X n × {0, 1}n and a class of discriminators H. The model will contain a set Π of subspaces of X ; Π ⊆ Π, where Π is the set of all axis-aligned subspaces of the original feature space, the power set of the features. To each projection πi ∈ Π corresponds one discriminator from a given hypothesis space hi ∈ H. It will also contain a selection function g : X → Π × H, which yields, for a query point x, the projection/discriminator pair with which this point will be classified. The notation π(x) refers to the projection of the point x onto the subspace 1 π while h(π(x)) represents the predicted label for x. Formally, we describe the model class as Md = {Π = {π : π ∈ Π, dim(π) ≤ d}, H = {hi : hi ∈ H, h : πi → Y, ∀i = 1 . . . |Π|}, g ∈ {f : X → {1 . . . |Π|}} . where dim(π) presents the dimensionality of the subspace determined by the projection π. Note that only projections up to size d will be considered, where d is a parameter specific to the application. The set H contains one discriminator from the hypothesis class H for each projection. Intuitively, the aim is to minimize the expected classification error over Md , however, a notable modification is that the projection and, implicitly, the discriminator, are chosen according to the data point that needs to be classified. Given a query x in the space X , g(x) will yield the subspace πg(x) onto which the query is projected and the discriminator hg(x) for it. Distinct test points can be handled using different combinations of subspaces and discriminators. We consider models that minimize 0/1 loss. Hence, the PRC problem can be stated as follows: M ∗ = arg min EX ,Y y = hg(x) (πg(x) (x)) M ∈Md There are limitations to the type of selection function g that can be learned. A simple example for which g can be recovered is a set of signal readings x for which, if one of the readings xi exceeds a threshold ti , the label can be predicted just based on xi . A more complex one is a dataset containing regulatory variables, that is, for xi in the interval [ak , bk ] the label only depends on (x1 . . . xnk ) k k datasets that fall into the latter category fulfill what we call the Subspace-Separability Assumption. This paper proposes an algorithm called RECIP that solves the PRC problem for a class of nonparametric classifiers. We evaluate the method on artificial data to show that indeed it correctly identifies the underlying structure for data satisfying the Subspace-Separability Assumption. We show some case studies to illustrate how RECIP offers insight into applications requiring human intervention. The use of dimensionality reduction techniques is a common preprocessing step in applications where the use of simplified classification models is preferable. Methods that learn linear combinations of features, such as Linear Discriminant Analysis, are not quite appropriate for the task considered here, since we prefer to natively rely on the dimensions available in the original feature space. Feature selection methods, such as e.g. lasso, are suitable for identifying sets of relevant features, but do not consider interactions between them. Our work better fits the areas of class dependent feature selection and context specific classification, highly connected to the concept of Transductive Learning [6]. Other context-sensitive methods are Lazy and Data-Dependent Decision Trees, [5] and [10] respectively. In Ting et al [14], the Feating submodel selection relies on simple attribute splits followed by fitting local predictors, though the algorithm itself is substantially different. Obozinski et al present a subspace selection method in the context of multitask learning [11]. Go et al propose a joint method for feature selection and subspace learning [7], however, their classification model is not particularly query specific. Alternatively, algorithms that transform complex or unintelligible models with user-friendly equivalents have been proposed [3, 2, 1, 8]. Algorithms specifically designed to yield understandable models are a precious few. Here we note a rule learning method described in [12], even though the resulting rules can make visualization difficult, while itemset mining [9] is not specifically designed for classification. Unlike those approaches, our method is designed to retrieve subsets of the feature space designed for use in a way that is complementary to the basic task at hand (classification) while providing query-specific information. 2 Recovering informative projections with RECIP To solve PRC, we need means by which to ascertain which projections are useful in terms of discriminating data from the two classes. Since our model allows the use of distinct projections depending on the query point, it is expected that each projection would potentially benefit different areas of the feature space. A(π) refers to the area of the feature space where the projection π is selected. A(π) = {x ∈ X : πg(x) = π} The objective becomes min E(X ×Y) y = hg(x) (πg(x) (x)) M ∈Md = p(A(π))E y = hg(x) (πg(x) (x))|x ∈ A(π) min M ∈Md 2 π∈Π . The expected classification error over A(π) is linked to the conditional entropy of Y |X. Fano’s inequality provides a lower bound on the error while Feder and Merhav [4] derive a tight upper bound on the minimal error probability in terms of the entropy. This means that conditional entropy characterizes the potential of a subset of the feature space to separate data, which is more generic than simply quantifying classification accuracy for a specific discriminator. In view of this connection between classification accuracy and entropy, we adapt the objective to: p(A(π))H(Y |π(X); X ∈ A(π)) min M ∈Md (1) π∈Π The method we propose optimizes an empirical analog of (1) which we develop below and for which we will need the following result. Proposition 2.1. Given a continuous variable X ∈ X and a binary variable Y , where X is sampled from the mixture model f (x) = p(y = 0)f0 (x) + p(y = 1)f1 (x) = p0 f0 (x) + p1 f1 (x) , then H(Y |X) = −p0 log p0 − p1 log p1 − DKL (f0 ||f ) − DKL (f1 ||f ) . Next, we will use the nonparametric estimator presented in [13] for Tsallis α-divergence. Given samples Ui ∼ U, with i = 1, n and Vj ∼ V with j = 1, m, the divergence is estimated as follows: ˆ Tα (U ||V ) = 1 1 1−α n n (n − 1)νk (Ui , U \ ui )d mνk (Ui , V )d i=1 1−α B(k, α) − 1 , (2) where d is the dimensionality of the variables U and V and νk (z, Z) represents the distance from z ˆ to its k th nearest neighbor of the set of points Z. For α ≈ 1 and n → ∞, Tα (u||v) ≈ DKL (u||v). 2.1 Local estimators of entropy We will now plug (2) in the formula obtained by Proposition 2.1 to estimate the quantity (1). We use the notation X0 to represent the n0 samples from X which have the labels Y equal to 0, and X1 to represent the n1 samples from X which have the labels set to 1. Also, Xy(x) represents the set of samples that have labels equal to the label of x and X¬y(x) the data that have labels opposite to the label of x. ˆ ˆ x ˆ x H(Y |X; X ∈ A) = −H(p0 ) − H(p1 ) − T (f0 ||f x ) − T (f1 ||f x ) + C α≈1 ˆ H(Y |X; X ∈ A) ∝ 1 n0 + 1 n1 ∝ 1 n0 + 1 n1 ∝ 1 n n0 (n0 − 1)νk (xi , X0 \ xi )d nνk (xi , X \ xi )d 1−α I[xi ∈ A] (n1 − 1)νk (xi , X1 \ xi )d nνk (xi , X \ xi )d 1−α I[xi ∈ A] (n0 − 1)νk (xi , X0 \ xi )d nνk (xi , X1 \ xi )d 1−α I[xi ∈ A] (n1 − 1)νk (xi , X1 \ xi )d nνk (xi , X0 \ xi )d 1−α I[xi ∈ A] i=1 n1 i=1 n0 i=1 n1 i=1 n I[xi ∈ A] i=1 (n − 1)νk (xi , Xy(xi ) \ xi )d nνk (xi , X¬y(xi ) \ xi )d 1−α The estimator for the entropy of the data that is classified with projection π is as follows: ˆ H(Y |π(X); X ∈ A(π)) ∝ 1 n n I[xi ∈ A(π)] i=1 (n − 1)νk (π(xi ), π(Xy(xi ) ) \ π(xi ))d nνk (π(xi ), π(X¬y(xi ) \ xi ))d 1−α (3) From 3 and using the fact that I[xi ∈ A(π)] = I[πg(xi ) = π] for which we use the notation I[g(xi ) → π], we estimate the objective as min M ∈Md π∈Π 1 n n I[g(xi ) → π] i=1 (n − 1)νk (π(xi ), π(Xy(xi ) ) \ π(xi ))d nνk (π(xi ), π(X¬y(xi ) \ xi ))d 3 1−α (4) Therefore, the contribution of each data point to the objective corresponds to a distance ratio on the projection π ∗ where the class of the point is obtained with the highest confidence (data is separable in the neighborhood of the point). We start by computing the distance-based metric of each point on each projection of size up to d - there are d∗ such projections. This procedure yields an extended set of features Z, which we name local entropy estimates: Zij = νk (πj (xi ), πj (Xy(xi ) ) \ πj (xi )) νk (πj (xi ), πj (X¬y(xi ) ) \ πj (xi )) d(1−α) α≈1 j ∈ {1 . . . d∗ } (5) For each training data point, we compute the best distance ratio amid all the projections, which is simply Ti = minj∈[d∗ ] Zij . The objective can be then further rewritten as a function of the entropy estimates: n I[g(xi ) → πj ]Zij min M ∈Md (6) i=1 πj ∈Π From the definition of T, it is also clear that n n I[g(xi ) → πj ]Zij min M ∈Md 2.2 ≥ i=1 πj ∈Π Ti . (7) i=1 Projection selection as a combinatorial problem Considering form (6) of the objective, and given that the estimates Zij are constants, depending only on the training set, the projection retrieval problem is reduced to finding g for all training points, which will implicitly select the projection set of the model. Naturally, one might assume the bestperforming classification model is the one containing all the axis-aligned subspaces. This model achieves the lower bound (7) for the training set. However, the larger the set of projections, the more values the function g takes, and thus the problem of selecting the correct projection becomes more difficult. It becomes apparent that the number of projections should be somehow restricted to allow intepretability. Assuming a hard threshold of at most t projections, the optimization (6) becomes an entry selection problem over matrix Z where one value must be picked from each row under a limitation on the number of columns that can be used. This problem cannot be solved exactly in polynomial time. Instead, it can be formulated as an optimization problem under 1 constraints. 2.3 Projection retrieval through regularized regression To transform the projection retrieval to a regression problem we consider T, the minimum obtainable value of the entropy estimator for each point, as the output which the method needs to predict. Each row i of the parameter matrix B represents the degrees to which the entropy estimates on each projection contribute to the entropy estimator of point xi . Thus, the sum over each row of B is 1, and the regularization penalty applies to the number of non-zero columns in B. d∗ min ||T − (Z B B)J|Π|,1 ||2 2 +λ [Bi = 0] (8) i=1 subject to |Bk | 1 = 1 k = 1, n where (Z B)ij = Zij + Bij and J is a matrix of ones. The problem with this optimization is that it is not convex. A typical walk-around of this issue is to use the convex relaxation for Bi = 0, that is 1 norm. This would transform the penalized term d∗ d∗ n to i=1 |Bi | 1 . However, i=1 |Bi | 1 = k=1 |Bk | 1 = n , so this penalty really has no effect. An alternative mechanism to encourage the non-zero elements in B to populate a small number of columns is to add a penalty term in the form of Bδ, where δ is a d∗ -size column vector with each element representing the penalty for a column in B. With no prior information about which subspaces are more informative, δ starts as an all-1 vector. An initial value for B is obtained through the optimization (8). Since our goal is to handle data using a small number of projections, δ is then updated such that its value is lower for the denser columns in B. This update resembles the reweighing in adaptive lasso. The matrix B itself is updated, and this 2-step process continues until convergence of δ. Once δ converges, the projections corresponding to the non-zero columns of B are added to the model. The procedure is shown in Algorithm 1. 4 Algorithm 1: RECIP δ = [1 . . . 1] repeat |P I| b = arg minB ||T − i=1 < Z, B > ||2 + λ|Bδ| 1 2 subject to |Bk | 1 = 1 k = 1...n δk = |Bi | 1 i = . . . d∗ (update the differential penalty) δ δ = 1 − |δ| 1 until δ converges return Π = {πi ; |Bi | 1 > 0 ∀i = 1 . . . d∗ } 2.4 Lasso for projection selection We will compare our algorithm to lasso regularization that ranks the projections in terms of their potential for data separability. We write this as an 1 -penalized optimization on the extended feature set Z, with the objective T : minβ |T − Zβ|2 + λ|β| 1 . The lasso penalty to the coefficient vector encourages sparsity. For a high enough λ, the sparsity pattern in β is indicative of the usefulness of the projections. The lasso on entropy contributions was not found to perform well as it is not query specific and will find one projection for all data. We improved it by allowing it to iteratively find projections - this robust version offers increased performance by reweighting the data thus focusing on different subsets of it. Although better than running lasso on entropy contributions, the robust lasso does not match RECIP’s performance as the projections are selected gradually rather than jointly. Running the standard lasso on the original design matrix yields a set of relevant variables and it is not immediately clear how the solution would translate to the desired class. 2.5 The selection function Once the projections are selected, the second stage of the algorithm deals with assigning the projection with which to classify a particular query point. An immediate way of selecting the correct projection starts by computing the local entropy estimator for each subspace with each class assignment. Then, we may select the label/subspace combination that minimizes the empirical entropy. (i∗ , θ∗ ) = arg min i,θ 3 νk (πi (x), πi (Xθ )) νk (πi (x), πi (X¬θ )) dim(πi )(1−α) i = 1 . . . d∗ , α≈1 (9) Experimental results In this section we illustrate the capability of RECIP to retrieve informative projections of data and their use in support of interpreting results of classification. First, we analyze how well RECIP can identify subspaces in synthetic data whose distribution obeys the subspace separability assumption (3.1). As a point of reference, we also present classification accuracy results (3.2) for both the synthetic data and a few real-world sets. This is to quantify the extent of the trade-off between fidelity of attainable classifiers and desired informativeness of the projections chosen by RECIP. We expect RECIP’s classification performance to be slightly, but not substantially worse when compared to relevant classification algorithms trained to maximize classification accuracy. Finally, we present a few examples (3.3) of informative projections recovered from real-world data and their utility in explaining to human users the decision processes applied to query points. A set of artificial data used in our experiments contains q batches of data points, each of them made classifiable with high accuracy using one of available 2-dimensional subspaces (x1 , x2 ) with k ∈ k k {1 . . . q}. The data in batch k also have the property that x1 > tk . This is done such that the group a k point belongs to can be detected from x1 , thus x1 is a regulatory variable. We control the amount of k k noise added to thusly created synthetic data by varying the proportion of noisy data points in each batch. The results below are for datasets with 7 features each, with number of batches q ranging between 1 and 7. We kept the number of features specifically low in order to prevent excessive variation between any two sets generated this way, and to enable computing meaningful estimates of the expectation and variance of performance, while enabling creation of complicated data in which synthetic patterns may substantially overlap (using 7 features and 7 2-dimensional patterns implies that dimensions of at least 4 of the patterns will overlap). We implemented our method 5 to be scalable to the size and dimensionality of data and although for brevity we do not include a discussion of this topic here, we have successfully run RECIP against data with 100 features. The parameter α is a value close to 1, because the Tsallis divergence converges to the KL divergence as alpha approaches 1. For the experiments on real-world data, d was set to n (all projections were considered). For the artificial data experiments, we reported results for d = 2 as they do not change significantly for d >= 2 because this data was synthesized to contain bidimensional informative projections. In general, if d is too low, the correct full set of projections will not be found, but it may be recovered partially. If d is chosen too high, there is a risk that a given selected projection p will contain irrelevant features compared to the true projection p0 . However, this situation only occurs if the noise introduced by these features in the estimators makes the entropy contributions on p and p0 statistically indistinguishable for a large subset of data. The users will choose d according to the desired/acceptable complexity of the resulting model. If the results are to be visually interpreted by a human, values of 2 or 3 are reasonable for d. 3.1 Recovering informative projections Table 1 shows how well RECIP recovers the q subspaces corresponding to the synthesized batches of data. We measure precision (proportion of the recovered projections that are known to be informative), and recall (proportion of known informative projections that are recovered by the algorithm). in Table 1, rows correspond to the number of distinct synthetic batches injected in data, q, and subsequent columns correspond to the increasing amounts of noise in data. We note that the observed precision is nearly perfect: the algorithm makes only 2 mistakes over the entire set of experiments, and those occur for highly noisy setups. The recall is nearly perfect as long as there is little overlap among the dimensions, that is when the injections do not interfere with each other. As the number of projections increases, the chances for overlap among the affected features also increase, which makes the data more confusing resulting on a gradual drop of recall until only about 3 or 4 of the 7 known to be informative subspaces can be recovered. We have also used lasso as described in 2.4 in an attempt to recover projections. This setup only manages to recover one of the informative subspaces, regardless of how the regularization parameter is tuned. 3.2 Classification accuracy Table 2 shows the classification accuracy of RECIP, obtained using synthetic data. As expected, the observed performance is initially high when there are few known informative projections in data and it decreases as noise and ambiguity of the injected patterns increase. Most types of ensemble learners would use a voting scheme to arrive at the final classification of a testing sample, rather than use a model selection scheme. For this reason, we have also compared predictive accuracy revealed by RECIP against a method based on majority voting among multiple candidate subspaces. Table 4 shows that the accuracy of this technique is lower than the accuracy of RECIP, regardless of whether the informative projections are recovered by the algorithm or assumed to be known a priori. This confirms the intuition that a selection-based approach can be more effective than voting for data which satisfies the subspace separability assumption. For reference, we have also classified the synthetic data using K-Nearest-Neighbors algorithm using all available features at once. The results of that experiment are shown in Table 5. Since RECIP uses neighbor information, K-NN is conceptually the closest among the popular alternatives. Compared to RECIP, K-NN performs worse when there are fewer synthetic patterns injected in data to form informative projections. It is because some features used then by K-NN are noisy. As more features become informative, the K-NN accuracy improves. This example shows the benefit of a selective approach to feature space and using a subset of the most explanatory projections to support not only explanatory analyses but also classification tasks in such circumstances. 3.3 RECIP case studies using real-world data Table 3 summarizes the RECIP and K-NN performance on UCI datasets. We also test the methods using Cell dataset containing a set of measurements such as the area and perimeter biological cells with separate labels marking cells subjected to treatment and control cells. In Vowel data, the nearest-neighbor approach works exceptionally well, even outperforming random forests (0.94 accuracy), which is an indication that all features are jointly relevant. For d lower than the number of features, RECIP picks projections of only one feature, but if there is no such limitation, RECIP picks the space of all the features as informative. 6 Table 1: Projection recovery for artificial datasets with 1 . . . 7 informative features and noise level 0 . . . 0.2 in terms of mean and variance of Precision and Recall. Mean/var obtained for each setting by repeating the experiment with datasets with different informative projections. PRECISION 1 2 3 4 5 6 7 0 1 1 1 1 1 1 1 0.02 1 1 1 1 1 1 1 Mean 0.05 1 1 1 1 1 1 1 1 2 3 4 5 6 7 0 1 1 1 0.9643 0.7714 0.6429 0.6327 0.02 1 1 1 0.9643 0.7429 0.6905 0.5918 Mean 0.05 1 1 0.9524 0.9643 0.8286 0.6905 0.5918 0 0 0 0 0 0 0 0 0.02 0 0 0 0 0 0 0 Variance 0.05 0 0 0 0 0 0 0 0.1 0.0306 0 0 0 0 0 0 0.2 0.0306 0 0 0 0 0 0 0 0 0 0 0.0077 0.0163 0.0113 0.0225 0.02 0 0 0 0.0077 0.0196 0.0113 0.02 Variance 0.05 0 0 0.0136 0.0077 0.0049 0.0272 0.0258 0.1 0 0 0.0136 0.0077 0.0082 0.0113 0.0233 0.2 0 0 0 0.0128 0.0278 0.0113 0.02 0.1 0.0008 0.0001 0.0028 0.0025 0.0036 0.0025 0.0042 0.2 0.0007 0.0001 0.0007 0.0032 0.0044 0.0027 0.0045 0.1 0.0001 0.0001 0.0007 0.0014 0.0019 0.0023 0.0021 0.2 0.0000 0.0001 0.0007 0.0020 0.0023 0.0021 0.0020 0.1 0.9286 1 1 1 1 1 1 0.2 0.9286 1 1 1 1 1 1 RECALL 0.1 1 1 0.9524 0.9643 0.8571 0.6905 0.5714 0.2 1 1 1 0.9286 0.7714 0.6905 0.551 Table 2: RECIP Classification Accuracy on Artificial Data 1 2 3 4 5 6 7 0 0.9751 0.9333 0.9053 0.8725 0.8113 0.7655 0.7534 1 2 3 4 5 6 7 0 0.9751 0.9333 0.9053 0.8820 0.8714 0.8566 0.8429 CLASSIFICATION ACCURACY Mean Variance 0.02 0.05 0.1 0.2 0 0.02 0.05 0.9731 0.9686 0.9543 0.9420 0.0000 0.0000 0.0000 0.9297 0.9227 0.9067 0.8946 0.0001 0.0001 0.0001 0.8967 0.8764 0.8640 0.8618 0.0004 0.0005 0.0016 0.8685 0.8589 0.8454 0.8187 0.0020 0.0020 0.0019 0.8009 0.8105 0.8105 0.7782 0.0042 0.0044 0.0033 0.7739 0.7669 0.7632 0.7511 0.0025 0.0021 0.0026 0.7399 0.7347 0.7278 0.7205 0.0034 0.0040 0.0042 CLASSIFICATION ACCURACY - KNOWN PROJECTIONS Mean Variance 0.02 0.05 0.1 0.2 0 0.02 0.05 0.9731 0.9686 0.9637 0.9514 0.0000 0.0000 0.0000 0.9297 0.9227 0.9067 0.8946 0.0001 0.0001 0.0001 0.8967 0.8914 0.8777 0.8618 0.0004 0.0005 0.0005 0.8781 0.8657 0.8541 0.8331 0.0011 0.0011 0.0014 0.8641 0.8523 0.8429 0.8209 0.0015 0.0015 0.0018 0.8497 0.8377 0.8285 0.8074 0.0014 0.0015 0.0016 0.8371 0.8256 0.8122 0.7988 0.0015 0.0018 0.0018 Table 3: Accuracy of K-NN and RECIP Dataset KNN RECIP In Spam data, the two most informative projections are Breast Cancer Wis 0.8415 0.8275 ’Capital Length Total’ (CLT)/’Capital Length Longest’ Breast Tissue 1.0000 1.0000 (CLL) and CLT/’Frequency of word your’ (FWY). FigCell 0.7072 0.7640 ure 1 shows these two projections, with the dots repreMiniBOONE* 0.7896 0.7396 senting training points. The red dots represent points laSpam 0.7680 0.7680 Vowel 0.9839 0.9839 beled as spam while the blue ones are non-spam. The circles are query points that have been assigned to be classified with the projection in which they are plotted. The green circles are correctly classified points, while the magenta circles - far fewer - are the incorrectly classified ones. Not only does the importance of text in capital letters make sense for a spam filtering dataset, but the points that select those projections are almost flawlessly classified. Additionally, assuming the user would need to attest the validity of classification for the first plot, he/she would have no trouble seeing that the circled data points are located in a region predominantly populated with examples of spam, so any non-spam entry appears suspicious. Both of the magenta-colored cases fall into this category, and they can be therefore flagged for further investigation. 7 Informative Projection for the Spam dataset 2000 1500 1000 500 0 Informative Projection for the Spam dataset 12 Frequency of word ‘your‘ Capital Run Length Longest 2500 10 8 6 4 2 0 500 1000 1500 2000 2500 Capital Run Length Total 3000 0 3500 0 2000 4000 6000 8000 10000 12000 14000 Capital Run Length Total 16000 Figure 1: Spam Dataset Selected Subspaces Table 4: Classification accuracy using RECIP-learned projections - or known projections, in the lower section - within a voting model instead of a selection model 1 2 3 4 5 6 7 1 2 3 4 5 6 7 CLASSIFICATION ACCURACY - VOTING ENSEMBLE Mean Variance 0 0.02 0.05 0.1 0.2 0 0.02 0.05 0.1 0.9751 0.9731 0.9686 0.9317 0.9226 0.0000 0.0000 0.0000 0.0070 0.7360 0.7354 0.7331 0.7303 0.7257 0.0002 0.0002 0.0001 0.0002 0.7290 0.7266 0.7163 0.7166 0.7212 0.0002 0.0002 0.0008 0.0006 0.6934 0.6931 0.6932 0.6904 0.6867 0.0008 0.0008 0.0008 0.0008 0.6715 0.6602 0.6745 0.6688 0.6581 0.0013 0.0014 0.0013 0.0014 0.6410 0.6541 0.6460 0.6529 0.6512 0.0008 0.0007 0.0010 0.0006 0.6392 0.6342 0.6268 0.6251 0.6294 0.0009 0.0011 0.0012 0.0012 CLASSIFICATION ACCURACY - VOTING ENSEMBLE, KNOWN PROJECTIONS Mean Variance 0 0.02 0.05 0.1 0.2 0 0.02 0.05 0.1 0.9751 0.9731 0.9686 0.9637 0.9514 0.0000 0.0000 0.0000 0.0001 0.7360 0.7354 0.7331 0.7303 0.7257 0.0002 0.0002 0.0001 0.0002 0.7409 0.7385 0.7390 0.7353 0.7325 0.0010 0.0012 0.0010 0.0011 0.7110 0.7109 0.7083 0.7067 0.7035 0.0041 0.0041 0.0042 0.0042 0.7077 0.7070 0.7050 0.7034 0.7008 0.0015 0.0015 0.0015 0.0016 0.6816 0.6807 0.6801 0.6790 0.6747 0.0008 0.0008 0.0008 0.0008 0.6787 0.6783 0.6772 0.6767 0.6722 0.0008 0.0009 0.0009 0.0008 0.2 0.0053 0.0001 0.0002 0.0009 0.0013 0.0005 0.0012 0.2 0.0000 0.0001 0.0010 0.0043 0.0016 0.0009 0.0008 Table 5: Classification accuracy for artificial data with the K-Nearest Neighbors method CLASSIFICATION ACCURACY - KNN 1 2 3 4 5 6 7 4 0 0.7909 0.7940 0.7964 0.7990 0.8038 0.8043 0.8054 0.02 0.7843 0.7911 0.7939 0.7972 0.8024 0.8032 0.8044 Mean 0.05 0.7747 0.7861 0.7901 0.7942 0.8002 0.8015 0.8028 0.1 0.7652 0.7790 0.7854 0.7904 0.7970 0.7987 0.8004 0.2 0.7412 0.7655 0.7756 0.7828 0.7905 0.7930 0.7955 0 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 0.02 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 Variance 0.05 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.1 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 0.2 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 Conclusion This paper considers the problem of Projection Recovery for Classification. It is relevant in applications where the decision process must be easy to understand in order to enable human interpretation of the results. We have developed a principled, regression-based algorithm designed to recover small sets of low-dimensional subspaces that support interpretability. It optimizes the selection using individual data-point-specific entropy estimators. In this context, the proposed algorithm follows the idea of transductive learning, and the role of the resulting projections bears resemblance to high confidence regions known in conformal prediction models. Empirical results obtained using simulated and real-world data show the effectiveness of our method in finding informative projections that enable accurate classification while maintaining transparency of the underlying decision process. Acknowledgments This material is based upon work supported by the NSF, under Grant No. IIS-0911032. 8 References [1] Mark W. Craven and Jude W. Shavlik. Extracting Tree-Structured Representations of Trained Networks. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, Advances in Neural Information Processing Systems, volume 8, pages 24–30. The MIT Press, 1996. [2] Pedro Domingos. Knowledge discovery via multiple models. Intelligent Data Analysis, 2:187–202, 1998. [3] Eulanda M. Dos Santos, Robert Sabourin, and Patrick Maupin. A dynamic overproduce-and-choose strategy for the selection of classifier ensembles. Pattern Recogn., 41:2993–3009, October 2008. [4] M. Feder and N. Merhav. Relations between entropy and error probability. Information Theory, IEEE Transactions on, 40(1):259–266, January 1994. [5] Jerome H. Friedman, Ron Kohavi, and Yeogirl Yun. Lazy decision trees, 1996. [6] A. Gammerman, V. Vovk, and V. Vapnik. Learning by transduction. In In Uncertainty in Artificial Intelligence, pages 148–155. Morgan Kaufmann, 1998. [7] Quanquan Gu, Zhenhui Li, and Jiawei Han. Joint feature selection and subspace learning, 2011. [8] Bing Liu, Minqing Hu, and Wynne Hsu. Intuitive representation of decision trees using general rules and exceptions. In Proceedings of Seventeeth National Conference on Artificial Intellgience (AAAI-2000), July 30 - Aug 3, 2000, pages 615–620, 2000. [9] Michael Mampaey, Nikolaj Tatti, and Jilles Vreeken. Tell me what i need to know: succinctly summarizing data with itemsets. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’11, pages 573–581, New York, NY, USA, 2011. ACM. [10] Mario Marchand and Marina Sokolova. Learning with decision lists of data-dependent features. JOURNAL OF MACHINE LEARNING REASEARCH, 6, 2005. [11] Guillaume Obozinski, Ben Taskar, and Michael I. Jordan. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2):231–252, April 2010. [12] Michael J. Pazzani, Subramani Mani, and W. Rodman Shankle. Beyond concise and colorful: Learning intelligible rules, 1997. [13] B. Poczos and J. Schneider. On the estimation of alpha-divergences. AISTATS, 2011. [14] Kai Ting, Jonathan Wells, Swee Tan, Shyh Teng, and Geoffrey Webb. Feature-subspace aggregating: ensembles for stable andunstable learners. Machine Learning, 82:375–397, 2011. 10.1007/s10994-0105224-5. 9

2 0.64441633 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

Author: Ashish Kapoor, Raajay Viswanathan, Prateek Jain

Abstract: In this paper, we present a Bayesian framework for multilabel classiďŹ cation using compressed sensing. The key idea in compressed sensing for multilabel classiďŹ cation is to ďŹ rst project the label vector to a lower dimensional space using a random transformation and then learn regression functions over these projections. Our approach considers both of these components in a single probabilistic model, thereby jointly optimizing over compression as well as learning tasks. We then derive an efďŹ cient variational inference scheme that provides joint posterior distribution over all the unobserved labels. The two key beneďŹ ts of the model are that a) it can naturally handle datasets that have missing labels and b) it can also measure uncertainty in prediction. The uncertainty estimate provided by the model allows for active learning paradigms where an oracle provides information about labels that promise to be maximally informative for the prediction task. Our experiments show signiďŹ cant boost over prior methods in terms of prediction performance over benchmark datasets, both in the fully labeled and the missing labels case. Finally, we also highlight various useful active learning scenarios that are enabled by the probabilistic model. 1

3 0.6358791 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

Author: Matthew Der, Lawrence K. Saul

Abstract: We describe a latent variable model for supervised dimensionality reduction and distance metric learning. The model discovers linear projections of high dimensional data that shrink the distance between similarly labeled inputs and expand the distance between differently labeled ones. The model’s continuous latent variables locate pairs of examples in a latent space of lower dimensionality. The model differs significantly from classical factor analysis in that the posterior distribution over these latent variables is not always multivariate Gaussian. Nevertheless we show that inference is completely tractable and derive an Expectation-Maximization (EM) algorithm for parameter estimation. We also compare the model to other approaches in distance metric learning. The model’s main advantage is its simplicity: at each iteration of the EM algorithm, the distance metric is re-estimated by solving an unconstrained least-squares problem. Experiments show that these simple updates are highly effective. 1

4 0.6281023 200 nips-2012-Local Supervised Learning through Space Partitioning

Author: Joseph Wang, Venkatesh Saligrama

Abstract: We develop a novel approach for supervised learning based on adaptively partitioning the feature space into different regions and learning local region-specific classifiers. We formulate an empirical risk minimization problem that incorporates both partitioning and classification in to a single global objective. We show that space partitioning can be equivalently reformulated as a supervised learning problem and consequently any discriminative learning method can be utilized in conjunction with our approach. Nevertheless, we consider locally linear schemes by learning linear partitions and linear region classifiers. Locally linear schemes can not only approximate complex decision boundaries and ensure low training error but also provide tight control on over-fitting and generalization error. We train locally linear classifiers by using LDA, logistic regression and perceptrons, and so our scheme is scalable to large data sizes and high-dimensions. We present experimental results demonstrating improved performance over state of the art classification techniques on benchmark datasets. We also show improved robustness to label noise.

5 0.62451017 97 nips-2012-Diffusion Decision Making for Adaptive k-Nearest Neighbor Classification

Author: Yung-kyun Noh, Frank Park, Daniel D. Lee

Abstract: This paper sheds light on some fundamental connections of the diffusion decision making model of neuroscience and cognitive psychology with k-nearest neighbor classification. We show that conventional k-nearest neighbor classification can be viewed as a special problem of the diffusion decision model in the asymptotic situation. By applying the optimal strategy associated with the diffusion decision model, an adaptive rule is developed for determining appropriate values of k in knearest neighbor classification. Making use of the sequential probability ratio test (SPRT) and Bayesian analysis, we propose five different criteria for adaptively acquiring nearest neighbors. Experiments with both synthetic and real datasets demonstrate the effectiveness of our classification criteria. 1

6 0.6131379 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics

7 0.60103905 223 nips-2012-Multi-criteria Anomaly Detection using Pareto Depth Analysis

8 0.57604122 119 nips-2012-Entropy Estimations Using Correlated Symmetric Stable Random Projections

9 0.54084331 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction

10 0.53473961 189 nips-2012-Learning from the Wisdom of Crowds by Minimax Entropy

11 0.53361094 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data

12 0.5305649 32 nips-2012-Active Comparison of Prediction Models

13 0.5302645 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification

14 0.52539712 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound

15 0.52401406 50 nips-2012-Bandit Algorithms boost Brain Computer Interfaces for motor-task selection of a brain-controlled button

16 0.52142853 361 nips-2012-Volume Regularization for Binary Classification

17 0.51980358 14 nips-2012-A P300 BCI for the Masses: Prior Information Enables Instant Unsupervised Spelling

18 0.51851773 242 nips-2012-Non-linear Metric Learning

19 0.51524013 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs

20 0.51422399 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.042), (4, 0.175), (17, 0.022), (21, 0.035), (38, 0.099), (42, 0.045), (54, 0.056), (55, 0.037), (74, 0.066), (76, 0.16), (80, 0.132), (92, 0.042)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87682569 140 nips-2012-Fusion with Diffusion for Robust Visual Tracking

Author: Yu Zhou, Xiang Bai, Wenyu Liu, Longin J. Latecki

Abstract: A weighted graph is used as an underlying structure of many algorithms like semisupervised learning and spectral clustering. If the edge weights are determined by a single similarity measure, then it hard if not impossible to capture all relevant aspects of similarity when using a single similarity measure. In particular, in the case of visual object matching it is beneficial to integrate different similarity measures that focus on different visual representations. In this paper, a novel approach to integrate multiple similarity measures is proposed. First pairs of similarity measures are combined with a diffusion process on their tensor product graph (TPG). Hence the diffused similarity of each pair of objects becomes a function of joint diffusion of the two original similarities, which in turn depends on the neighborhood structure of the TPG. We call this process Fusion with Diffusion (FD). However, a higher order graph like the TPG usually means significant increase in time complexity. This is not the case in the proposed approach. A key feature of our approach is that the time complexity of the diffusion on the TPG is the same as the diffusion process on each of the original graphs. Moreover, it is not necessary to explicitly construct the TPG in our framework. Finally all diffused pairs of similarity measures are combined as a weighted sum. We demonstrate the advantages of the proposed approach on the task of visual tracking, where different aspects of the appearance similarity between the target object in frame t − 1 and target object candidates in frame t are integrated. The obtained method is tested on several challenge video sequences and the experimental results show that it outperforms state-of-the-art tracking methods. 1

same-paper 2 0.86231738 279 nips-2012-Projection Retrieval for Classification

Author: Madalina Fiterau, Artur Dubrawski

Abstract: In many applications, classification systems often require human intervention in the loop. In such cases the decision process must be transparent and comprehensible, simultaneously requiring minimal assumptions on the underlying data distributions. To tackle this problem, we formulate an axis-aligned subspace-finding task under the assumption that query specific information dictates the complementary use of the subspaces. We develop a regression-based approach called RECIP that efficiently solves this problem by finding projections that minimize a nonparametric conditional entropy estimator. Experiments show that the method is accurate in identifying the informative projections of the dataset, picking the correct views to classify query points, and facilitates visual evaluation by users. 1 Introduction and problem statement In the domain of predictive analytics, many applications which keep human users in the loop require the use of simple classification models. Often, it is required that a test-point be ‘explained’ (classified) using a simple low-dimensional projection of the original feature space. This is a Projection Retrieval for Classification problem (PRC). The interaction with the user proceeds as follows: the user provides the system a query point; the system searches for a projection in which the point can be accurately classified; the system displays the classification result as well as an illustration of how the classification decision was reached in the selected projection. Solving the PRC problem is relevant in many practical applications. For instance, consider a nuclear threat detection system installed at a border check point. Vehicles crossing the border are scanned with sensors so that a large array of measurements of radioactivity and secondary contextual information is being collected. These observations are fed into a classification system that determines whether the scanned vehicle may carry a threat. Given the potentially devastating consequences of a false negative, a border control agent is requested to validate the prediction and decide whether to submit the vehicle for a costly further inspection. With the positive classification rate of the system under strict bounds because of limitations in the control process, the risk of false negatives is increased. Despite its crucial role, human intervention should only be withheld for cases in which there are reasons to doubt the validity of classification. In order for a user to attest the validity of a decision, the user must have a good understanding of the classification process, which happens more readily when the classifier only uses the original dataset features rather than combinations of them, and when the discrimination models are low-dimensional. In this context, we aim to learn a set of classifiers in low-dimensional subspaces and a decision function which selects the subspace under which a test point is to be classified. Assume we are given a dataset {(x1 , y1 ) . . . (xn , yn )} ∈ X n × {0, 1}n and a class of discriminators H. The model will contain a set Π of subspaces of X ; Π ⊆ Π, where Π is the set of all axis-aligned subspaces of the original feature space, the power set of the features. To each projection πi ∈ Π corresponds one discriminator from a given hypothesis space hi ∈ H. It will also contain a selection function g : X → Π × H, which yields, for a query point x, the projection/discriminator pair with which this point will be classified. The notation π(x) refers to the projection of the point x onto the subspace 1 π while h(π(x)) represents the predicted label for x. Formally, we describe the model class as Md = {Π = {π : π ∈ Π, dim(π) ≤ d}, H = {hi : hi ∈ H, h : πi → Y, ∀i = 1 . . . |Π|}, g ∈ {f : X → {1 . . . |Π|}} . where dim(π) presents the dimensionality of the subspace determined by the projection π. Note that only projections up to size d will be considered, where d is a parameter specific to the application. The set H contains one discriminator from the hypothesis class H for each projection. Intuitively, the aim is to minimize the expected classification error over Md , however, a notable modification is that the projection and, implicitly, the discriminator, are chosen according to the data point that needs to be classified. Given a query x in the space X , g(x) will yield the subspace πg(x) onto which the query is projected and the discriminator hg(x) for it. Distinct test points can be handled using different combinations of subspaces and discriminators. We consider models that minimize 0/1 loss. Hence, the PRC problem can be stated as follows: M ∗ = arg min EX ,Y y = hg(x) (πg(x) (x)) M ∈Md There are limitations to the type of selection function g that can be learned. A simple example for which g can be recovered is a set of signal readings x for which, if one of the readings xi exceeds a threshold ti , the label can be predicted just based on xi . A more complex one is a dataset containing regulatory variables, that is, for xi in the interval [ak , bk ] the label only depends on (x1 . . . xnk ) k k datasets that fall into the latter category fulfill what we call the Subspace-Separability Assumption. This paper proposes an algorithm called RECIP that solves the PRC problem for a class of nonparametric classifiers. We evaluate the method on artificial data to show that indeed it correctly identifies the underlying structure for data satisfying the Subspace-Separability Assumption. We show some case studies to illustrate how RECIP offers insight into applications requiring human intervention. The use of dimensionality reduction techniques is a common preprocessing step in applications where the use of simplified classification models is preferable. Methods that learn linear combinations of features, such as Linear Discriminant Analysis, are not quite appropriate for the task considered here, since we prefer to natively rely on the dimensions available in the original feature space. Feature selection methods, such as e.g. lasso, are suitable for identifying sets of relevant features, but do not consider interactions between them. Our work better fits the areas of class dependent feature selection and context specific classification, highly connected to the concept of Transductive Learning [6]. Other context-sensitive methods are Lazy and Data-Dependent Decision Trees, [5] and [10] respectively. In Ting et al [14], the Feating submodel selection relies on simple attribute splits followed by fitting local predictors, though the algorithm itself is substantially different. Obozinski et al present a subspace selection method in the context of multitask learning [11]. Go et al propose a joint method for feature selection and subspace learning [7], however, their classification model is not particularly query specific. Alternatively, algorithms that transform complex or unintelligible models with user-friendly equivalents have been proposed [3, 2, 1, 8]. Algorithms specifically designed to yield understandable models are a precious few. Here we note a rule learning method described in [12], even though the resulting rules can make visualization difficult, while itemset mining [9] is not specifically designed for classification. Unlike those approaches, our method is designed to retrieve subsets of the feature space designed for use in a way that is complementary to the basic task at hand (classification) while providing query-specific information. 2 Recovering informative projections with RECIP To solve PRC, we need means by which to ascertain which projections are useful in terms of discriminating data from the two classes. Since our model allows the use of distinct projections depending on the query point, it is expected that each projection would potentially benefit different areas of the feature space. A(π) refers to the area of the feature space where the projection π is selected. A(π) = {x ∈ X : πg(x) = π} The objective becomes min E(X ×Y) y = hg(x) (πg(x) (x)) M ∈Md = p(A(π))E y = hg(x) (πg(x) (x))|x ∈ A(π) min M ∈Md 2 π∈Π . The expected classification error over A(π) is linked to the conditional entropy of Y |X. Fano’s inequality provides a lower bound on the error while Feder and Merhav [4] derive a tight upper bound on the minimal error probability in terms of the entropy. This means that conditional entropy characterizes the potential of a subset of the feature space to separate data, which is more generic than simply quantifying classification accuracy for a specific discriminator. In view of this connection between classification accuracy and entropy, we adapt the objective to: p(A(π))H(Y |π(X); X ∈ A(π)) min M ∈Md (1) π∈Π The method we propose optimizes an empirical analog of (1) which we develop below and for which we will need the following result. Proposition 2.1. Given a continuous variable X ∈ X and a binary variable Y , where X is sampled from the mixture model f (x) = p(y = 0)f0 (x) + p(y = 1)f1 (x) = p0 f0 (x) + p1 f1 (x) , then H(Y |X) = −p0 log p0 − p1 log p1 − DKL (f0 ||f ) − DKL (f1 ||f ) . Next, we will use the nonparametric estimator presented in [13] for Tsallis α-divergence. Given samples Ui ∼ U, with i = 1, n and Vj ∼ V with j = 1, m, the divergence is estimated as follows: ˆ Tα (U ||V ) = 1 1 1−α n n (n − 1)νk (Ui , U \ ui )d mνk (Ui , V )d i=1 1−α B(k, α) − 1 , (2) where d is the dimensionality of the variables U and V and νk (z, Z) represents the distance from z ˆ to its k th nearest neighbor of the set of points Z. For α ≈ 1 and n → ∞, Tα (u||v) ≈ DKL (u||v). 2.1 Local estimators of entropy We will now plug (2) in the formula obtained by Proposition 2.1 to estimate the quantity (1). We use the notation X0 to represent the n0 samples from X which have the labels Y equal to 0, and X1 to represent the n1 samples from X which have the labels set to 1. Also, Xy(x) represents the set of samples that have labels equal to the label of x and X¬y(x) the data that have labels opposite to the label of x. ˆ ˆ x ˆ x H(Y |X; X ∈ A) = −H(p0 ) − H(p1 ) − T (f0 ||f x ) − T (f1 ||f x ) + C α≈1 ˆ H(Y |X; X ∈ A) ∝ 1 n0 + 1 n1 ∝ 1 n0 + 1 n1 ∝ 1 n n0 (n0 − 1)νk (xi , X0 \ xi )d nνk (xi , X \ xi )d 1−α I[xi ∈ A] (n1 − 1)νk (xi , X1 \ xi )d nνk (xi , X \ xi )d 1−α I[xi ∈ A] (n0 − 1)νk (xi , X0 \ xi )d nνk (xi , X1 \ xi )d 1−α I[xi ∈ A] (n1 − 1)νk (xi , X1 \ xi )d nνk (xi , X0 \ xi )d 1−α I[xi ∈ A] i=1 n1 i=1 n0 i=1 n1 i=1 n I[xi ∈ A] i=1 (n − 1)νk (xi , Xy(xi ) \ xi )d nνk (xi , X¬y(xi ) \ xi )d 1−α The estimator for the entropy of the data that is classified with projection π is as follows: ˆ H(Y |π(X); X ∈ A(π)) ∝ 1 n n I[xi ∈ A(π)] i=1 (n − 1)νk (π(xi ), π(Xy(xi ) ) \ π(xi ))d nνk (π(xi ), π(X¬y(xi ) \ xi ))d 1−α (3) From 3 and using the fact that I[xi ∈ A(π)] = I[πg(xi ) = π] for which we use the notation I[g(xi ) → π], we estimate the objective as min M ∈Md π∈Π 1 n n I[g(xi ) → π] i=1 (n − 1)νk (π(xi ), π(Xy(xi ) ) \ π(xi ))d nνk (π(xi ), π(X¬y(xi ) \ xi ))d 3 1−α (4) Therefore, the contribution of each data point to the objective corresponds to a distance ratio on the projection π ∗ where the class of the point is obtained with the highest confidence (data is separable in the neighborhood of the point). We start by computing the distance-based metric of each point on each projection of size up to d - there are d∗ such projections. This procedure yields an extended set of features Z, which we name local entropy estimates: Zij = νk (πj (xi ), πj (Xy(xi ) ) \ πj (xi )) νk (πj (xi ), πj (X¬y(xi ) ) \ πj (xi )) d(1−α) α≈1 j ∈ {1 . . . d∗ } (5) For each training data point, we compute the best distance ratio amid all the projections, which is simply Ti = minj∈[d∗ ] Zij . The objective can be then further rewritten as a function of the entropy estimates: n I[g(xi ) → πj ]Zij min M ∈Md (6) i=1 πj ∈Π From the definition of T, it is also clear that n n I[g(xi ) → πj ]Zij min M ∈Md 2.2 ≥ i=1 πj ∈Π Ti . (7) i=1 Projection selection as a combinatorial problem Considering form (6) of the objective, and given that the estimates Zij are constants, depending only on the training set, the projection retrieval problem is reduced to finding g for all training points, which will implicitly select the projection set of the model. Naturally, one might assume the bestperforming classification model is the one containing all the axis-aligned subspaces. This model achieves the lower bound (7) for the training set. However, the larger the set of projections, the more values the function g takes, and thus the problem of selecting the correct projection becomes more difficult. It becomes apparent that the number of projections should be somehow restricted to allow intepretability. Assuming a hard threshold of at most t projections, the optimization (6) becomes an entry selection problem over matrix Z where one value must be picked from each row under a limitation on the number of columns that can be used. This problem cannot be solved exactly in polynomial time. Instead, it can be formulated as an optimization problem under 1 constraints. 2.3 Projection retrieval through regularized regression To transform the projection retrieval to a regression problem we consider T, the minimum obtainable value of the entropy estimator for each point, as the output which the method needs to predict. Each row i of the parameter matrix B represents the degrees to which the entropy estimates on each projection contribute to the entropy estimator of point xi . Thus, the sum over each row of B is 1, and the regularization penalty applies to the number of non-zero columns in B. d∗ min ||T − (Z B B)J|Π|,1 ||2 2 +λ [Bi = 0] (8) i=1 subject to |Bk | 1 = 1 k = 1, n where (Z B)ij = Zij + Bij and J is a matrix of ones. The problem with this optimization is that it is not convex. A typical walk-around of this issue is to use the convex relaxation for Bi = 0, that is 1 norm. This would transform the penalized term d∗ d∗ n to i=1 |Bi | 1 . However, i=1 |Bi | 1 = k=1 |Bk | 1 = n , so this penalty really has no effect. An alternative mechanism to encourage the non-zero elements in B to populate a small number of columns is to add a penalty term in the form of Bδ, where δ is a d∗ -size column vector with each element representing the penalty for a column in B. With no prior information about which subspaces are more informative, δ starts as an all-1 vector. An initial value for B is obtained through the optimization (8). Since our goal is to handle data using a small number of projections, δ is then updated such that its value is lower for the denser columns in B. This update resembles the reweighing in adaptive lasso. The matrix B itself is updated, and this 2-step process continues until convergence of δ. Once δ converges, the projections corresponding to the non-zero columns of B are added to the model. The procedure is shown in Algorithm 1. 4 Algorithm 1: RECIP δ = [1 . . . 1] repeat |P I| b = arg minB ||T − i=1 < Z, B > ||2 + λ|Bδ| 1 2 subject to |Bk | 1 = 1 k = 1...n δk = |Bi | 1 i = . . . d∗ (update the differential penalty) δ δ = 1 − |δ| 1 until δ converges return Π = {πi ; |Bi | 1 > 0 ∀i = 1 . . . d∗ } 2.4 Lasso for projection selection We will compare our algorithm to lasso regularization that ranks the projections in terms of their potential for data separability. We write this as an 1 -penalized optimization on the extended feature set Z, with the objective T : minβ |T − Zβ|2 + λ|β| 1 . The lasso penalty to the coefficient vector encourages sparsity. For a high enough λ, the sparsity pattern in β is indicative of the usefulness of the projections. The lasso on entropy contributions was not found to perform well as it is not query specific and will find one projection for all data. We improved it by allowing it to iteratively find projections - this robust version offers increased performance by reweighting the data thus focusing on different subsets of it. Although better than running lasso on entropy contributions, the robust lasso does not match RECIP’s performance as the projections are selected gradually rather than jointly. Running the standard lasso on the original design matrix yields a set of relevant variables and it is not immediately clear how the solution would translate to the desired class. 2.5 The selection function Once the projections are selected, the second stage of the algorithm deals with assigning the projection with which to classify a particular query point. An immediate way of selecting the correct projection starts by computing the local entropy estimator for each subspace with each class assignment. Then, we may select the label/subspace combination that minimizes the empirical entropy. (i∗ , θ∗ ) = arg min i,θ 3 νk (πi (x), πi (Xθ )) νk (πi (x), πi (X¬θ )) dim(πi )(1−α) i = 1 . . . d∗ , α≈1 (9) Experimental results In this section we illustrate the capability of RECIP to retrieve informative projections of data and their use in support of interpreting results of classification. First, we analyze how well RECIP can identify subspaces in synthetic data whose distribution obeys the subspace separability assumption (3.1). As a point of reference, we also present classification accuracy results (3.2) for both the synthetic data and a few real-world sets. This is to quantify the extent of the trade-off between fidelity of attainable classifiers and desired informativeness of the projections chosen by RECIP. We expect RECIP’s classification performance to be slightly, but not substantially worse when compared to relevant classification algorithms trained to maximize classification accuracy. Finally, we present a few examples (3.3) of informative projections recovered from real-world data and their utility in explaining to human users the decision processes applied to query points. A set of artificial data used in our experiments contains q batches of data points, each of them made classifiable with high accuracy using one of available 2-dimensional subspaces (x1 , x2 ) with k ∈ k k {1 . . . q}. The data in batch k also have the property that x1 > tk . This is done such that the group a k point belongs to can be detected from x1 , thus x1 is a regulatory variable. We control the amount of k k noise added to thusly created synthetic data by varying the proportion of noisy data points in each batch. The results below are for datasets with 7 features each, with number of batches q ranging between 1 and 7. We kept the number of features specifically low in order to prevent excessive variation between any two sets generated this way, and to enable computing meaningful estimates of the expectation and variance of performance, while enabling creation of complicated data in which synthetic patterns may substantially overlap (using 7 features and 7 2-dimensional patterns implies that dimensions of at least 4 of the patterns will overlap). We implemented our method 5 to be scalable to the size and dimensionality of data and although for brevity we do not include a discussion of this topic here, we have successfully run RECIP against data with 100 features. The parameter α is a value close to 1, because the Tsallis divergence converges to the KL divergence as alpha approaches 1. For the experiments on real-world data, d was set to n (all projections were considered). For the artificial data experiments, we reported results for d = 2 as they do not change significantly for d >= 2 because this data was synthesized to contain bidimensional informative projections. In general, if d is too low, the correct full set of projections will not be found, but it may be recovered partially. If d is chosen too high, there is a risk that a given selected projection p will contain irrelevant features compared to the true projection p0 . However, this situation only occurs if the noise introduced by these features in the estimators makes the entropy contributions on p and p0 statistically indistinguishable for a large subset of data. The users will choose d according to the desired/acceptable complexity of the resulting model. If the results are to be visually interpreted by a human, values of 2 or 3 are reasonable for d. 3.1 Recovering informative projections Table 1 shows how well RECIP recovers the q subspaces corresponding to the synthesized batches of data. We measure precision (proportion of the recovered projections that are known to be informative), and recall (proportion of known informative projections that are recovered by the algorithm). in Table 1, rows correspond to the number of distinct synthetic batches injected in data, q, and subsequent columns correspond to the increasing amounts of noise in data. We note that the observed precision is nearly perfect: the algorithm makes only 2 mistakes over the entire set of experiments, and those occur for highly noisy setups. The recall is nearly perfect as long as there is little overlap among the dimensions, that is when the injections do not interfere with each other. As the number of projections increases, the chances for overlap among the affected features also increase, which makes the data more confusing resulting on a gradual drop of recall until only about 3 or 4 of the 7 known to be informative subspaces can be recovered. We have also used lasso as described in 2.4 in an attempt to recover projections. This setup only manages to recover one of the informative subspaces, regardless of how the regularization parameter is tuned. 3.2 Classification accuracy Table 2 shows the classification accuracy of RECIP, obtained using synthetic data. As expected, the observed performance is initially high when there are few known informative projections in data and it decreases as noise and ambiguity of the injected patterns increase. Most types of ensemble learners would use a voting scheme to arrive at the final classification of a testing sample, rather than use a model selection scheme. For this reason, we have also compared predictive accuracy revealed by RECIP against a method based on majority voting among multiple candidate subspaces. Table 4 shows that the accuracy of this technique is lower than the accuracy of RECIP, regardless of whether the informative projections are recovered by the algorithm or assumed to be known a priori. This confirms the intuition that a selection-based approach can be more effective than voting for data which satisfies the subspace separability assumption. For reference, we have also classified the synthetic data using K-Nearest-Neighbors algorithm using all available features at once. The results of that experiment are shown in Table 5. Since RECIP uses neighbor information, K-NN is conceptually the closest among the popular alternatives. Compared to RECIP, K-NN performs worse when there are fewer synthetic patterns injected in data to form informative projections. It is because some features used then by K-NN are noisy. As more features become informative, the K-NN accuracy improves. This example shows the benefit of a selective approach to feature space and using a subset of the most explanatory projections to support not only explanatory analyses but also classification tasks in such circumstances. 3.3 RECIP case studies using real-world data Table 3 summarizes the RECIP and K-NN performance on UCI datasets. We also test the methods using Cell dataset containing a set of measurements such as the area and perimeter biological cells with separate labels marking cells subjected to treatment and control cells. In Vowel data, the nearest-neighbor approach works exceptionally well, even outperforming random forests (0.94 accuracy), which is an indication that all features are jointly relevant. For d lower than the number of features, RECIP picks projections of only one feature, but if there is no such limitation, RECIP picks the space of all the features as informative. 6 Table 1: Projection recovery for artificial datasets with 1 . . . 7 informative features and noise level 0 . . . 0.2 in terms of mean and variance of Precision and Recall. Mean/var obtained for each setting by repeating the experiment with datasets with different informative projections. PRECISION 1 2 3 4 5 6 7 0 1 1 1 1 1 1 1 0.02 1 1 1 1 1 1 1 Mean 0.05 1 1 1 1 1 1 1 1 2 3 4 5 6 7 0 1 1 1 0.9643 0.7714 0.6429 0.6327 0.02 1 1 1 0.9643 0.7429 0.6905 0.5918 Mean 0.05 1 1 0.9524 0.9643 0.8286 0.6905 0.5918 0 0 0 0 0 0 0 0 0.02 0 0 0 0 0 0 0 Variance 0.05 0 0 0 0 0 0 0 0.1 0.0306 0 0 0 0 0 0 0.2 0.0306 0 0 0 0 0 0 0 0 0 0 0.0077 0.0163 0.0113 0.0225 0.02 0 0 0 0.0077 0.0196 0.0113 0.02 Variance 0.05 0 0 0.0136 0.0077 0.0049 0.0272 0.0258 0.1 0 0 0.0136 0.0077 0.0082 0.0113 0.0233 0.2 0 0 0 0.0128 0.0278 0.0113 0.02 0.1 0.0008 0.0001 0.0028 0.0025 0.0036 0.0025 0.0042 0.2 0.0007 0.0001 0.0007 0.0032 0.0044 0.0027 0.0045 0.1 0.0001 0.0001 0.0007 0.0014 0.0019 0.0023 0.0021 0.2 0.0000 0.0001 0.0007 0.0020 0.0023 0.0021 0.0020 0.1 0.9286 1 1 1 1 1 1 0.2 0.9286 1 1 1 1 1 1 RECALL 0.1 1 1 0.9524 0.9643 0.8571 0.6905 0.5714 0.2 1 1 1 0.9286 0.7714 0.6905 0.551 Table 2: RECIP Classification Accuracy on Artificial Data 1 2 3 4 5 6 7 0 0.9751 0.9333 0.9053 0.8725 0.8113 0.7655 0.7534 1 2 3 4 5 6 7 0 0.9751 0.9333 0.9053 0.8820 0.8714 0.8566 0.8429 CLASSIFICATION ACCURACY Mean Variance 0.02 0.05 0.1 0.2 0 0.02 0.05 0.9731 0.9686 0.9543 0.9420 0.0000 0.0000 0.0000 0.9297 0.9227 0.9067 0.8946 0.0001 0.0001 0.0001 0.8967 0.8764 0.8640 0.8618 0.0004 0.0005 0.0016 0.8685 0.8589 0.8454 0.8187 0.0020 0.0020 0.0019 0.8009 0.8105 0.8105 0.7782 0.0042 0.0044 0.0033 0.7739 0.7669 0.7632 0.7511 0.0025 0.0021 0.0026 0.7399 0.7347 0.7278 0.7205 0.0034 0.0040 0.0042 CLASSIFICATION ACCURACY - KNOWN PROJECTIONS Mean Variance 0.02 0.05 0.1 0.2 0 0.02 0.05 0.9731 0.9686 0.9637 0.9514 0.0000 0.0000 0.0000 0.9297 0.9227 0.9067 0.8946 0.0001 0.0001 0.0001 0.8967 0.8914 0.8777 0.8618 0.0004 0.0005 0.0005 0.8781 0.8657 0.8541 0.8331 0.0011 0.0011 0.0014 0.8641 0.8523 0.8429 0.8209 0.0015 0.0015 0.0018 0.8497 0.8377 0.8285 0.8074 0.0014 0.0015 0.0016 0.8371 0.8256 0.8122 0.7988 0.0015 0.0018 0.0018 Table 3: Accuracy of K-NN and RECIP Dataset KNN RECIP In Spam data, the two most informative projections are Breast Cancer Wis 0.8415 0.8275 ’Capital Length Total’ (CLT)/’Capital Length Longest’ Breast Tissue 1.0000 1.0000 (CLL) and CLT/’Frequency of word your’ (FWY). FigCell 0.7072 0.7640 ure 1 shows these two projections, with the dots repreMiniBOONE* 0.7896 0.7396 senting training points. The red dots represent points laSpam 0.7680 0.7680 Vowel 0.9839 0.9839 beled as spam while the blue ones are non-spam. The circles are query points that have been assigned to be classified with the projection in which they are plotted. The green circles are correctly classified points, while the magenta circles - far fewer - are the incorrectly classified ones. Not only does the importance of text in capital letters make sense for a spam filtering dataset, but the points that select those projections are almost flawlessly classified. Additionally, assuming the user would need to attest the validity of classification for the first plot, he/she would have no trouble seeing that the circled data points are located in a region predominantly populated with examples of spam, so any non-spam entry appears suspicious. Both of the magenta-colored cases fall into this category, and they can be therefore flagged for further investigation. 7 Informative Projection for the Spam dataset 2000 1500 1000 500 0 Informative Projection for the Spam dataset 12 Frequency of word ‘your‘ Capital Run Length Longest 2500 10 8 6 4 2 0 500 1000 1500 2000 2500 Capital Run Length Total 3000 0 3500 0 2000 4000 6000 8000 10000 12000 14000 Capital Run Length Total 16000 Figure 1: Spam Dataset Selected Subspaces Table 4: Classification accuracy using RECIP-learned projections - or known projections, in the lower section - within a voting model instead of a selection model 1 2 3 4 5 6 7 1 2 3 4 5 6 7 CLASSIFICATION ACCURACY - VOTING ENSEMBLE Mean Variance 0 0.02 0.05 0.1 0.2 0 0.02 0.05 0.1 0.9751 0.9731 0.9686 0.9317 0.9226 0.0000 0.0000 0.0000 0.0070 0.7360 0.7354 0.7331 0.7303 0.7257 0.0002 0.0002 0.0001 0.0002 0.7290 0.7266 0.7163 0.7166 0.7212 0.0002 0.0002 0.0008 0.0006 0.6934 0.6931 0.6932 0.6904 0.6867 0.0008 0.0008 0.0008 0.0008 0.6715 0.6602 0.6745 0.6688 0.6581 0.0013 0.0014 0.0013 0.0014 0.6410 0.6541 0.6460 0.6529 0.6512 0.0008 0.0007 0.0010 0.0006 0.6392 0.6342 0.6268 0.6251 0.6294 0.0009 0.0011 0.0012 0.0012 CLASSIFICATION ACCURACY - VOTING ENSEMBLE, KNOWN PROJECTIONS Mean Variance 0 0.02 0.05 0.1 0.2 0 0.02 0.05 0.1 0.9751 0.9731 0.9686 0.9637 0.9514 0.0000 0.0000 0.0000 0.0001 0.7360 0.7354 0.7331 0.7303 0.7257 0.0002 0.0002 0.0001 0.0002 0.7409 0.7385 0.7390 0.7353 0.7325 0.0010 0.0012 0.0010 0.0011 0.7110 0.7109 0.7083 0.7067 0.7035 0.0041 0.0041 0.0042 0.0042 0.7077 0.7070 0.7050 0.7034 0.7008 0.0015 0.0015 0.0015 0.0016 0.6816 0.6807 0.6801 0.6790 0.6747 0.0008 0.0008 0.0008 0.0008 0.6787 0.6783 0.6772 0.6767 0.6722 0.0008 0.0009 0.0009 0.0008 0.2 0.0053 0.0001 0.0002 0.0009 0.0013 0.0005 0.0012 0.2 0.0000 0.0001 0.0010 0.0043 0.0016 0.0009 0.0008 Table 5: Classification accuracy for artificial data with the K-Nearest Neighbors method CLASSIFICATION ACCURACY - KNN 1 2 3 4 5 6 7 4 0 0.7909 0.7940 0.7964 0.7990 0.8038 0.8043 0.8054 0.02 0.7843 0.7911 0.7939 0.7972 0.8024 0.8032 0.8044 Mean 0.05 0.7747 0.7861 0.7901 0.7942 0.8002 0.8015 0.8028 0.1 0.7652 0.7790 0.7854 0.7904 0.7970 0.7987 0.8004 0.2 0.7412 0.7655 0.7756 0.7828 0.7905 0.7930 0.7955 0 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 0.02 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 Variance 0.05 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.1 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 0.2 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 Conclusion This paper considers the problem of Projection Recovery for Classification. It is relevant in applications where the decision process must be easy to understand in order to enable human interpretation of the results. We have developed a principled, regression-based algorithm designed to recover small sets of low-dimensional subspaces that support interpretability. It optimizes the selection using individual data-point-specific entropy estimators. In this context, the proposed algorithm follows the idea of transductive learning, and the role of the resulting projections bears resemblance to high confidence regions known in conformal prediction models. Empirical results obtained using simulated and real-world data show the effectiveness of our method in finding informative projections that enable accurate classification while maintaining transparency of the underlying decision process. Acknowledgments This material is based upon work supported by the NSF, under Grant No. IIS-0911032. 8 References [1] Mark W. Craven and Jude W. Shavlik. Extracting Tree-Structured Representations of Trained Networks. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, Advances in Neural Information Processing Systems, volume 8, pages 24–30. The MIT Press, 1996. [2] Pedro Domingos. Knowledge discovery via multiple models. Intelligent Data Analysis, 2:187–202, 1998. [3] Eulanda M. Dos Santos, Robert Sabourin, and Patrick Maupin. A dynamic overproduce-and-choose strategy for the selection of classifier ensembles. Pattern Recogn., 41:2993–3009, October 2008. [4] M. Feder and N. Merhav. Relations between entropy and error probability. Information Theory, IEEE Transactions on, 40(1):259–266, January 1994. [5] Jerome H. Friedman, Ron Kohavi, and Yeogirl Yun. Lazy decision trees, 1996. [6] A. Gammerman, V. Vovk, and V. Vapnik. Learning by transduction. In In Uncertainty in Artificial Intelligence, pages 148–155. Morgan Kaufmann, 1998. [7] Quanquan Gu, Zhenhui Li, and Jiawei Han. Joint feature selection and subspace learning, 2011. [8] Bing Liu, Minqing Hu, and Wynne Hsu. Intuitive representation of decision trees using general rules and exceptions. In Proceedings of Seventeeth National Conference on Artificial Intellgience (AAAI-2000), July 30 - Aug 3, 2000, pages 615–620, 2000. [9] Michael Mampaey, Nikolaj Tatti, and Jilles Vreeken. Tell me what i need to know: succinctly summarizing data with itemsets. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’11, pages 573–581, New York, NY, USA, 2011. ACM. [10] Mario Marchand and Marina Sokolova. Learning with decision lists of data-dependent features. JOURNAL OF MACHINE LEARNING REASEARCH, 6, 2005. [11] Guillaume Obozinski, Ben Taskar, and Michael I. Jordan. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2):231–252, April 2010. [12] Michael J. Pazzani, Subramani Mani, and W. Rodman Shankle. Beyond concise and colorful: Learning intelligible rules, 1997. [13] B. Poczos and J. Schneider. On the estimation of alpha-divergences. AISTATS, 2011. [14] Kai Ting, Jonathan Wells, Swee Tan, Shyh Teng, and Geoffrey Webb. Feature-subspace aggregating: ensembles for stable andunstable learners. Machine Learning, 82:375–397, 2011. 10.1007/s10994-0105224-5. 9

3 0.80227804 197 nips-2012-Learning with Recursive Perceptual Representations

Author: Oriol Vinyals, Yangqing Jia, Li Deng, Trevor Darrell

Abstract: Linear Support Vector Machines (SVMs) have become very popular in vision as part of state-of-the-art object recognition and other classification tasks but require high dimensional feature spaces for good performance. Deep learning methods can find more compact representations but current methods employ multilayer perceptrons that require solving a difficult, non-convex optimization problem. We propose a deep non-linear classifier whose layers are SVMs and which incorporates random projection as its core stacking element. Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous –often more complicated– methods on several vision and speech benchmarks. 1

4 0.79676652 168 nips-2012-Kernel Latent SVM for Visual Recognition

Author: Weilong Yang, Yang Wang, Arash Vahdat, Greg Mori

Abstract: Latent SVMs (LSVMs) are a class of powerful tools that have been successfully applied to many applications in computer vision. However, a limitation of LSVMs is that they rely on linear models. For many computer vision tasks, linear models are suboptimal and nonlinear models learned with kernels typically perform much better. Therefore it is desirable to develop the kernel version of LSVM. In this paper, we propose kernel latent SVM (KLSVM) – a new learning framework that combines latent SVMs and kernel methods. We develop an iterative training algorithm to learn the model parameters. We demonstrate the effectiveness of KLSVM using three different applications in visual recognition. Our KLSVM formulation is very general and can be applied to solve a wide range of applications in computer vision and machine learning. 1

5 0.79290295 188 nips-2012-Learning from Distributions via Support Measure Machines

Author: Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, Bernhard Schölkopf

Abstract: This paper presents a kernel-based discriminative learning framework on probability measures. Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. By representing these probability distributions as mean embeddings in the reproducing kernel Hilbert space (RKHS), we are able to apply many standard kernel-based learning techniques in straightforward fashion. To accomplish this, we construct a generalization of the support vector machine (SVM) called a support measure machine (SMM). Our analyses of SMMs provides several insights into their relationship to traditional SVMs. Based on such insights, we propose a flexible SVM (FlexSVM) that places different kernel functions on each training example. Experimental results on both synthetic and real-world data demonstrate the effectiveness of our proposed framework. 1

6 0.79152179 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data

7 0.79116309 319 nips-2012-Sparse Prediction with the $k$-Support Norm

8 0.7902019 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

9 0.78866035 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

10 0.78830057 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization

11 0.7879554 200 nips-2012-Local Supervised Learning through Space Partitioning

12 0.78690892 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

13 0.78645325 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

14 0.78573883 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

15 0.78473085 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

16 0.78396654 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

17 0.78304839 103 nips-2012-Distributed Probabilistic Learning for Camera Networks with Missing Data

18 0.78197902 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

19 0.78170502 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

20 0.78160244 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems