nips nips2012 nips2012-228 knowledge-graph by maker-knowledge-mining

228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

Source: pdf

Author: Ashish Kapoor, Raajay Viswanathan, Prateek Jain

Abstract: In this paper, we present a Bayesian framework for multilabel classiďŹ cation using compressed sensing. The key idea in compressed sensing for multilabel classiďŹ cation is to ďŹ rst project the label vector to a lower dimensional space using a random transformation and then learn regression functions over these projections. Our approach considers both of these components in a single probabilistic model, thereby jointly optimizing over compression as well as learning tasks. We then derive an efďŹ cient variational inference scheme that provides joint posterior distribution over all the unobserved labels. The two key beneďŹ ts of the model are that a) it can naturally handle datasets that have missing labels and b) it can also measure uncertainty in prediction. The uncertainty estimate provided by the model allows for active learning paradigms where an oracle provides information about labels that promise to be maximally informative for the prediction task. Our experiments show signiďŹ cant boost over prior methods in terms of prediction performance over benchmark datasets, both in the fully labeled and the missing labels case. Finally, we also highlight various useful active learning scenarios that are enabled by the probabilistic model. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract In this paper, we present a Bayesian framework for multilabel classiďŹ cation using compressed sensing. [sent-2, score-0.701]

2 The key idea in compressed sensing for multilabel classiďŹ cation is to ďŹ rst project the label vector to a lower dimensional space using a random transformation and then learn regression functions over these projections. [sent-3, score-1.175]

3 The two key beneďŹ ts of the model are that a) it can naturally handle datasets that have missing labels and b) it can also measure uncertainty in prediction. [sent-6, score-0.655]

4 The uncertainty estimate provided by the model allows for active learning paradigms where an oracle provides information about labels that promise to be maximally informative for the prediction task. [sent-7, score-0.929]

5 Our experiments show signiďŹ cant boost over prior methods in terms of prediction performance over benchmark datasets, both in the fully labeled and the missing labels case. [sent-8, score-0.617]

6 Most of the existing multilabel methods learn a decision function or weight vector per label and then combine the decision functions in a certain manner to predict labels for an unseen point [3, 4, 2, 5, 6]. [sent-14, score-0.941]

7 To alleviate this problem, [1] proposed a compressed sensing (CS) based method that exploits the fact that usually the label vectors are very sparse, i. [sent-17, score-0.719]

8 For test points, the learnt regression function is applied in the reduced space and then standard recovery algorithms from CS literature are used to obtain sparse predicted labels [8, 9]. [sent-22, score-0.48]

9 Another limitation of this method is that the scheme does not directly apply when labels are missing, a common aspect in real-world web applications. [sent-24, score-0.386]

10 Finally, the method does not lend itself naturally to uncertainty analysis that can be used for active learning of labels. [sent-25, score-0.478]

11 In particular, we propose a joint probabilistic model that combines compressed sensing [10, 11] with a Bayesian learning model on the projected space. [sent-27, score-0.561]

12 First, the model naturally handles missing labels as the missing labels are modeled as random variables that can be marginalized out. [sent-33, score-1.142]

13 Thus, we can infer labels not only for the test point but also for all the missing labels in the training set. [sent-35, score-0.94]

14 Finally, the inferred posterior over labels provide an estimate of uncertainty making the proposed method amenable to active learning. [sent-36, score-0.86]

15 Active learning is an important learning paradigm that has received a lot of attention due to the availability of large unlabeled data but paucity of labels over these data sets. [sent-37, score-0.434]

16 In the traditional active learning setting (for binary/multiclass classiďŹ cation), at each round the learner actively seeks labels for a selected unlabeled point and updates its models using the provided label. [sent-38, score-0.955]

17 Our proposed model naturally handles the active learning task as the variational inference procedure provides the required posteriors which can guide information acquisition. [sent-42, score-0.621]

18 Further, besides the traditional active learning scenario, where all the labels are revealed for a selected data, the model leads to extension of information foraging to more practical and novel scenarios. [sent-43, score-0.745]

19 For example, we introduce active diagnosis, where the algorithm only asks about labels for the test case that potentially can help with prediction over the rest of the unobserved tags. [sent-44, score-0.851]

20 Similarly, we can extend to a generalized active learning setting, where the method seeks answer to questions of the type: â€œdoes label â€™Aâ€™ exists in data point xâ€? [sent-45, score-0.607]

21 We also show that the proposed framework is robust to missing labels and actually outperforms 1-vs-all SVM with about 85-95% missing labels while using K = . [sent-51, score-1.026]

22 Finally, we demonstrate that our active learning strategies select signiďŹ cantly more informative labels/points than the random selection strategy. [sent-53, score-0.416]

23 2 Approach Assume that we are given a set of training data points X = {xi } with labels Y = {yi }, where each 1 L yi = [yi , . [sent-54, score-0.542]

24 , yi ] âˆˆ [0, 1]L is a multilabel binary vector of size L. [sent-56, score-0.487]

25 Further, we also seek an active learning procedure that would request as few labels as possible from an oracle to maximize classiďŹ cation rate over the test set. [sent-60, score-0.767]

26 If we treat each label independently then standard machine learning procedures could be used to train individual classiďŹ ers and this can even be extended to do active learning. [sent-61, score-0.53]

27 However, such procedures 2 can be fairly expensive when the number of labels is huge. [sent-62, score-0.433]

28 Recent techniques in multilabel classiďŹ cation alleviate the problem of large output space [1, 18], but cannot handle the missing data cases. [sent-64, score-0.671]

29 We present a probabilistic graphical model that builds upon ideas of compressed sensing and utilizes statistical relations across the output space for prediction and active information acquisition. [sent-66, score-0.971]

30 The key idea in compressed sensing is to consider a linear transformation of the L dimensional label vector y to a K dimensional space z, where K L, via a random matrix ÎŚ. [sent-67, score-0.761]

31 The efďŹ ciency in the classiďŹ cation system is improved by considering regression functions to the compressed vectors z instead of the true label space. [sent-68, score-0.612]

32 The proposed framework considers Gaussian process priors over the compressed label space and has the capability to propagate uncertainties to the output label space by considering the constraints imposed by the random projection matrix. [sent-69, score-0.842]

33 1 A Model for Multilabel ClassiďŹ cation with Bayesian Compressed Sensing We propose a model that simultaneously handles two key aspects: ďŹ rst is the task of compressing and recovering the label vector yi to and from the lower dimensional representation zi . [sent-72, score-0.605]

34 For every data point xi , the output labels yi inďŹ‚uence the compressed latent vector zi via the random projection matrix ÎŚ. [sent-76, score-1.096]

35 These compressed signals in turn also get inďŹ‚uenced by the d-dimensional feature vector xi via the K different linear regression functions represented as a d Ă— K matrix W. [sent-77, score-0.415]

36 One of the critical assumptions in compressed sensing is that the output labels yi is sparse. [sent-81, score-1.05]

37 Intuitively, marginalizing the precision in the product of Gamma priors and the Gaussian likelihoods leads to a potential function on the labels that is a student-t distribution and has a signiďŹ cant probability mass around j zero. [sent-86, score-0.578]

38 Thus, the labels yi naturally tend to zero unless they need to explain observed data. [sent-87, score-0.542]

39 Finally, the conjugate-exponential form between the precisions Îąi and the output labels yi leads to an efďŹ cient inference procedure that we describe later in the paper. [sent-88, score-0.673]

40 Note that, for labeled training data xi âˆˆ XL all the labels yi are observed, while only some or none of the labels are observed for the partially labeled and test cases respectively. [sent-89, score-1.017]

41 ‘ Figure 1: A Bayesian model for multilabel classiďŹ cation via compressed sensing. [sent-161, score-0.701]

42 The input data is xi with multiple labels yi , which are fully observed for the case of fully labeled training data set L, partially observed for training data with missing labels P, or completely unobserved as in test data U. [sent-162, score-1.352]

43 The latent variables zi indicate the compressed label space, and Îąi with independent Gamma priors enforce the sparsity. [sent-163, score-0.706]

44 The parameters Ďƒ 2 and Ď‡2 denote noise parameters and determine how tight the relation is between the labels in the output space, the compressed space and the regression coefďŹ cients. [sent-169, score-0.812]

45 In summary, our model provides a powerful framework for modeling multilabel classiďŹ cation using compressive sensing. [sent-171, score-0.486]

46 The model promises statistical efďŹ ciency by jointly considering compressive sensing and regression within a single model. [sent-172, score-0.483]

47 Moreover, as we will see in the next section this model allows efďŹ cient numerical procedures for inferring unobserved labels by resolving the constraints imposed by the potential functions and the observed data. [sent-173, score-0.548]

48 The model naturally handles the case of missing data (incomplete labels) by automatically marginalizing the unobserved data as a part of the inference mechanism. [sent-174, score-0.495]

49 Finally, the probabilistic nature of the approach provides us with valid probabilistic quantities that can be used to perform active selection of the unlabeled points. [sent-175, score-0.472]

50 2 Inference First, consider the simpler scenario where the training data set only consists of fully labeled instances XL with labels YL . [sent-177, score-0.504]

51 Along with these sparsity terms, the projection of the label space into the compressed space precludes usage of exact inference via a junction tree algorithm. [sent-181, score-0.577]

52 By doing the update on q(yi ), the algorithms attempts to explain the compressed signal zi using sparsity imposed by the precisions Îąi . [sent-191, score-0.617]

53 Similarly, by updating q(zi ) and q(W) the inference procedures reasons about a compressed representation that is most efďŹ cient in terms of reconstruction. [sent-192, score-0.406]

54 By iterating between these updates the model consolidates information from the two key components, compressed sensing and regression, that constitute the system and is more effective than doing these tasks in isolation. [sent-193, score-0.537]

55 Handling Missing Labels in Training Data: The proposed model and the inference procedure naturally handles the case of missing labels in the training set via the variational inference. [sent-199, score-0.759]

56 Lets o u consider a data point xp with set of partially observed labels yp . [sent-200, score-0.481]

57 Intuitively, the compressed signal zp now considers compatibility with the unobserved labels, while taking into account the observed labels, and in doing so effectively facilitates message passing between all the latent random variables. [sent-203, score-0.606]

58 Intuitively, the key idea is that the information about the training set is fully captured in the regression parameters, thus, the labels for the test point can be simply recovered by only iteratively updating q(yâˆ— ), q(zâˆ— ) and q(Îąâˆ— ). [sent-207, score-0.524]

59 3 Active Learning The main aim in active learning is to seek bits of information that would promise to enhance the discriminatory power of the framework the most. [sent-209, score-0.397]

60 When employed in a traditional classiďŹ cation setting, the active learning procedure boils down to the task of seeking the label for one of the unlabeled examples that promises to be most informative and then update the classiďŹ cation model by incorporating it into the existing training set. [sent-210, score-0.886]

61 However, multilabel classiďŹ cation enables richer forms of active information acquisitions, which we describe below: 5 â€˘ Traditional Active Learning: This is similar to the active learning scenario in traditional classiďŹ cation tasks. [sent-211, score-1.188]

62 In particular, the goal is to select an unlabeled sample for which all the labels will be revealed. [sent-212, score-0.434]

63 â€˘ Active Diagnosis: Given a test data point, at every iteration the active acquisition procedure seeks a label for each test point that is maximally informative for the same and promises to improve the prediction accuracy over the rest of the unknown labels. [sent-213, score-0.739]

64 Active diagnosis should be able to leverage the statistical dependency amongst the output label space, in order to ask for labels that are maximally informative. [sent-217, score-0.69]

65 A direct generalization of the above two paradigms is a setting in which the active learning procedure selects a label for one point in the training set. [sent-218, score-0.594]

66 SpeciďŹ cally, the key difference between this scenario and the traditional active learning is that only one label is chosen to be revealed for the selected data point instead of the entire set of labels. [sent-219, score-0.611]

67 Non-probabilistic classiďŹ cation schemes, such as SVMs, can handle traditional active learning by ďŹ rst establishing the conďŹ dence in the estimate of each label by using the distance from the classiďŹ cation boundary (margin) and then selecting the point that is closest to the margin. [sent-220, score-0.694]

68 However, it is fairly non-trivial to extend those approaches to tackle the active diagnosis and generalized information acquisition. [sent-221, score-0.453]

69 On the other hand the proposed Bayesian model provides a posterior distribution over the unknown class labels as well as other latent variables and can be used for active learning. [sent-222, score-0.813]

70 In particular, measures such as uncertainty or information gain can be used to guide the selective sampling procedure for active learning. [sent-223, score-0.508]

71 The uncertainty criterion seeks to select the labels that have the highest entropy, whereas the information gain criterion seeks to select a label that has the highest expected reduction in uncertainty over all the other unlabeled points or unknown labels. [sent-226, score-0.918]

72 Either of these criteria can be computed given the inferred posteriors; however we note that the information gain criterion is far more expensive to compute as it requires repeated inference by considering all possible labels for every unlabeled data point. [sent-227, score-0.572]

73 The uncertainty criterion on the other hand is very simple and often guides active learning with reasonable amount of gains. [sent-228, score-0.439]

74 In this work we will consider uncertainty as the primary active learning criterion. [sent-229, score-0.407]

75 Finally, weâ€™d like to point that the different described forms of active learning can naturally be addressed with these heuristics by appropriately choosing the set of possible candidates and the posterior distributions over which the entropy is measured. [sent-230, score-0.478]

76 We also implemented the multilabel classiďŹ cation method based on compressed sensing (ML-CS ) [1] with CoSamp [8] being the underlying sparse vector recovery algorithm. [sent-238, score-0.948]

77 Such datasets generally tend to have only a few labels per data point and the compressed sensing methods can exploit this sparsity to their advantage. [sent-241, score-0.926]

78 17 (d) Figure 3: Precision values obtained by various methods in retrieving 3 and 5 labels respectively. [sent-335, score-0.39]

79 First column in each table shows K as the fraction of number of labels L. [sent-336, score-0.387]

80 For each of the algorithms we recover the top 1, 3, 5 most likely positive labels and set remaining labels to be negative. [sent-339, score-0.71]

81 It is clear from the ďŹ gure that both BML-CS and ML-CS are signiďŹ cantly worse than 1-vs-all SVM when K is very small compared to total number of labels L. [sent-349, score-0.386]

82 This is fairly intuitive as for higher recall rate the multilabel problems become harder and hence our method requires more weight vectors to be learned per label. [sent-356, score-0.472]

83 2 Missing Labels Next, we conduct experiments for multilabel classiďŹ cation with missing labels. [sent-358, score-0.54]

84 SpeciďŹ cally, we remove a ďŹ xed fraction of training labels randomly from each dataset considered. [sent-359, score-0.427]

85 5L) (b) Precision obtained after each round of active learning by BML-CS-Active method and by the baseline random selection strategy over RCV1 dataset. [sent-363, score-0.466]

86 (c) Precision after active learning, where one label per point is added to the training set, in comparison with random baseline on RCV1 dataset. [sent-364, score-0.646]

87 Since, SVM cannot directly handle missing labels, we always set a missing label to be a negative label. [sent-369, score-0.504]

88 In contrast, our method can explicitly handle missing labels and can perform inference by marginalizing the unobserved tags. [sent-370, score-0.799]

89 As the number of positive labels is signiďŹ cantly smaller than the negative labels, when only a small fraction of labels are removed, both SVM and BML-CS obtain similar accuracies to the case where all the labels are present. [sent-371, score-1.176]

90 However, as the number of missing labels increase there is a smooth dip in the precision of the two methods. [sent-372, score-0.66]

91 For each of the tasks, we use our Bayesian multilabel method to compute the posterior over the label vector. [sent-379, score-0.592]

92 For these experiments, we initialize both the methods with an initial labeled dataset of 100 points and then after each active learning round we seek all the labels for the selected training data point. [sent-385, score-0.859]

93 Figure 4 (b) compares precisions obtained by BML-CS-Active method with the precisions obtained by the baseline method after every active learning round. [sent-386, score-0.607]

94 After just 15 active learning rounds, our method is able to gain about 6% of accuracy while random selection method do not provide any gain in the accuracy. [sent-387, score-0.457]

95 Active Diagnosis: In this type of active learning, we query one label for each of the training points in each round. [sent-388, score-0.533]

96 Figure 4 (c) plots the improvement in precision values with number of rounds of active learning, for estimating the top-1 label. [sent-390, score-0.545]

97 4 Conclusion and Future Work We presented a Bayesian framework for multilabel classiďŹ cation that uses compressive sensing. [sent-392, score-0.486]

98 The proposed framework jointly models the compressive sensing/reconstruction task with learning regression over the compressed space. [sent-393, score-0.535]

99 We present an efďŹ cient variational inference scheme that jointly resolves compressed sensing and regression tasks. [sent-394, score-0.708]

100 The resulting posterior distribution can further be used to perform different ďŹ‚avors of active learning. [sent-395, score-0.406]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('labels', 0.355), ('multilabel', 0.34), ('active', 0.339), ('compressed', 0.319), ('svm', 0.206), ('bml', 0.203), ('sensing', 0.187), ('cs', 0.165), ('missing', 0.158), ('label', 0.154), ('zi', 0.149), ('precision', 0.147), ('yi', 0.147), ('unobserved', 0.127), ('compressive', 0.104), ('classi', 0.082), ('fxi', 0.08), ('unlabeled', 0.079), ('precisions', 0.079), ('handles', 0.076), ('yu', 0.075), ('diagnosis', 0.073), ('regression', 0.069), ('gamma', 0.068), ('uncertainty', 0.068), ('posterior', 0.067), ('yp', 0.066), ('zp', 0.066), ('rounds', 0.059), ('bayesian', 0.054), ('latent', 0.052), ('promises', 0.052), ('traditional', 0.051), ('seeks', 0.051), ('inference', 0.05), ('accuracies', 0.048), ('baseline', 0.048), ('round', 0.048), ('informative', 0.046), ('labeled', 0.046), ('wi', 0.046), ('posteriors', 0.045), ('cosamp', 0.045), ('uo', 0.045), ('marginalizing', 0.044), ('jointly', 0.043), ('output', 0.042), ('compatibility', 0.042), ('selective', 0.042), ('cation', 0.042), ('obtains', 0.042), ('fairly', 0.041), ('update', 0.041), ('training', 0.04), ('naturally', 0.04), ('ml', 0.04), ('variational', 0.04), ('procedures', 0.037), ('dimensional', 0.037), ('xl', 0.036), ('ij', 0.036), ('spherical', 0.035), ('scenario', 0.035), ('maximally', 0.035), ('retrieving', 0.035), ('vs', 0.034), ('handle', 0.034), ('per', 0.033), ('yl', 0.033), ('fraction', 0.032), ('point', 0.032), ('priors', 0.032), ('criterion', 0.032), ('committee', 0.031), ('amongst', 0.031), ('tasks', 0.031), ('guide', 0.031), ('seek', 0.031), ('cantly', 0.031), ('method', 0.031), ('prediction', 0.03), ('projecting', 0.03), ('uncertainties', 0.03), ('positives', 0.029), ('paradigms', 0.029), ('recovery', 0.029), ('imposed', 0.029), ('projected', 0.028), ('alleviate', 0.028), ('rand', 0.028), ('considering', 0.028), ('partially', 0.028), ('fully', 0.028), ('gain', 0.028), ('space', 0.027), ('gaussian', 0.027), ('weight', 0.027), ('turn', 0.027), ('probabilistic', 0.027), ('promise', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

Author: Ashish Kapoor, Raajay Viswanathan, Prateek Jain

2 0.20850493 32 nips-2012-Active Comparison of Prediction Models

Author: Christoph Sawade, Niels Landwehr, Tobias Scheffer

Abstract: We address the problem of comparing the risks of two given predictive models—for instance, a baseline model and a challenger—as conﬁdently as possible on a ﬁxed labeling budget. This problem occurs whenever models cannot be compared on held-out training data, possibly because the training data are unavailable or do not reﬂect the desired test distribution. In this case, new test instances have to be drawn and labeled at a cost. We devise an active comparison method that selects instances according to an instrumental sampling distribution. We derive the sampling distribution that maximizes the power of a statistical test applied to the observed empirical risks, and thereby minimizes the likelihood of choosing the inferior model. Empirically, we investigate model selection problems on several classiﬁcation and regression tasks and study the accuracy of the resulting p-values. 1

3 0.19667403 79 nips-2012-Compressive neural representation of sparse, high-dimensional probabilities

Author: Xaq Pitkow

Abstract: This paper shows how sparse, high-dimensional probability distributions could be represented by neurons with exponential compression. The representation is a novel application of compressive sensing to sparse probability distributions rather than to the usual sparse signals. The compressive measurements correspond to expected values of nonlinear functions of the probabilistically distributed variables. When these expected values are estimated by sampling, the quality of the compressed representation is limited only by the quality of sampling. Since the compression preserves the geometric structure of the space of sparse probability distributions, probabilistic computation can be performed in the compressed domain. Interestingly, functions satisfying the requirements of compressive sensing can be implemented as simple perceptrons. If we use perceptrons as a simple model of feedforward computation by neurons, these results show that the mean activity of a relatively small number of neurons can accurately represent a highdimensional joint distribution implicitly, even without accounting for any noise correlations. This comprises a novel hypothesis for how neurons could encode probabilities in the brain. 1

4 0.17694177 197 nips-2012-Learning with Recursive Perceptual Representations

Author: Oriol Vinyals, Yangqing Jia, Li Deng, Trevor Darrell

Abstract: Linear Support Vector Machines (SVMs) have become very popular in vision as part of state-of-the-art object recognition and other classiﬁcation tasks but require high dimensional feature spaces for good performance. Deep learning methods can ﬁnd more compact representations but current methods employ multilayer perceptrons that require solving a difﬁcult, non-convex optimization problem. We propose a deep non-linear classiﬁer whose layers are SVMs and which incorporates random projection as its core stacking element. Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous –often more complicated– methods on several vision and speech benchmarks. 1

5 0.14115936 207 nips-2012-Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification

Author: Wei Bi, James T. Kwok

Abstract: In hierarchical classiﬁcation, the prediction paths may be required to always end at leaf nodes. This is called mandatory leaf node prediction (MLNP) and is particularly useful when the leaf nodes have much stronger semantic meaning than the internal nodes. However, while there have been a lot of MLNP methods in hierarchical multiclass classiﬁcation, performing MLNP in hierarchical multilabel classiﬁcation is much more difﬁcult. In this paper, we propose a novel MLNP algorithm that (i) considers the global hierarchy structure; and (ii) can be used on hierarchies of both trees and DAGs. We show that one can efﬁciently maximize the joint posterior probability of all the node labels by a simple greedy algorithm. Moreover, this can be further extended to the minimization of the expected symmetric loss. Experiments are performed on a number of real-world data sets with tree- and DAG-structured label hierarchies. The proposed method consistently outperforms other hierarchical and ﬂat multilabel classiﬁcation methods. 1

6 0.13345329 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

7 0.11889498 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking

8 0.11090796 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

9 0.11021511 305 nips-2012-Selective Labeling via Error Bound Minimization

10 0.10860823 200 nips-2012-Local Supervised Learning through Space Partitioning

11 0.10750083 5 nips-2012-A Conditional Multinomial Mixture Model for Superset Label Learning

12 0.1004067 280 nips-2012-Proper losses for learning from partial labels

13 0.098142132 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

14 0.097032771 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

15 0.092136331 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

16 0.091279767 10 nips-2012-A Linear Time Active Learning Algorithm for Link Classification

17 0.091200896 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction

18 0.090999231 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

19 0.088769548 168 nips-2012-Kernel Latent SVM for Visual Recognition

20 0.087050609 279 nips-2012-Projection Retrieval for Classification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.253), (1, 0.052), (2, -0.029), (3, 0.002), (4, 0.011), (5, -0.071), (6, 0.043), (7, 0.124), (8, -0.026), (9, -0.07), (10, -0.076), (11, 0.095), (12, 0.079), (13, 0.06), (14, 0.019), (15, -0.042), (16, -0.033), (17, 0.001), (18, -0.037), (19, 0.012), (20, -0.073), (21, 0.188), (22, -0.061), (23, 0.103), (24, 0.052), (25, -0.021), (26, 0.023), (27, 0.006), (28, 0.028), (29, -0.048), (30, -0.272), (31, -0.122), (32, 0.062), (33, 0.035), (34, -0.256), (35, -0.154), (36, 0.062), (37, 0.088), (38, 0.094), (39, 0.004), (40, 0.026), (41, 0.005), (42, -0.052), (43, 0.006), (44, 0.019), (45, -0.041), (46, 0.023), (47, 0.05), (48, -0.008), (49, 0.083)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96752137 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

Author: Ashish Kapoor, Raajay Viswanathan, Prateek Jain

2 0.69878554 32 nips-2012-Active Comparison of Prediction Models

Author: Christoph Sawade, Niels Landwehr, Tobias Scheffer

3 0.63283789 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification

Author: Yao-nan Chen, Hsuan-tien Lin

Abstract: Label space dimension reduction (LSDR) is an efﬁcient and effective paradigm for multi-label classiﬁcation with many classes. Existing approaches to LSDR, such as compressive sensing and principal label space transformation, exploit only the label part of the dataset, but not the feature part. In this paper, we propose a novel approach to LSDR that considers both the label and the feature parts. The approach, called conditional principal label space transformation, is based on minimizing an upper bound of the popular Hamming loss. The minimization step of the approach can be carried out efﬁciently by a simple use of singular value decomposition. In addition, the approach can be extended to a kernelized version that allows the use of sophisticated feature combinations to assist LSDR. The experimental results verify that the proposed approach is more effective than existing ones to LSDR across many real-world datasets. 1

4 0.62343013 79 nips-2012-Compressive neural representation of sparse, high-dimensional probabilities

Author: Xaq Pitkow

5 0.60201812 279 nips-2012-Projection Retrieval for Classification

Author: Madalina Fiterau, Artur Dubrawski

Abstract: In many applications, classiﬁcation systems often require human intervention in the loop. In such cases the decision process must be transparent and comprehensible, simultaneously requiring minimal assumptions on the underlying data distributions. To tackle this problem, we formulate an axis-aligned subspace-ﬁnding task under the assumption that query speciﬁc information dictates the complementary use of the subspaces. We develop a regression-based approach called RECIP that efﬁciently solves this problem by ﬁnding projections that minimize a nonparametric conditional entropy estimator. Experiments show that the method is accurate in identifying the informative projections of the dataset, picking the correct views to classify query points, and facilitates visual evaluation by users. 1 Introduction and problem statement In the domain of predictive analytics, many applications which keep human users in the loop require the use of simple classiﬁcation models. Often, it is required that a test-point be ‘explained’ (classiﬁed) using a simple low-dimensional projection of the original feature space. This is a Projection Retrieval for Classiﬁcation problem (PRC). The interaction with the user proceeds as follows: the user provides the system a query point; the system searches for a projection in which the point can be accurately classiﬁed; the system displays the classiﬁcation result as well as an illustration of how the classiﬁcation decision was reached in the selected projection. Solving the PRC problem is relevant in many practical applications. For instance, consider a nuclear threat detection system installed at a border check point. Vehicles crossing the border are scanned with sensors so that a large array of measurements of radioactivity and secondary contextual information is being collected. These observations are fed into a classiﬁcation system that determines whether the scanned vehicle may carry a threat. Given the potentially devastating consequences of a false negative, a border control agent is requested to validate the prediction and decide whether to submit the vehicle for a costly further inspection. With the positive classiﬁcation rate of the system under strict bounds because of limitations in the control process, the risk of false negatives is increased. Despite its crucial role, human intervention should only be withheld for cases in which there are reasons to doubt the validity of classiﬁcation. In order for a user to attest the validity of a decision, the user must have a good understanding of the classiﬁcation process, which happens more readily when the classiﬁer only uses the original dataset features rather than combinations of them, and when the discrimination models are low-dimensional. In this context, we aim to learn a set of classiﬁers in low-dimensional subspaces and a decision function which selects the subspace under which a test point is to be classiﬁed. Assume we are given a dataset {(x1 , y1 ) . . . (xn , yn )} ∈ X n × {0, 1}n and a class of discriminators H. The model will contain a set Π of subspaces of X ; Π ⊆ Π, where Π is the set of all axis-aligned subspaces of the original feature space, the power set of the features. To each projection πi ∈ Π corresponds one discriminator from a given hypothesis space hi ∈ H. It will also contain a selection function g : X → Π × H, which yields, for a query point x, the projection/discriminator pair with which this point will be classiﬁed. The notation π(x) refers to the projection of the point x onto the subspace 1 π while h(π(x)) represents the predicted label for x. Formally, we describe the model class as Md = {Π = {π : π ∈ Π, dim(π) ≤ d}, H = {hi : hi ∈ H, h : πi → Y, ∀i = 1 . . . |Π|}, g ∈ {f : X → {1 . . . |Π|}} . where dim(π) presents the dimensionality of the subspace determined by the projection π. Note that only projections up to size d will be considered, where d is a parameter speciﬁc to the application. The set H contains one discriminator from the hypothesis class H for each projection. Intuitively, the aim is to minimize the expected classiﬁcation error over Md , however, a notable modiﬁcation is that the projection and, implicitly, the discriminator, are chosen according to the data point that needs to be classiﬁed. Given a query x in the space X , g(x) will yield the subspace πg(x) onto which the query is projected and the discriminator hg(x) for it. Distinct test points can be handled using different combinations of subspaces and discriminators. We consider models that minimize 0/1 loss. Hence, the PRC problem can be stated as follows: M ∗ = arg min EX ,Y y = hg(x) (πg(x) (x)) M ∈Md There are limitations to the type of selection function g that can be learned. A simple example for which g can be recovered is a set of signal readings x for which, if one of the readings xi exceeds a threshold ti , the label can be predicted just based on xi . A more complex one is a dataset containing regulatory variables, that is, for xi in the interval [ak , bk ] the label only depends on (x1 . . . xnk ) k k datasets that fall into the latter category fulﬁll what we call the Subspace-Separability Assumption. This paper proposes an algorithm called RECIP that solves the PRC problem for a class of nonparametric classiﬁers. We evaluate the method on artiﬁcial data to show that indeed it correctly identiﬁes the underlying structure for data satisfying the Subspace-Separability Assumption. We show some case studies to illustrate how RECIP offers insight into applications requiring human intervention. The use of dimensionality reduction techniques is a common preprocessing step in applications where the use of simpliﬁed classiﬁcation models is preferable. Methods that learn linear combinations of features, such as Linear Discriminant Analysis, are not quite appropriate for the task considered here, since we prefer to natively rely on the dimensions available in the original feature space. Feature selection methods, such as e.g. lasso, are suitable for identifying sets of relevant features, but do not consider interactions between them. Our work better ﬁts the areas of class dependent feature selection and context speciﬁc classiﬁcation, highly connected to the concept of Transductive Learning [6]. Other context-sensitive methods are Lazy and Data-Dependent Decision Trees, [5] and [10] respectively. In Ting et al [14], the Feating submodel selection relies on simple attribute splits followed by ﬁtting local predictors, though the algorithm itself is substantially different. Obozinski et al present a subspace selection method in the context of multitask learning [11]. Go et al propose a joint method for feature selection and subspace learning [7], however, their classiﬁcation model is not particularly query speciﬁc. Alternatively, algorithms that transform complex or unintelligible models with user-friendly equivalents have been proposed [3, 2, 1, 8]. Algorithms speciﬁcally designed to yield understandable models are a precious few. Here we note a rule learning method described in [12], even though the resulting rules can make visualization difﬁcult, while itemset mining [9] is not speciﬁcally designed for classiﬁcation. Unlike those approaches, our method is designed to retrieve subsets of the feature space designed for use in a way that is complementary to the basic task at hand (classiﬁcation) while providing query-speciﬁc information. 2 Recovering informative projections with RECIP To solve PRC, we need means by which to ascertain which projections are useful in terms of discriminating data from the two classes. Since our model allows the use of distinct projections depending on the query point, it is expected that each projection would potentially beneﬁt different areas of the feature space. A(π) refers to the area of the feature space where the projection π is selected. A(π) = {x ∈ X : πg(x) = π} The objective becomes min E(X ×Y) y = hg(x) (πg(x) (x)) M ∈Md = p(A(π))E y = hg(x) (πg(x) (x))|x ∈ A(π) min M ∈Md 2 π∈Π . The expected classiﬁcation error over A(π) is linked to the conditional entropy of Y |X. Fano’s inequality provides a lower bound on the error while Feder and Merhav [4] derive a tight upper bound on the minimal error probability in terms of the entropy. This means that conditional entropy characterizes the potential of a subset of the feature space to separate data, which is more generic than simply quantifying classiﬁcation accuracy for a speciﬁc discriminator. In view of this connection between classiﬁcation accuracy and entropy, we adapt the objective to: p(A(π))H(Y |π(X); X ∈ A(π)) min M ∈Md (1) π∈Π The method we propose optimizes an empirical analog of (1) which we develop below and for which we will need the following result. Proposition 2.1. Given a continuous variable X ∈ X and a binary variable Y , where X is sampled from the mixture model f (x) = p(y = 0)f0 (x) + p(y = 1)f1 (x) = p0 f0 (x) + p1 f1 (x) , then H(Y |X) = −p0 log p0 − p1 log p1 − DKL (f0 ||f ) − DKL (f1 ||f ) . Next, we will use the nonparametric estimator presented in [13] for Tsallis α-divergence. Given samples Ui ∼ U, with i = 1, n and Vj ∼ V with j = 1, m, the divergence is estimated as follows: ˆ Tα (U ||V ) = 1 1 1−α n n (n − 1)νk (Ui , U \ ui )d mνk (Ui , V )d i=1 1−α B(k, α) − 1 , (2) where d is the dimensionality of the variables U and V and νk (z, Z) represents the distance from z ˆ to its k th nearest neighbor of the set of points Z. For α ≈ 1 and n → ∞, Tα (u||v) ≈ DKL (u||v). 2.1 Local estimators of entropy We will now plug (2) in the formula obtained by Proposition 2.1 to estimate the quantity (1). We use the notation X0 to represent the n0 samples from X which have the labels Y equal to 0, and X1 to represent the n1 samples from X which have the labels set to 1. Also, Xy(x) represents the set of samples that have labels equal to the label of x and X¬y(x) the data that have labels opposite to the label of x. ˆ ˆ x ˆ x H(Y |X; X ∈ A) = −H(p0 ) − H(p1 ) − T (f0 ||f x ) − T (f1 ||f x ) + C α≈1 ˆ H(Y |X; X ∈ A) ∝ 1 n0 + 1 n1 ∝ 1 n0 + 1 n1 ∝ 1 n n0 (n0 − 1)νk (xi , X0 \ xi )d nνk (xi , X \ xi )d 1−α I[xi ∈ A] (n1 − 1)νk (xi , X1 \ xi )d nνk (xi , X \ xi )d 1−α I[xi ∈ A] (n0 − 1)νk (xi , X0 \ xi )d nνk (xi , X1 \ xi )d 1−α I[xi ∈ A] (n1 − 1)νk (xi , X1 \ xi )d nνk (xi , X0 \ xi )d 1−α I[xi ∈ A] i=1 n1 i=1 n0 i=1 n1 i=1 n I[xi ∈ A] i=1 (n − 1)νk (xi , Xy(xi ) \ xi )d nνk (xi , X¬y(xi ) \ xi )d 1−α The estimator for the entropy of the data that is classiﬁed with projection π is as follows: ˆ H(Y |π(X); X ∈ A(π)) ∝ 1 n n I[xi ∈ A(π)] i=1 (n − 1)νk (π(xi ), π(Xy(xi ) ) \ π(xi ))d nνk (π(xi ), π(X¬y(xi ) \ xi ))d 1−α (3) From 3 and using the fact that I[xi ∈ A(π)] = I[πg(xi ) = π] for which we use the notation I[g(xi ) → π], we estimate the objective as min M ∈Md π∈Π 1 n n I[g(xi ) → π] i=1 (n − 1)νk (π(xi ), π(Xy(xi ) ) \ π(xi ))d nνk (π(xi ), π(X¬y(xi ) \ xi ))d 3 1−α (4) Therefore, the contribution of each data point to the objective corresponds to a distance ratio on the projection π ∗ where the class of the point is obtained with the highest conﬁdence (data is separable in the neighborhood of the point). We start by computing the distance-based metric of each point on each projection of size up to d - there are d∗ such projections. This procedure yields an extended set of features Z, which we name local entropy estimates: Zij = νk (πj (xi ), πj (Xy(xi ) ) \ πj (xi )) νk (πj (xi ), πj (X¬y(xi ) ) \ πj (xi )) d(1−α) α≈1 j ∈ {1 . . . d∗ } (5) For each training data point, we compute the best distance ratio amid all the projections, which is simply Ti = minj∈[d∗ ] Zij . The objective can be then further rewritten as a function of the entropy estimates: n I[g(xi ) → πj ]Zij min M ∈Md (6) i=1 πj ∈Π From the deﬁnition of T, it is also clear that n n I[g(xi ) → πj ]Zij min M ∈Md 2.2 ≥ i=1 πj ∈Π Ti . (7) i=1 Projection selection as a combinatorial problem Considering form (6) of the objective, and given that the estimates Zij are constants, depending only on the training set, the projection retrieval problem is reduced to ﬁnding g for all training points, which will implicitly select the projection set of the model. Naturally, one might assume the bestperforming classiﬁcation model is the one containing all the axis-aligned subspaces. This model achieves the lower bound (7) for the training set. However, the larger the set of projections, the more values the function g takes, and thus the problem of selecting the correct projection becomes more difﬁcult. It becomes apparent that the number of projections should be somehow restricted to allow intepretability. Assuming a hard threshold of at most t projections, the optimization (6) becomes an entry selection problem over matrix Z where one value must be picked from each row under a limitation on the number of columns that can be used. This problem cannot be solved exactly in polynomial time. Instead, it can be formulated as an optimization problem under 1 constraints. 2.3 Projection retrieval through regularized regression To transform the projection retrieval to a regression problem we consider T, the minimum obtainable value of the entropy estimator for each point, as the output which the method needs to predict. Each row i of the parameter matrix B represents the degrees to which the entropy estimates on each projection contribute to the entropy estimator of point xi . Thus, the sum over each row of B is 1, and the regularization penalty applies to the number of non-zero columns in B. d∗ min ||T − (Z B B)J|Π|,1 ||2 2 +λ [Bi = 0] (8) i=1 subject to |Bk | 1 = 1 k = 1, n where (Z B)ij = Zij + Bij and J is a matrix of ones. The problem with this optimization is that it is not convex. A typical walk-around of this issue is to use the convex relaxation for Bi = 0, that is 1 norm. This would transform the penalized term d∗ d∗ n to i=1 |Bi | 1 . However, i=1 |Bi | 1 = k=1 |Bk | 1 = n , so this penalty really has no effect. An alternative mechanism to encourage the non-zero elements in B to populate a small number of columns is to add a penalty term in the form of Bδ, where δ is a d∗ -size column vector with each element representing the penalty for a column in B. With no prior information about which subspaces are more informative, δ starts as an all-1 vector. An initial value for B is obtained through the optimization (8). Since our goal is to handle data using a small number of projections, δ is then updated such that its value is lower for the denser columns in B. This update resembles the reweighing in adaptive lasso. The matrix B itself is updated, and this 2-step process continues until convergence of δ. Once δ converges, the projections corresponding to the non-zero columns of B are added to the model. The procedure is shown in Algorithm 1. 4 Algorithm 1: RECIP δ = [1 . . . 1] repeat |P I| b = arg minB ||T − i=1 < Z, B > ||2 + λ|Bδ| 1 2 subject to |Bk | 1 = 1 k = 1...n δk = |Bi | 1 i = . . . d∗ (update the differential penalty) δ δ = 1 − |δ| 1 until δ converges return Π = {πi ; |Bi | 1 > 0 ∀i = 1 . . . d∗ } 2.4 Lasso for projection selection We will compare our algorithm to lasso regularization that ranks the projections in terms of their potential for data separability. We write this as an 1 -penalized optimization on the extended feature set Z, with the objective T : minβ |T − Zβ|2 + λ|β| 1 . The lasso penalty to the coefﬁcient vector encourages sparsity. For a high enough λ, the sparsity pattern in β is indicative of the usefulness of the projections. The lasso on entropy contributions was not found to perform well as it is not query speciﬁc and will ﬁnd one projection for all data. We improved it by allowing it to iteratively ﬁnd projections - this robust version offers increased performance by reweighting the data thus focusing on different subsets of it. Although better than running lasso on entropy contributions, the robust lasso does not match RECIP’s performance as the projections are selected gradually rather than jointly. Running the standard lasso on the original design matrix yields a set of relevant variables and it is not immediately clear how the solution would translate to the desired class. 2.5 The selection function Once the projections are selected, the second stage of the algorithm deals with assigning the projection with which to classify a particular query point. An immediate way of selecting the correct projection starts by computing the local entropy estimator for each subspace with each class assignment. Then, we may select the label/subspace combination that minimizes the empirical entropy. (i∗ , θ∗ ) = arg min i,θ 3 νk (πi (x), πi (Xθ )) νk (πi (x), πi (X¬θ )) dim(πi )(1−α) i = 1 . . . d∗ , α≈1 (9) Experimental results In this section we illustrate the capability of RECIP to retrieve informative projections of data and their use in support of interpreting results of classiﬁcation. First, we analyze how well RECIP can identify subspaces in synthetic data whose distribution obeys the subspace separability assumption (3.1). As a point of reference, we also present classiﬁcation accuracy results (3.2) for both the synthetic data and a few real-world sets. This is to quantify the extent of the trade-off between ﬁdelity of attainable classiﬁers and desired informativeness of the projections chosen by RECIP. We expect RECIP’s classiﬁcation performance to be slightly, but not substantially worse when compared to relevant classiﬁcation algorithms trained to maximize classiﬁcation accuracy. Finally, we present a few examples (3.3) of informative projections recovered from real-world data and their utility in explaining to human users the decision processes applied to query points. A set of artiﬁcial data used in our experiments contains q batches of data points, each of them made classiﬁable with high accuracy using one of available 2-dimensional subspaces (x1 , x2 ) with k ∈ k k {1 . . . q}. The data in batch k also have the property that x1 > tk . This is done such that the group a k point belongs to can be detected from x1 , thus x1 is a regulatory variable. We control the amount of k k noise added to thusly created synthetic data by varying the proportion of noisy data points in each batch. The results below are for datasets with 7 features each, with number of batches q ranging between 1 and 7. We kept the number of features speciﬁcally low in order to prevent excessive variation between any two sets generated this way, and to enable computing meaningful estimates of the expectation and variance of performance, while enabling creation of complicated data in which synthetic patterns may substantially overlap (using 7 features and 7 2-dimensional patterns implies that dimensions of at least 4 of the patterns will overlap). We implemented our method 5 to be scalable to the size and dimensionality of data and although for brevity we do not include a discussion of this topic here, we have successfully run RECIP against data with 100 features. The parameter α is a value close to 1, because the Tsallis divergence converges to the KL divergence as alpha approaches 1. For the experiments on real-world data, d was set to n (all projections were considered). For the artiﬁcial data experiments, we reported results for d = 2 as they do not change signiﬁcantly for d >= 2 because this data was synthesized to contain bidimensional informative projections. In general, if d is too low, the correct full set of projections will not be found, but it may be recovered partially. If d is chosen too high, there is a risk that a given selected projection p will contain irrelevant features compared to the true projection p0 . However, this situation only occurs if the noise introduced by these features in the estimators makes the entropy contributions on p and p0 statistically indistinguishable for a large subset of data. The users will choose d according to the desired/acceptable complexity of the resulting model. If the results are to be visually interpreted by a human, values of 2 or 3 are reasonable for d. 3.1 Recovering informative projections Table 1 shows how well RECIP recovers the q subspaces corresponding to the synthesized batches of data. We measure precision (proportion of the recovered projections that are known to be informative), and recall (proportion of known informative projections that are recovered by the algorithm). in Table 1, rows correspond to the number of distinct synthetic batches injected in data, q, and subsequent columns correspond to the increasing amounts of noise in data. We note that the observed precision is nearly perfect: the algorithm makes only 2 mistakes over the entire set of experiments, and those occur for highly noisy setups. The recall is nearly perfect as long as there is little overlap among the dimensions, that is when the injections do not interfere with each other. As the number of projections increases, the chances for overlap among the affected features also increase, which makes the data more confusing resulting on a gradual drop of recall until only about 3 or 4 of the 7 known to be informative subspaces can be recovered. We have also used lasso as described in 2.4 in an attempt to recover projections. This setup only manages to recover one of the informative subspaces, regardless of how the regularization parameter is tuned. 3.2 Classiﬁcation accuracy Table 2 shows the classiﬁcation accuracy of RECIP, obtained using synthetic data. As expected, the observed performance is initially high when there are few known informative projections in data and it decreases as noise and ambiguity of the injected patterns increase. Most types of ensemble learners would use a voting scheme to arrive at the ﬁnal classiﬁcation of a testing sample, rather than use a model selection scheme. For this reason, we have also compared predictive accuracy revealed by RECIP against a method based on majority voting among multiple candidate subspaces. Table 4 shows that the accuracy of this technique is lower than the accuracy of RECIP, regardless of whether the informative projections are recovered by the algorithm or assumed to be known a priori. This conﬁrms the intuition that a selection-based approach can be more effective than voting for data which satisﬁes the subspace separability assumption. For reference, we have also classiﬁed the synthetic data using K-Nearest-Neighbors algorithm using all available features at once. The results of that experiment are shown in Table 5. Since RECIP uses neighbor information, K-NN is conceptually the closest among the popular alternatives. Compared to RECIP, K-NN performs worse when there are fewer synthetic patterns injected in data to form informative projections. It is because some features used then by K-NN are noisy. As more features become informative, the K-NN accuracy improves. This example shows the beneﬁt of a selective approach to feature space and using a subset of the most explanatory projections to support not only explanatory analyses but also classiﬁcation tasks in such circumstances. 3.3 RECIP case studies using real-world data Table 3 summarizes the RECIP and K-NN performance on UCI datasets. We also test the methods using Cell dataset containing a set of measurements such as the area and perimeter biological cells with separate labels marking cells subjected to treatment and control cells. In Vowel data, the nearest-neighbor approach works exceptionally well, even outperforming random forests (0.94 accuracy), which is an indication that all features are jointly relevant. For d lower than the number of features, RECIP picks projections of only one feature, but if there is no such limitation, RECIP picks the space of all the features as informative. 6 Table 1: Projection recovery for artiﬁcial datasets with 1 . . . 7 informative features and noise level 0 . . . 0.2 in terms of mean and variance of Precision and Recall. Mean/var obtained for each setting by repeating the experiment with datasets with different informative projections. PRECISION 1 2 3 4 5 6 7 0 1 1 1 1 1 1 1 0.02 1 1 1 1 1 1 1 Mean 0.05 1 1 1 1 1 1 1 1 2 3 4 5 6 7 0 1 1 1 0.9643 0.7714 0.6429 0.6327 0.02 1 1 1 0.9643 0.7429 0.6905 0.5918 Mean 0.05 1 1 0.9524 0.9643 0.8286 0.6905 0.5918 0 0 0 0 0 0 0 0 0.02 0 0 0 0 0 0 0 Variance 0.05 0 0 0 0 0 0 0 0.1 0.0306 0 0 0 0 0 0 0.2 0.0306 0 0 0 0 0 0 0 0 0 0 0.0077 0.0163 0.0113 0.0225 0.02 0 0 0 0.0077 0.0196 0.0113 0.02 Variance 0.05 0 0 0.0136 0.0077 0.0049 0.0272 0.0258 0.1 0 0 0.0136 0.0077 0.0082 0.0113 0.0233 0.2 0 0 0 0.0128 0.0278 0.0113 0.02 0.1 0.0008 0.0001 0.0028 0.0025 0.0036 0.0025 0.0042 0.2 0.0007 0.0001 0.0007 0.0032 0.0044 0.0027 0.0045 0.1 0.0001 0.0001 0.0007 0.0014 0.0019 0.0023 0.0021 0.2 0.0000 0.0001 0.0007 0.0020 0.0023 0.0021 0.0020 0.1 0.9286 1 1 1 1 1 1 0.2 0.9286 1 1 1 1 1 1 RECALL 0.1 1 1 0.9524 0.9643 0.8571 0.6905 0.5714 0.2 1 1 1 0.9286 0.7714 0.6905 0.551 Table 2: RECIP Classiﬁcation Accuracy on Artiﬁcial Data 1 2 3 4 5 6 7 0 0.9751 0.9333 0.9053 0.8725 0.8113 0.7655 0.7534 1 2 3 4 5 6 7 0 0.9751 0.9333 0.9053 0.8820 0.8714 0.8566 0.8429 CLASSIFICATION ACCURACY Mean Variance 0.02 0.05 0.1 0.2 0 0.02 0.05 0.9731 0.9686 0.9543 0.9420 0.0000 0.0000 0.0000 0.9297 0.9227 0.9067 0.8946 0.0001 0.0001 0.0001 0.8967 0.8764 0.8640 0.8618 0.0004 0.0005 0.0016 0.8685 0.8589 0.8454 0.8187 0.0020 0.0020 0.0019 0.8009 0.8105 0.8105 0.7782 0.0042 0.0044 0.0033 0.7739 0.7669 0.7632 0.7511 0.0025 0.0021 0.0026 0.7399 0.7347 0.7278 0.7205 0.0034 0.0040 0.0042 CLASSIFICATION ACCURACY - KNOWN PROJECTIONS Mean Variance 0.02 0.05 0.1 0.2 0 0.02 0.05 0.9731 0.9686 0.9637 0.9514 0.0000 0.0000 0.0000 0.9297 0.9227 0.9067 0.8946 0.0001 0.0001 0.0001 0.8967 0.8914 0.8777 0.8618 0.0004 0.0005 0.0005 0.8781 0.8657 0.8541 0.8331 0.0011 0.0011 0.0014 0.8641 0.8523 0.8429 0.8209 0.0015 0.0015 0.0018 0.8497 0.8377 0.8285 0.8074 0.0014 0.0015 0.0016 0.8371 0.8256 0.8122 0.7988 0.0015 0.0018 0.0018 Table 3: Accuracy of K-NN and RECIP Dataset KNN RECIP In Spam data, the two most informative projections are Breast Cancer Wis 0.8415 0.8275 ’Capital Length Total’ (CLT)/’Capital Length Longest’ Breast Tissue 1.0000 1.0000 (CLL) and CLT/’Frequency of word your’ (FWY). FigCell 0.7072 0.7640 ure 1 shows these two projections, with the dots repreMiniBOONE* 0.7896 0.7396 senting training points. The red dots represent points laSpam 0.7680 0.7680 Vowel 0.9839 0.9839 beled as spam while the blue ones are non-spam. The circles are query points that have been assigned to be classiﬁed with the projection in which they are plotted. The green circles are correctly classiﬁed points, while the magenta circles - far fewer - are the incorrectly classiﬁed ones. Not only does the importance of text in capital letters make sense for a spam ﬁltering dataset, but the points that select those projections are almost ﬂawlessly classiﬁed. Additionally, assuming the user would need to attest the validity of classiﬁcation for the ﬁrst plot, he/she would have no trouble seeing that the circled data points are located in a region predominantly populated with examples of spam, so any non-spam entry appears suspicious. Both of the magenta-colored cases fall into this category, and they can be therefore ﬂagged for further investigation. 7 Informative Projection for the Spam dataset 2000 1500 1000 500 0 Informative Projection for the Spam dataset 12 Frequency of word ‘your‘ Capital Run Length Longest 2500 10 8 6 4 2 0 500 1000 1500 2000 2500 Capital Run Length Total 3000 0 3500 0 2000 4000 6000 8000 10000 12000 14000 Capital Run Length Total 16000 Figure 1: Spam Dataset Selected Subspaces Table 4: Classiﬁcation accuracy using RECIP-learned projections - or known projections, in the lower section - within a voting model instead of a selection model 1 2 3 4 5 6 7 1 2 3 4 5 6 7 CLASSIFICATION ACCURACY - VOTING ENSEMBLE Mean Variance 0 0.02 0.05 0.1 0.2 0 0.02 0.05 0.1 0.9751 0.9731 0.9686 0.9317 0.9226 0.0000 0.0000 0.0000 0.0070 0.7360 0.7354 0.7331 0.7303 0.7257 0.0002 0.0002 0.0001 0.0002 0.7290 0.7266 0.7163 0.7166 0.7212 0.0002 0.0002 0.0008 0.0006 0.6934 0.6931 0.6932 0.6904 0.6867 0.0008 0.0008 0.0008 0.0008 0.6715 0.6602 0.6745 0.6688 0.6581 0.0013 0.0014 0.0013 0.0014 0.6410 0.6541 0.6460 0.6529 0.6512 0.0008 0.0007 0.0010 0.0006 0.6392 0.6342 0.6268 0.6251 0.6294 0.0009 0.0011 0.0012 0.0012 CLASSIFICATION ACCURACY - VOTING ENSEMBLE, KNOWN PROJECTIONS Mean Variance 0 0.02 0.05 0.1 0.2 0 0.02 0.05 0.1 0.9751 0.9731 0.9686 0.9637 0.9514 0.0000 0.0000 0.0000 0.0001 0.7360 0.7354 0.7331 0.7303 0.7257 0.0002 0.0002 0.0001 0.0002 0.7409 0.7385 0.7390 0.7353 0.7325 0.0010 0.0012 0.0010 0.0011 0.7110 0.7109 0.7083 0.7067 0.7035 0.0041 0.0041 0.0042 0.0042 0.7077 0.7070 0.7050 0.7034 0.7008 0.0015 0.0015 0.0015 0.0016 0.6816 0.6807 0.6801 0.6790 0.6747 0.0008 0.0008 0.0008 0.0008 0.6787 0.6783 0.6772 0.6767 0.6722 0.0008 0.0009 0.0009 0.0008 0.2 0.0053 0.0001 0.0002 0.0009 0.0013 0.0005 0.0012 0.2 0.0000 0.0001 0.0010 0.0043 0.0016 0.0009 0.0008 Table 5: Classiﬁcation accuracy for artiﬁcial data with the K-Nearest Neighbors method CLASSIFICATION ACCURACY - KNN 1 2 3 4 5 6 7 4 0 0.7909 0.7940 0.7964 0.7990 0.8038 0.8043 0.8054 0.02 0.7843 0.7911 0.7939 0.7972 0.8024 0.8032 0.8044 Mean 0.05 0.7747 0.7861 0.7901 0.7942 0.8002 0.8015 0.8028 0.1 0.7652 0.7790 0.7854 0.7904 0.7970 0.7987 0.8004 0.2 0.7412 0.7655 0.7756 0.7828 0.7905 0.7930 0.7955 0 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 0.02 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 Variance 0.05 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.1 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 0.2 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 Conclusion This paper considers the problem of Projection Recovery for Classiﬁcation. It is relevant in applications where the decision process must be easy to understand in order to enable human interpretation of the results. We have developed a principled, regression-based algorithm designed to recover small sets of low-dimensional subspaces that support interpretability. It optimizes the selection using individual data-point-speciﬁc entropy estimators. In this context, the proposed algorithm follows the idea of transductive learning, and the role of the resulting projections bears resemblance to high conﬁdence regions known in conformal prediction models. Empirical results obtained using simulated and real-world data show the effectiveness of our method in ﬁnding informative projections that enable accurate classiﬁcation while maintaining transparency of the underlying decision process. Acknowledgments This material is based upon work supported by the NSF, under Grant No. IIS-0911032. 8 References [1] Mark W. Craven and Jude W. Shavlik. Extracting Tree-Structured Representations of Trained Networks. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, Advances in Neural Information Processing Systems, volume 8, pages 24–30. The MIT Press, 1996. [2] Pedro Domingos. Knowledge discovery via multiple models. Intelligent Data Analysis, 2:187–202, 1998. [3] Eulanda M. Dos Santos, Robert Sabourin, and Patrick Maupin. A dynamic overproduce-and-choose strategy for the selection of classiﬁer ensembles. Pattern Recogn., 41:2993–3009, October 2008. [4] M. Feder and N. Merhav. Relations between entropy and error probability. Information Theory, IEEE Transactions on, 40(1):259–266, January 1994. [5] Jerome H. Friedman, Ron Kohavi, and Yeogirl Yun. Lazy decision trees, 1996. [6] A. Gammerman, V. Vovk, and V. Vapnik. Learning by transduction. In In Uncertainty in Artiﬁcial Intelligence, pages 148–155. Morgan Kaufmann, 1998. [7] Quanquan Gu, Zhenhui Li, and Jiawei Han. Joint feature selection and subspace learning, 2011. [8] Bing Liu, Minqing Hu, and Wynne Hsu. Intuitive representation of decision trees using general rules and exceptions. In Proceedings of Seventeeth National Conference on Artiﬁcial Intellgience (AAAI-2000), July 30 - Aug 3, 2000, pages 615–620, 2000. [9] Michael Mampaey, Nikolaj Tatti, and Jilles Vreeken. Tell me what i need to know: succinctly summarizing data with itemsets. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’11, pages 573–581, New York, NY, USA, 2011. ACM. [10] Mario Marchand and Marina Sokolova. Learning with decision lists of data-dependent features. JOURNAL OF MACHINE LEARNING REASEARCH, 6, 2005. [11] Guillaume Obozinski, Ben Taskar, and Michael I. Jordan. Joint covariate selection and joint subspace selection for multiple classiﬁcation problems. Statistics and Computing, 20(2):231–252, April 2010. [12] Michael J. Pazzani, Subramani Mani, and W. Rodman Shankle. Beyond concise and colorful: Learning intelligible rules, 1997. [13] B. Poczos and J. Schneider. On the estimation of alpha-divergences. AISTATS, 2011. [14] Kai Ting, Jonathan Wells, Swee Tan, Shyh Teng, and Geoffrey Webb. Feature-subspace aggregating: ensembles for stable andunstable learners. Machine Learning, 82:375–397, 2011. 10.1007/s10994-0105224-5. 9

6 0.59742033 207 nips-2012-Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification

7 0.57791424 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking

8 0.56748664 305 nips-2012-Selective Labeling via Error Bound Minimization

9 0.5620957 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

10 0.55043381 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification

11 0.53400975 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

12 0.50783587 63 nips-2012-CPRL -- An Extension of Compressive Sensing to the Phase Retrieval Problem

13 0.50524515 200 nips-2012-Local Supervised Learning through Space Partitioning

14 0.50405192 197 nips-2012-Learning with Recursive Perceptual Representations

15 0.50154728 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

16 0.50003064 54 nips-2012-Bayesian Probabilistic Co-Subspace Addition

17 0.49905205 10 nips-2012-A Linear Time Active Learning Algorithm for Link Classification

18 0.48538154 34 nips-2012-Active Learning of Multi-Index Function Models

19 0.48236504 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics

20 0.47358501 280 nips-2012-Proper losses for learning from partial labels

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.027), (17, 0.013), (21, 0.019), (38, 0.094), (54, 0.022), (55, 0.016), (74, 0.033), (76, 0.108), (80, 0.546), (92, 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96233881 204 nips-2012-MAP Inference in Chains using Column Generation

Author: David Belanger, Alexandre Passos, Sebastian Riedel, Andrew McCallum

Abstract: Linear chains and trees are basic building blocks in many applications of graphical models, and they admit simple exact maximum a-posteriori (MAP) inference algorithms based on message passing. However, in many cases this computation is prohibitively expensive, due to quadratic dependence on variables’ domain sizes. The standard algorithms are inefﬁcient because they compute scores for hypotheses for which there is strong negative local evidence. For this reason there has been signiﬁcant previous interest in beam search and its variants; however, these methods provide only approximate results. This paper presents new exact inference algorithms based on the combination of column generation and pre-computed bounds on terms of the model’s scoring function. While we do not improve worst-case performance, our method substantially speeds real-world, typical-case inference in chains and trees. Experiments show our method to be twice as fast as exact Viterbi for Wall Street Journal part-of-speech tagging and over thirteen times faster for a joint part-of-speed and named-entity-recognition task. Our algorithm is also extendable to new techniques for approximate inference, to faster 0/1 loss oracles, and new opportunities for connections between inference and learning. We encourage further exploration of high-level reasoning about the optimization problem implicit in dynamic programs. 1

2 0.9622153 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks

Author: Tivadar Papai, Henry Kautz, Daniel Stefankovic

Abstract: Markov logic is a widely used tool in statistical relational learning, which uses a weighted ﬁrst-order logic knowledge base to specify a Markov random ﬁeld (MRF) or a conditional random ﬁeld (CRF). In many applications, a Markov logic network (MLN) is trained in one domain, but used in a different one. This paper focuses on dynamic Markov logic networks, where the size of the discretized time-domain typically varies between training and testing. It has been previously pointed out that the marginal probabilities of truth assignments to ground atoms can change if one extends or reduces the domains of predicates in an MLN. We show that in addition to this problem, the standard way of unrolling a Markov logic theory into a MRF may result in time-inhomogeneity of the underlying Markov chain. Furthermore, even if these representational problems are not signiﬁcant for a given domain, we show that the more practical problem of generating samples in a sequential conditional random ﬁeld for the next slice relying on the samples from the previous slice has high computational cost in the general case, due to the need to estimate a normalization factor for each sample. We propose a new discriminative model, slice normalized dynamic Markov logic networks (SN-DMLN), that suffers from none of these issues. It supports efﬁcient online inference, and can directly model inﬂuences between variables within a time slice that do not have a causal direction, in contrast with fully directed models (e.g., DBNs). Experimental results show an improvement in accuracy over previous approaches to online inference in dynamic Markov logic networks. 1

3 0.96184605 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

Author: James Scott, Jonathan W. Pillow

Abstract: Characterizing the information carried by neural populations in the brain requires accurate statistical models of neural spike responses. The negative-binomial distribution provides a convenient model for over-dispersed spike counts, that is, responses with greater-than-Poisson variability. Here we describe a powerful data-augmentation framework for fully Bayesian inference in neural models with negative-binomial spiking. Our approach relies on a recently described latentvariable representation of the negative-binomial distribution, which equates it to a Polya-gamma mixture of normals. This framework provides a tractable, conditionally Gaussian representation of the posterior that can be used to design efﬁcient EM and Gibbs sampling based algorithms for inference in regression and dynamic factor models. We apply the model to neural data from primate retina and show that it substantially outperforms Poisson regression on held-out data, and reveals latent structure underlying spike count correlations in simultaneously recorded spike trains. 1

same-paper 4 0.96101683 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

Author: Ashish Kapoor, Raajay Viswanathan, Prateek Jain

5 0.93521941 100 nips-2012-Discriminative Learning of Sum-Product Networks

Author: Robert Gens, Pedro Domingos

Abstract: Sum-product networks are a new deep architecture that can perform fast, exact inference on high-treewidth models. Only generative methods for training SPNs have been proposed to date. In this paper, we present the ﬁrst discriminative training algorithms for SPNs, combining the high accuracy of the former with the representational power and tractability of the latter. We show that the class of tractable discriminative SPNs is broader than the class of tractable generative ones, and propose an efﬁcient backpropagation-style algorithm for computing the gradient of the conditional log likelihood. Standard gradient descent suffers from the diffusion problem, but networks with many layers can be learned reliably using “hard” gradient descent, where marginal inference is replaced by MPE inference (i.e., inferring the most probable state of the non-evidence variables). The resulting updates have a simple and intuitive form. We test discriminative SPNs on standard image classiﬁcation tasks. We obtain the best results to date on the CIFAR-10 dataset, using fewer features than prior methods with an SPN architecture that learns local image structure discriminatively. We also report the highest published test accuracy on STL-10 even though we only use the labeled portion of the dataset. 1

6 0.91984934 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

7 0.8192485 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

8 0.79800516 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

9 0.76429904 251 nips-2012-On Lifting the Gibbs Sampling Algorithm

10 0.76001757 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

11 0.74738353 200 nips-2012-Local Supervised Learning through Space Partitioning

12 0.74582946 293 nips-2012-Relax and Randomize : From Value to Algorithms

13 0.74515444 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

14 0.72987932 207 nips-2012-Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification

15 0.72196215 197 nips-2012-Learning with Recursive Perceptual Representations

16 0.71922827 279 nips-2012-Projection Retrieval for Classification

17 0.71522534 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

18 0.70581323 315 nips-2012-Slice sampling normalized kernel-weighted completely random measure mixture models

19 0.69784492 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

20 0.69636691 168 nips-2012-Kernel Latent SVM for Visual Recognition