nips nips2012 nips2012-280 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jesús Cid-sueiro
Abstract: This paper discusses the problem of calibrating posterior class probabilities from partially labelled data. Each instance is assumed to be labelled as belonging to one of several candidate categories, at most one of them being true. We generalize the concept of proper loss to this scenario, we establish a necessary and sufficient condition for a loss function to be proper, and we show a direct procedure to construct a proper loss for partial labels from a conventional proper loss. The problem can be characterized by the mixing probability matrix relating the true class of the data and the observed labels. The full knowledge of this matrix is not required, and losses can be constructed that are proper for a wide set of mixing probability matrices. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Proper losses for learning from partial labels ´ Jesus Cid-Sueiro Department of Signal Theory and Communications Universidad Carlos III de Madrid Legans-Madrid, 28911 Spain jcid@tsc. [sent-1, score-0.63]
2 es Abstract This paper discusses the problem of calibrating posterior class probabilities from partially labelled data. [sent-3, score-0.447]
3 Each instance is assumed to be labelled as belonging to one of several candidate categories, at most one of them being true. [sent-4, score-0.197]
4 We generalize the concept of proper loss to this scenario, we establish a necessary and sufficient condition for a loss function to be proper, and we show a direct procedure to construct a proper loss for partial labels from a conventional proper loss. [sent-5, score-2.229]
5 The problem can be characterized by the mixing probability matrix relating the true class of the data and the observed labels. [sent-6, score-0.475]
6 The full knowledge of this matrix is not required, and losses can be constructed that are proper for a wide set of mixing probability matrices. [sent-7, score-1.075]
7 It arises in many different applications: Cour [1] cites some of them: picture collections containing several faces per image and a caption that only specifies who is in the picture but not which name matches which face, or video collections with labels taken from annotations. [sent-9, score-0.217]
8 In a partially labelled data set, each instance is assigned to a set of candidate categories, at most only one of them true. [sent-10, score-0.243]
9 Other related problems can be interpreted as particular forms of partial labelling: semisupervised learning, or hierarchical classification in databases where some instances could be labelled with respect to parent categories only. [sent-12, score-0.421]
10 It is also a particular case of the more general problems of learning from soft labels [4] or learning from measurements [5]. [sent-13, score-0.202]
11 Several algorithms have been proposed to deal with partial labelling [1] [2] [6] [7] [8]. [sent-14, score-0.25]
12 Though some theoretical work has been addressed in order to analyze the consistency of algorithms [1] or the information provided by uncertain data [8], little effort has been done to analyze the conditions under which the true class can be inferred from partial labels. [sent-15, score-0.261]
13 In this paper we address the problem of estimating posterior class probabilities from partially labelled data. [sent-16, score-0.45]
14 In particular, we obtain general conditions under which the posterior probability of the true class given the observation can be estimated from training data with ambiguous class labels. [sent-17, score-0.303]
15 To do so, we generalize the concept of proper losses to losses that are functions of ambiguous labels, and show that the capability to estimate posterior class probabilities using a given loss depends on the probability matrix relating the ambiguous labels with the true class of the data. [sent-18, score-1.932]
16 Each generalized proper loss can be characterized by the set (a convex polytope) of all admissible probability matrices. [sent-19, score-0.783]
17 Analyzing the structure of these losses is one of the main goals of this paper. [sent-20, score-0.301]
18 Up to our knowledge, the design of proper losses for learning from imperfect labels has not been addressed in the area of Statistical Learning. [sent-21, score-0.971]
19 3 generalizes proper losses to scenarios with ambiguous labels, Sec. [sent-24, score-0.913]
20 4 proposes a procedure to design proper losses for wide sets of mixing matrices, Sec. [sent-25, score-1.014]
21 We will use () to denote a loss based on partial labels, and ˜ to losses based on true labels. [sent-32, score-0.678]
22 The number of classes is c, and the number of possible partial label vectors is d ≤ 2c . [sent-34, score-0.315]
23 2 Learning from partial labels Let X be a sample set, Y = {ec , j = 0, 1, . [sent-36, score-0.329]
24 , c − 1}, a set of labels, and Z ⊂ {0, 1}c a set of j partial labels. [sent-39, score-0.166]
25 Partial label vector z ∈ Z is a noisy version of the true label y ∈ Y. [sent-41, score-0.323]
26 Several authors [1] [6] [7] [8] assume that the true label is always present in z, i. [sent-42, score-0.166]
27 , zj = 1 when yj = 1, but this assumption is not required in our setting, which admits noisy label scenarios (as, for instance, in [2]). [sent-44, score-0.328]
28 Without loss of generality, we assume that Z contains only partial labels with nonzero probability (i. [sent-45, score-0.54]
29 In general, we model the relationship between z and y through an arbitrary d × c conditional mixing probability matrix M(x) with components mij (x) = P {z = bi |yj = 1, x} (1) where bi ∈ Z is the i-th element of Z for some arbitrary ordering. [sent-48, score-0.672]
30 Note that, in general, the mixing matrix could depend on x, though a constant mixing matrix [2] [6] [7] [8] is a common assumption, as well as the statistical independence of the incorrect labels [6] [7] [8]. [sent-49, score-0.759]
31 To do so, a set of partially labelled samples, S = {(xk , zk ), k = 1, . [sent-52, score-0.243]
32 We will illustrate different partial label scenarios with a 3-class problem. [sent-57, score-0.374]
33 Consider that each column of MT corresponds to a label pattern (z0 , z1 , z2 ) following the ordering (0, 0, 0), (1, 0, 0), (0, 1, 0), (0, 0, 1), (1, 1, 0), (1, 0, 1), (0, 1, 1), (1, 1, 1) (e. [sent-58, score-0.179]
34 knowing the scenario and the value of parameters α, β and γ), we can estimate accurate posterior class probabilities from partially labelled data in all these cases, however, is it possible if α, β and γ are unknown? [sent-68, score-0.516]
35 In the positive case, no information is lost by the partial label process for infinite sample sizes. [sent-70, score-0.311]
36 In the negative case, some performance is lost as a consequence of the mixing process, that persists even for infinite sample sizes 1 2. [sent-71, score-0.266]
37 3 Inference through partial label probabilities If the mixing matrix is known, a conceptually simple strategy to solve the partial label problem consists of estimating posterior partial label probabilities, using them to estimate posterior class probabilities and predict y. [sent-72, score-1.484]
38 Thus, a first condition to estimate η from p given M is that the conditional mixing matrix has a left inverse (i. [sent-74, score-0.298]
39 There are some trivial cases where the mixing matrix has no pseudoinverse (for instance, if P {z|y, x} = P {z|x}, all rows in M(x) are equal, and MT (x)M(x) is a rank 1 matrix, which has no inverse), but these are degenerate cases of no practical interest. [sent-77, score-0.364]
40 3 Loss functions for posterior probability estimation The estimation of posterior probabilities from labelled data is a well known problem in statistics and machine learning, that has received some recent attention in the machine learning literature [9] [10]. [sent-82, score-0.478]
41 ˆ In order to estimate posteriors from labelled data, a loss function ˜(y, η ) is required such that η is ˆ a member of arg minη Ey { ˜(y, η )}. [sent-83, score-0.394]
42 Losses satisfying this property are said to be Fisher consistent ˆ and are known as proper scoring rules. [sent-84, score-0.545]
43 A loss is strictly proper if η is the only member of this set. [sent-85, score-0.761]
44 ˆ A loss is regular if it is finite for any y, except possibly that ˜(y, η ) = ∞ if yj = 1 and ηj = 0. [sent-86, score-0.24]
45 1 A regular scoring rule ˜ : Z × Pc → R is (strictly) proper if and only if ˜(y, η ) = h(ˆ ) + g(ˆ )(y − η ) ˆ ˆ η η (4) ˆ where h is a (strictly) concave function and g(ˆ ) is a supergradient of h at the point η , for all η ˆ η ∈ Pc . [sent-88, score-0.649]
46 1 If the sample size is large (in particular for scenarios C and D), one could think of simply ignoring samples with imperfect labels, and training the classifier with the samples whose class is known. [sent-89, score-0.193]
47 η In order to deal with partial labels, we generalize proper losses as follows Definition Let y and z be random vectors taking values in Y and Z, respectively. [sent-92, score-1.016]
48 A scoring rule ˆ (z, η ) is proper to estimate η (with components ηj = P {yj = 1}) from z if ˆ η ∈ arg min Ez { (z, η )} (5) ˆ η It is strictly proper if η is the only member of this set. [sent-93, score-1.213]
49 This generalized family of proper scoring rules can be characterized by the following. [sent-94, score-0.604]
50 2 Scoring rule (z, η ) is (strictly) proper to estimate η from z if and only if the equivalent loss ˜(y, η ) = yT MT l(ˆ ), ˆ η (6) ˆ where l(ˆ ) is a vector with components i (ˆ ) = (bi , η ) and bi is the i-th element in Z (according η η to some arbitrary ordering), is (strictly) proper. [sent-96, score-0.825]
51 (7) is (strictly) proper with Note that, defining vector ˜ η ) with components ˜j (ˆ ) = ˜(ec , η ), we can write l(ˆ η j ˆ ˜ η ) = MT l(ˆ ) l(ˆ η (8) We will use this vector representation of losses extensively in the following. [sent-98, score-0.794]
52 2 states that the proper character of a loss for estimating η from z depends on M. [sent-101, score-0.641]
53 For this ˆ reason, in the following we will say that (z, η ) is M-proper if it is proper to estimate η from z. [sent-102, score-0.451]
54 4 Proper losses for sets of mixing matrices Eq. [sent-103, score-0.619]
55 For any given M and any given equivalent loss ˜ η ), there is an uncountable number of losses l(ˆ ) l(ˆ η satisfying (8). [sent-106, score-0.487]
56 Example Let ˜ be an arbitrary proper loss for a 3-class problem. [sent-107, score-0.636]
57 ˆ Also, for any (z, η ), there are different mixing matrices such that the equivalent loss is the same. [sent-109, score-0.479]
58 In general, if l(ˆ ) is M-proper and N-proper with equivalent loss ˜ η ), then it is also Q-proper with η l(ˆ the same equivalent loss, for any Q in the form Q = M(I − D) + ND (13) where D is a diagonal nonnegative matrix (note that Q is a probability matrix, because QT 1d = 1c ). [sent-111, score-0.297]
59 Example Assuming diagonal D, if M and N are the mixing matrices defined in (11) and (12), respectively, the loss (9) is Q-proper for any mixing matrix Q in the form (13). [sent-113, score-0.752]
60 1 Building proper losses from ambiguity sets The ambiguity on M for a given loss l(ˆ ) can be used to deal with the second problem mentioned η in Sec. [sent-116, score-1.36]
61 3: in general, the mixing matrix may be unknown, or, even if it is known, it may depend on the observation, x. [sent-118, score-0.298]
62 Thus, we need a procedure to design losses that are proper for a wide family of mixing matrices. [sent-119, score-0.989]
63 In general, given a set of mixing matrices, Q, we will say that is Q-proper if it is M-proper for any M ∈ Q The following result provides a way to construct a proper loss conventional proper loss ˜. [sent-120, score-1.492]
64 1 For 0 ≤ j ≤ c − 1, let Vj = {vi ∈ Pd , 1 ≤ i ≤ nj } be a set of nj > 0 probability c−1 vectors with dimension d, such that j=0 nj = d and span(∪c−1 Vj ) = Rd and let Q = {M ∈ j=0 M : Mec ∈ span(Vj ) ∩ Pd }. [sent-122, score-0.469]
65 j ˆ ˆ Then, for any (strictly) proper loss ˜(y, η ), there exists a loss (z, η ) which is (strictly) Q-proper. [sent-123, score-0.773]
66 Therefore, it is also proper for any affine combination of matrices in R inside Pd . [sent-130, score-0.507]
67 1 shows that we can construct proper losses for learning from partial labels by specifying the points of sets Vj , j = 0, . [sent-136, score-1.137]
68 Each of these sets defines an ambiguity set Aj = span(Vj ) ∩ Pd which represents all admissible conditional distributions for P (z|yj = 1). [sent-140, score-0.334]
69 If the columns of the true mixing matrix M are members of the ambiguity set, the resulting loss can be used to estimate posterior class probabilities from the observed partial labels. [sent-141, score-1.099]
70 Thus, a general procedure to design a loss function for learning from partial labels is: ˆ 1. [sent-142, score-0.49]
71 Define the ambiguity sets by choosing, for each class j, a set Vj of nj linearly independent basis vectors for each class. [sent-144, score-0.51]
72 Construct binary matrix U with uji = 1 if the i-th column of V is in Vj , and uji = 0 otherwise. [sent-149, score-0.268]
73 These vertices must have a set of at least nj − 1 zero components which cannot be a set of zeros in any other vertex. [sent-152, score-0.218]
74 This has two consequences: (1) we can define the ambiguity sets from these vertices, and (2), the choice is not unique, because the number of vertices can be higher than nj − 1. [sent-153, score-0.396]
75 ˆ If proper loss ˜(y, η ) is non degenerate, Q contains all mixing matrices for which a loss is proper: ˆ Theorem 4. [sent-154, score-1.091]
76 Proof Since the columns of V are in the ambiguity sets and form a basis of Rd , span(∪c−1 Aj ) = j=0 c Rd . [sent-158, score-0.31]
77 2 Virtual labels The analysis above shows a procedure to construct proper losses from ambiguity sets. [sent-164, score-1.141]
78 The main result of this section is to show that (16) is actually a universal representation, in the sense that any proper loss can be represented in this form, and we generalize the Savage’s representation providing and explicit formula for Q-proper losses. [sent-165, score-0.645]
79 3 Scoring rule (z, η ) is (strictly) Q-proper for some matrix set Q with equivalent loss ˜(y, η ) if and only if ˆ ˆ ˆ (z, η ) = h(ˆ ) + g(ˆ )T (UT V−1 z − η ). [sent-167, score-0.273]
80 Moreover, the ambiguity set of class j is Aj = span(Vj ), where Vj is the set of all columns in V such that uji = 1. [sent-169, score-0.378]
81 Comparing (4) with (17), the effect of imperfect labelling becomes clear: the unknown true label y ˜ is replaced by a virtual label y = UT V−1 z, which is a linear combination of the partial labels. [sent-171, score-0.615]
82 5 Estimation Errors If the true mixing matrix M is not in Q, a Q-proper loss may fail to estimate η. [sent-177, score-0.509]
83 Then ˆ L(η, η ) = η T NT (VT )−1 U˜ η ) + η T˜ η ) l(ˆ l(ˆ (19) Example The effect of a bad choice of the ambiguity set can be illustrated using the loss in (9) in ˆ two cases: ˜j (ˆ ) = ec − η 2 (the square error) and ˜j (ˆ ) = − ln(ˆj ) (the cross entropy). [sent-181, score-0.558]
84 As we l η l η η j have discussed before, loss (9) is proper for any scenario where label z contains the true class and possibly another class taken at random. [sent-182, score-0.935]
85 Let us assume that the true mixing matrix is M= 0. [sent-183, score-0.348]
86 2 T (20) (were each column of MT corresponds to a label vector (z0 , z1 , z2 ) following the ordering (1, 0, 0), (0, 1, 0), (0, 0, 1), (1, 1, 0), (1, 0, 1), (0, 1, 1). [sent-192, score-0.179]
87 1 shows the expected loss in (19) for the square ˆ error (left) and the cross entropy (right), as a function of η over the probability simplex P3 , for η = (0. [sent-194, score-0.291]
88 Since M ∈ Q, the estimated posterior minimizing expected loss, η ∗ (which / ˆ is unique because both losses are strictly proper), does not coincide with the true posterior. [sent-198, score-0.533]
89 ˆ Figure 1: Square loss (left) and cross entropy (right) in the probability simplex, as a function of η for η = (0. [sent-199, score-0.242]
90 4)T ˆ It is important to note that the minimum η ∗ does not depend on the choice of the cost and, thus, the estimation error is invariant to the choice of the strict proper loss (though this could be not true when η is estimated from an empirical distribution). [sent-202, score-0.689]
91 This is because, using (19) and noting that the expected proper loss is . [sent-203, score-0.612]
92 (23) If ˜ is proper but not strictly proper, the minimum may be not unique. [sent-205, score-0.564]
93 Therefore, those values T −1 of η with η and U V Mη in the same decision region are not influenced by a bad choice of the ambiguity set. [sent-207, score-0.22]
94 Therefore, a wrong choice of the ambiguity set always changes the boundary decision. [sent-209, score-0.195]
95 Summarizing, the ambiguity set for probability estimation is not larger than that for classification. [sent-210, score-0.247]
96 6 Conclusions In this paper we have generalized proper losses to deal with scenarios with partial labels. [sent-211, score-1.042]
97 Proper losses based on partial labels can be designed to cope with different mixing matrices. [sent-212, score-0.867]
98 We have also generalized the Savage’s representation of proper losses to obtain an explicit expression for proper losses as a function of a concave generator. [sent-213, score-1.535]
99 3 ˆ ˆ Let us assume that (z, η ) is (strictly) Q-proper for some matrix set Q with equivalent loss ˜(y, η ). [sent-215, score-0.247]
100 Raftery, “Strictly proper scoring rules, prediction, and estimation,” Journal of the American Statistical Association, vol. [sent-309, score-0.545]
wordName wordTfidf (topN-words)
[('proper', 0.451), ('losses', 0.301), ('mt', 0.28), ('mixing', 0.237), ('labelled', 0.197), ('ambiguity', 0.195), ('partial', 0.166), ('labels', 0.163), ('loss', 0.161), ('vj', 0.156), ('pd', 0.156), ('ec', 0.151), ('savage', 0.142), ('nj', 0.137), ('vt', 0.123), ('label', 0.116), ('admissible', 0.114), ('strictly', 0.113), ('span', 0.105), ('aj', 0.097), ('bi', 0.096), ('scoring', 0.094), ('scenarios', 0.092), ('uji', 0.087), ('ey', 0.084), ('yj', 0.079), ('ambiguous', 0.069), ('posterior', 0.069), ('mij', 0.067), ('probabilities', 0.064), ('matrix', 0.061), ('imperfect', 0.056), ('matrices', 0.056), ('ut', 0.055), ('mec', 0.054), ('knowing', 0.053), ('labelling', 0.052), ('columns', 0.051), ('true', 0.05), ('simplex', 0.049), ('qj', 0.048), ('ez', 0.048), ('cour', 0.047), ('imprecise', 0.047), ('djj', 0.047), ('supergradient', 0.047), ('partially', 0.046), ('class', 0.045), ('zt', 0.044), ('mq', 0.044), ('elicitation', 0.044), ('components', 0.042), ('scenario', 0.042), ('pseudoinverse', 0.041), ('noisy', 0.041), ('soft', 0.039), ('vertices', 0.039), ('basis', 0.039), ('member', 0.036), ('linearly', 0.036), ('pc', 0.036), ('qi', 0.034), ('column', 0.033), ('vectors', 0.033), ('generalize', 0.033), ('characterized', 0.032), ('deal', 0.032), ('boldface', 0.032), ('construct', 0.031), ('qt', 0.031), ('concave', 0.031), ('virtual', 0.03), ('entropy', 0.03), ('ordering', 0.03), ('polytope', 0.03), ('situation', 0.029), ('categories', 0.029), ('lost', 0.029), ('semisupervised', 0.029), ('estimating', 0.029), ('unknown', 0.029), ('proof', 0.029), ('rules', 0.027), ('picture', 0.027), ('consequences', 0.027), ('estimation', 0.027), ('discusses', 0.026), ('cross', 0.026), ('rule', 0.026), ('contains', 0.025), ('theorem', 0.025), ('bad', 0.025), ('equivalent', 0.025), ('sets', 0.025), ('probability', 0.025), ('degenerate', 0.025), ('wj', 0.025), ('relating', 0.025), ('arbitrary', 0.024), ('sapp', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 280 nips-2012-Proper losses for learning from partial labels
Author: Jesús Cid-sueiro
Abstract: This paper discusses the problem of calibrating posterior class probabilities from partially labelled data. Each instance is assumed to be labelled as belonging to one of several candidate categories, at most one of them being true. We generalize the concept of proper loss to this scenario, we establish a necessary and sufficient condition for a loss function to be proper, and we show a direct procedure to construct a proper loss for partial labels from a conventional proper loss. The problem can be characterized by the mixing probability matrix relating the true class of the data and the observed labels. The full knowledge of this matrix is not required, and losses can be constructed that are proper for a wide set of mixing probability matrices. 1
2 0.1360103 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses
Author: Harish G. Ramaswamy, Shivani Agarwal
Abstract: We study consistency properties of surrogate loss functions for general multiclass classification problems, defined by a general loss matrix. We extend the notion of classification calibration, which has been studied for binary and multiclass 0-1 classification problems (and for certain other specific learning problems), to the general multiclass setting, and derive necessary and sufficient conditions for a surrogate loss to be classification calibrated with respect to a loss matrix in this setting. We then introduce the notion of classification calibration dimension of a multiclass loss matrix, which measures the smallest ‘size’ of a prediction space for which it is possible to design a convex surrogate that is classification calibrated with respect to the loss matrix. We derive both upper and lower bounds on this quantity, and use these results to analyze various loss matrices. In particular, as one application, we provide a different route from the recent result of Duchi et al. (2010) for analyzing the difficulty of designing ‘low-dimensional’ convex surrogates that are consistent with respect to pairwise subset ranking losses. We anticipate the classification calibration dimension may prove to be a useful tool in the study and design of surrogate losses for general multiclass learning problems. 1
3 0.12940398 283 nips-2012-Putting Bayes to sleep
Author: Dmitry Adamskiy, Manfred K. Warmuth, Wouter M. Koolen
Abstract: We consider sequential prediction algorithms that are given the predictions from a set of models as inputs. If the nature of the data is changing over time in that different models predict well on different segments of the data, then adaptivity is typically achieved by mixing into the weights in each round a bit of the initial prior (kind of like a weak restart). However, what if the favored models in each segment are from a small subset, i.e. the data is likely to be predicted well by models that predicted well before? Curiously, fitting such “sparse composite models” is achieved by mixing in a bit of all the past posteriors. This self-referential updating method is rather peculiar, but it is efficient and gives superior performance on many natural data sets. Also it is important because it introduces a long-term memory: any model that has done well in the past can be recovered quickly. While Bayesian interpretations can be found for mixing in a bit of the initial prior, no Bayesian interpretation is known for mixing in past posteriors. We build atop the “specialist” framework from the online learning literature to give the Mixing Past Posteriors update a proper Bayesian foundation. We apply our method to a well-studied multitask learning problem and obtain a new intriguing efficient update that achieves a significantly better bound. 1
4 0.12170276 169 nips-2012-Label Ranking with Partial Abstention based on Thresholded Probabilistic Models
Author: Weiwei Cheng, Willem Waegeman, Volkmar Welker, Eyke Hüllermeier
Abstract: Several machine learning methods allow for abstaining from uncertain predictions. While being common for settings like conventional classification, abstention has been studied much less in learning to rank. We address abstention for the label ranking setting, allowing the learner to declare certain pairs of labels as being incomparable and, thus, to predict partial instead of total orders. In our method, such predictions are produced via thresholding the probabilities of pairwise preferences between labels, as induced by a predicted probability distribution on the set of all rankings. We formally analyze this approach for the Mallows and the Plackett-Luce model, showing that it produces proper partial orders as predictions and characterizing the expressiveness of the induced class of partial orders. These theoretical results are complemented by experiments demonstrating the practical usefulness of the approach. 1
5 0.1074124 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)
Author: Nicolò Cesa-bianchi, Pierre Gaillard, Gabor Lugosi, Gilles Stoltz
Abstract: Mirror descent with an entropic regularizer is known to achieve shifting regret bounds that are logarithmic in the dimension. This is done using either a carefully designed projection or by a weight sharing technique. Via a novel unified analysis, we show that these two approaches deliver essentially equivalent bounds on a notion of regret generalizing shifting, adaptive, discounted, and other related regrets. Our analysis also captures and extends the generalized weight sharing technique of Bousquet and Warmuth, and can be refined in several ways, including improvements for small losses and adaptive tuning of parameters. 1
6 0.10682786 227 nips-2012-Multiclass Learning with Simplex Coding
7 0.1004067 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing
8 0.095511489 16 nips-2012-A Polynomial-time Form of Robust Regression
9 0.090831511 5 nips-2012-A Conditional Multinomial Mixture Model for Superset Label Learning
10 0.087281741 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking
11 0.080718786 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
12 0.080671996 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking
13 0.076576971 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback
14 0.075783253 111 nips-2012-Efficient Sampling for Bipartite Matching Problems
15 0.073006414 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction
16 0.072979711 359 nips-2012-Variational Inference for Crowdsourcing
17 0.072469972 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
18 0.066672415 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions
19 0.0624387 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs
20 0.061005183 293 nips-2012-Relax and Randomize : From Value to Algorithms
topicId topicWeight
[(0, 0.182), (1, 0.021), (2, 0.057), (3, 0.02), (4, 0.05), (5, -0.039), (6, 0.024), (7, 0.11), (8, -0.028), (9, 0.055), (10, 0.014), (11, 0.056), (12, -0.006), (13, -0.01), (14, 0.054), (15, -0.022), (16, 0.036), (17, -0.035), (18, -0.004), (19, -0.004), (20, -0.059), (21, 0.076), (22, -0.117), (23, 0.115), (24, -0.055), (25, 0.054), (26, -0.062), (27, -0.032), (28, -0.058), (29, -0.036), (30, -0.011), (31, -0.002), (32, 0.11), (33, 0.037), (34, 0.054), (35, -0.062), (36, 0.052), (37, -0.018), (38, -0.066), (39, 0.004), (40, 0.017), (41, -0.032), (42, 0.038), (43, -0.1), (44, 0.044), (45, -0.062), (46, 0.085), (47, 0.094), (48, 0.057), (49, 0.048)]
simIndex simValue paperId paperTitle
same-paper 1 0.96535832 280 nips-2012-Proper losses for learning from partial labels
Author: Jesús Cid-sueiro
Abstract: This paper discusses the problem of calibrating posterior class probabilities from partially labelled data. Each instance is assumed to be labelled as belonging to one of several candidate categories, at most one of them being true. We generalize the concept of proper loss to this scenario, we establish a necessary and sufficient condition for a loss function to be proper, and we show a direct procedure to construct a proper loss for partial labels from a conventional proper loss. The problem can be characterized by the mixing probability matrix relating the true class of the data and the observed labels. The full knowledge of this matrix is not required, and losses can be constructed that are proper for a wide set of mixing probability matrices. 1
2 0.73257977 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses
Author: Harish G. Ramaswamy, Shivani Agarwal
Abstract: We study consistency properties of surrogate loss functions for general multiclass classification problems, defined by a general loss matrix. We extend the notion of classification calibration, which has been studied for binary and multiclass 0-1 classification problems (and for certain other specific learning problems), to the general multiclass setting, and derive necessary and sufficient conditions for a surrogate loss to be classification calibrated with respect to a loss matrix in this setting. We then introduce the notion of classification calibration dimension of a multiclass loss matrix, which measures the smallest ‘size’ of a prediction space for which it is possible to design a convex surrogate that is classification calibrated with respect to the loss matrix. We derive both upper and lower bounds on this quantity, and use these results to analyze various loss matrices. In particular, as one application, we provide a different route from the recent result of Duchi et al. (2010) for analyzing the difficulty of designing ‘low-dimensional’ convex surrogates that are consistent with respect to pairwise subset ranking losses. We anticipate the classification calibration dimension may prove to be a useful tool in the study and design of surrogate losses for general multiclass learning problems. 1
3 0.6454258 227 nips-2012-Multiclass Learning with Simplex Coding
Author: Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, Jean-jeacques Slotine
Abstract: In this paper we discuss a novel framework for multiclass learning, defined by a suitable coding/decoding strategy, namely the simplex coding, that allows us to generalize to multiple classes a relaxation approach commonly used in binary classification. In this framework, we develop a relaxation error analysis that avoids constraints on the considered hypotheses class. Moreover, using this setting we derive the first provably consistent regularized method with training/tuning complexity that is independent to the number of classes. We introduce tools from convex analysis that can be used beyond the scope of this paper. 1
4 0.60330468 217 nips-2012-Mixability in Statistical Learning
Author: Tim V. Erven, Peter Grünwald, Mark D. Reid, Robert C. Williamson
Abstract: Statistical learning and sequential prediction are two different but related formalisms to study the quality of predictions. Mapping out their relations and transferring ideas is an active area of investigation. We provide another piece of the puzzle by showing that an important concept in sequential prediction, the mixability of a loss, has a natural counterpart in the statistical setting, which we call stochastic mixability. Just as ordinary mixability characterizes fast rates for the worst-case regret in sequential prediction, stochastic mixability characterizes fast rates in statistical learning. We show that, in the special case of log-loss, stochastic mixability reduces to a well-known (but usually unnamed) martingale condition, which is used in existing convergence theorems for minimum description length and Bayesian inference. In the case of 0/1-loss, it reduces to the margin condition of Mammen and Tsybakov, and in the case that the model under consideration contains all possible predictors, it is equivalent to ordinary mixability. 1
5 0.57017922 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs
Author: Aharon Birnbaum, Shai S. Shwartz
Abstract: Given α, ϵ, we study the time complexity required to improperly learn a halfspace with misclassification error rate of at most (1 + α) L∗ + ϵ, where L∗ is the γ γ optimal γ-margin error rate. For α = 1/γ, polynomial time and sample complexity is achievable using the hinge-loss. For α = 0, Shalev-Shwartz et al. [2011] showed that poly(1/γ) time is impossible, while learning is possible in ˜ time exp(O(1/γ)). An immediate question, which this paper tackles, is what is achievable if α ∈ (0, 1/γ). We derive positive results interpolating between the polynomial time for α = 1/γ and the exponential time for α = 0. In particular, we show that there are cases in which α = o(1/γ) but the problem is still solvable in polynomial time. Our results naturally extend to the adversarial online learning model and to the PAC learning with malicious noise model. 1
6 0.56938905 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification
7 0.56238288 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification
8 0.54161596 283 nips-2012-Putting Bayes to sleep
9 0.54000324 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking
10 0.53795737 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
11 0.52544916 323 nips-2012-Statistical Consistency of Ranking Methods in A Rank-Differentiable Probability Space
12 0.52160341 16 nips-2012-A Polynomial-time Form of Robust Regression
13 0.51389694 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)
14 0.51119977 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders
15 0.50600457 169 nips-2012-Label Ranking with Partial Abstention based on Thresholded Probabilistic Models
16 0.49357039 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach
17 0.46690553 361 nips-2012-Volume Regularization for Binary Classification
18 0.45533109 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing
19 0.44563961 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction
20 0.43896824 34 nips-2012-Active Learning of Multi-Index Function Models
topicId topicWeight
[(0, 0.038), (21, 0.037), (38, 0.098), (39, 0.016), (41, 0.172), (42, 0.022), (53, 0.013), (54, 0.044), (55, 0.025), (74, 0.045), (76, 0.2), (80, 0.174), (92, 0.034)]
simIndex simValue paperId paperTitle
1 0.94468743 72 nips-2012-Cocktail Party Processing via Structured Prediction
Author: Yuxuan Wang, Deliang Wang
Abstract: While human listeners excel at selectively attending to a conversation in a cocktail party, machine performance is still far inferior by comparison. We show that the cocktail party problem, or the speech separation problem, can be effectively approached via structured prediction. To account for temporal dynamics in speech, we employ conditional random fields (CRFs) to classify speech dominance within each time-frequency unit for a sound mixture. To capture complex, nonlinear relationship between input and output, both state and transition feature functions in CRFs are learned by deep neural networks. The formulation of the problem as classification allows us to directly optimize a measure that is well correlated with human speech intelligibility. The proposed system substantially outperforms existing ones in a variety of noises.
2 0.93510985 345 nips-2012-Topic-Partitioned Multinetwork Embeddings
Author: Peter Krafft, Juston Moore, Bruce Desmarais, Hanna M. Wallach
Abstract: We introduce a new Bayesian admixture model intended for exploratory analysis of communication networks—specifically, the discovery and visualization of topic-specific subnetworks in email data sets. Our model produces principled visualizations of email networks, i.e., visualizations that have precise mathematical interpretations in terms of our model and its relationship to the observed data. We validate our modeling assumptions by demonstrating that our model achieves better link prediction performance than three state-of-the-art network models and exhibits topic coherence comparable to that of latent Dirichlet allocation. We showcase our model’s ability to discover and visualize topic-specific communication patterns using a new email data set: the New Hanover County email network. We provide an extensive analysis of these communication patterns, leading us to recommend our model for any exploratory analysis of email networks or other similarly-structured communication data. Finally, we advocate for principled visualization as a primary objective in the development of new network models. 1
same-paper 3 0.88441503 280 nips-2012-Proper losses for learning from partial labels
Author: Jesús Cid-sueiro
Abstract: This paper discusses the problem of calibrating posterior class probabilities from partially labelled data. Each instance is assumed to be labelled as belonging to one of several candidate categories, at most one of them being true. We generalize the concept of proper loss to this scenario, we establish a necessary and sufficient condition for a loss function to be proper, and we show a direct procedure to construct a proper loss for partial labels from a conventional proper loss. The problem can be characterized by the mixing probability matrix relating the true class of the data and the observed labels. The full knowledge of this matrix is not required, and losses can be constructed that are proper for a wide set of mixing probability matrices. 1
4 0.84879321 279 nips-2012-Projection Retrieval for Classification
Author: Madalina Fiterau, Artur Dubrawski
Abstract: In many applications, classification systems often require human intervention in the loop. In such cases the decision process must be transparent and comprehensible, simultaneously requiring minimal assumptions on the underlying data distributions. To tackle this problem, we formulate an axis-aligned subspace-finding task under the assumption that query specific information dictates the complementary use of the subspaces. We develop a regression-based approach called RECIP that efficiently solves this problem by finding projections that minimize a nonparametric conditional entropy estimator. Experiments show that the method is accurate in identifying the informative projections of the dataset, picking the correct views to classify query points, and facilitates visual evaluation by users. 1 Introduction and problem statement In the domain of predictive analytics, many applications which keep human users in the loop require the use of simple classification models. Often, it is required that a test-point be ‘explained’ (classified) using a simple low-dimensional projection of the original feature space. This is a Projection Retrieval for Classification problem (PRC). The interaction with the user proceeds as follows: the user provides the system a query point; the system searches for a projection in which the point can be accurately classified; the system displays the classification result as well as an illustration of how the classification decision was reached in the selected projection. Solving the PRC problem is relevant in many practical applications. For instance, consider a nuclear threat detection system installed at a border check point. Vehicles crossing the border are scanned with sensors so that a large array of measurements of radioactivity and secondary contextual information is being collected. These observations are fed into a classification system that determines whether the scanned vehicle may carry a threat. Given the potentially devastating consequences of a false negative, a border control agent is requested to validate the prediction and decide whether to submit the vehicle for a costly further inspection. With the positive classification rate of the system under strict bounds because of limitations in the control process, the risk of false negatives is increased. Despite its crucial role, human intervention should only be withheld for cases in which there are reasons to doubt the validity of classification. In order for a user to attest the validity of a decision, the user must have a good understanding of the classification process, which happens more readily when the classifier only uses the original dataset features rather than combinations of them, and when the discrimination models are low-dimensional. In this context, we aim to learn a set of classifiers in low-dimensional subspaces and a decision function which selects the subspace under which a test point is to be classified. Assume we are given a dataset {(x1 , y1 ) . . . (xn , yn )} ∈ X n × {0, 1}n and a class of discriminators H. The model will contain a set Π of subspaces of X ; Π ⊆ Π, where Π is the set of all axis-aligned subspaces of the original feature space, the power set of the features. To each projection πi ∈ Π corresponds one discriminator from a given hypothesis space hi ∈ H. It will also contain a selection function g : X → Π × H, which yields, for a query point x, the projection/discriminator pair with which this point will be classified. The notation π(x) refers to the projection of the point x onto the subspace 1 π while h(π(x)) represents the predicted label for x. Formally, we describe the model class as Md = {Π = {π : π ∈ Π, dim(π) ≤ d}, H = {hi : hi ∈ H, h : πi → Y, ∀i = 1 . . . |Π|}, g ∈ {f : X → {1 . . . |Π|}} . where dim(π) presents the dimensionality of the subspace determined by the projection π. Note that only projections up to size d will be considered, where d is a parameter specific to the application. The set H contains one discriminator from the hypothesis class H for each projection. Intuitively, the aim is to minimize the expected classification error over Md , however, a notable modification is that the projection and, implicitly, the discriminator, are chosen according to the data point that needs to be classified. Given a query x in the space X , g(x) will yield the subspace πg(x) onto which the query is projected and the discriminator hg(x) for it. Distinct test points can be handled using different combinations of subspaces and discriminators. We consider models that minimize 0/1 loss. Hence, the PRC problem can be stated as follows: M ∗ = arg min EX ,Y y = hg(x) (πg(x) (x)) M ∈Md There are limitations to the type of selection function g that can be learned. A simple example for which g can be recovered is a set of signal readings x for which, if one of the readings xi exceeds a threshold ti , the label can be predicted just based on xi . A more complex one is a dataset containing regulatory variables, that is, for xi in the interval [ak , bk ] the label only depends on (x1 . . . xnk ) k k datasets that fall into the latter category fulfill what we call the Subspace-Separability Assumption. This paper proposes an algorithm called RECIP that solves the PRC problem for a class of nonparametric classifiers. We evaluate the method on artificial data to show that indeed it correctly identifies the underlying structure for data satisfying the Subspace-Separability Assumption. We show some case studies to illustrate how RECIP offers insight into applications requiring human intervention. The use of dimensionality reduction techniques is a common preprocessing step in applications where the use of simplified classification models is preferable. Methods that learn linear combinations of features, such as Linear Discriminant Analysis, are not quite appropriate for the task considered here, since we prefer to natively rely on the dimensions available in the original feature space. Feature selection methods, such as e.g. lasso, are suitable for identifying sets of relevant features, but do not consider interactions between them. Our work better fits the areas of class dependent feature selection and context specific classification, highly connected to the concept of Transductive Learning [6]. Other context-sensitive methods are Lazy and Data-Dependent Decision Trees, [5] and [10] respectively. In Ting et al [14], the Feating submodel selection relies on simple attribute splits followed by fitting local predictors, though the algorithm itself is substantially different. Obozinski et al present a subspace selection method in the context of multitask learning [11]. Go et al propose a joint method for feature selection and subspace learning [7], however, their classification model is not particularly query specific. Alternatively, algorithms that transform complex or unintelligible models with user-friendly equivalents have been proposed [3, 2, 1, 8]. Algorithms specifically designed to yield understandable models are a precious few. Here we note a rule learning method described in [12], even though the resulting rules can make visualization difficult, while itemset mining [9] is not specifically designed for classification. Unlike those approaches, our method is designed to retrieve subsets of the feature space designed for use in a way that is complementary to the basic task at hand (classification) while providing query-specific information. 2 Recovering informative projections with RECIP To solve PRC, we need means by which to ascertain which projections are useful in terms of discriminating data from the two classes. Since our model allows the use of distinct projections depending on the query point, it is expected that each projection would potentially benefit different areas of the feature space. A(π) refers to the area of the feature space where the projection π is selected. A(π) = {x ∈ X : πg(x) = π} The objective becomes min E(X ×Y) y = hg(x) (πg(x) (x)) M ∈Md = p(A(π))E y = hg(x) (πg(x) (x))|x ∈ A(π) min M ∈Md 2 π∈Π . The expected classification error over A(π) is linked to the conditional entropy of Y |X. Fano’s inequality provides a lower bound on the error while Feder and Merhav [4] derive a tight upper bound on the minimal error probability in terms of the entropy. This means that conditional entropy characterizes the potential of a subset of the feature space to separate data, which is more generic than simply quantifying classification accuracy for a specific discriminator. In view of this connection between classification accuracy and entropy, we adapt the objective to: p(A(π))H(Y |π(X); X ∈ A(π)) min M ∈Md (1) π∈Π The method we propose optimizes an empirical analog of (1) which we develop below and for which we will need the following result. Proposition 2.1. Given a continuous variable X ∈ X and a binary variable Y , where X is sampled from the mixture model f (x) = p(y = 0)f0 (x) + p(y = 1)f1 (x) = p0 f0 (x) + p1 f1 (x) , then H(Y |X) = −p0 log p0 − p1 log p1 − DKL (f0 ||f ) − DKL (f1 ||f ) . Next, we will use the nonparametric estimator presented in [13] for Tsallis α-divergence. Given samples Ui ∼ U, with i = 1, n and Vj ∼ V with j = 1, m, the divergence is estimated as follows: ˆ Tα (U ||V ) = 1 1 1−α n n (n − 1)νk (Ui , U \ ui )d mνk (Ui , V )d i=1 1−α B(k, α) − 1 , (2) where d is the dimensionality of the variables U and V and νk (z, Z) represents the distance from z ˆ to its k th nearest neighbor of the set of points Z. For α ≈ 1 and n → ∞, Tα (u||v) ≈ DKL (u||v). 2.1 Local estimators of entropy We will now plug (2) in the formula obtained by Proposition 2.1 to estimate the quantity (1). We use the notation X0 to represent the n0 samples from X which have the labels Y equal to 0, and X1 to represent the n1 samples from X which have the labels set to 1. Also, Xy(x) represents the set of samples that have labels equal to the label of x and X¬y(x) the data that have labels opposite to the label of x. ˆ ˆ x ˆ x H(Y |X; X ∈ A) = −H(p0 ) − H(p1 ) − T (f0 ||f x ) − T (f1 ||f x ) + C α≈1 ˆ H(Y |X; X ∈ A) ∝ 1 n0 + 1 n1 ∝ 1 n0 + 1 n1 ∝ 1 n n0 (n0 − 1)νk (xi , X0 \ xi )d nνk (xi , X \ xi )d 1−α I[xi ∈ A] (n1 − 1)νk (xi , X1 \ xi )d nνk (xi , X \ xi )d 1−α I[xi ∈ A] (n0 − 1)νk (xi , X0 \ xi )d nνk (xi , X1 \ xi )d 1−α I[xi ∈ A] (n1 − 1)νk (xi , X1 \ xi )d nνk (xi , X0 \ xi )d 1−α I[xi ∈ A] i=1 n1 i=1 n0 i=1 n1 i=1 n I[xi ∈ A] i=1 (n − 1)νk (xi , Xy(xi ) \ xi )d nνk (xi , X¬y(xi ) \ xi )d 1−α The estimator for the entropy of the data that is classified with projection π is as follows: ˆ H(Y |π(X); X ∈ A(π)) ∝ 1 n n I[xi ∈ A(π)] i=1 (n − 1)νk (π(xi ), π(Xy(xi ) ) \ π(xi ))d nνk (π(xi ), π(X¬y(xi ) \ xi ))d 1−α (3) From 3 and using the fact that I[xi ∈ A(π)] = I[πg(xi ) = π] for which we use the notation I[g(xi ) → π], we estimate the objective as min M ∈Md π∈Π 1 n n I[g(xi ) → π] i=1 (n − 1)νk (π(xi ), π(Xy(xi ) ) \ π(xi ))d nνk (π(xi ), π(X¬y(xi ) \ xi ))d 3 1−α (4) Therefore, the contribution of each data point to the objective corresponds to a distance ratio on the projection π ∗ where the class of the point is obtained with the highest confidence (data is separable in the neighborhood of the point). We start by computing the distance-based metric of each point on each projection of size up to d - there are d∗ such projections. This procedure yields an extended set of features Z, which we name local entropy estimates: Zij = νk (πj (xi ), πj (Xy(xi ) ) \ πj (xi )) νk (πj (xi ), πj (X¬y(xi ) ) \ πj (xi )) d(1−α) α≈1 j ∈ {1 . . . d∗ } (5) For each training data point, we compute the best distance ratio amid all the projections, which is simply Ti = minj∈[d∗ ] Zij . The objective can be then further rewritten as a function of the entropy estimates: n I[g(xi ) → πj ]Zij min M ∈Md (6) i=1 πj ∈Π From the definition of T, it is also clear that n n I[g(xi ) → πj ]Zij min M ∈Md 2.2 ≥ i=1 πj ∈Π Ti . (7) i=1 Projection selection as a combinatorial problem Considering form (6) of the objective, and given that the estimates Zij are constants, depending only on the training set, the projection retrieval problem is reduced to finding g for all training points, which will implicitly select the projection set of the model. Naturally, one might assume the bestperforming classification model is the one containing all the axis-aligned subspaces. This model achieves the lower bound (7) for the training set. However, the larger the set of projections, the more values the function g takes, and thus the problem of selecting the correct projection becomes more difficult. It becomes apparent that the number of projections should be somehow restricted to allow intepretability. Assuming a hard threshold of at most t projections, the optimization (6) becomes an entry selection problem over matrix Z where one value must be picked from each row under a limitation on the number of columns that can be used. This problem cannot be solved exactly in polynomial time. Instead, it can be formulated as an optimization problem under 1 constraints. 2.3 Projection retrieval through regularized regression To transform the projection retrieval to a regression problem we consider T, the minimum obtainable value of the entropy estimator for each point, as the output which the method needs to predict. Each row i of the parameter matrix B represents the degrees to which the entropy estimates on each projection contribute to the entropy estimator of point xi . Thus, the sum over each row of B is 1, and the regularization penalty applies to the number of non-zero columns in B. d∗ min ||T − (Z B B)J|Π|,1 ||2 2 +λ [Bi = 0] (8) i=1 subject to |Bk | 1 = 1 k = 1, n where (Z B)ij = Zij + Bij and J is a matrix of ones. The problem with this optimization is that it is not convex. A typical walk-around of this issue is to use the convex relaxation for Bi = 0, that is 1 norm. This would transform the penalized term d∗ d∗ n to i=1 |Bi | 1 . However, i=1 |Bi | 1 = k=1 |Bk | 1 = n , so this penalty really has no effect. An alternative mechanism to encourage the non-zero elements in B to populate a small number of columns is to add a penalty term in the form of Bδ, where δ is a d∗ -size column vector with each element representing the penalty for a column in B. With no prior information about which subspaces are more informative, δ starts as an all-1 vector. An initial value for B is obtained through the optimization (8). Since our goal is to handle data using a small number of projections, δ is then updated such that its value is lower for the denser columns in B. This update resembles the reweighing in adaptive lasso. The matrix B itself is updated, and this 2-step process continues until convergence of δ. Once δ converges, the projections corresponding to the non-zero columns of B are added to the model. The procedure is shown in Algorithm 1. 4 Algorithm 1: RECIP δ = [1 . . . 1] repeat |P I| b = arg minB ||T − i=1 < Z, B > ||2 + λ|Bδ| 1 2 subject to |Bk | 1 = 1 k = 1...n δk = |Bi | 1 i = . . . d∗ (update the differential penalty) δ δ = 1 − |δ| 1 until δ converges return Π = {πi ; |Bi | 1 > 0 ∀i = 1 . . . d∗ } 2.4 Lasso for projection selection We will compare our algorithm to lasso regularization that ranks the projections in terms of their potential for data separability. We write this as an 1 -penalized optimization on the extended feature set Z, with the objective T : minβ |T − Zβ|2 + λ|β| 1 . The lasso penalty to the coefficient vector encourages sparsity. For a high enough λ, the sparsity pattern in β is indicative of the usefulness of the projections. The lasso on entropy contributions was not found to perform well as it is not query specific and will find one projection for all data. We improved it by allowing it to iteratively find projections - this robust version offers increased performance by reweighting the data thus focusing on different subsets of it. Although better than running lasso on entropy contributions, the robust lasso does not match RECIP’s performance as the projections are selected gradually rather than jointly. Running the standard lasso on the original design matrix yields a set of relevant variables and it is not immediately clear how the solution would translate to the desired class. 2.5 The selection function Once the projections are selected, the second stage of the algorithm deals with assigning the projection with which to classify a particular query point. An immediate way of selecting the correct projection starts by computing the local entropy estimator for each subspace with each class assignment. Then, we may select the label/subspace combination that minimizes the empirical entropy. (i∗ , θ∗ ) = arg min i,θ 3 νk (πi (x), πi (Xθ )) νk (πi (x), πi (X¬θ )) dim(πi )(1−α) i = 1 . . . d∗ , α≈1 (9) Experimental results In this section we illustrate the capability of RECIP to retrieve informative projections of data and their use in support of interpreting results of classification. First, we analyze how well RECIP can identify subspaces in synthetic data whose distribution obeys the subspace separability assumption (3.1). As a point of reference, we also present classification accuracy results (3.2) for both the synthetic data and a few real-world sets. This is to quantify the extent of the trade-off between fidelity of attainable classifiers and desired informativeness of the projections chosen by RECIP. We expect RECIP’s classification performance to be slightly, but not substantially worse when compared to relevant classification algorithms trained to maximize classification accuracy. Finally, we present a few examples (3.3) of informative projections recovered from real-world data and their utility in explaining to human users the decision processes applied to query points. A set of artificial data used in our experiments contains q batches of data points, each of them made classifiable with high accuracy using one of available 2-dimensional subspaces (x1 , x2 ) with k ∈ k k {1 . . . q}. The data in batch k also have the property that x1 > tk . This is done such that the group a k point belongs to can be detected from x1 , thus x1 is a regulatory variable. We control the amount of k k noise added to thusly created synthetic data by varying the proportion of noisy data points in each batch. The results below are for datasets with 7 features each, with number of batches q ranging between 1 and 7. We kept the number of features specifically low in order to prevent excessive variation between any two sets generated this way, and to enable computing meaningful estimates of the expectation and variance of performance, while enabling creation of complicated data in which synthetic patterns may substantially overlap (using 7 features and 7 2-dimensional patterns implies that dimensions of at least 4 of the patterns will overlap). We implemented our method 5 to be scalable to the size and dimensionality of data and although for brevity we do not include a discussion of this topic here, we have successfully run RECIP against data with 100 features. The parameter α is a value close to 1, because the Tsallis divergence converges to the KL divergence as alpha approaches 1. For the experiments on real-world data, d was set to n (all projections were considered). For the artificial data experiments, we reported results for d = 2 as they do not change significantly for d >= 2 because this data was synthesized to contain bidimensional informative projections. In general, if d is too low, the correct full set of projections will not be found, but it may be recovered partially. If d is chosen too high, there is a risk that a given selected projection p will contain irrelevant features compared to the true projection p0 . However, this situation only occurs if the noise introduced by these features in the estimators makes the entropy contributions on p and p0 statistically indistinguishable for a large subset of data. The users will choose d according to the desired/acceptable complexity of the resulting model. If the results are to be visually interpreted by a human, values of 2 or 3 are reasonable for d. 3.1 Recovering informative projections Table 1 shows how well RECIP recovers the q subspaces corresponding to the synthesized batches of data. We measure precision (proportion of the recovered projections that are known to be informative), and recall (proportion of known informative projections that are recovered by the algorithm). in Table 1, rows correspond to the number of distinct synthetic batches injected in data, q, and subsequent columns correspond to the increasing amounts of noise in data. We note that the observed precision is nearly perfect: the algorithm makes only 2 mistakes over the entire set of experiments, and those occur for highly noisy setups. The recall is nearly perfect as long as there is little overlap among the dimensions, that is when the injections do not interfere with each other. As the number of projections increases, the chances for overlap among the affected features also increase, which makes the data more confusing resulting on a gradual drop of recall until only about 3 or 4 of the 7 known to be informative subspaces can be recovered. We have also used lasso as described in 2.4 in an attempt to recover projections. This setup only manages to recover one of the informative subspaces, regardless of how the regularization parameter is tuned. 3.2 Classification accuracy Table 2 shows the classification accuracy of RECIP, obtained using synthetic data. As expected, the observed performance is initially high when there are few known informative projections in data and it decreases as noise and ambiguity of the injected patterns increase. Most types of ensemble learners would use a voting scheme to arrive at the final classification of a testing sample, rather than use a model selection scheme. For this reason, we have also compared predictive accuracy revealed by RECIP against a method based on majority voting among multiple candidate subspaces. Table 4 shows that the accuracy of this technique is lower than the accuracy of RECIP, regardless of whether the informative projections are recovered by the algorithm or assumed to be known a priori. This confirms the intuition that a selection-based approach can be more effective than voting for data which satisfies the subspace separability assumption. For reference, we have also classified the synthetic data using K-Nearest-Neighbors algorithm using all available features at once. The results of that experiment are shown in Table 5. Since RECIP uses neighbor information, K-NN is conceptually the closest among the popular alternatives. Compared to RECIP, K-NN performs worse when there are fewer synthetic patterns injected in data to form informative projections. It is because some features used then by K-NN are noisy. As more features become informative, the K-NN accuracy improves. This example shows the benefit of a selective approach to feature space and using a subset of the most explanatory projections to support not only explanatory analyses but also classification tasks in such circumstances. 3.3 RECIP case studies using real-world data Table 3 summarizes the RECIP and K-NN performance on UCI datasets. We also test the methods using Cell dataset containing a set of measurements such as the area and perimeter biological cells with separate labels marking cells subjected to treatment and control cells. In Vowel data, the nearest-neighbor approach works exceptionally well, even outperforming random forests (0.94 accuracy), which is an indication that all features are jointly relevant. For d lower than the number of features, RECIP picks projections of only one feature, but if there is no such limitation, RECIP picks the space of all the features as informative. 6 Table 1: Projection recovery for artificial datasets with 1 . . . 7 informative features and noise level 0 . . . 0.2 in terms of mean and variance of Precision and Recall. Mean/var obtained for each setting by repeating the experiment with datasets with different informative projections. PRECISION 1 2 3 4 5 6 7 0 1 1 1 1 1 1 1 0.02 1 1 1 1 1 1 1 Mean 0.05 1 1 1 1 1 1 1 1 2 3 4 5 6 7 0 1 1 1 0.9643 0.7714 0.6429 0.6327 0.02 1 1 1 0.9643 0.7429 0.6905 0.5918 Mean 0.05 1 1 0.9524 0.9643 0.8286 0.6905 0.5918 0 0 0 0 0 0 0 0 0.02 0 0 0 0 0 0 0 Variance 0.05 0 0 0 0 0 0 0 0.1 0.0306 0 0 0 0 0 0 0.2 0.0306 0 0 0 0 0 0 0 0 0 0 0.0077 0.0163 0.0113 0.0225 0.02 0 0 0 0.0077 0.0196 0.0113 0.02 Variance 0.05 0 0 0.0136 0.0077 0.0049 0.0272 0.0258 0.1 0 0 0.0136 0.0077 0.0082 0.0113 0.0233 0.2 0 0 0 0.0128 0.0278 0.0113 0.02 0.1 0.0008 0.0001 0.0028 0.0025 0.0036 0.0025 0.0042 0.2 0.0007 0.0001 0.0007 0.0032 0.0044 0.0027 0.0045 0.1 0.0001 0.0001 0.0007 0.0014 0.0019 0.0023 0.0021 0.2 0.0000 0.0001 0.0007 0.0020 0.0023 0.0021 0.0020 0.1 0.9286 1 1 1 1 1 1 0.2 0.9286 1 1 1 1 1 1 RECALL 0.1 1 1 0.9524 0.9643 0.8571 0.6905 0.5714 0.2 1 1 1 0.9286 0.7714 0.6905 0.551 Table 2: RECIP Classification Accuracy on Artificial Data 1 2 3 4 5 6 7 0 0.9751 0.9333 0.9053 0.8725 0.8113 0.7655 0.7534 1 2 3 4 5 6 7 0 0.9751 0.9333 0.9053 0.8820 0.8714 0.8566 0.8429 CLASSIFICATION ACCURACY Mean Variance 0.02 0.05 0.1 0.2 0 0.02 0.05 0.9731 0.9686 0.9543 0.9420 0.0000 0.0000 0.0000 0.9297 0.9227 0.9067 0.8946 0.0001 0.0001 0.0001 0.8967 0.8764 0.8640 0.8618 0.0004 0.0005 0.0016 0.8685 0.8589 0.8454 0.8187 0.0020 0.0020 0.0019 0.8009 0.8105 0.8105 0.7782 0.0042 0.0044 0.0033 0.7739 0.7669 0.7632 0.7511 0.0025 0.0021 0.0026 0.7399 0.7347 0.7278 0.7205 0.0034 0.0040 0.0042 CLASSIFICATION ACCURACY - KNOWN PROJECTIONS Mean Variance 0.02 0.05 0.1 0.2 0 0.02 0.05 0.9731 0.9686 0.9637 0.9514 0.0000 0.0000 0.0000 0.9297 0.9227 0.9067 0.8946 0.0001 0.0001 0.0001 0.8967 0.8914 0.8777 0.8618 0.0004 0.0005 0.0005 0.8781 0.8657 0.8541 0.8331 0.0011 0.0011 0.0014 0.8641 0.8523 0.8429 0.8209 0.0015 0.0015 0.0018 0.8497 0.8377 0.8285 0.8074 0.0014 0.0015 0.0016 0.8371 0.8256 0.8122 0.7988 0.0015 0.0018 0.0018 Table 3: Accuracy of K-NN and RECIP Dataset KNN RECIP In Spam data, the two most informative projections are Breast Cancer Wis 0.8415 0.8275 ’Capital Length Total’ (CLT)/’Capital Length Longest’ Breast Tissue 1.0000 1.0000 (CLL) and CLT/’Frequency of word your’ (FWY). FigCell 0.7072 0.7640 ure 1 shows these two projections, with the dots repreMiniBOONE* 0.7896 0.7396 senting training points. The red dots represent points laSpam 0.7680 0.7680 Vowel 0.9839 0.9839 beled as spam while the blue ones are non-spam. The circles are query points that have been assigned to be classified with the projection in which they are plotted. The green circles are correctly classified points, while the magenta circles - far fewer - are the incorrectly classified ones. Not only does the importance of text in capital letters make sense for a spam filtering dataset, but the points that select those projections are almost flawlessly classified. Additionally, assuming the user would need to attest the validity of classification for the first plot, he/she would have no trouble seeing that the circled data points are located in a region predominantly populated with examples of spam, so any non-spam entry appears suspicious. Both of the magenta-colored cases fall into this category, and they can be therefore flagged for further investigation. 7 Informative Projection for the Spam dataset 2000 1500 1000 500 0 Informative Projection for the Spam dataset 12 Frequency of word ‘your‘ Capital Run Length Longest 2500 10 8 6 4 2 0 500 1000 1500 2000 2500 Capital Run Length Total 3000 0 3500 0 2000 4000 6000 8000 10000 12000 14000 Capital Run Length Total 16000 Figure 1: Spam Dataset Selected Subspaces Table 4: Classification accuracy using RECIP-learned projections - or known projections, in the lower section - within a voting model instead of a selection model 1 2 3 4 5 6 7 1 2 3 4 5 6 7 CLASSIFICATION ACCURACY - VOTING ENSEMBLE Mean Variance 0 0.02 0.05 0.1 0.2 0 0.02 0.05 0.1 0.9751 0.9731 0.9686 0.9317 0.9226 0.0000 0.0000 0.0000 0.0070 0.7360 0.7354 0.7331 0.7303 0.7257 0.0002 0.0002 0.0001 0.0002 0.7290 0.7266 0.7163 0.7166 0.7212 0.0002 0.0002 0.0008 0.0006 0.6934 0.6931 0.6932 0.6904 0.6867 0.0008 0.0008 0.0008 0.0008 0.6715 0.6602 0.6745 0.6688 0.6581 0.0013 0.0014 0.0013 0.0014 0.6410 0.6541 0.6460 0.6529 0.6512 0.0008 0.0007 0.0010 0.0006 0.6392 0.6342 0.6268 0.6251 0.6294 0.0009 0.0011 0.0012 0.0012 CLASSIFICATION ACCURACY - VOTING ENSEMBLE, KNOWN PROJECTIONS Mean Variance 0 0.02 0.05 0.1 0.2 0 0.02 0.05 0.1 0.9751 0.9731 0.9686 0.9637 0.9514 0.0000 0.0000 0.0000 0.0001 0.7360 0.7354 0.7331 0.7303 0.7257 0.0002 0.0002 0.0001 0.0002 0.7409 0.7385 0.7390 0.7353 0.7325 0.0010 0.0012 0.0010 0.0011 0.7110 0.7109 0.7083 0.7067 0.7035 0.0041 0.0041 0.0042 0.0042 0.7077 0.7070 0.7050 0.7034 0.7008 0.0015 0.0015 0.0015 0.0016 0.6816 0.6807 0.6801 0.6790 0.6747 0.0008 0.0008 0.0008 0.0008 0.6787 0.6783 0.6772 0.6767 0.6722 0.0008 0.0009 0.0009 0.0008 0.2 0.0053 0.0001 0.0002 0.0009 0.0013 0.0005 0.0012 0.2 0.0000 0.0001 0.0010 0.0043 0.0016 0.0009 0.0008 Table 5: Classification accuracy for artificial data with the K-Nearest Neighbors method CLASSIFICATION ACCURACY - KNN 1 2 3 4 5 6 7 4 0 0.7909 0.7940 0.7964 0.7990 0.8038 0.8043 0.8054 0.02 0.7843 0.7911 0.7939 0.7972 0.8024 0.8032 0.8044 Mean 0.05 0.7747 0.7861 0.7901 0.7942 0.8002 0.8015 0.8028 0.1 0.7652 0.7790 0.7854 0.7904 0.7970 0.7987 0.8004 0.2 0.7412 0.7655 0.7756 0.7828 0.7905 0.7930 0.7955 0 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 0.02 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 Variance 0.05 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.1 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 0.2 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 Conclusion This paper considers the problem of Projection Recovery for Classification. It is relevant in applications where the decision process must be easy to understand in order to enable human interpretation of the results. We have developed a principled, regression-based algorithm designed to recover small sets of low-dimensional subspaces that support interpretability. It optimizes the selection using individual data-point-specific entropy estimators. In this context, the proposed algorithm follows the idea of transductive learning, and the role of the resulting projections bears resemblance to high confidence regions known in conformal prediction models. Empirical results obtained using simulated and real-world data show the effectiveness of our method in finding informative projections that enable accurate classification while maintaining transparency of the underlying decision process. Acknowledgments This material is based upon work supported by the NSF, under Grant No. IIS-0911032. 8 References [1] Mark W. Craven and Jude W. Shavlik. Extracting Tree-Structured Representations of Trained Networks. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, Advances in Neural Information Processing Systems, volume 8, pages 24–30. The MIT Press, 1996. [2] Pedro Domingos. Knowledge discovery via multiple models. Intelligent Data Analysis, 2:187–202, 1998. [3] Eulanda M. Dos Santos, Robert Sabourin, and Patrick Maupin. A dynamic overproduce-and-choose strategy for the selection of classifier ensembles. Pattern Recogn., 41:2993–3009, October 2008. [4] M. Feder and N. Merhav. Relations between entropy and error probability. Information Theory, IEEE Transactions on, 40(1):259–266, January 1994. [5] Jerome H. Friedman, Ron Kohavi, and Yeogirl Yun. Lazy decision trees, 1996. [6] A. Gammerman, V. Vovk, and V. Vapnik. Learning by transduction. In In Uncertainty in Artificial Intelligence, pages 148–155. Morgan Kaufmann, 1998. [7] Quanquan Gu, Zhenhui Li, and Jiawei Han. Joint feature selection and subspace learning, 2011. [8] Bing Liu, Minqing Hu, and Wynne Hsu. Intuitive representation of decision trees using general rules and exceptions. In Proceedings of Seventeeth National Conference on Artificial Intellgience (AAAI-2000), July 30 - Aug 3, 2000, pages 615–620, 2000. [9] Michael Mampaey, Nikolaj Tatti, and Jilles Vreeken. Tell me what i need to know: succinctly summarizing data with itemsets. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’11, pages 573–581, New York, NY, USA, 2011. ACM. [10] Mario Marchand and Marina Sokolova. Learning with decision lists of data-dependent features. JOURNAL OF MACHINE LEARNING REASEARCH, 6, 2005. [11] Guillaume Obozinski, Ben Taskar, and Michael I. Jordan. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2):231–252, April 2010. [12] Michael J. Pazzani, Subramani Mani, and W. Rodman Shankle. Beyond concise and colorful: Learning intelligible rules, 1997. [13] B. Poczos and J. Schneider. On the estimation of alpha-divergences. AISTATS, 2011. [14] Kai Ting, Jonathan Wells, Swee Tan, Shyh Teng, and Geoffrey Webb. Feature-subspace aggregating: ensembles for stable andunstable learners. Machine Learning, 82:375–397, 2011. 10.1007/s10994-0105224-5. 9
5 0.84610599 315 nips-2012-Slice sampling normalized kernel-weighted completely random measure mixture models
Author: Nicholas Foti, Sinead Williamson
Abstract: A number of dependent nonparametric processes have been proposed to model non-stationary data with unknown latent dimensionality. However, the inference algorithms are often slow and unwieldy, and are in general highly specific to a given model formulation. In this paper, we describe a large class of dependent nonparametric processes, including several existing models, and present a slice sampler that allows efficient inference across this class of models. 1
6 0.84541535 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems
7 0.84082681 197 nips-2012-Learning with Recursive Perceptual Representations
8 0.83312833 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning
9 0.83298522 200 nips-2012-Local Supervised Learning through Space Partitioning
10 0.8321867 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions
12 0.82711172 251 nips-2012-On Lifting the Gibbs Sampling Algorithm
13 0.82651293 103 nips-2012-Distributed Probabilistic Learning for Camera Networks with Missing Data
14 0.82361037 198 nips-2012-Learning with Target Prior
15 0.82354647 188 nips-2012-Learning from Distributions via Support Measure Machines
16 0.82350731 168 nips-2012-Kernel Latent SVM for Visual Recognition
17 0.82291919 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization
18 0.82230842 74 nips-2012-Collaborative Gaussian Processes for Preference Learning
19 0.82168233 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models
20 0.82078642 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model