nips nips2012 nips2012-319 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Andreas Argyriou, Rina Foygel, Nathan Srebro
Abstract: We derive a novel norm that corresponds to the tightest convex relaxation of sparsity combined with an 2 penalty. We show that this new k-support norm provides a tighter relaxation than the elastic net and can thus be advantageous in in sparse prediction problems. We also bound the looseness of the elastic net, thus shedding new light on it and providing justification for its use. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We derive a novel norm that corresponds to the tightest convex relaxation of sparsity combined with an 2 penalty. [sent-4, score-0.711]
2 We show that this new k-support norm provides a tighter relaxation than the elastic net and can thus be advantageous in in sparse prediction problems. [sent-5, score-1.484]
3 We also bound the looseness of the elastic net, thus shedding new light on it and providing justification for its use. [sent-6, score-0.669]
4 1 Introduction Regularizing with the 1 norm, when we expect a sparse solution to a regression problem, is often justified by w 1 being the “convex envelope” of w 0 (the number of non-zero coordinates of a vector w ∈ Rd ). [sent-7, score-0.134]
5 That is, w 1 is the tightest convex lower bound on w 0 . [sent-8, score-0.263]
6 But we must be careful with this statement—for sparse vectors with large entries, w 0 can be small while w 1 is large. [sent-9, score-0.056]
7 In order to discuss convex lower bounds on w 0 , we must impose some scale constraint. [sent-10, score-0.093]
8 A more accurate statement is that w 1 ≤ w ∞ w 0 , and so, when the magnitudes of entries in w are bounded by 1, then w 1 ≤ w 0 , and indeed it is the largest such convex lower bound. [sent-11, score-0.118]
9 Viewed as a convex outer relaxation, (∞) Sk := w w 0 ≤ k, w ∞ ≤1 ⊆ w w 1 ≤k . [sent-12, score-0.141]
10 Intersecting the right-hand-side with the ∞ unit ball, we get the tightest convex outer bound (convex (∞) hull) of Sk : (∞) w w 1 ≤ k, w ∞ ≤ 1 = conv(Sk ) . [sent-13, score-0.355]
11 However, in our view, this relationship between w 1 and w 0 yields disappointing learning guarantees, and does not appropriately capture the success of the 1 norm as a surrogate for sparsity. [sent-14, score-0.284]
12 Perhaps a better reason for the 1 norm being a good surrogate for sparsity is that, not only do we expect the magnitude of each entry of w to be bounded, but we further expect w 2 to be small. [sent-16, score-0.395]
13 In a regression setting, with a vector of features x, this can be justified when E[(x w)2 ] is bounded (a reasonable assumption) and the features are not too correlated—see, e. [sent-17, score-0.152]
14 In any case, we have w 1 ≤ w 2 w 0 , and so if we are interested in predictors with bounded 2 norm, we can motivate the 1 norm through the following relaxation of sparsity, where the scale is now set by the 2 norm: √ w w 0 ≤ k, w 2 ≤ B ⊆ w w 1 ≤ B k . [sent-22, score-0.421]
15 The sample complexity when using the relaxation now scales as2 O(k log d). [sent-23, score-0.208]
16 Our starting point is then that of combining sparsity and 2 regularization, and learning a sparse predictor with small 2 norm. [sent-25, score-0.22]
17 √ As discussed above, the class { w 1 ≤ k} (corresponding to the standard Lasso) provides a (2) convex relaxation of Sk . [sent-27, score-0.231]
18 But clearly we can get a tighter relaxation by keeping the 2 constraint: √ √ (2) conv(Sk ) ⊆ w w 1 ≤ k, w 2 ≤ 1 w w 1≤ k . [sent-28, score-0.253]
19 In this paper, we ask (2) whether the elastic net is the tightest convex relaxation to sparsity plus 2 (that is, to Sk ) or whether a tighter, and better, convex relaxation is possible. [sent-30, score-1.612]
20 (2) We consider the convex hull (tightest convex outer bound) of Sk , (2) Ck := conv(Sk ) = conv w w 0 ≤ k, w 2 ≤1 . [sent-32, score-0.355]
21 (2) We study the gauge function associated with this convex set, that is, the norm whose unit ball is given by (2), which we call the k-support norm. [sent-33, score-0.415]
22 We show that, for k > 1, this is indeed a tighter convex relaxation than the elastic net (that is, both inequalities in (1) are in fact strict inequalities), and is therefore a better convex constraint than the elastic net when seeking a sparse, low 2 -norm linear predictor. [sent-34, score-2.339]
23 We thus advocate using it as a replacement for the elastic net. [sent-35, score-0.617]
24 However, we also show that the gap between the elastic net and the k-support norm is at most a factor √ of 2, corresponding to a factor of two difference in the sample complexity. [sent-36, score-1.188]
25 Thus, our work can also be interpreted as justifying the use of the elastic net, viewing it as a fairly good approximation to the tightest possible convex relaxation of sparsity intersected with an 2 constraint. [sent-37, score-1.096]
26 Still, even a factor of two should not necessarily be ignored and, as we show in our experiments, using the tighter k-support norm can indeed be beneficial. [sent-38, score-0.403]
27 To better understand the k-support norm, we show in Section 2 that it can also be described as d the group lasso with overlaps norm [10] corresponding to all k subsets of k features. [sent-39, score-0.391]
28 Despite the exponential number of groups in this description, we show that the k-support norm can be calculated efficiently in time O(d log d) and that its dual is given simply by the 2 norm of the k largest entries. [sent-40, score-0.489]
29 In particular, in many applications, when a group of variables is highly correlated, the Lasso may prefer a sparse solution, but we might gain more predictive accuracy by including all the correlated variables in our model. [sent-43, score-0.185]
30 individual features are bounded), the sample complexity when using only w 2 ≤ B (without a sparsity or 1 constraint) scales as O(B 2 d). [sent-47, score-0.187]
31 That is, even after identifying the correct support, we still need a sample complexity that scales with B 2 . [sent-48, score-0.07]
32 The elastic net can be viewed as a trade-off between 1 regularization (the Lasso) and 2 regularization (Ridge regression [9]), depending on the relative values of λ1 and λ2 . [sent-50, score-1.119]
33 The pairwise elastic net (PEN) [13] is a penalty function that accounts for similarity among features: w P EN R = w 2 2 + w 2 1 − |w| R|w| , p×p where R ∈ [0, 1] is a matrix with Rjk measuring similarity between features Xj and Xk . [sent-53, score-0.981]
34 The trace Lasso [6] is a second method proposed to handle correlations within X, defined by w trace X = Xdiag(w) ∗ , where · ∗ denotes the matrix trace-norm (the sum of the singular values) and promotes a low-rank solution. [sent-54, score-0.108]
35 If the features are all identical, then both penalties are equivalent to Ridge regression (penalizing w 2 ). [sent-56, score-0.088]
36 i k−r (4) i=k−r This result shows that · sp trades off between the 1 and 2 norms in a way that favors sparse k vectors but allows for cardinality larger than k. [sent-59, score-0.318]
37 It combines the uniform shrinkage of an 2 penalty for the largest components, with the sparse shrinkage of an 1 penalty for the smallest components. [sent-60, score-0.24]
38 We have 1 ( w 2 sp 2 k ) 1 u, w − ( u 2 = max d (2) 2 (k) ) : u ∈ Rd i=1 k−1 α1 ≥ · · · ≥ αd ≥ 0 d αi |w|↓ + αk i = max i=1 d Let Ar := i=k−r αi |w|↓ − i = max |w|↓ − i i=k 1 2 1 2 k 2 αi : i=1 k 2 αi : α1 ≥ · · · ≥ αk ≥ 0 . [sent-64, score-0.182]
39 3 Learning with the k-support norm We thus propose using learning rules with k-support norm regularization. [sent-100, score-0.464]
40 These are appropriate when we would like to learn a sparse predictor that also has low 2 norm, and are especially relevant when features might be correlated (that is, in almost all learning tasks) but the correlation structure is not known in advance. [sent-101, score-0.245]
41 , for squared error regression problems we have: 1 λ 2 min Xw − y 2 + ( w sp ) : w ∈ Rd (5) k 2 2 with λ > 0 a regularization parameter and k ∈ {1, . [sent-104, score-0.343]
42 As typical in regularization-based methods, both λ and k can be selected by cross validation [8]. [sent-108, score-0.049]
43 Despite the (2) relationship to Sk , the parameter k does not necessarily correspond to the sparsity of the actual minimizer of (5), and should be chosen via cross-validation rather than set to the desired sparsity. [sent-109, score-0.107]
44 3 Relation to the Elastic Net Recall that the elastic net with penalty parameters λ1 and λ2 selects a vector of coefficients given by 1 Xw − y 2 + λ1 w 1 + λ2 w 2 . [sent-110, score-0.942]
45 Now for any w = w, ˆ ˆ ˆ ˆ if w el ≤ w el , then w p ≤ w p for p = 1, 2. [sent-113, score-0.384]
46 This proves that, for some constraint parameter B, ˆ 2 2 1 w = arg min ˆ Xw − y 2 : w el ≤ B . [sent-115, score-0.246]
47 2 k n Like the k-support norm, the elastic net interpolates between the 1 and 2 norms. [sent-116, score-0.902]
48 In fact, when k is an integer, any k-sparse unit vector w ∈ Rd must lie in the unit ball of · el . [sent-117, score-0.326]
49 Since the k-support k norm gives the convex hull of all k-sparse unit vectors, this immediately implies that w el ≤ w sp ∀ w ∈ Rd . [sent-118, score-0.789]
50 The difference between the two is illustrated in Figure 1, where we see that the k-support norm is more “rounded”. [sent-120, score-0.232]
51 To see an example where the two norms are not equal, we set d = 1 + k 2 for some large k, and let w = (k 1. [sent-121, score-0.08]
52 , √1 ) , we have u (k) < 1, and recalling this norm is dual to the 2k 2k 2k k-support norm: √ k 1. [sent-132, score-0.257]
53 k 2 2k √ In this example, we see that the two norms can differ by as much as a factor of 2. [sent-135, score-0.107]
54 First, since maximum over the 1 and 2 norms, its dual is given by √ (el)∗ u k a 2+ k· u−a ∞ := inf · el k is a a∈Rd (2) (k) Now take any u ∈ Rd . [sent-142, score-0.217]
55 i=1 Furthermore, this yields a strict inequality, because if u1 > uk+1 , the next-to-last inequality is strict, while if u1 = · · · = uk+1 , then the last inequality is strict. [sent-155, score-0.105]
56 4 Optimization Solving the optimization problem (5) efficiently can be done with a first-order proximal algorithm. [sent-156, score-0.085]
57 Proximal methods – see [1, 4, 14, 18, 19] and references therein – are used to solve composite problems of the form min{f (x) + ω(x) : x ∈ Rd }, where the loss function f (x) and the regularizer ω(x) are convex functions, and f is smooth with an L-Lipschitz gradient. [sent-157, score-0.119]
58 These methods require fast computation of the gradient f and the proximity operator proxω (x) := argmin 1 u−x 2 2 + ω(u) : u ∈ Rd . [sent-158, score-0.054]
59 To obtain a proximal method for k-support regularization, it suffices to compute the proximity map 1 of g = 2β ( · sp )2 , for any β > 0 (in particular, for problem (5) β corresponds to L ). [sent-159, score-0.321]
60 Input v ∈ Rd 1 Output q = prox 2β ( · sp )2 (v) k Find r ∈ {0, . [sent-162, score-0.243]
61 , d} such that 1 β+1 zk−r−1 z > > Tr, −k+(β+1)r+β+1 Tr, −k+(β+1)r+β+1 ≥ ≥z where z := |v|↓ , z0 := +∞, zd+1 := −∞, Tr, := zi i=k−r β β+1 zi if i = 1, . [sent-168, score-0.128]
62 , k − r − 1 Tr, qi ← zi − −k+(β+1)r+β+1 if i = k − r, . [sent-171, score-0.156]
63 Since the support-norm is sign and permutation invariant, proxg (v) has the same ordering and signs as v. [sent-180, score-0.087]
64 Hence, without loss of generality, we may assume that v1 ≥ · · · ≥ vd ≥ 0 and require that q1 ≥ · · · ≥ qd ≥ 0, which follows from inequality (7) and the fact that z is ordered. [sent-181, score-0.059]
65 Now, q = proxg (v) is equivalent to βz − βq = βv − βq ∈ ∂ 1 ( · sp )2 (q). [sent-182, score-0.236]
66 Indeed, Ar corresponds to d zi − qi = i=k−r i=k−r Tr, −k+(β+1)r+β+1 = Tr, − ( −k+r+1)Tr, −k+(β+1)r+β+1 = (r + 1) β Tr, −k+(β+1)r+β+1 and (4) is equivalent to condition (7). [sent-185, score-0.156]
67 For i ≥ + 1, since qi = 0, we only need βzi − βqi ≤ r+1 Ar , which is true by (8). [sent-188, score-0.092]
68 We can now apply a standard accelerated proximal method, such as FISTA [1], to (5), at each iteration using the gradient of the loss and performing a prox step using Algorithm 1. [sent-189, score-0.172]
69 The FISTA guarantee ensures us that, with appropriate step sizes, after T such iterations, we have: 1 XwT − y 2 5 2 + λ ( wT 2 sp 2 k ) ≤ 1 Xw∗ − y 2 2 + λ ( w∗ 2 sp 2 k ) + 2L w∗ − w1 (T + 1)2 2 . [sent-190, score-0.364]
70 Empirical Comparisons Our theoretical analysis indicates that the k-support norm and the elastic net differ by at most a factor √ of 2, corresponding to at most a factor of two difference in their sample complexities and generalization guarantees. [sent-191, score-1.188]
71 We thus do not expect huge differences between their actual performances, but would still like to see whether the tighter relaxation of the k-support norm does yield some gains. [sent-192, score-0.514]
72 A total of 50 data sets were created in this way, each containing 50 training points, 50 validation points and 350 test points. [sent-212, score-0.049]
73 The goal is to achieve good prediction performance on the test data. [sent-213, score-0.041]
74 We compared the k-support norm with Lasso and the elastic net. [sent-214, score-0.849]
75 , d} for k-support norm regularization, λ = 10i , i = {−15, . [sent-218, score-0.232]
76 , 5}, for the regularization parameter of Lasso and k-support regularization and the same range for the λ1 , λ2 of the elastic net. [sent-221, score-0.785]
77 For each method, the optimal set of parameters was selected based on mean squared error on the validation set. [sent-222, score-0.077]
78 Whereas all three methods distinguish the 15 relevant variables, the elastic net result varies less within these variables. [sent-226, score-0.902]
79 There are 9 variables and 462 examples, and the response is presence/absence of coronary heart disease. [sent-228, score-0.073]
80 We 7 Table 1: Mean squared errors and classification accuracy for the synthetic data (median over 50 repetition), SA heart data (median over 50 replications) and for the “20 newsgroups” data set. [sent-229, score-0.162]
81 (SE = standard error) Method Lasso Elastic net k-support Synthetic MSE (SE) 0. [sent-230, score-0.285]
82 40 normalized the data so that each predictor variable has zero mean and unit variance. [sent-254, score-0.13]
83 For each method, parameters were selected using the validation data. [sent-256, score-0.049]
84 20 Newsgroups This is a binary classification version of 20 newsgroups created in [12] which can be found in the LIBSVM data repository. [sent-259, score-0.089]
85 We randomly split the data into a training, a validation and a test set of sizes 14000,1000 and 4996, respectively. [sent-265, score-0.049]
86 We found that k-support regularization gave improved prediction accuracy over both other methods. [sent-267, score-0.157]
87 5 6 Summary We introduced the k-support norm as the tightest convex relaxation of sparsity plus 2 regularization, √ and showed that it is tighter than the elastic net by exactly a factor of 2. [sent-268, score-1.755]
88 In our view, this sheds light on the elastic net as a close approximation to this tightest possible convex relaxation, and motivates using the k-support norm when a tighter relaxation is sought. [sent-269, score-1.65]
89 We note that the k-support norm has better prediction properties, but not necessarily better sparsityinducing properties, as evident from its more rounded unit ball. [sent-271, score-0.411]
90 It is well understood that there is often a tradeoff between sparsity and good prediction, and that even if the population optimal predictor is sparse, a denser predictor often yields better predictive performance [3, 10, 21]. [sent-272, score-0.283]
91 For example, in the presence of correlated features, it is often beneficial to include several highly correlated features rather than a single representative feature. [sent-273, score-0.167]
92 This is exactly the behavior encouraged by 2 norm regularization, and the elastic net is already known to yield less sparse (but more predictive) solutions. [sent-274, score-1.19]
93 The k-support norm goes a step further in this direction, often yielding solutions that are even less sparse (but more predictive) compared to the elastic net. [sent-275, score-0.905]
94 Nevertheless, it is interesting to consider whether compressed sensing results, where 1 regularization is of course central, can be refined by using the k-support norm, which might be able to handle more correlation structure within the set of features. [sent-276, score-0.084]
95 Acknowledgements The construction showing that the gap between the elastic net and the k√ overlap norm can be as large as 2 is due to joint work with Ohad Shamir. [sent-277, score-1.134]
96 tw/∼cjlin/libsvmtools/datasets/ Regarding other sparse prediction methods, we did not manage to compare with OSCAR, due to memory limitations, or to PEN or trace Lasso, which do not have code available online. [sent-293, score-0.151]
97 Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. [sent-299, score-0.075]
98 Trace lasso: a trace norm regularization for correlated designs. [sent-319, score-0.434]
99 Exploiting covariate similarity in sparse regression via the pairwise elastic net. [sent-380, score-0.722]
100 Approximation accuracy, gradient methods, and error bound for structured convex optimization. [sent-407, score-0.093]
wordName wordTfidf (topN-words)
[('elastic', 0.617), ('net', 0.285), ('norm', 0.232), ('el', 0.192), ('sp', 0.182), ('tightest', 0.17), ('lasso', 0.159), ('relaxation', 0.138), ('xw', 0.133), ('tighter', 0.115), ('sk', 0.112), ('ar', 0.109), ('convex', 0.093), ('qi', 0.092), ('oscar', 0.092), ('newsgroups', 0.089), ('predictor', 0.086), ('proximal', 0.085), ('rd', 0.085), ('regularization', 0.084), ('pen', 0.081), ('norms', 0.08), ('sparsity', 0.078), ('se', 0.078), ('mse', 0.078), ('conv', 0.075), ('tr', 0.073), ('uk', 0.073), ('heart', 0.073), ('zi', 0.064), ('correlated', 0.064), ('prox', 0.061), ('sparse', 0.056), ('proximity', 0.054), ('trace', 0.054), ('proxg', 0.054), ('shrinkage', 0.052), ('regression', 0.049), ('validation', 0.049), ('outer', 0.048), ('foygel', 0.047), ('rina', 0.047), ('ball', 0.046), ('hull', 0.046), ('ridge', 0.045), ('unit', 0.044), ('scales', 0.041), ('prediction', 0.041), ('sridharan', 0.041), ('penalty', 0.04), ('rounded', 0.04), ('fista', 0.04), ('features', 0.039), ('strict', 0.037), ('penalizing', 0.035), ('inequality', 0.034), ('obozinski', 0.034), ('srebro', 0.034), ('signs', 0.033), ('predictive', 0.033), ('inequalities', 0.033), ('zk', 0.032), ('accuracy', 0.032), ('justi', 0.03), ('complexity', 0.029), ('expect', 0.029), ('necessarily', 0.029), ('synthetic', 0.029), ('median', 0.028), ('proves', 0.028), ('squared', 0.028), ('factor', 0.027), ('surrogate', 0.027), ('lorbert', 0.027), ('mol', 0.027), ('reorder', 0.027), ('duals', 0.027), ('nati', 0.027), ('rinafb', 0.027), ('vito', 0.027), ('shedding', 0.027), ('centrale', 0.027), ('grave', 0.027), ('hardy', 0.027), ('predictors', 0.026), ('constraint', 0.026), ('composite', 0.026), ('accelerated', 0.026), ('dual', 0.025), ('bounded', 0.025), ('sparsityinducing', 0.025), ('disappointing', 0.025), ('replications', 0.025), ('nonorthogonal', 0.025), ('xwt', 0.025), ('keerthi', 0.025), ('qd', 0.025), ('african', 0.025), ('looseness', 0.025), ('conform', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 319 nips-2012-Sparse Prediction with the $k$-Support Norm
Author: Andreas Argyriou, Rina Foygel, Nathan Srebro
Abstract: We derive a novel norm that corresponds to the tightest convex relaxation of sparsity combined with an 2 penalty. We show that this new k-support norm provides a tighter relaxation than the elastic net and can thus be advantageous in in sparse prediction problems. We also bound the looseness of the elastic net, thus shedding new light on it and providing justification for its use. 1
2 0.24253729 208 nips-2012-Matrix reconstruction with the local max norm
Author: Rina Foygel, Nathan Srebro, Ruslan Salakhutdinov
Abstract: We introduce a new family of matrix norms, the “local max” norms, generalizing existing methods such as the max norm, the trace norm (nuclear norm), and the weighted or smoothed weighted trace norms, which have been extensively used in the literature as regularizers for matrix reconstruction problems. We show that this new family can be used to interpolate between the (weighted or unweighted) trace norm and the more conservative max norm. We test this interpolation on simulated data and on the large-scale Netflix and MovieLens ratings data, and find improved accuracy relative to the existing matrix norms. We also provide theoretical results showing learning guarantees for some of the new norms. 1
3 0.15363544 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression
Author: Piyush Rai, Abhishek Kumar, Hal Daume
Abstract: Multiple-output regression models require estimating multiple parameters, one for each output. Structural regularization is usually employed to improve parameter estimation in such models. In this paper, we present a multiple-output regression model that leverages the covariance structure of the latent model parameters as well as the conditional covariance structure of the observed outputs. This is in contrast with existing methods that usually take into account only one of these structures. More importantly, unlike some of the other existing methods, none of these structures need be known a priori in our model, and are learned from the data. Several previously proposed structural regularization based multiple-output regression models turn out to be special cases of our model. Moreover, in addition to being a rich model for multiple-output regression, our model can also be used in estimating the graphical model structure of a set of variables (multivariate outputs) conditioned on another set of variables (inputs). Experimental results on both synthetic and real datasets demonstrate the effectiveness of our method. 1
4 0.15250382 247 nips-2012-Nonparametric Reduced Rank Regression
Author: Rina Foygel, Michael Horrell, Mathias Drton, John D. Lafferty
Abstract: We propose an approach to multivariate nonparametric regression that generalizes reduced rank regression for linear models. An additive model is estimated for each dimension of a q-dimensional response, with a shared p-dimensional predictor variable. To control the complexity of the model, we employ a functional form of the Ky-Fan or nuclear norm, resulting in a set of function estimates that have low rank. Backfitting algorithms are derived and justified using a nonparametric form of the nuclear norm subdifferential. Oracle inequalities on excess risk are derived that exhibit the scaling behavior of the procedure in the high dimensional setting. The methods are illustrated on gene expression data. 1
5 0.14932236 64 nips-2012-Calibrated Elastic Regularization in Matrix Completion
Author: Tingni Sun, Cun-hui Zhang
Abstract: This paper concerns the problem of matrix completion, which is to estimate a matrix from observations in a small subset of indices. We propose a calibrated spectrum elastic net method with a sum of the nuclear and Frobenius penalties and develop an iterative algorithm to solve the convex minimization problem. The iterative algorithm alternates between imputing the missing entries in the incomplete matrix by the current guess and estimating the matrix by a scaled soft-thresholding singular value decomposition of the imputed matrix until the resulting matrix converges. A calibration step follows to correct the bias caused by the Frobenius penalty. Under proper coherence conditions and for suitable penalties levels, we prove that the proposed estimator achieves an error bound of nearly optimal order and in proportion to the noise level. This provides a unified analysis of the noisy and noiseless matrix completion problems. Simulation results are presented to compare our proposal with previous ones. 1
6 0.1442789 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation
7 0.11938702 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach
8 0.10902901 27 nips-2012-A quasi-Newton proximal splitting method
9 0.10431816 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
10 0.10306867 282 nips-2012-Proximal Newton-type methods for convex optimization
11 0.098034933 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation
12 0.097443849 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming
13 0.086525261 16 nips-2012-A Polynomial-time Form of Robust Regression
14 0.083945014 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance
15 0.082379699 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods
16 0.076670401 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model
17 0.075833201 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
18 0.074028097 304 nips-2012-Selecting Diverse Features via Spectral Regularization
19 0.070440397 86 nips-2012-Convex Multi-view Subspace Learning
20 0.069096506 330 nips-2012-Supervised Learning with Similarity Functions
topicId topicWeight
[(0, 0.194), (1, 0.055), (2, 0.117), (3, -0.108), (4, 0.102), (5, 0.12), (6, -0.024), (7, 0.011), (8, 0.049), (9, -0.043), (10, -0.002), (11, 0.053), (12, -0.076), (13, 0.049), (14, 0.082), (15, 0.027), (16, 0.09), (17, 0.019), (18, 0.037), (19, -0.107), (20, 0.003), (21, 0.103), (22, -0.018), (23, -0.004), (24, -0.1), (25, -0.09), (26, 0.072), (27, 0.021), (28, -0.011), (29, -0.029), (30, -0.034), (31, -0.031), (32, -0.035), (33, 0.039), (34, 0.007), (35, 0.226), (36, 0.046), (37, -0.068), (38, -0.094), (39, -0.079), (40, -0.098), (41, 0.012), (42, 0.075), (43, 0.089), (44, 0.085), (45, 0.123), (46, -0.054), (47, 0.015), (48, 0.061), (49, -0.003)]
simIndex simValue paperId paperTitle
same-paper 1 0.95621717 319 nips-2012-Sparse Prediction with the $k$-Support Norm
Author: Andreas Argyriou, Rina Foygel, Nathan Srebro
Abstract: We derive a novel norm that corresponds to the tightest convex relaxation of sparsity combined with an 2 penalty. We show that this new k-support norm provides a tighter relaxation than the elastic net and can thus be advantageous in in sparse prediction problems. We also bound the looseness of the elastic net, thus shedding new light on it and providing justification for its use. 1
2 0.86292636 208 nips-2012-Matrix reconstruction with the local max norm
Author: Rina Foygel, Nathan Srebro, Ruslan Salakhutdinov
Abstract: We introduce a new family of matrix norms, the “local max” norms, generalizing existing methods such as the max norm, the trace norm (nuclear norm), and the weighted or smoothed weighted trace norms, which have been extensively used in the literature as regularizers for matrix reconstruction problems. We show that this new family can be used to interpolate between the (weighted or unweighted) trace norm and the more conservative max norm. We test this interpolation on simulated data and on the large-scale Netflix and MovieLens ratings data, and find improved accuracy relative to the existing matrix norms. We also provide theoretical results showing learning guarantees for some of the new norms. 1
3 0.77276987 247 nips-2012-Nonparametric Reduced Rank Regression
Author: Rina Foygel, Michael Horrell, Mathias Drton, John D. Lafferty
Abstract: We propose an approach to multivariate nonparametric regression that generalizes reduced rank regression for linear models. An additive model is estimated for each dimension of a q-dimensional response, with a shared p-dimensional predictor variable. To control the complexity of the model, we employ a functional form of the Ky-Fan or nuclear norm, resulting in a set of function estimates that have low rank. Backfitting algorithms are derived and justified using a nonparametric form of the nuclear norm subdifferential. Oracle inequalities on excess risk are derived that exhibit the scaling behavior of the procedure in the high dimensional setting. The methods are illustrated on gene expression data. 1
4 0.69974792 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach
Author: Xinhua Zhang, Dale Schuurmans, Yao-liang Yu
Abstract: Sparse learning models typically combine a smooth loss with a nonsmooth penalty, such as trace norm. Although recent developments in sparse approximation have offered promising solution methods, current approaches either apply only to matrix-norm constrained problems or provide suboptimal convergence rates. In this paper, we propose a boosting method for regularized learning that guarantees accuracy within O(1/ ) iterations. Performance is further accelerated by interlacing boosting with fixed-rank local optimization—exploiting a simpler local objective than previous work. The proposed method yields state-of-the-art performance on large-scale problems. We also demonstrate an application to latent multiview learning for which we provide the first efficient weak-oracle. 1
5 0.66322434 86 nips-2012-Convex Multi-view Subspace Learning
Author: Martha White, Xinhua Zhang, Dale Schuurmans, Yao-liang Yu
Abstract: Subspace learning seeks a low dimensional representation of data that enables accurate reconstruction. However, in many applications, data is obtained from multiple sources rather than a single source (e.g. an object might be viewed by cameras at different angles, or a document might consist of text and images). The conditional independence of separate sources imposes constraints on their shared latent representation, which, if respected, can improve the quality of a learned low dimensional representation. In this paper, we present a convex formulation of multi-view subspace learning that enforces conditional independence while reducing dimensionality. For this formulation, we develop an efficient algorithm that recovers an optimal data reconstruction by exploiting an implicit convex regularizer, then recovers the corresponding latent representation and reconstruction model, jointly and optimally. Experiments illustrate that the proposed method produces high quality results. 1
6 0.59151196 221 nips-2012-Multi-Stage Multi-Task Feature Learning
7 0.56562155 64 nips-2012-Calibrated Elastic Regularization in Matrix Completion
8 0.56508923 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance
9 0.55345631 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL
10 0.55213976 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression
11 0.53107679 16 nips-2012-A Polynomial-time Form of Robust Regression
12 0.52395231 244 nips-2012-Nonconvex Penalization Using Laplace Exponents and Concave Conjugates
14 0.51327145 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
15 0.50282007 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization
16 0.50274575 125 nips-2012-Factoring nonnegative matrices with linear programs
17 0.49506187 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation
18 0.48584071 327 nips-2012-Structured Learning of Gaussian Graphical Models
19 0.48161814 254 nips-2012-On the Sample Complexity of Robust PCA
20 0.46440923 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming
topicId topicWeight
[(0, 0.057), (4, 0.146), (21, 0.041), (38, 0.192), (42, 0.027), (53, 0.016), (54, 0.017), (55, 0.074), (74, 0.031), (76, 0.178), (80, 0.095), (92, 0.043)]
simIndex simValue paperId paperTitle
same-paper 1 0.91850227 319 nips-2012-Sparse Prediction with the $k$-Support Norm
Author: Andreas Argyriou, Rina Foygel, Nathan Srebro
Abstract: We derive a novel norm that corresponds to the tightest convex relaxation of sparsity combined with an 2 penalty. We show that this new k-support norm provides a tighter relaxation than the elastic net and can thus be advantageous in in sparse prediction problems. We also bound the looseness of the elastic net, thus shedding new light on it and providing justification for its use. 1
2 0.90854031 140 nips-2012-Fusion with Diffusion for Robust Visual Tracking
Author: Yu Zhou, Xiang Bai, Wenyu Liu, Longin J. Latecki
Abstract: A weighted graph is used as an underlying structure of many algorithms like semisupervised learning and spectral clustering. If the edge weights are determined by a single similarity measure, then it hard if not impossible to capture all relevant aspects of similarity when using a single similarity measure. In particular, in the case of visual object matching it is beneficial to integrate different similarity measures that focus on different visual representations. In this paper, a novel approach to integrate multiple similarity measures is proposed. First pairs of similarity measures are combined with a diffusion process on their tensor product graph (TPG). Hence the diffused similarity of each pair of objects becomes a function of joint diffusion of the two original similarities, which in turn depends on the neighborhood structure of the TPG. We call this process Fusion with Diffusion (FD). However, a higher order graph like the TPG usually means significant increase in time complexity. This is not the case in the proposed approach. A key feature of our approach is that the time complexity of the diffusion on the TPG is the same as the diffusion process on each of the original graphs. Moreover, it is not necessary to explicitly construct the TPG in our framework. Finally all diffused pairs of similarity measures are combined as a weighted sum. We demonstrate the advantages of the proposed approach on the task of visual tracking, where different aspects of the appearance similarity between the target object in frame t − 1 and target object candidates in frame t are integrated. The obtained method is tested on several challenge video sequences and the experimental results show that it outperforms state-of-the-art tracking methods. 1
3 0.87765306 279 nips-2012-Projection Retrieval for Classification
Author: Madalina Fiterau, Artur Dubrawski
Abstract: In many applications, classification systems often require human intervention in the loop. In such cases the decision process must be transparent and comprehensible, simultaneously requiring minimal assumptions on the underlying data distributions. To tackle this problem, we formulate an axis-aligned subspace-finding task under the assumption that query specific information dictates the complementary use of the subspaces. We develop a regression-based approach called RECIP that efficiently solves this problem by finding projections that minimize a nonparametric conditional entropy estimator. Experiments show that the method is accurate in identifying the informative projections of the dataset, picking the correct views to classify query points, and facilitates visual evaluation by users. 1 Introduction and problem statement In the domain of predictive analytics, many applications which keep human users in the loop require the use of simple classification models. Often, it is required that a test-point be ‘explained’ (classified) using a simple low-dimensional projection of the original feature space. This is a Projection Retrieval for Classification problem (PRC). The interaction with the user proceeds as follows: the user provides the system a query point; the system searches for a projection in which the point can be accurately classified; the system displays the classification result as well as an illustration of how the classification decision was reached in the selected projection. Solving the PRC problem is relevant in many practical applications. For instance, consider a nuclear threat detection system installed at a border check point. Vehicles crossing the border are scanned with sensors so that a large array of measurements of radioactivity and secondary contextual information is being collected. These observations are fed into a classification system that determines whether the scanned vehicle may carry a threat. Given the potentially devastating consequences of a false negative, a border control agent is requested to validate the prediction and decide whether to submit the vehicle for a costly further inspection. With the positive classification rate of the system under strict bounds because of limitations in the control process, the risk of false negatives is increased. Despite its crucial role, human intervention should only be withheld for cases in which there are reasons to doubt the validity of classification. In order for a user to attest the validity of a decision, the user must have a good understanding of the classification process, which happens more readily when the classifier only uses the original dataset features rather than combinations of them, and when the discrimination models are low-dimensional. In this context, we aim to learn a set of classifiers in low-dimensional subspaces and a decision function which selects the subspace under which a test point is to be classified. Assume we are given a dataset {(x1 , y1 ) . . . (xn , yn )} ∈ X n × {0, 1}n and a class of discriminators H. The model will contain a set Π of subspaces of X ; Π ⊆ Π, where Π is the set of all axis-aligned subspaces of the original feature space, the power set of the features. To each projection πi ∈ Π corresponds one discriminator from a given hypothesis space hi ∈ H. It will also contain a selection function g : X → Π × H, which yields, for a query point x, the projection/discriminator pair with which this point will be classified. The notation π(x) refers to the projection of the point x onto the subspace 1 π while h(π(x)) represents the predicted label for x. Formally, we describe the model class as Md = {Π = {π : π ∈ Π, dim(π) ≤ d}, H = {hi : hi ∈ H, h : πi → Y, ∀i = 1 . . . |Π|}, g ∈ {f : X → {1 . . . |Π|}} . where dim(π) presents the dimensionality of the subspace determined by the projection π. Note that only projections up to size d will be considered, where d is a parameter specific to the application. The set H contains one discriminator from the hypothesis class H for each projection. Intuitively, the aim is to minimize the expected classification error over Md , however, a notable modification is that the projection and, implicitly, the discriminator, are chosen according to the data point that needs to be classified. Given a query x in the space X , g(x) will yield the subspace πg(x) onto which the query is projected and the discriminator hg(x) for it. Distinct test points can be handled using different combinations of subspaces and discriminators. We consider models that minimize 0/1 loss. Hence, the PRC problem can be stated as follows: M ∗ = arg min EX ,Y y = hg(x) (πg(x) (x)) M ∈Md There are limitations to the type of selection function g that can be learned. A simple example for which g can be recovered is a set of signal readings x for which, if one of the readings xi exceeds a threshold ti , the label can be predicted just based on xi . A more complex one is a dataset containing regulatory variables, that is, for xi in the interval [ak , bk ] the label only depends on (x1 . . . xnk ) k k datasets that fall into the latter category fulfill what we call the Subspace-Separability Assumption. This paper proposes an algorithm called RECIP that solves the PRC problem for a class of nonparametric classifiers. We evaluate the method on artificial data to show that indeed it correctly identifies the underlying structure for data satisfying the Subspace-Separability Assumption. We show some case studies to illustrate how RECIP offers insight into applications requiring human intervention. The use of dimensionality reduction techniques is a common preprocessing step in applications where the use of simplified classification models is preferable. Methods that learn linear combinations of features, such as Linear Discriminant Analysis, are not quite appropriate for the task considered here, since we prefer to natively rely on the dimensions available in the original feature space. Feature selection methods, such as e.g. lasso, are suitable for identifying sets of relevant features, but do not consider interactions between them. Our work better fits the areas of class dependent feature selection and context specific classification, highly connected to the concept of Transductive Learning [6]. Other context-sensitive methods are Lazy and Data-Dependent Decision Trees, [5] and [10] respectively. In Ting et al [14], the Feating submodel selection relies on simple attribute splits followed by fitting local predictors, though the algorithm itself is substantially different. Obozinski et al present a subspace selection method in the context of multitask learning [11]. Go et al propose a joint method for feature selection and subspace learning [7], however, their classification model is not particularly query specific. Alternatively, algorithms that transform complex or unintelligible models with user-friendly equivalents have been proposed [3, 2, 1, 8]. Algorithms specifically designed to yield understandable models are a precious few. Here we note a rule learning method described in [12], even though the resulting rules can make visualization difficult, while itemset mining [9] is not specifically designed for classification. Unlike those approaches, our method is designed to retrieve subsets of the feature space designed for use in a way that is complementary to the basic task at hand (classification) while providing query-specific information. 2 Recovering informative projections with RECIP To solve PRC, we need means by which to ascertain which projections are useful in terms of discriminating data from the two classes. Since our model allows the use of distinct projections depending on the query point, it is expected that each projection would potentially benefit different areas of the feature space. A(π) refers to the area of the feature space where the projection π is selected. A(π) = {x ∈ X : πg(x) = π} The objective becomes min E(X ×Y) y = hg(x) (πg(x) (x)) M ∈Md = p(A(π))E y = hg(x) (πg(x) (x))|x ∈ A(π) min M ∈Md 2 π∈Π . The expected classification error over A(π) is linked to the conditional entropy of Y |X. Fano’s inequality provides a lower bound on the error while Feder and Merhav [4] derive a tight upper bound on the minimal error probability in terms of the entropy. This means that conditional entropy characterizes the potential of a subset of the feature space to separate data, which is more generic than simply quantifying classification accuracy for a specific discriminator. In view of this connection between classification accuracy and entropy, we adapt the objective to: p(A(π))H(Y |π(X); X ∈ A(π)) min M ∈Md (1) π∈Π The method we propose optimizes an empirical analog of (1) which we develop below and for which we will need the following result. Proposition 2.1. Given a continuous variable X ∈ X and a binary variable Y , where X is sampled from the mixture model f (x) = p(y = 0)f0 (x) + p(y = 1)f1 (x) = p0 f0 (x) + p1 f1 (x) , then H(Y |X) = −p0 log p0 − p1 log p1 − DKL (f0 ||f ) − DKL (f1 ||f ) . Next, we will use the nonparametric estimator presented in [13] for Tsallis α-divergence. Given samples Ui ∼ U, with i = 1, n and Vj ∼ V with j = 1, m, the divergence is estimated as follows: ˆ Tα (U ||V ) = 1 1 1−α n n (n − 1)νk (Ui , U \ ui )d mνk (Ui , V )d i=1 1−α B(k, α) − 1 , (2) where d is the dimensionality of the variables U and V and νk (z, Z) represents the distance from z ˆ to its k th nearest neighbor of the set of points Z. For α ≈ 1 and n → ∞, Tα (u||v) ≈ DKL (u||v). 2.1 Local estimators of entropy We will now plug (2) in the formula obtained by Proposition 2.1 to estimate the quantity (1). We use the notation X0 to represent the n0 samples from X which have the labels Y equal to 0, and X1 to represent the n1 samples from X which have the labels set to 1. Also, Xy(x) represents the set of samples that have labels equal to the label of x and X¬y(x) the data that have labels opposite to the label of x. ˆ ˆ x ˆ x H(Y |X; X ∈ A) = −H(p0 ) − H(p1 ) − T (f0 ||f x ) − T (f1 ||f x ) + C α≈1 ˆ H(Y |X; X ∈ A) ∝ 1 n0 + 1 n1 ∝ 1 n0 + 1 n1 ∝ 1 n n0 (n0 − 1)νk (xi , X0 \ xi )d nνk (xi , X \ xi )d 1−α I[xi ∈ A] (n1 − 1)νk (xi , X1 \ xi )d nνk (xi , X \ xi )d 1−α I[xi ∈ A] (n0 − 1)νk (xi , X0 \ xi )d nνk (xi , X1 \ xi )d 1−α I[xi ∈ A] (n1 − 1)νk (xi , X1 \ xi )d nνk (xi , X0 \ xi )d 1−α I[xi ∈ A] i=1 n1 i=1 n0 i=1 n1 i=1 n I[xi ∈ A] i=1 (n − 1)νk (xi , Xy(xi ) \ xi )d nνk (xi , X¬y(xi ) \ xi )d 1−α The estimator for the entropy of the data that is classified with projection π is as follows: ˆ H(Y |π(X); X ∈ A(π)) ∝ 1 n n I[xi ∈ A(π)] i=1 (n − 1)νk (π(xi ), π(Xy(xi ) ) \ π(xi ))d nνk (π(xi ), π(X¬y(xi ) \ xi ))d 1−α (3) From 3 and using the fact that I[xi ∈ A(π)] = I[πg(xi ) = π] for which we use the notation I[g(xi ) → π], we estimate the objective as min M ∈Md π∈Π 1 n n I[g(xi ) → π] i=1 (n − 1)νk (π(xi ), π(Xy(xi ) ) \ π(xi ))d nνk (π(xi ), π(X¬y(xi ) \ xi ))d 3 1−α (4) Therefore, the contribution of each data point to the objective corresponds to a distance ratio on the projection π ∗ where the class of the point is obtained with the highest confidence (data is separable in the neighborhood of the point). We start by computing the distance-based metric of each point on each projection of size up to d - there are d∗ such projections. This procedure yields an extended set of features Z, which we name local entropy estimates: Zij = νk (πj (xi ), πj (Xy(xi ) ) \ πj (xi )) νk (πj (xi ), πj (X¬y(xi ) ) \ πj (xi )) d(1−α) α≈1 j ∈ {1 . . . d∗ } (5) For each training data point, we compute the best distance ratio amid all the projections, which is simply Ti = minj∈[d∗ ] Zij . The objective can be then further rewritten as a function of the entropy estimates: n I[g(xi ) → πj ]Zij min M ∈Md (6) i=1 πj ∈Π From the definition of T, it is also clear that n n I[g(xi ) → πj ]Zij min M ∈Md 2.2 ≥ i=1 πj ∈Π Ti . (7) i=1 Projection selection as a combinatorial problem Considering form (6) of the objective, and given that the estimates Zij are constants, depending only on the training set, the projection retrieval problem is reduced to finding g for all training points, which will implicitly select the projection set of the model. Naturally, one might assume the bestperforming classification model is the one containing all the axis-aligned subspaces. This model achieves the lower bound (7) for the training set. However, the larger the set of projections, the more values the function g takes, and thus the problem of selecting the correct projection becomes more difficult. It becomes apparent that the number of projections should be somehow restricted to allow intepretability. Assuming a hard threshold of at most t projections, the optimization (6) becomes an entry selection problem over matrix Z where one value must be picked from each row under a limitation on the number of columns that can be used. This problem cannot be solved exactly in polynomial time. Instead, it can be formulated as an optimization problem under 1 constraints. 2.3 Projection retrieval through regularized regression To transform the projection retrieval to a regression problem we consider T, the minimum obtainable value of the entropy estimator for each point, as the output which the method needs to predict. Each row i of the parameter matrix B represents the degrees to which the entropy estimates on each projection contribute to the entropy estimator of point xi . Thus, the sum over each row of B is 1, and the regularization penalty applies to the number of non-zero columns in B. d∗ min ||T − (Z B B)J|Π|,1 ||2 2 +λ [Bi = 0] (8) i=1 subject to |Bk | 1 = 1 k = 1, n where (Z B)ij = Zij + Bij and J is a matrix of ones. The problem with this optimization is that it is not convex. A typical walk-around of this issue is to use the convex relaxation for Bi = 0, that is 1 norm. This would transform the penalized term d∗ d∗ n to i=1 |Bi | 1 . However, i=1 |Bi | 1 = k=1 |Bk | 1 = n , so this penalty really has no effect. An alternative mechanism to encourage the non-zero elements in B to populate a small number of columns is to add a penalty term in the form of Bδ, where δ is a d∗ -size column vector with each element representing the penalty for a column in B. With no prior information about which subspaces are more informative, δ starts as an all-1 vector. An initial value for B is obtained through the optimization (8). Since our goal is to handle data using a small number of projections, δ is then updated such that its value is lower for the denser columns in B. This update resembles the reweighing in adaptive lasso. The matrix B itself is updated, and this 2-step process continues until convergence of δ. Once δ converges, the projections corresponding to the non-zero columns of B are added to the model. The procedure is shown in Algorithm 1. 4 Algorithm 1: RECIP δ = [1 . . . 1] repeat |P I| b = arg minB ||T − i=1 < Z, B > ||2 + λ|Bδ| 1 2 subject to |Bk | 1 = 1 k = 1...n δk = |Bi | 1 i = . . . d∗ (update the differential penalty) δ δ = 1 − |δ| 1 until δ converges return Π = {πi ; |Bi | 1 > 0 ∀i = 1 . . . d∗ } 2.4 Lasso for projection selection We will compare our algorithm to lasso regularization that ranks the projections in terms of their potential for data separability. We write this as an 1 -penalized optimization on the extended feature set Z, with the objective T : minβ |T − Zβ|2 + λ|β| 1 . The lasso penalty to the coefficient vector encourages sparsity. For a high enough λ, the sparsity pattern in β is indicative of the usefulness of the projections. The lasso on entropy contributions was not found to perform well as it is not query specific and will find one projection for all data. We improved it by allowing it to iteratively find projections - this robust version offers increased performance by reweighting the data thus focusing on different subsets of it. Although better than running lasso on entropy contributions, the robust lasso does not match RECIP’s performance as the projections are selected gradually rather than jointly. Running the standard lasso on the original design matrix yields a set of relevant variables and it is not immediately clear how the solution would translate to the desired class. 2.5 The selection function Once the projections are selected, the second stage of the algorithm deals with assigning the projection with which to classify a particular query point. An immediate way of selecting the correct projection starts by computing the local entropy estimator for each subspace with each class assignment. Then, we may select the label/subspace combination that minimizes the empirical entropy. (i∗ , θ∗ ) = arg min i,θ 3 νk (πi (x), πi (Xθ )) νk (πi (x), πi (X¬θ )) dim(πi )(1−α) i = 1 . . . d∗ , α≈1 (9) Experimental results In this section we illustrate the capability of RECIP to retrieve informative projections of data and their use in support of interpreting results of classification. First, we analyze how well RECIP can identify subspaces in synthetic data whose distribution obeys the subspace separability assumption (3.1). As a point of reference, we also present classification accuracy results (3.2) for both the synthetic data and a few real-world sets. This is to quantify the extent of the trade-off between fidelity of attainable classifiers and desired informativeness of the projections chosen by RECIP. We expect RECIP’s classification performance to be slightly, but not substantially worse when compared to relevant classification algorithms trained to maximize classification accuracy. Finally, we present a few examples (3.3) of informative projections recovered from real-world data and their utility in explaining to human users the decision processes applied to query points. A set of artificial data used in our experiments contains q batches of data points, each of them made classifiable with high accuracy using one of available 2-dimensional subspaces (x1 , x2 ) with k ∈ k k {1 . . . q}. The data in batch k also have the property that x1 > tk . This is done such that the group a k point belongs to can be detected from x1 , thus x1 is a regulatory variable. We control the amount of k k noise added to thusly created synthetic data by varying the proportion of noisy data points in each batch. The results below are for datasets with 7 features each, with number of batches q ranging between 1 and 7. We kept the number of features specifically low in order to prevent excessive variation between any two sets generated this way, and to enable computing meaningful estimates of the expectation and variance of performance, while enabling creation of complicated data in which synthetic patterns may substantially overlap (using 7 features and 7 2-dimensional patterns implies that dimensions of at least 4 of the patterns will overlap). We implemented our method 5 to be scalable to the size and dimensionality of data and although for brevity we do not include a discussion of this topic here, we have successfully run RECIP against data with 100 features. The parameter α is a value close to 1, because the Tsallis divergence converges to the KL divergence as alpha approaches 1. For the experiments on real-world data, d was set to n (all projections were considered). For the artificial data experiments, we reported results for d = 2 as they do not change significantly for d >= 2 because this data was synthesized to contain bidimensional informative projections. In general, if d is too low, the correct full set of projections will not be found, but it may be recovered partially. If d is chosen too high, there is a risk that a given selected projection p will contain irrelevant features compared to the true projection p0 . However, this situation only occurs if the noise introduced by these features in the estimators makes the entropy contributions on p and p0 statistically indistinguishable for a large subset of data. The users will choose d according to the desired/acceptable complexity of the resulting model. If the results are to be visually interpreted by a human, values of 2 or 3 are reasonable for d. 3.1 Recovering informative projections Table 1 shows how well RECIP recovers the q subspaces corresponding to the synthesized batches of data. We measure precision (proportion of the recovered projections that are known to be informative), and recall (proportion of known informative projections that are recovered by the algorithm). in Table 1, rows correspond to the number of distinct synthetic batches injected in data, q, and subsequent columns correspond to the increasing amounts of noise in data. We note that the observed precision is nearly perfect: the algorithm makes only 2 mistakes over the entire set of experiments, and those occur for highly noisy setups. The recall is nearly perfect as long as there is little overlap among the dimensions, that is when the injections do not interfere with each other. As the number of projections increases, the chances for overlap among the affected features also increase, which makes the data more confusing resulting on a gradual drop of recall until only about 3 or 4 of the 7 known to be informative subspaces can be recovered. We have also used lasso as described in 2.4 in an attempt to recover projections. This setup only manages to recover one of the informative subspaces, regardless of how the regularization parameter is tuned. 3.2 Classification accuracy Table 2 shows the classification accuracy of RECIP, obtained using synthetic data. As expected, the observed performance is initially high when there are few known informative projections in data and it decreases as noise and ambiguity of the injected patterns increase. Most types of ensemble learners would use a voting scheme to arrive at the final classification of a testing sample, rather than use a model selection scheme. For this reason, we have also compared predictive accuracy revealed by RECIP against a method based on majority voting among multiple candidate subspaces. Table 4 shows that the accuracy of this technique is lower than the accuracy of RECIP, regardless of whether the informative projections are recovered by the algorithm or assumed to be known a priori. This confirms the intuition that a selection-based approach can be more effective than voting for data which satisfies the subspace separability assumption. For reference, we have also classified the synthetic data using K-Nearest-Neighbors algorithm using all available features at once. The results of that experiment are shown in Table 5. Since RECIP uses neighbor information, K-NN is conceptually the closest among the popular alternatives. Compared to RECIP, K-NN performs worse when there are fewer synthetic patterns injected in data to form informative projections. It is because some features used then by K-NN are noisy. As more features become informative, the K-NN accuracy improves. This example shows the benefit of a selective approach to feature space and using a subset of the most explanatory projections to support not only explanatory analyses but also classification tasks in such circumstances. 3.3 RECIP case studies using real-world data Table 3 summarizes the RECIP and K-NN performance on UCI datasets. We also test the methods using Cell dataset containing a set of measurements such as the area and perimeter biological cells with separate labels marking cells subjected to treatment and control cells. In Vowel data, the nearest-neighbor approach works exceptionally well, even outperforming random forests (0.94 accuracy), which is an indication that all features are jointly relevant. For d lower than the number of features, RECIP picks projections of only one feature, but if there is no such limitation, RECIP picks the space of all the features as informative. 6 Table 1: Projection recovery for artificial datasets with 1 . . . 7 informative features and noise level 0 . . . 0.2 in terms of mean and variance of Precision and Recall. Mean/var obtained for each setting by repeating the experiment with datasets with different informative projections. PRECISION 1 2 3 4 5 6 7 0 1 1 1 1 1 1 1 0.02 1 1 1 1 1 1 1 Mean 0.05 1 1 1 1 1 1 1 1 2 3 4 5 6 7 0 1 1 1 0.9643 0.7714 0.6429 0.6327 0.02 1 1 1 0.9643 0.7429 0.6905 0.5918 Mean 0.05 1 1 0.9524 0.9643 0.8286 0.6905 0.5918 0 0 0 0 0 0 0 0 0.02 0 0 0 0 0 0 0 Variance 0.05 0 0 0 0 0 0 0 0.1 0.0306 0 0 0 0 0 0 0.2 0.0306 0 0 0 0 0 0 0 0 0 0 0.0077 0.0163 0.0113 0.0225 0.02 0 0 0 0.0077 0.0196 0.0113 0.02 Variance 0.05 0 0 0.0136 0.0077 0.0049 0.0272 0.0258 0.1 0 0 0.0136 0.0077 0.0082 0.0113 0.0233 0.2 0 0 0 0.0128 0.0278 0.0113 0.02 0.1 0.0008 0.0001 0.0028 0.0025 0.0036 0.0025 0.0042 0.2 0.0007 0.0001 0.0007 0.0032 0.0044 0.0027 0.0045 0.1 0.0001 0.0001 0.0007 0.0014 0.0019 0.0023 0.0021 0.2 0.0000 0.0001 0.0007 0.0020 0.0023 0.0021 0.0020 0.1 0.9286 1 1 1 1 1 1 0.2 0.9286 1 1 1 1 1 1 RECALL 0.1 1 1 0.9524 0.9643 0.8571 0.6905 0.5714 0.2 1 1 1 0.9286 0.7714 0.6905 0.551 Table 2: RECIP Classification Accuracy on Artificial Data 1 2 3 4 5 6 7 0 0.9751 0.9333 0.9053 0.8725 0.8113 0.7655 0.7534 1 2 3 4 5 6 7 0 0.9751 0.9333 0.9053 0.8820 0.8714 0.8566 0.8429 CLASSIFICATION ACCURACY Mean Variance 0.02 0.05 0.1 0.2 0 0.02 0.05 0.9731 0.9686 0.9543 0.9420 0.0000 0.0000 0.0000 0.9297 0.9227 0.9067 0.8946 0.0001 0.0001 0.0001 0.8967 0.8764 0.8640 0.8618 0.0004 0.0005 0.0016 0.8685 0.8589 0.8454 0.8187 0.0020 0.0020 0.0019 0.8009 0.8105 0.8105 0.7782 0.0042 0.0044 0.0033 0.7739 0.7669 0.7632 0.7511 0.0025 0.0021 0.0026 0.7399 0.7347 0.7278 0.7205 0.0034 0.0040 0.0042 CLASSIFICATION ACCURACY - KNOWN PROJECTIONS Mean Variance 0.02 0.05 0.1 0.2 0 0.02 0.05 0.9731 0.9686 0.9637 0.9514 0.0000 0.0000 0.0000 0.9297 0.9227 0.9067 0.8946 0.0001 0.0001 0.0001 0.8967 0.8914 0.8777 0.8618 0.0004 0.0005 0.0005 0.8781 0.8657 0.8541 0.8331 0.0011 0.0011 0.0014 0.8641 0.8523 0.8429 0.8209 0.0015 0.0015 0.0018 0.8497 0.8377 0.8285 0.8074 0.0014 0.0015 0.0016 0.8371 0.8256 0.8122 0.7988 0.0015 0.0018 0.0018 Table 3: Accuracy of K-NN and RECIP Dataset KNN RECIP In Spam data, the two most informative projections are Breast Cancer Wis 0.8415 0.8275 ’Capital Length Total’ (CLT)/’Capital Length Longest’ Breast Tissue 1.0000 1.0000 (CLL) and CLT/’Frequency of word your’ (FWY). FigCell 0.7072 0.7640 ure 1 shows these two projections, with the dots repreMiniBOONE* 0.7896 0.7396 senting training points. The red dots represent points laSpam 0.7680 0.7680 Vowel 0.9839 0.9839 beled as spam while the blue ones are non-spam. The circles are query points that have been assigned to be classified with the projection in which they are plotted. The green circles are correctly classified points, while the magenta circles - far fewer - are the incorrectly classified ones. Not only does the importance of text in capital letters make sense for a spam filtering dataset, but the points that select those projections are almost flawlessly classified. Additionally, assuming the user would need to attest the validity of classification for the first plot, he/she would have no trouble seeing that the circled data points are located in a region predominantly populated with examples of spam, so any non-spam entry appears suspicious. Both of the magenta-colored cases fall into this category, and they can be therefore flagged for further investigation. 7 Informative Projection for the Spam dataset 2000 1500 1000 500 0 Informative Projection for the Spam dataset 12 Frequency of word ‘your‘ Capital Run Length Longest 2500 10 8 6 4 2 0 500 1000 1500 2000 2500 Capital Run Length Total 3000 0 3500 0 2000 4000 6000 8000 10000 12000 14000 Capital Run Length Total 16000 Figure 1: Spam Dataset Selected Subspaces Table 4: Classification accuracy using RECIP-learned projections - or known projections, in the lower section - within a voting model instead of a selection model 1 2 3 4 5 6 7 1 2 3 4 5 6 7 CLASSIFICATION ACCURACY - VOTING ENSEMBLE Mean Variance 0 0.02 0.05 0.1 0.2 0 0.02 0.05 0.1 0.9751 0.9731 0.9686 0.9317 0.9226 0.0000 0.0000 0.0000 0.0070 0.7360 0.7354 0.7331 0.7303 0.7257 0.0002 0.0002 0.0001 0.0002 0.7290 0.7266 0.7163 0.7166 0.7212 0.0002 0.0002 0.0008 0.0006 0.6934 0.6931 0.6932 0.6904 0.6867 0.0008 0.0008 0.0008 0.0008 0.6715 0.6602 0.6745 0.6688 0.6581 0.0013 0.0014 0.0013 0.0014 0.6410 0.6541 0.6460 0.6529 0.6512 0.0008 0.0007 0.0010 0.0006 0.6392 0.6342 0.6268 0.6251 0.6294 0.0009 0.0011 0.0012 0.0012 CLASSIFICATION ACCURACY - VOTING ENSEMBLE, KNOWN PROJECTIONS Mean Variance 0 0.02 0.05 0.1 0.2 0 0.02 0.05 0.1 0.9751 0.9731 0.9686 0.9637 0.9514 0.0000 0.0000 0.0000 0.0001 0.7360 0.7354 0.7331 0.7303 0.7257 0.0002 0.0002 0.0001 0.0002 0.7409 0.7385 0.7390 0.7353 0.7325 0.0010 0.0012 0.0010 0.0011 0.7110 0.7109 0.7083 0.7067 0.7035 0.0041 0.0041 0.0042 0.0042 0.7077 0.7070 0.7050 0.7034 0.7008 0.0015 0.0015 0.0015 0.0016 0.6816 0.6807 0.6801 0.6790 0.6747 0.0008 0.0008 0.0008 0.0008 0.6787 0.6783 0.6772 0.6767 0.6722 0.0008 0.0009 0.0009 0.0008 0.2 0.0053 0.0001 0.0002 0.0009 0.0013 0.0005 0.0012 0.2 0.0000 0.0001 0.0010 0.0043 0.0016 0.0009 0.0008 Table 5: Classification accuracy for artificial data with the K-Nearest Neighbors method CLASSIFICATION ACCURACY - KNN 1 2 3 4 5 6 7 4 0 0.7909 0.7940 0.7964 0.7990 0.8038 0.8043 0.8054 0.02 0.7843 0.7911 0.7939 0.7972 0.8024 0.8032 0.8044 Mean 0.05 0.7747 0.7861 0.7901 0.7942 0.8002 0.8015 0.8028 0.1 0.7652 0.7790 0.7854 0.7904 0.7970 0.7987 0.8004 0.2 0.7412 0.7655 0.7756 0.7828 0.7905 0.7930 0.7955 0 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 0.02 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 Variance 0.05 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.1 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 0.2 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 Conclusion This paper considers the problem of Projection Recovery for Classification. It is relevant in applications where the decision process must be easy to understand in order to enable human interpretation of the results. We have developed a principled, regression-based algorithm designed to recover small sets of low-dimensional subspaces that support interpretability. It optimizes the selection using individual data-point-specific entropy estimators. In this context, the proposed algorithm follows the idea of transductive learning, and the role of the resulting projections bears resemblance to high confidence regions known in conformal prediction models. Empirical results obtained using simulated and real-world data show the effectiveness of our method in finding informative projections that enable accurate classification while maintaining transparency of the underlying decision process. Acknowledgments This material is based upon work supported by the NSF, under Grant No. IIS-0911032. 8 References [1] Mark W. Craven and Jude W. Shavlik. Extracting Tree-Structured Representations of Trained Networks. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, Advances in Neural Information Processing Systems, volume 8, pages 24–30. The MIT Press, 1996. [2] Pedro Domingos. Knowledge discovery via multiple models. Intelligent Data Analysis, 2:187–202, 1998. [3] Eulanda M. Dos Santos, Robert Sabourin, and Patrick Maupin. A dynamic overproduce-and-choose strategy for the selection of classifier ensembles. Pattern Recogn., 41:2993–3009, October 2008. [4] M. Feder and N. Merhav. Relations between entropy and error probability. Information Theory, IEEE Transactions on, 40(1):259–266, January 1994. [5] Jerome H. Friedman, Ron Kohavi, and Yeogirl Yun. Lazy decision trees, 1996. [6] A. Gammerman, V. Vovk, and V. Vapnik. Learning by transduction. In In Uncertainty in Artificial Intelligence, pages 148–155. Morgan Kaufmann, 1998. [7] Quanquan Gu, Zhenhui Li, and Jiawei Han. Joint feature selection and subspace learning, 2011. [8] Bing Liu, Minqing Hu, and Wynne Hsu. Intuitive representation of decision trees using general rules and exceptions. In Proceedings of Seventeeth National Conference on Artificial Intellgience (AAAI-2000), July 30 - Aug 3, 2000, pages 615–620, 2000. [9] Michael Mampaey, Nikolaj Tatti, and Jilles Vreeken. Tell me what i need to know: succinctly summarizing data with itemsets. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’11, pages 573–581, New York, NY, USA, 2011. ACM. [10] Mario Marchand and Marina Sokolova. Learning with decision lists of data-dependent features. JOURNAL OF MACHINE LEARNING REASEARCH, 6, 2005. [11] Guillaume Obozinski, Ben Taskar, and Michael I. Jordan. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2):231–252, April 2010. [12] Michael J. Pazzani, Subramani Mani, and W. Rodman Shankle. Beyond concise and colorful: Learning intelligible rules, 1997. [13] B. Poczos and J. Schneider. On the estimation of alpha-divergences. AISTATS, 2011. [14] Kai Ting, Jonathan Wells, Swee Tan, Shyh Teng, and Geoffrey Webb. Feature-subspace aggregating: ensembles for stable andunstable learners. Machine Learning, 82:375–397, 2011. 10.1007/s10994-0105224-5. 9
4 0.87637043 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model
Author: Yi Wu, David P. Wipf
Abstract: Sparse linear (or generalized linear) models combine a standard likelihood function with a sparse prior on the unknown coefficients. These priors can conveniently be expressed as a maximization over zero-mean Gaussians with different variance hyperparameters. Standard MAP estimation (Type I) involves maximizing over both the hyperparameters and coefficients, while an empirical Bayesian alternative (Type II) first marginalizes the coefficients and then maximizes over the hyperparameters, leading to a tractable posterior approximation. The underlying cost functions can be related via a dual-space framework from [22], which allows both the Type I or Type II objectives to be expressed in either coefficient or hyperparmeter space. This perspective is useful because some analyses or extensions are more conducive to development in one space or the other. Herein we consider the estimation of a trade-off parameter balancing sparsity and data fit. As this parameter is effectively a variance, natural estimators exist by assessing the problem in hyperparameter (variance) space, transitioning natural ideas from Type II to solve what is much less intuitive for Type I. In contrast, for analyses of update rules and sparsity properties of local and global solutions, as well as extensions to more general likelihood models, we can leverage coefficient-space techniques developed for Type I and apply them to Type II. For example, this allows us to prove that Type II-inspired techniques can be successful recovering sparse coefficients when unfavorable restricted isometry properties (RIP) lead to failure of popular ℓ1 reconstructions. It also facilitates the analysis of Type II when non-Gaussian likelihood models lead to intractable integrations. 1
5 0.87160879 215 nips-2012-Minimizing Uncertainty in Pipelines
Author: Nilesh Dalvi, Aditya Parameswaran, Vibhor Rastogi
Abstract: In this paper, we consider the problem of debugging large pipelines by human labeling. We represent the execution of a pipeline using a directed acyclic graph of AND and OR nodes, where each node represents a data item produced by some operator in the pipeline. We assume that each operator assigns a confidence to each of its output data. We want to reduce the uncertainty in the output by issuing queries to a human, where a query consists of checking if a given data item is correct. In this paper, we consider the problem of asking the optimal set of queries to minimize the resulting output uncertainty. We perform a detailed evaluation of the complexity of the problem for various classes of graphs. We give efficient algorithms for the problem for trees, and show that, for a general dag, the problem is intractable. 1
6 0.86953437 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance
7 0.86909187 361 nips-2012-Volume Regularization for Binary Classification
8 0.86907792 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
9 0.86876392 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes
10 0.86731648 27 nips-2012-A quasi-Newton proximal splitting method
11 0.86676949 179 nips-2012-Learning Manifolds with K-Means and K-Flats
12 0.86645699 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models
13 0.86637092 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
14 0.86550033 16 nips-2012-A Polynomial-time Form of Robust Regression
15 0.8649497 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression
16 0.86467391 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL
17 0.86439306 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation
18 0.86422127 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
19 0.86392051 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging
20 0.86302358 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation