jmlr jmlr2011 jmlr2011-67 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Tony Jebara
Abstract: A multitask learning framework is developed for discriminative classification and regression where multiple large-margin linear classifiers are estimated for different prediction problems. These classifiers operate in a common input space but are coupled as they recover an unknown shared representation. A maximum entropy discrimination (MED) framework is used to derive the multitask algorithm which involves only convex optimization problems that are straightforward to implement. Three multitask scenarios are described. The first multitask method produces multiple support vector machines that learn a shared sparse feature selection over the input space. The second multitask method produces multiple support vector machines that learn a shared conic kernel combination. The third multitask method produces a pooled classifier as well as adaptively specialized individual classifiers. Furthermore, extensions to regression, graphical model structure estimation and other sparse methods are discussed. The maximum entropy optimization problems are implemented via a sequential quadratic programming method which leverages recent progress in fast SVM solvers. Fast monotonic convergence bounds are provided by bounding the MED sparsifying cost function with a quadratic function and ensuring only a constant factor runtime increase above standard independent SVM solvers. Results are shown on multitask data sets and favor multitask learning over single-task or tabula rasa methods. Keywords: meta-learning, support vector machines, feature selection, kernel selection, maximum entropy, large margin, Bayesian methods, variational bounds, classification, regression, Lasso, graphical model structure estimation, quadratic programming, convex programming
Reference: text
sentIndex sentText sentNum sentScore
1 EDU Department of Computer Science Columbia University New York, NY 10027, USA Editor: Francis Bach Abstract A multitask learning framework is developed for discriminative classification and regression where multiple large-margin linear classifiers are estimated for different prediction problems. [sent-3, score-0.439]
2 A maximum entropy discrimination (MED) framework is used to derive the multitask algorithm which involves only convex optimization problems that are straightforward to implement. [sent-5, score-0.492]
3 The first multitask method produces multiple support vector machines that learn a shared sparse feature selection over the input space. [sent-7, score-0.657]
4 The second multitask method produces multiple support vector machines that learn a shared conic kernel combination. [sent-8, score-0.638]
5 The third multitask method produces a pooled classifier as well as adaptively specialized individual classifiers. [sent-9, score-0.51]
6 Results are shown on multitask data sets and favor multitask learning over single-task or tabula rasa methods. [sent-13, score-0.912]
7 A more realistic, multitask learning approach is to combine data from multiple smaller sources and synergistically leverage heterogeneous labeling or annotation efforts. [sent-18, score-0.402]
8 Meanwhile, multitask learning (or inductive transfer) uses the collection of data sets simultaneously to exploit the related nature of the problems. [sent-25, score-0.402]
9 For example, a multitask learning approach may involve algorithms that discover shared representations that are useful across several data sets and tasks. [sent-26, score-0.459]
10 This article explores maximum entropy discrimination approaches to multitask problems and is organized as follows. [sent-29, score-0.544]
11 Section 2 reviews previous work in multitask learning, support vector machine feature selection and support vector machine kernel selection. [sent-30, score-0.592]
12 Section 3 sets up the general multitask problem as learning from data that has been sampled from a set of generative models that are dependent given data observations yet become independent given a shared representation. [sent-31, score-0.459]
13 Section 9 illustrates the corresponding derivations in a multitask (scalar) regression setting. [sent-37, score-0.439]
14 Previous Work Since this article involves the combination of the three research areas, we review previous work in multitask learning, support vector machine (SVM) feature selection and SVM kernel selection. [sent-44, score-0.644]
15 Instead, multitask learning couples multiple models and their individual training sets and tasks. [sent-47, score-0.45]
16 Early implementations of multitask learning primarily investigated neural network or nearest neighbor learners (Thrun, 1995; Baxter, 1995; Caruana, 1997). [sent-49, score-0.402]
17 In addition to neural approaches, Bayesian methods have been explored that implement multitask 76 M ULTITASK S PARSITY VIA M AXIMUM E NTROPY D ISCRIMINATION learning by assuming dependencies between the various models and tasks (Heskes, 1998, 2004). [sent-50, score-0.491]
18 In addition, some theoretical arguments for the benefits of multitask learning have been made (Baxter, 2000) showing that the average error of M tasks can potentially decrease inversely with M. [sent-52, score-0.462]
19 Concurrently, kernel methods (Sch¨ lkopf and Smola, 2001) and large-margin support vector o machines are highly successful in single-task settings and are good candidates for multitask extensions. [sent-54, score-0.535]
20 While multiclass variants of binary classifiers have been extensively explored (Crammer and Singer, 2001), multitask classification differs in that it often involves distinct sets of input data for each task. [sent-55, score-0.453]
21 This article focuses on multitask extensions of both feature selection and kernel selection with support vector machines. [sent-62, score-0.707]
22 , 1999), to sparse SVMs (Jebara and Jaakkola, 2000) and to multitask SVMs (Jebara, 2003, 2004). [sent-64, score-0.428]
23 1 This maximum entropy framework led to one of the first convex large margin multitask classification approaches (Jebara, 2004). [sent-65, score-0.471]
24 Convexity was subsequently explored in other multitask frameworks (Argyriou et al. [sent-66, score-0.431]
25 The present article extends the derivations in the maximum entropy discrimination multitask approach, provides a straightforward iterative quadratic programming implementation and uses tighter bounds for improved runtime efficiency. [sent-68, score-0.65]
26 Other related multitask SVM approaches have also been promising including novel kernel construction techniques to couple tasks (Evgeniou et al. [sent-69, score-0.546]
27 These permit standard SVM learning algorithms to perform multitask learning while the multitask issues are handled primarily by the kernel itself. [sent-71, score-0.888]
28 , 2006) for multitask learning with margin-based predictors and provide interesting worst-case guarantees. [sent-73, score-0.402]
29 Extensions to handle unlabeled data in multitask settings have also been promising (Ando and Zhang, 2005) and enjoyed theoretical generalization guarantees. [sent-74, score-0.424]
30 An alternative perspective to multitask feature and kernel selection can be explored by performing joint covariate or subspace selection for multiple classification problems (Obozinski et al. [sent-75, score-0.684]
31 Furthermore, feature selection and kernel selection can be seen as sparsity inducing methods. [sent-77, score-0.29]
32 The multitask extension to such sparse estimation techniques is known as the Group Lasso and allows sparsity to be explored over predefined subsets of variables (Turlach et al. [sent-80, score-0.494]
33 This article provides another contact point between sparsity, large margins, multitask learning and kernel selection. [sent-88, score-0.538]
34 The next sections formulate the general probabilistic setup for such multitask problems and convert traditional Bayesian solutions into a discriminative large-margin setting using the maximum entropy framework (Jaakkola et al. [sent-89, score-0.448]
35 Multitask Learning The general multitask learning setup is as follows. [sent-92, score-0.402]
36 In this section and in Section 4, it will be helpful to take a Bayesian perspective to the multitask problem although this perspective is not strictly necessary in subsequent sections. [sent-111, score-0.402]
37 The more general multitask learning assumption is that there exist dependencies between the tasks. [sent-119, score-0.402]
38 In terms of a directed acyclic graph where the joint probability density function factorizes as a product of nodes given their parents, the following dependency structure emerges in the (simplest) case of multitask learning with two models: Θ1 → D1 ← s → D2 ← Θ2 . [sent-147, score-0.402]
39 We explore the following scenarios: • Feature Selection: Consider M individual models Θm = {θm , bm } which are linear classifiers where θm ∈ RD and bm ∈ R. [sent-154, score-0.354]
40 , θm,D , bm } where each model Θm consists of D linear classifiers in D different Hilbert spaces and one scalar bm ∈ R. [sent-159, score-0.393]
41 The following sections detail these multitask learning scenarios and show how we can learn discriminative classifiers (that predict outputs accurately and with large margin) from multiple tasks. [sent-166, score-0.402]
42 Z m=1 t=1 79 J EBARA Previous approaches (Heskes, 2004) followed such a Bayesian treatment for multitask learning and obtained promising results. [sent-177, score-0.402]
43 Assume that the predictive distribution is log-linear as follows: p(y|x, Θm ) ∝ exp y x, θm + bm ) . [sent-230, score-0.292]
44 The objective function we need to maximize is the negative logarithm of the partition function: M J(λ) = − log p(Θ) exp Tm ∑ ∑ λm,t ym,t ( xm,t , θm + bm ) − γλm,t dΘ. [sent-236, score-0.374]
45 Therefore, we can obtain each bm by solving Tm ym,t = ˆ ∑ λm,τ ym,τ k(xm,t , xm,τ ) + bm τ=1 for any datum t which has a corresponding Lagrange multiplier (once the dual program halts) that satisfies λm,t > 0. [sent-258, score-0.466]
46 Therefore, to obtain multitask learning, we will need some shared representation s to couple the learning problems and give rise to a non-factorized posterior over models. [sent-268, score-0.521]
47 The MED solution is then p(Θ|D ) = M Tm 1 p(Θ) ∏ ∏ exp λm,t ym,t Z(λ) m=1 t=1 83 D ∑ s(d)xm,t (d)θm (d) + bm d=1 − γλm,t . [sent-296, score-0.292]
48 , Θm as well as summing over all binary settings of s which yields M Z(λ) = p(Θ) exp Tm D ∑ ∑ λm,t ym,t ∑ s(d)xm,t (d)θm (d) + bm m=1 t=1 = exp ∑ m σ2 2 ∑ λm,t ym,t t − γλm,t dΘ d=1 2 − ∑ γλm,t t ∏ 1 − ρ + ρe 2 ∑m (∑t λm,t ym,t xm,t (d)) 2 1 . [sent-300, score-0.451]
49 The predictive distribution for multitask kernel selection is then given by the following loglinear model (for the m’th task): p(y|x, Θm , s) ∝ exp y 2 D ∑ s(d) θm,d , φd (x) + bm . [sent-380, score-0.841]
50 This multitask kernel selection objective function J(λ) is the following convex program: Tm maxλ ∑M ∑t=1 γλm,t + D log(α + 1) m=1 Tm Tm 1 M − ∑D log α + e 2 ∑m=1 ∑t=1 ∑τ=1 λm,t λm,τ ym,t ym,τ kd (xm,t ,xm,τ ) d=1 Tm s. [sent-383, score-0.748]
51 The weights for each kernel are recovered as: ˆ s(d) = 1 Tm 1 + α exp − 1 ∑M ∑t=1 ∑Tm λm,t λm,τ ym,t ym,τ kd (xm,t , xm,τ ) τ=1 2 m=1 . [sent-387, score-0.337]
52 1 Independent Kernel Selection It is possible to break the above multitask framework by allowing each task to select a combination of kernels independently. [sent-405, score-0.471]
53 Consider constructing a base ¯ ¯ distance metric ∆d (x, x) from each base kernel kd (x, x) as follows: ¯ ∆d (x, x) = ¯ kd (x, x) − 2kd (x, x) + kd (¯ , x). [sent-422, score-0.435]
54 x ¯ Given this multitask kernel selection framework, it is possible to use the above formula to perform multitask metric learning. [sent-423, score-0.951]
55 d=1 Thus, metric learning can be performed using the multitask kernel selection setup. [sent-430, score-0.549]
56 Shared Classifiers and Adaptive Pooling Another interesting multitask learning approach involves shared classifiers or shared models. [sent-435, score-0.516]
57 This setup is clarified by the following log-linear predictive distribution for the m’th task: p(y|m, x, Θ, s) ∝ exp y (s(m) θm , φm (x) + θ, φ(x) + bm ) . [sent-449, score-0.292]
58 Regression It is easy to convert multitask feature selection, kernel selection and pooling problems to a regression setup where outputs are scalars ym,t ∈ R. [sent-476, score-0.806]
59 While this article will only show experiments with classification problems, the multitask regression setting is briefly summarized here for completeness. [sent-477, score-0.491]
60 t=1 It is straightforward to adapt this regression problem to multitask kernel selection (which once ¯ again subsumes feature selection if we select kd (x, x) = x(d)¯ (d)). [sent-490, score-0.809]
61 For brevity, we focus on the multitask kernel selection problem which strictly subsumes multitask feature selection. [sent-513, score-0.994]
62 Recall the kernel selection optimization: Tm maxλ J(λ) = ∑M ∑t=1 γλm,t + D log(α + 1) m=1 Tm Tm 1 M − ∑D log α + e 2 ∑m=1 ∑t=1 ∑τ=1 λm,t λm,τ ym,t ym,τ kd (xm,t ,xm,τ ) d=1 Tm s. [sent-515, score-0.32]
63 We apply Theorem 1 in the 1/2 1/2 ˜ Appendix after a simple change of variables, u = Hd λ and v = Hd λ which gives: ˜ ˜ λ⊤ Hd λ ≤ log α + exp 2 λ⊤ Hd λ log α + exp 2 + ˜ ⊤ Hd λ ˜ ) ˜⊤ 2 ˜ ˜ ˜ λ Hd (λ − λ) λ⊤ Hd λ α + exp( 2 ) exp( λ 1 ˜ ˜˜ ˜ + (λ − λ)⊤ Gd Hd λλ⊤ Hd + Hd (λ − λ). [sent-558, score-0.342]
64 This provides a simple iterative algorithm for multitask learning which builds on current SVM solvers. [sent-587, score-0.402]
65 , M do: 3a Tm Set gd = α exp − 1 ∑M ∑t=1 ∑Tm λm,t λm,τ ym,t ym,τ kd (xm,t , xm,τ ) for all d. [sent-613, score-0.31]
66 In summary, solving multitask feature or kernel selection is only a constant factor more computational effort than solving M independent support vector machines. [sent-656, score-0.592]
67 Experiments To evaluate the multitask learning framework, we considered UCI data8 as well as the Land Mine data set9 which was developed and investigated in previous work (Xue et al. [sent-659, score-0.402]
68 The classification accuracy of standard support vector machines learned independently is compared to the accuracy of the multitask kernel selection procedure described in Section 6 and Section 7. [sent-661, score-0.576]
69 55 0 250 (a) Feature and RBF kernel selection 50 100 150 Training Set Size per Task 200 250 (b) Feature, polynomial and RBF kernel selection Figure 2: Feature selection and kernel selection on the Landmine data set. [sent-685, score-0.504]
70 C and α (or, equivalently, ρ) for the multitask learner. [sent-692, score-0.402]
71 Both the independent SVMs and the multitask feature selection approach were evaluated by training on various numbers of examples (from 20 to 200) for each task, and the remaining examples (with labels kept unobserved) are split in half for cross-validation and testing. [sent-696, score-0.508]
72 Cross-validation was used to select a value of C for the independent SVMs and values of α and C for the multitask feature selection SVMs. [sent-700, score-0.508]
73 Figure 1 shows the MAUC performance of the independent SVMs versus the multitask SVMs with averages and standard deviations across 5 folds. [sent-701, score-0.402]
74 There is a clear and statistically significant advantage (under a paired t-test) for multitask learning over independent SVM classification. [sent-702, score-0.402]
75 Both independent SVM learning and the multitask kernel selection approach were evaluated by training on various numbers of examples (20, 40, . [sent-705, score-0.549]
76 Cross-validation was used to select a value of C for the independent SVM approach and to select values for C and α for the multitask kernel selection SVM. [sent-715, score-0.549]
77 Figure 2(a) shows the performance of the independent SVMs versus the multitask SVMs as an average and standard deviation of MAUC across 5 folds. [sent-716, score-0.402]
78 Tabula rasa learning obtains lower accuracy in general while multitask learning improved accuracy at all sizes of the training data set with statistical significance (a paired t-test produced a p-value below 0. [sent-717, score-0.456]
79 Figure 2(b) summarizes the results which again demonstrate an advantage for the multitask setup. [sent-720, score-0.402]
80 Since the SVMs were warm-started at their previous solutions, sweeping across a range of α values in the multitask sparsity approach (after starting from an initial SVM solution) never required more than 50 times the run time of the initial SVM solution. [sent-725, score-0.439]
81 Thus, empirically, the multitask sparsity framework, while sweeping over the full regularization path over α, incurs a constant factor (under 50) increase in the computational effort over independent SVM learning. [sent-726, score-0.439]
82 The Heart data set was changed into a multitask data set by dividing the data into ten different tasks based on the age of the patient. [sent-730, score-0.462]
83 Figure 3 shows the average test AUC results across 100 folds using independent SVMs, pooling and adaptive pooling for various values of α (after cross-validation only over the value of C for all methods). [sent-740, score-0.326]
84 The Figure reveals an advantage for adaptive pooling compared to full pooling and independent learning. [sent-741, score-0.326]
85 Graphical Model Structure Estimation The multitask sparse discrimination framework is a general tool for large margin classification since ˆ most elements of s become vanishingly small (at appropriate settings of ρ or α). [sent-763, score-0.517]
86 The multitask MED approach can potentially circumvent this ad hoc AND/OR step by forcing all sparse predictors to agree on a single undirected edge connectivity matrix E from the outset. [sent-790, score-0.428]
87 Thus, consistency of the edges used by the sparse prediction is enforced up-front in a multitask setting instead of resorting to a post-processing (i. [sent-802, score-0.428]
88 s Taking σ → ∞ and J(λ) = − log(Z(λ)) produces (up to an additive constant) the dual program: 2 2 1 T 2 (∑t λm,t ym,t xt (d)) +(∑t λd,t yd,t xt (m)) maxλ ∑D ∑t=1 λm,t − ∑D ∑D m=1 m=1 d=m+1 log α + e T s. [sent-806, score-0.291]
89 These switch configurations essentially identify the network structure and are obtained from expected s(m, d) values under the posterior p(Θ|D ) as follows: ˆ s(m, d) = 1 1 + α exp −1 2 (∑t λm,t ym,t xt (d))2 + (∑t λd,t yd,t xt (m))2 . [sent-815, score-0.349]
90 Discussion A multitask learning framework was developed for support vector machines and large-margin linear classifiers. [sent-835, score-0.429]
91 dual optimization for both multitask feature selection and multitask kernel selection problems. [sent-856, score-1.081]
92 The MED multitask framework potentially allows flexible exploration of sparsity structure over different groups of variables and is reminiscent of Group Lasso methods (Yuan and Lin, 2006; Bach, 2008). [sent-859, score-0.439]
93 Experiments on real world data sets show that MED multitask learning is advantageous over single-task or tabula rasa learning. [sent-860, score-0.51]
94 In future work, it would be interesting to investigate theoretical generalization guarantees for multitask sparse MED. [sent-861, score-0.428]
95 Since generalization guarantees in multitask settings have already been provided for other algorithms (Ando and Zhang, 2005; Maurer, 2006, 2009), this may be a fruitful line of work. [sent-863, score-0.424]
96 Bounding the Logistic-Quadratic Function Theorem 1 For all u ∈ RD , log α + exp log α + exp v⊤ v 2 + u⊤ u 2 is bounded above by v⊤ (u − v) ⊤ 1 + α exp(− v 2 v ) 1 + (u − v)⊤ I + G vv⊤ (u − v) 2 1 for the scalar term G = 1 tanh( 2 log(α exp(−v⊤ v/2)))/log(α exp(−v⊤ v/2)). [sent-876, score-0.381]
97 Recall the following inequality (Jaakkola and Jordan, 2000) which holds for any choice of ξ ∈ R: log exp − ξ ξ + exp 2 2 + ξ tanh( 2 ) 2 χ χ + exp (χ − ξ2 ) ≥ log exp − 4ξ 2 2 . [sent-893, score-0.572]
98 Choose ξ = log ϕ (or, equivalently, ξ = − log ϕ) and rewrite the bound as 1 tanh( 2 log ϕ) 2 χ χ (χ − (log ϕ)2 ) ≥ log exp − + exp 4 log ϕ 2 2 1 1 − log(ϕ 2 + ϕ− 2 ). [sent-894, score-0.51]
99 , D terms produces the overall variational upper bound D Ui (λ) = D log(α + 1) − ∑ log α + exp d=1 1 ⊤ exp 2 λi Hd λi − 1 ⊤ d=1 α + exp 2 λi Hd λi D ∑ 1 ⊤ λ Hd λi 2 i 1 ⊤ 2 λi Hd λi 1 ⊤ d=1 α + exp 2 λi Hd λi D +∑ exp 1 ⊤ λ Hd λi 2 i 1 ⊤ λ Hd λ + λ⊤ 1. [sent-921, score-0.67]
100 While ℓMED is not identical to the Elastic Net regularization, the similarity warrants further exploration and may be useful in group Lasso and multitask settings (Turlach et al. [sent-1007, score-0.424]
wordName wordTfidf (topN-words)
[('tm', 0.455), ('hd', 0.403), ('multitask', 0.402), ('med', 0.287), ('bm', 0.177), ('pooling', 0.15), ('ebara', 0.144), ('iscrimination', 0.136), ('ultitask', 0.136), ('kd', 0.117), ('tanh', 0.116), ('ntropy', 0.116), ('exp', 0.115), ('aximum', 0.095), ('ym', 0.093), ('parsity', 0.089), ('xt', 0.086), ('svm', 0.085), ('kernel', 0.084), ('gd', 0.078), ('auc', 0.075), ('quadratic', 0.073), ('svms', 0.07), ('selection', 0.063), ('posterior', 0.062), ('tasks', 0.06), ('jebara', 0.059), ('shared', 0.057), ('communal', 0.056), ('log', 0.056), ('ui', 0.054), ('rasa', 0.054), ('tabula', 0.054), ('article', 0.052), ('xm', 0.052), ('jaakkola', 0.05), ('lagrange', 0.05), ('couples', 0.048), ('entropy', 0.046), ('discrimination', 0.044), ('th', 0.044), ('classi', 0.044), ('feature', 0.043), ('specialized', 0.042), ('datum', 0.041), ('scalar', 0.039), ('produces', 0.039), ('regression', 0.037), ('sparsity', 0.037), ('task', 0.036), ('ers', 0.036), ('lasso', 0.034), ('az', 0.034), ('kernels', 0.033), ('programming', 0.033), ('mauc', 0.032), ('li', 0.03), ('km', 0.03), ('rd', 0.029), ('yt', 0.029), ('multipliers', 0.029), ('explored', 0.029), ('conic', 0.029), ('dm', 0.028), ('primal', 0.028), ('pooled', 0.027), ('scalars', 0.027), ('machines', 0.027), ('adaptive', 0.026), ('objective', 0.026), ('sparse', 0.026), ('sequential', 0.026), ('pl', 0.025), ('pu', 0.025), ('resembles', 0.025), ('vv', 0.024), ('multiplier', 0.024), ('switches', 0.024), ('coupling', 0.024), ('ising', 0.024), ('normalizer', 0.024), ('sqp', 0.024), ('dual', 0.024), ('margin', 0.023), ('elastic', 0.023), ('sup', 0.023), ('program', 0.023), ('wd', 0.022), ('koh', 0.022), ('wainwright', 0.022), ('permits', 0.022), ('binary', 0.022), ('graphical', 0.022), ('settings', 0.022), ('bd', 0.021), ('hilbert', 0.021), ('recovered', 0.021), ('er', 0.02), ('curve', 0.02), ('kkt', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999911 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
Author: Tony Jebara
Abstract: A multitask learning framework is developed for discriminative classification and regression where multiple large-margin linear classifiers are estimated for different prediction problems. These classifiers operate in a common input space but are coupled as they recover an unknown shared representation. A maximum entropy discrimination (MED) framework is used to derive the multitask algorithm which involves only convex optimization problems that are straightforward to implement. Three multitask scenarios are described. The first multitask method produces multiple support vector machines that learn a shared sparse feature selection over the input space. The second multitask method produces multiple support vector machines that learn a shared conic kernel combination. The third multitask method produces a pooled classifier as well as adaptively specialized individual classifiers. Furthermore, extensions to regression, graphical model structure estimation and other sparse methods are discussed. The maximum entropy optimization problems are implemented via a sequential quadratic programming method which leverages recent progress in fast SVM solvers. Fast monotonic convergence bounds are provided by bounding the MED sparsifying cost function with a quadratic function and ensuring only a constant factor runtime increase above standard independent SVM solvers. Results are shown on multitask data sets and favor multitask learning over single-task or tabula rasa methods. Keywords: meta-learning, support vector machines, feature selection, kernel selection, maximum entropy, large margin, Bayesian methods, variational bounds, classification, regression, Lasso, graphical model structure estimation, quadratic programming, convex programming
2 0.069722287 14 jmlr-2011-Better Algorithms for Benign Bandits
Author: Elad Hazan, Satyen Kale
Abstract: The online multi-armed bandit problem and its generalizations are repeated decision making problems, where the goal is to select one of several possible decisions in every round, and incur a cost associated with the decision, in such a way that the total cost incurred over all iterations is close to the cost of the best fixed decision in hindsight. The difference in these costs is known as the regret of the algorithm. The term bandit refers to the setting where one only obtains the cost of the decision used in a given iteration and no other information. A very general form of this problem is the non-stochastic bandit linear optimization problem, where the set of decisions is a convex set in some√ Euclidean space, and the cost functions are linear. ˜ Only recently an efficient algorithm attaining O( T ) regret was discovered in this setting. In this paper we propose a new algorithm for the bandit linear optimization problem which √ ˜ obtains a tighter regret bound of O( Q), where Q is the total variation in the cost functions. This regret bound, previously conjectured to hold in the full information case, shows that it is possible to incur much less regret in a slowly changing environment even in the bandit setting. Our algorithm is efficient and applies several new ideas to bandit optimization such as reservoir sampling. Keywords: multi-armed bandit, regret minimization, online learning
3 0.065082327 66 jmlr-2011-Multiple Kernel Learning Algorithms
Author: Mehmet Gönen, Ethem Alpaydın
Abstract: In recent years, several methods have been proposed to combine multiple kernels instead of using a single one. These different kernels may correspond to using different notions of similarity or may be using information coming from multiple sources (different representations or different feature subsets). In trying to organize and highlight the similarities and differences between them, we give a taxonomy of and review several multiple kernel learning algorithms. We perform experiments on real data sets for better illustration and comparison of existing algorithms. We see that though there may not be large differences in terms of accuracy, there is difference between them in complexity as given by the number of stored support vectors, the sparsity of the solution as given by the number of used kernels, and training time complexity. We see that overall, using multiple kernels instead of a single one is useful and believe that combining kernels in a nonlinear or data-dependent way seems more promising than linear combination in fusing information provided by simple linear kernels, whereas linear methods are more reasonable when combining complex Gaussian kernels. Keywords: support vector machines, kernel machines, multiple kernel learning
4 0.065041423 105 jmlr-2011-lp-Norm Multiple Kernel Learning
Author: Marius Kloft, Ulf Brefeld, Sören Sonnenburg, Alexander Zien
Abstract: Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, this ℓ1 -norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures that generalize well, we extend MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary norms, that is ℓ p -norms with p ≥ 1. This interleaved optimization is much faster than the commonly used wrapper approaches, as demonstrated on several data sets. A theoretical analysis and an experiment on controlled artificial data shed light on the appropriateness of sparse, non-sparse and ℓ∞ -norm MKL in various scenarios. Importantly, empirical applications of ℓ p -norm MKL to three real-world problems from computational biology show that non-sparse MKL achieves accuracies that surpass the state-of-the-art. Data sets, source code to reproduce the experiments, implementations of the algorithms, and further information are available at http://doc.ml.tu-berlin.de/nonsparse_mkl/. Keywords: multiple kernel learning, learning kernels, non-sparse, support vector machine, convex conjugate, block coordinate descent, large scale optimization, bioinformatics, generalization bounds, Rademacher complexity ∗. Also at Machine Learning Group, Technische Universit¨ t Berlin, 10587 Berlin, Germany. a †. Parts of this work were done while SS was at the Friedrich Miescher Laboratory, Max Planck Society, 72076 T¨ bingen, Germany. u ‡. Most contributions by AZ were done at the Fraunhofer Institute FIRST, 12489 Berlin, Germany. c 2011 Marius Kloft, Ulf Brefeld, S¨ ren Sonnenburg and Alexander Zien. o K LOFT, B REFELD , S ONNENBURG AND Z IEN
5 0.062470168 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
Author: John Duchi, Elad Hazan, Yoram Singer
Abstract: We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms. Keywords: subgradient methods, adaptivity, online learning, stochastic convex optimization
6 0.061248247 97 jmlr-2011-Union Support Recovery in Multi-task Learning
7 0.054497652 101 jmlr-2011-Variable Sparsity Kernel Learning
8 0.053427052 55 jmlr-2011-Learning Multi-modal Similarity
9 0.04557623 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
10 0.045034379 48 jmlr-2011-Kernel Analysis of Deep Networks
11 0.044218492 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods
12 0.042938638 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
13 0.042308986 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
14 0.041847609 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
15 0.041106988 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels
16 0.039813895 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
17 0.039506074 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
18 0.038726859 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
19 0.036431469 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
20 0.03596339 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
topicId topicWeight
[(0, 0.211), (1, 0.007), (2, 0.054), (3, -0.075), (4, 0.06), (5, 0.015), (6, 0.042), (7, -0.014), (8, -0.048), (9, -0.16), (10, 0.034), (11, -0.028), (12, 0.017), (13, -0.088), (14, -0.003), (15, 0.017), (16, -0.076), (17, -0.043), (18, 0.006), (19, 0.091), (20, -0.102), (21, 0.072), (22, 0.015), (23, 0.102), (24, 0.02), (25, 0.01), (26, -0.051), (27, -0.074), (28, -0.018), (29, -0.006), (30, 0.013), (31, 0.08), (32, -0.062), (33, 0.035), (34, 0.0), (35, 0.068), (36, -0.276), (37, 0.159), (38, -0.014), (39, 0.113), (40, 0.04), (41, 0.018), (42, -0.055), (43, 0.09), (44, -0.203), (45, -0.16), (46, -0.23), (47, -0.241), (48, 0.202), (49, 0.071)]
simIndex simValue paperId paperTitle
same-paper 1 0.93531477 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
Author: Tony Jebara
Abstract: A multitask learning framework is developed for discriminative classification and regression where multiple large-margin linear classifiers are estimated for different prediction problems. These classifiers operate in a common input space but are coupled as they recover an unknown shared representation. A maximum entropy discrimination (MED) framework is used to derive the multitask algorithm which involves only convex optimization problems that are straightforward to implement. Three multitask scenarios are described. The first multitask method produces multiple support vector machines that learn a shared sparse feature selection over the input space. The second multitask method produces multiple support vector machines that learn a shared conic kernel combination. The third multitask method produces a pooled classifier as well as adaptively specialized individual classifiers. Furthermore, extensions to regression, graphical model structure estimation and other sparse methods are discussed. The maximum entropy optimization problems are implemented via a sequential quadratic programming method which leverages recent progress in fast SVM solvers. Fast monotonic convergence bounds are provided by bounding the MED sparsifying cost function with a quadratic function and ensuring only a constant factor runtime increase above standard independent SVM solvers. Results are shown on multitask data sets and favor multitask learning over single-task or tabula rasa methods. Keywords: meta-learning, support vector machines, feature selection, kernel selection, maximum entropy, large margin, Bayesian methods, variational bounds, classification, regression, Lasso, graphical model structure estimation, quadratic programming, convex programming
2 0.40929973 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
Author: Mauricio A. Álvarez, Neil D. Lawrence
Abstract: Recently there has been an increasing interest in regression methods that deal with multiple outputs. This has been motivated partly by frameworks like multitask learning, multisensor networks or structured output data. From a Gaussian processes perspective, the problem reduces to specifying an appropriate covariance function that, whilst being positive semi-definite, captures the dependencies between all the data points and across all the outputs. One approach to account for non-trivial correlations between outputs employs convolution processes. Under a latent function interpretation of the convolution transform we establish dependencies between output variables. The main drawbacks of this approach are the associated computational and storage demands. In this paper we address these issues. We present different efficient approximations for dependent output Gaussian processes constructed through the convolution formalism. We exploit the conditional independencies present naturally in the model. This leads to a form of the covariance similar in spirit to the so called PITC and FITC approximations for a single output. We show experimental results with synthetic and real data, in particular, we show results in school exams score prediction, pollution prediction and gene expression data. Keywords: Gaussian processes, convolution processes, efficient approximations, multitask learning, structured outputs, multivariate processes
3 0.34072161 61 jmlr-2011-Logistic Stick-Breaking Process
Author: Lu Ren, Lan Du, Lawrence Carin, David Dunson
Abstract: A logistic stick-breaking process (LSBP) is proposed for non-parametric clustering of general spatially- or temporally-dependent data, imposing the belief that proximate data are more likely to be clustered together. The sticks in the LSBP are realized via multiple logistic regression functions, with shrinkage priors employed to favor contiguous and spatially localized segments. The LSBP is also extended for the simultaneous processing of multiple data sets, yielding a hierarchical logistic stick-breaking process (H-LSBP). The model parameters (atoms) within the H-LSBP are shared across the multiple learning tasks. Efficient variational Bayesian inference is derived, and comparisons are made to related techniques in the literature. Experimental analysis is performed for audio waveforms and images, and it is demonstrated that for segmentation applications the LSBP yields generally homogeneous segments with sharp boundaries. Keywords: Bayesian, nonparametric, dependent, hierarchical models, segmentation
4 0.31841308 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
Author: Mark D. Reid, Robert C. Williamson
Abstract: We unify f -divergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROC-curves and statistical information. We do this by systematically studying integral and variational representations of these objects and in so doing identify their representation primitives which all are related to cost-sensitive binary classification. As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating f -divergences to variational divergence. The new viewpoint also illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates maximum mean discrepancy to Fisher linear discriminants. Keywords: classification, loss functions, divergence, statistical information, regret bounds
5 0.31503144 6 jmlr-2011-A Simpler Approach to Matrix Completion
Author: Benjamin Recht
Abstract: This paper provides the best bounds to date on the number of randomly sampled entries required to reconstruct an unknown low-rank matrix. These results improve on prior work by Cand` s and e Recht (2009), Cand` s and Tao (2009), and Keshavan et al. (2009). The reconstruction is accome plished by minimizing the nuclear norm, or sum of the singular values, of the hidden matrix subject to agreement with the provided entries. If the underlying matrix satisfies a certain incoherence condition, then the number of entries required is equal to a quadratic logarithmic factor times the number of parameters in the singular value decomposition. The proof of this assertion is short, self contained, and uses very elementary analysis. The novel techniques herein are based on recent work in quantum information theory. Keywords: matrix completion, low-rank matrices, convex optimization, nuclear norm minimization, random matrices, operator Chernoff bound, compressed sensing
6 0.30360419 66 jmlr-2011-Multiple Kernel Learning Algorithms
7 0.29478768 97 jmlr-2011-Union Support Recovery in Multi-task Learning
8 0.29118726 105 jmlr-2011-lp-Norm Multiple Kernel Learning
9 0.29110193 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
10 0.28888389 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership
11 0.27358362 48 jmlr-2011-Kernel Analysis of Deep Networks
12 0.26560032 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
13 0.25830311 14 jmlr-2011-Better Algorithms for Benign Bandits
14 0.25668246 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels
15 0.25551155 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package
16 0.25276268 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal
17 0.24740882 59 jmlr-2011-Learning with Structured Sparsity
18 0.24275725 55 jmlr-2011-Learning Multi-modal Similarity
19 0.24178723 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation
20 0.24118288 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
topicId topicWeight
[(4, 0.065), (6, 0.011), (9, 0.048), (10, 0.032), (24, 0.057), (31, 0.103), (32, 0.035), (41, 0.024), (60, 0.019), (65, 0.011), (66, 0.33), (70, 0.017), (71, 0.012), (73, 0.054), (78, 0.079), (86, 0.011), (90, 0.017)]
simIndex simValue paperId paperTitle
1 0.97183424 92 jmlr-2011-The Stationary Subspace Analysis Toolbox
Author: Jan Saputra Müller, Paul von Bünau, Frank C. Meinecke, Franz J. Király, Klaus-Robert Müller
Abstract: The Stationary Subspace Analysis (SSA) algorithm linearly factorizes a high-dimensional time series into stationary and non-stationary components. The SSA Toolbox is a platform-independent efficient stand-alone implementation of the SSA algorithm with a graphical user interface written in Java, that can also be invoked from the command line and from Matlab. The graphical interface guides the user through the whole process; data can be imported and exported from comma separated values (CSV) and Matlab’s .mat files. Keywords: non-stationarities, blind source separation, dimensionality reduction, unsupervised learning
same-paper 2 0.75348276 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
Author: Tony Jebara
Abstract: A multitask learning framework is developed for discriminative classification and regression where multiple large-margin linear classifiers are estimated for different prediction problems. These classifiers operate in a common input space but are coupled as they recover an unknown shared representation. A maximum entropy discrimination (MED) framework is used to derive the multitask algorithm which involves only convex optimization problems that are straightforward to implement. Three multitask scenarios are described. The first multitask method produces multiple support vector machines that learn a shared sparse feature selection over the input space. The second multitask method produces multiple support vector machines that learn a shared conic kernel combination. The third multitask method produces a pooled classifier as well as adaptively specialized individual classifiers. Furthermore, extensions to regression, graphical model structure estimation and other sparse methods are discussed. The maximum entropy optimization problems are implemented via a sequential quadratic programming method which leverages recent progress in fast SVM solvers. Fast monotonic convergence bounds are provided by bounding the MED sparsifying cost function with a quadratic function and ensuring only a constant factor runtime increase above standard independent SVM solvers. Results are shown on multitask data sets and favor multitask learning over single-task or tabula rasa methods. Keywords: meta-learning, support vector machines, feature selection, kernel selection, maximum entropy, large margin, Bayesian methods, variational bounds, classification, regression, Lasso, graphical model structure estimation, quadratic programming, convex programming
3 0.48316938 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
Author: Bruno Pelletier, Pierre Pudlo
Abstract: Following Hartigan (1975), a cluster is defined as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classification, level sets, connected components
4 0.46107936 48 jmlr-2011-Kernel Analysis of Deep Networks
Author: Grégoire Montavon, Mikio L. Braun, Klaus-Robert Müller
Abstract: When training deep networks it is common knowledge that an efficient and well generalizing representation of the problem is formed. In this paper we aim to elucidate what makes the emerging representation successful. We analyze the layer-wise evolution of the representation in a deep network by building a sequence of deeper and deeper kernels that subsume the mapping performed by more and more layers of the deep network and measuring how these increasingly complex kernels fit the learning problem. We observe that deep networks create increasingly better representations of the learning problem and that the structure of the deep network controls how fast the representation of the task is formed layer after layer. Keywords: deep networks, kernel principal component analysis, representations
5 0.44924936 91 jmlr-2011-The Sample Complexity of Dictionary Learning
Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein
Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefficient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation
6 0.44025269 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
7 0.43837276 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
8 0.43806759 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
9 0.43538305 12 jmlr-2011-Bayesian Co-Training
10 0.43501866 66 jmlr-2011-Multiple Kernel Learning Algorithms
11 0.43499503 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
12 0.43495581 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
13 0.42996931 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package
14 0.42933345 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
15 0.42887789 59 jmlr-2011-Learning with Structured Sparsity
16 0.42828485 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
17 0.42583805 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
18 0.42571911 16 jmlr-2011-Clustering Algorithms for Chains
19 0.4244926 102 jmlr-2011-Waffles: A Machine Learning Toolkit
20 0.42441976 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments