jmlr jmlr2010 jmlr2010-66 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Giovanni Cavallanti, Nicolò Cesa-Bianchi, Claudio Gentile
Abstract: We introduce new Perceptron-based algorithms for the online multitask binary classification problem. Under suitable regularity conditions, our algorithms are shown to improve on their baselines by a factor proportional to the number of tasks. We achieve these improvements using various types of regularization that bias our algorithms towards specific notions of task relatedness. More specifically, similarity among tasks is either measured in terms of the geometric closeness of the task reference vectors or as a function of the dimension of their spanned subspace. In addition to adapting to the online setting a mix of known techniques, such as the multitask kernels of Evgeniou et al., our analysis also introduces a matrix-based multitask extension of the p-norm Perceptron, which is used to implement spectral co-regularization. Experiments on real-world data sets complement and support our theoretical findings. Keywords: mistake bounds, perceptron algorithm, multitask learning, spectral regularization
Reference: text
sentIndex sentText sentNum sentScore
1 IT DICOM, Universit` dell’Insubria a via Mazzini, 5 21100 Varese, Italy Editor: Manfred Warmuth Abstract We introduce new Perceptron-based algorithms for the online multitask binary classification problem. [sent-7, score-0.771]
2 In addition to adapting to the online setting a mix of known techniques, such as the multitask kernels of Evgeniou et al. [sent-11, score-0.771]
3 , our analysis also introduces a matrix-based multitask extension of the p-norm Perceptron, which is used to implement spectral co-regularization. [sent-12, score-0.75]
4 Keywords: mistake bounds, perceptron algorithm, multitask learning, spectral regularization 1. [sent-14, score-1.169]
5 In multitask classification an online linear classifier (such as the Perceptron algorithm) learns from examples associated with K > 1 different binary classification tasks. [sent-23, score-0.771]
6 Our investigation considers two variants of the online multitask protocol: (1) at each time step the learner acts on a single adversarially chosen task; (2) all tasks are simultaneously processed at each time step. [sent-27, score-0.992]
7 The multitask classifiers we study here manage to improve, under certain assumptions, the cumulative regret achieved by a natural baseline algorithm through the information acquired and shared across different tasks. [sent-30, score-0.791]
8 We first define a natural extension of the p-norm Perceptron algorithm to a certain class of matrix norms, and then provide a mistake bound analysis for the multitask learning problem depending on spectral relations among different tasks. [sent-47, score-0.934]
9 The above is possible as long as the the example vectors observed at each time step are unrelated, while the sequences of multitask data are well predicted by a set of highly related linear classifiers. [sent-50, score-0.755]
10 First, we provide theoretical guarantees in the form of mistake bounds for various algorithms operating within the online multitask protocol. [sent-53, score-0.88]
11 In this context, the notion of relatedness among tasks follows from the specific choice of a matrix parameter that essentially defines the so-called multitask kernel. [sent-60, score-0.959]
12 In the simultaneous task setting, we introduce and analyze a matrix-based multitask extension of the p-norm Perceptron algorithm (Grove et al. [sent-62, score-0.836]
13 In particular, we show that on a text categorization problem, where each task requires detecting the topic of a newsitem, a large multitask performance increase is attainable whenever the target topics are related. [sent-65, score-0.799]
14 The multitask Perceptron algorithm is presented in Section 3 where we also discuss the role of the multitask feature map and show (Section 4) that it can be used to turn online classifiers into multitask classifiers. [sent-69, score-2.223]
15 We detail the matrix-based approach to the simultaneous multitask learning framework in Section 5. [sent-70, score-0.763]
16 (2005) a batch multitask learning problem is defined as a regularized optimization problem and the notion of multitask kernel is introduced. [sent-76, score-1.474]
17 (2007, 2008) build on this formalization to simultaneously learn a multitask classifier and the underlying spectral dependencies among tasks. [sent-79, score-0.75]
18 A different approach is discussed in Ando and Zhang (2005) where a structural risk minimization method is presented and multitask relations are established by enforcing predictive functions for the different tasks to belong to the same hypothesis set. [sent-82, score-0.842]
19 In the context of online learning, multitask problems have been studied in Abernethy et al. [sent-84, score-0.771]
20 Whereas these studies consider a multitask protocol in which a single task is acted upon at each time step (what we call in this paper the adversarially chosen task protocol), the work of Lugosi et al. [sent-88, score-1.014]
21 2903 C AVALLANTI , C ESA -B IANCHI AND G ENTILE Online linear multitask algorithms for the simultaneous task setting have been studied in Dekel et al. [sent-91, score-0.836]
22 (2007), where the separate learning tasks are collectively dealt with through a common multitask loss. [sent-92, score-0.842]
23 Online matrix approaches to the multitask and the related multiview learning problems were considered in various works. [sent-97, score-0.781]
24 When dealing with the trace norm regularizer, their algorithms could be generalized to our simultaneous multitask framework to obtain mistake bounds comparable to ours. [sent-100, score-0.92]
25 (2008) consider multitask problems in the restricted expert setting, where task relatedness is enforced by a group norm regularization. [sent-103, score-0.889]
26 (1) (K − it )d times Within this protocol (studied in Sections 3 and 4) we use φt or xt , it interchangeably when referring to a multitask instance. [sent-136, score-0.868]
27 To remain consistent with the notation used for multitask instances, we introduce the “compound” reference task vector u⊤ = u⊤ , . [sent-139, score-0.834]
28 The Multitask Perceptron Algorithm We first introduce a simple multitask version of the Perceptron algorithm for the protocol described in the previous section. [sent-158, score-0.785]
29 The pseudocode for the multitask Perceptron algorithm using a generic interaction matrix A is given in Figure 1. [sent-170, score-0.834]
30 We specify the online multitask problem by the sequence (φ1 , y1 ), (φ2 , y2 ), . [sent-201, score-0.771]
31 Theorem 1 The number of mistakes m made by the multitask Perceptron algorithm in Figure 1, run with an interaction matrix A on any finite multitask sequence of examples (φ1 , y1 ), (φ2 , y2 ), . [sent-205, score-1.663]
32 Theorem 1 is readily proven by using the fact that the multitask Perceptron is a specific instance of the kernel Perceptron algorithm, for example, Freund and Schapire (1999), using the so-called linear multitask kernel introduced in Evgeniou et al. [sent-215, score-1.514]
33 The kernel used by the multitask ⊗ Perceptron is thus defined by K (xs , is ), (xt , it ) = ψ(xs , is ), ψ(xt , it ) 2906 H = φ⊤ A−1 φt . [sent-223, score-0.748]
34 1 Pairwise Distance Interaction Matrix The first choice of A we consider is the following simple update step (corresponding to the multitask Perceptron example we made at the beginning of this section). [sent-237, score-0.753]
35 2 (4) C AVALLANTI , C ESA -B IANCHI AND G ENTILE Corollary 3 The number of mistakes m made by the multitask Perceptron algorithm in Figure 1, run with the interaction matrix (4) on any finite multitask sequence of examples (φ1 , y1 ), (φ2 , y2 ), . [sent-289, score-1.663]
36 Corollary 4 The number of mistakes m made by the multitask Perceptron algorithm in Figure 1, run with the interaction matrix (5) on any finite multitask sequence of examples (φ1 , y1 ), (φ2 , y2 ), . [sent-361, score-1.663]
37 In this case the multitask algorithm becomes essentially equivalent to an ali=1 gorithm that, before learning starts, chooses one task at random and keeps referring all instance vectors xt to that task (somehow implementing the fact that now the information conveyed by it can be disregarded). [sent-377, score-1.002]
38 Corollary 5 The number of mistakes m made by the multitask Perceptron algorithm in Figure 1, run with the interaction matrix I + L on any finite multitask sequence of examples (φ1 , y1 ), (φ2 , y2 ), . [sent-385, score-1.663]
39 Turning Perceptron-like Algorithms Into Multitask Classifiers We now show how to obtain multitask versions of well-known classifiers by using the multitask kernel mapping detailed in Section 3. [sent-425, score-1.474]
40 If a mistake occurs at time t, vs−1 ∈ RKd is updated using the p 2 multitask Perceptron rule, vs = vs−1 + yt A−1 φt . [sent-432, score-1.092]
41 We are now ready to state the mistake bound for ⊗ the the multitask p-norm Perceptron algorithm. [sent-433, score-0.855]
42 Theorem 6 The number of mistakes m made by the p-norm multitask Perceptron, run with the pairwise distance matrix (4) and p = 2 ln max{K, d}, on any finite multitask sequence of examples (φ1 , y1 ), (φ2 , y2 ), . [sent-435, score-1.627]
43 Remark 7 Note that for p = 2 our p-norm variant of the multitask Perceptron algorithm does not reduce to the multitask Perceptron of Figure 1. [sent-477, score-1.452]
44 One then obtains a proper p-norm generalization of the multitask Perceptron algorithm by running the standard p-norm Perceptron on such multitask instances. [sent-479, score-1.452]
45 For example, when p is chosen −1/2 as in Theorem 6 and all task vectors are equal, then multitask instances of the form A⊗ φt yield a bound K times worse than (10), which is obtained with instances of the form A−1 φt . [sent-481, score-0.848]
46 The algorithm, described in Figure 2, maintains in its internal state a matrix S (initialized to the empty / matrix 0) and a multitask Perceptron weight vector v (initialized to the zero vector). [sent-486, score-0.836]
47 Theorem 8 The number of mistakes m made by the multitask Second Order Perceptron algorithm in Figure 2, run with an interaction matrix A on any finite multitask sequence of examples (φ1 , y1 ), (φ2 , y2 ), . [sent-495, score-1.663]
48 In this case, however, the interaction matrix A also plays a role in the scale of the eigenvalues of the resulting multitask Gram matrix. [sent-512, score-0.834]
49 Putting together, unlike the first-order Perceptron, the gain factor achieved by a multitask √ second-order perceptron over the K independent tasks bound is about K. [sent-516, score-1.172]
50 1 I MPLEMENTING T HE M ULTITASK S ECOND - ORDER P ERCEPTRON I N D UAL F ORM It is easy to see that the second-order multitask Perceptron can be run in dual form by maintaining K classifiers that share the same set of support vectors. [sent-519, score-0.747]
51 The Simultaneous Multitask Protocol: Preliminaries The multitask kernel-based regularization approach adopted in the previous sections is not the only way to design algorithms for the multiple tasks scenario. [sent-534, score-0.842]
52 We therefore extend the traditional online classification protocol to a fully simultaneous multitask environment where at each time step t the learner observes exactly K instance vectors xi,t ∈ Rd , i = 1, . [sent-538, score-0.936]
53 In the next section we show that simultaneous multitask learning algorithms can be designed in such a way that the cumulative number of mistakes is, in certain relevant cases, provably better than the K independent Perceptron algorithm. [sent-560, score-0.888]
54 Our potential-based matrix algorithm for classification shown in Figure 3 generalizes the classical potential-based algorithms operating on vectors to simultaneous multitask problems with matrix examples. [sent-594, score-0.902]
55 , K Figure 3: The potential-based matrix algorithm for the simultaneous multitask setting. [sent-635, score-0.818]
56 1 Analysis of Potential-based Matrix Classifiers We now develop a general analysis of potential-based matrix algorithms for (simultaneous) multitask classification. [sent-640, score-0.781]
57 (18) s=1 Equation (18) is our general starting point for analyzing potential-based matrix multitask algorithms. [sent-661, score-0.781]
58 Theorem 9 The overall number of mistakes µ made by the 2p-norm matrix multitask Perceptron (with p positive integer) run on finite sequences of examples (xi,1 , y1,t ), (xi,2 , , yi,2 ), · · · ∈ Rd×K × {−1, +1}, for i = 1, . [sent-677, score-0.884]
59 (19) It is easy to see that for p = q = 1 the 2p-norm matrix multitask Perceptron decomposes into K independent Perceptrons, which is our baseline. [sent-692, score-0.781]
60 These terms play an essential role to assess the potential advantage of the 2p-norm matrix multitask Perceptron. [sent-702, score-0.781]
61 Remark 12 The 2p-norm matrix multitask Perceptron algorithm updates the primal vector associated with a given task whenever an example for that task is wrongly predicted. [sent-709, score-0.956]
62 Since each entry of τt is dependent on all the K hinge losses suffered by the 2p-norm matrix multitask Perceptron algorithm at time t, the update now acts so as to favor certain tasks over the others according to the shared loss induced by · . [sent-719, score-0.924]
63 1 I MPLEMENTATION I N D UAL F ORM As for the algorithms in previous sections, the 2p-norm matrix multitask Perceptron algorithm can also be implemented in dual variables. [sent-729, score-0.802]
64 This allows us to turn our 2p-norm matrix multitask Perceptron into a kernel-based algorithm, and repeat the analysis given here using a standard RKHS formalism—see Warmuth (2009) for a more general treatment of kernelizable matrix learning algorithms. [sent-735, score-0.836]
65 Since we are more interested in the multitask kernel for sequential learning problems rather than the nature of the underlying classifiers, we restricted the experiments for the adversarially chosen task model to the multitask Perceptron algorithm of Section 3. [sent-738, score-1.63]
66 In particular, we compare the performance of the multitask Perceptron algorithm with parameter b > 0 to that of the same algorithm run with b = 0, which is our multitask baseline. [sent-739, score-1.452]
67 We also evaluated the 2p-norm matrix multitask Perceptron algorithm under a similar experimental setting, reporting the achieved performance for different values of the parameter p. [sent-742, score-0.781]
68 Finally, we provide experimental evidence of the effectiveness of the 2p-norm matrix multitask Perceptron algorithm when applied to a learning problem which requires the simultaneous processing of a significant number of tasks. [sent-743, score-0.818]
69 In our initial experiments, we empirically evaluated the multitask kernel using a collection of data sets derived from the first 160,000 newswire stories in the Reuters Corpus Volume 1 (RCV1, for details see NIST, 2004). [sent-744, score-0.748]
70 In fact, in order to evaluate the performance of our multitask algorithms in the presence of increasing levels of correlation among target tasks, we derived from RCV1 a collection of data sets where each example is associated with one task among a set of predefined tasks. [sent-746, score-0.799]
71 We generated eight multitask data sets (D1 through D8) in such a way that tasks in different data sets have different levels of correlation, from almost uncorrelated (D1) to completely overlapped (D8). [sent-747, score-0.87]
72 Once the eight sets of four tasks have been chosen, we generated labels for the corresponding multitask examples as follows. [sent-769, score-0.842]
73 A multitask example, defined as a set of four (instance, binary label) pairs, was then derived by replacing, for each of the four RCV1 examples, the original RCV1 categories with −1 if the intersection between the associated task and the categories was empty (i. [sent-771, score-0.905]
74 249 Table 1: Online training error rates made by the baseline b = 0 on consecutive sequences of multitask examples after a single pass on the data sets D1-D8. [sent-839, score-0.769]
75 Figure 5 shows the fraction of wrongly predicted examples during the online training of the multitask Perceptron algorithm with interaction matrix (5), when the task it is chosen either randomly3 (left) or in an adversarial manner (right). [sent-843, score-1.01]
76 Recall that b = 0 amounts to running four independent Perceptron algorithms, while b = 4 corresponds to running the multitask Perceptron algorithm with the interaction matrix (4). [sent-848, score-0.834]
77 Figure 5 confirms that multitask algorithms get more and more competitive as tasks get closer (since tasks are uncorrelated in D1 and totally overlapped in D8). [sent-851, score-0.986]
78 We report the extent to which during a single pass over the entire data set the fraction of training mistakes made by the multitask Perceptron algorithm (with b = 1, 4, 8, 16) exceeds the one achieved by the baseline b = 0, represented here as an horizontal line. [sent-972, score-0.901]
79 On the x-axis are the multitask data sets whose tasks have different levels of correlations, from low (D1) to high (D8). [sent-973, score-0.842]
80 101 Table 2: Fractions of training errors made by the baseline p = 1 on consecutive sequences of multitask examples after a single pass on the data sets D1-D8. [sent-1008, score-0.788]
81 overall better performance than the multitask baseline b = 0, with a higher cumulative error only on the first data set. [sent-1028, score-0.791]
82 We then evaluated the 2p-norm matrix multitask Perceptron algorithm on the same data set. [sent-1029, score-0.781]
83 This allows us to make the results achieved by the 2p-norm matrix multitask Perceptron algorithm somehow comparable (i. [sent-1032, score-0.781]
84 total number of binary labels received) with the ones achieved by the multitask Perceptron algorithm under the random task selection model (the four plots on the left in Figure 5). [sent-1035, score-0.799]
85 In Figure 6 we report the (differential) fractions of online training mistakes made by the algorithm on the subsequences 1-2,500, 2,501-5,000, 5,001-7,500 and 7,501-10,000, of multitask examples. [sent-1036, score-0.924]
86 As expected, the more the tasks get closer to each other the less is the number of wrongly predicted labels output by the 2p-norm matrix multitask Perceptron algorithm and the larger is the gap from the baseline. [sent-1038, score-0.926]
87 In particular, it can be observed that even in D1 the performance of the 2p-norm matrix multitask Perceptron algorithm is no worse than the one achieved by the baseline. [sent-1039, score-0.781]
88 As a further assessment, we evaluated the empirical performance of the p-norm matrix multitask algorithm on the ECML/PKDD 2006 Discovery Challenge spam data set (for details, see 2926 L INEAR A LGORITHMS FOR O NLINE M ULTITASK C LASSIFICATION 0. [sent-1041, score-0.781]
89 04 D1 D2 D3 D4 D5 D6 D7 D8 Fraction of errors recorded from the 1st to the 2,500th multitask example 0. [sent-1047, score-0.785]
90 04 D1 D2 D3 D4 D5 D6 D7 D8 Fraction of errors recorded from the 2,501st to the 5,000th multitask example 0. [sent-1053, score-0.785]
91 04 D1 D2 D3 D4 D5 D6 D7 D8 Fraction of errors recorded from the 5,001st to the 7,500th multitask example 0. [sent-1059, score-0.785]
92 04 D1 D2 D3 D4 D5 D6 D7 D8 Fraction of errors recorded from the 7,501st to the 10,000th multitask example Figure 6: Online behavior of the 2p-norm matrix multitask Perceptron algorithm. [sent-1065, score-1.566]
93 Again, we report the extent to which during a single pass over the entire data set the fraction training mistakes made by the 2p-norm matrix multitask Perceptron algorithm (with p = 2, 3, 4, 5) exceeds the one achieved by the baseline p = 1, represented here as an horizontal line. [sent-1066, score-0.956]
94 On the x-axis are the multitask data sets whose tasks have different levels of correlations, from low (D1) to high (D8). [sent-1067, score-0.842]
95 Table 3 shows that the 2p-norm matrix multitask algorithm exploits underlying latent relations and manages to achieve a significant decrease in both the training and test errors. [sent-1076, score-0.781]
96 Whereas these are average values, the advantage of the 2p-norm matrix multitask algorithm is still significant even when the standard deviation is factored in. [sent-1080, score-0.781]
97 In other words, if on a given fold the p-norm matrix multitask Perceptron algorithm with p > 1 made a larger number of mistakes than its average, the same held true for the baseline. [sent-1082, score-0.884]
98 Conclusions and Open Problems We have studied the problem of sequential multitask learning using two different approaches to formalize the notion of task relatedness: via the Euclidean distance between task vectors, or via a unitarily invariant norm applied to the matrix of task vectors. [sent-1088, score-1.1]
99 These two approaches naturally correspond to two different online multitask protocols: one where a single task is selected at each time step, and one where the learner operates simultaneously on all tasks at each time step. [sent-1089, score-0.982]
100 Our multitask regularization techniques rely on the fact that different tasks need to be embedded, either naturally or through some sensible preprocessing, in the same d-dimensional space. [sent-1108, score-0.842]
wordName wordTfidf (topN-words)
[('multitask', 0.726), ('perceptron', 0.31), ('vec', 0.159), ('rkd', 0.158), ('vs', 0.145), ('mt', 0.128), ('ss', 0.123), ('avallanti', 0.117), ('entile', 0.117), ('esa', 0.117), ('ianchi', 0.117), ('ultitask', 0.117), ('tasks', 0.116), ('yt', 0.112), ('mistake', 0.109), ('mistakes', 0.103), ('adversarially', 0.083), ('xt', 0.083), ('lgorithms', 0.078), ('ws', 0.077), ('tr', 0.073), ('nline', 0.073), ('task', 0.073), ('inear', 0.069), ('schatten', 0.065), ('relatedness', 0.062), ('protocol', 0.059), ('matrix', 0.055), ('interaction', 0.053), ('unitarily', 0.053), ('lassification', 0.053), ('categories', 0.053), ('ui', 0.052), ('tk', 0.049), ('online', 0.045), ('compound', 0.043), ('delta', 0.043), ('baseline', 0.043), ('xbg', 0.041), ('ik', 0.04), ('recorded', 0.04), ('evgeniou', 0.037), ('simultaneous', 0.037), ('reference', 0.035), ('rd', 0.033), ('vm', 0.031), ('cg', 0.03), ('fraction', 0.029), ('wrongly', 0.029), ('vectors', 0.029), ('norm', 0.028), ('norms', 0.028), ('herbster', 0.028), ('mistaken', 0.028), ('overlapped', 0.028), ('omnipress', 0.027), ('id', 0.027), ('update', 0.027), ('subsequences', 0.026), ('warmuth', 0.024), ('spectral', 0.024), ('fractions', 0.024), ('gentile', 0.024), ('magnus', 0.024), ('learner', 0.022), ('kernel', 0.022), ('cumulative', 0.022), ('perceptrons', 0.021), ('grove', 0.021), ('dekel', 0.021), ('dual', 0.021), ('dsi', 0.021), ('holder', 0.021), ('winnow', 0.021), ('remark', 0.02), ('bound', 0.02), ('sgn', 0.02), ('rkhs', 0.02), ('trace', 0.02), ('var', 0.02), ('hg', 0.02), ('tp', 0.02), ('invariant', 0.019), ('errors', 0.019), ('progressively', 0.018), ('argyriou', 0.018), ('tq', 0.018), ('kakade', 0.018), ('def', 0.018), ('instance', 0.018), ('gram', 0.018), ('littlestone', 0.018), ('kd', 0.018), ('multilabel', 0.018), ('wit', 0.018), ('cavallanti', 0.018), ('claudio', 0.018), ('gauge', 0.018), ('neudecker', 0.018), ('pairwise', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 66 jmlr-2010-Linear Algorithms for Online Multitask Classification
Author: Giovanni Cavallanti, Nicolò Cesa-Bianchi, Claudio Gentile
Abstract: We introduce new Perceptron-based algorithms for the online multitask binary classification problem. Under suitable regularity conditions, our algorithms are shown to improve on their baselines by a factor proportional to the number of tasks. We achieve these improvements using various types of regularization that bias our algorithms towards specific notions of task relatedness. More specifically, similarity among tasks is either measured in terms of the geometric closeness of the task reference vectors or as a function of the dimension of their spanned subspace. In addition to adapting to the online setting a mix of known techniques, such as the multitask kernels of Evgeniou et al., our analysis also introduces a matrix-based multitask extension of the p-norm Perceptron, which is used to implement spectral co-regularization. Experiments on real-world data sets complement and support our theoretical findings. Keywords: mistake bounds, perceptron algorithm, multitask learning, spectral regularization
2 0.10447041 44 jmlr-2010-Graph Kernels
Author: S.V.N. Vishwanathan, Nicol N. Schraudolph, Risi Kondor, Karsten M. Borgwardt
Abstract: We present a unified framework to study graph kernels, special cases of which include the random a walk (G¨ rtner et al., 2003; Borgwardt et al., 2005) and marginalized (Kashima et al., 2003, 2004; Mah´ et al., 2004) graph kernels. Through reduction to a Sylvester equation we improve the time e complexity of kernel computation between unlabeled graphs with n vertices from O(n6 ) to O(n3 ). We find a spectral decomposition approach even more efficient when computing entire kernel matrices. For labeled graphs we develop conjugate gradient and fixed-point methods that take O(dn3 ) time per iteration, where d is the size of the label set. By extending the necessary linear algebra to Reproducing Kernel Hilbert Spaces (RKHS) we obtain the same result for d-dimensional edge kernels, and O(n4 ) in the infinite-dimensional case; on sparse graphs these algorithms only take O(n2 ) time per iteration in all cases. Experiments on graphs from bioinformatics and other application domains show that these techniques can speed up computation of the kernel by an order of magnitude or more. We also show that certain rational kernels (Cortes et al., 2002, 2003, 2004) when specialized to graphs reduce to our random walk graph kernel. Finally, we relate our framework to R-convolution kernels (Haussler, 1999) and provide a kernel that is close to the optimal assignment o kernel of Fr¨ hlich et al. (2006) yet provably positive semi-definite. Keywords: linear algebra in RKHS, Sylvester equations, spectral decomposition, bioinformatics, rational kernels, transducers, semirings, random walks
3 0.08852721 84 jmlr-2010-On Spectral Learning
Author: Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil
Abstract: In this paper, we study the problem of learning a matrix W from a set of linear measurements. Our formulation consists in solving an optimization problem which involves regularization with a spectral penalty term. That is, the penalty term is a function of the spectrum of the covariance of W . Instances of this problem in machine learning include multi-task learning, collaborative filtering and multi-view learning, among others. Our goal is to elucidate the form of the optimal solution of spectral learning. The theory of spectral learning relies on the von Neumann characterization of orthogonally invariant norms and their association with symmetric gauge functions. Using this tool we formulate a representer theorem for spectral regularization and specify it to several useful example, such as Schatten p−norms, trace norm and spectral norm, which should proved useful in applications. Keywords: kernel methods, matrix learning, minimal norm interpolation, multi-task learning, orthogonally invariant norms, regularization
4 0.049411088 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
Author: Garvesh Raskutti, Martin J. Wainwright, Bin Yu
Abstract: Methods based on ℓ1 -relaxation, such as basis pursuit and the Lasso, are very popular for sparse regression in high dimensions. The conditions for success of these methods are now well-understood: (1) exact recovery in the noiseless setting is possible if and only if the design matrix X satisfies the restricted nullspace property, and (2) the squared ℓ2 -error of a Lasso estimate decays at the minimax optimal rate k log p , where k is the sparsity of the p-dimensional regression problem with additive n Gaussian noise, whenever the design satisfies a restricted eigenvalue condition. The key issue is thus to determine when the design matrix X satisfies these desirable properties. Thus far, there have been numerous results showing that the restricted isometry property, which implies both the restricted nullspace and eigenvalue conditions, is satisfied when all entries of X are independent and identically distributed (i.i.d.), or the rows are unitary. This paper proves directly that the restricted nullspace and eigenvalue conditions hold with high probability for quite general classes of Gaussian matrices for which the predictors may be highly dependent, and hence restricted isometry conditions can be violated with high probability. In this way, our results extend the attractive theoretical guarantees on ℓ1 -relaxations to a much broader class of problems than the case of completely independent or unitary designs. Keywords: Lasso, basis pursuit, random matrix theory, Gaussian comparison inequality, concentration of measure
Author: Dapo Omidiran, Martin J. Wainwright
Abstract: We consider the problem of high-dimensional variable selection: given n noisy observations of a k-sparse vector β∗ ∈ R p , estimate the subset of non-zero entries of β∗ . A significant body of work has studied behavior of ℓ1 -relaxations when applied to random measurement matrices that are dense (e.g., Gaussian, Bernoulli). In this paper, we analyze sparsified measurement ensembles, and consider the trade-off between measurement sparsity, as measured by the fraction γ of nonzero entries, and the statistical efficiency, as measured by the minimal number of observations n required for correct variable selection with probability converging to one. Our main result is to prove that it is possible to let the fraction on non-zero entries γ → 0 at some rate, yielding measurement matrices with a vanishing fraction of non-zeros per row, while retaining the same statistical efficiency as dense ensembles. A variety of simulation results confirm the sharpness of our theoretical predictions. Keywords: variable selection, sparse random projections, high-dimensional statistics, Lasso, consistency, ℓ1 -regularization
6 0.040714327 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
7 0.039679047 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
8 0.038984511 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
9 0.037354745 80 jmlr-2010-On-Line Sequential Bin Packing
10 0.036509976 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
11 0.036164951 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning
12 0.035812806 76 jmlr-2010-Message-passing for Graph-structured Linear Programs: Proximal Methods and Rounding Schemes
13 0.034121376 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation
14 0.034015395 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
15 0.033929206 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks
16 0.033637047 82 jmlr-2010-On Learning with Integral Operators
17 0.032626379 72 jmlr-2010-Matrix Completion from Noisy Entries
18 0.032621101 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors
19 0.03220642 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
20 0.02818094 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
topicId topicWeight
[(0, -0.151), (1, -0.067), (2, 0.039), (3, 0.0), (4, -0.026), (5, -0.004), (6, -0.098), (7, -0.088), (8, -0.009), (9, -0.008), (10, 0.021), (11, 0.034), (12, 0.098), (13, -0.011), (14, 0.019), (15, -0.004), (16, 0.081), (17, 0.068), (18, 0.033), (19, 0.115), (20, -0.092), (21, -0.009), (22, 0.062), (23, -0.008), (24, 0.005), (25, -0.092), (26, 0.047), (27, -0.037), (28, 0.061), (29, 0.129), (30, 0.068), (31, 0.054), (32, 0.063), (33, 0.065), (34, -0.087), (35, -0.318), (36, 0.182), (37, -0.032), (38, 0.159), (39, -0.357), (40, -0.065), (41, -0.052), (42, 0.018), (43, -0.171), (44, 0.202), (45, 0.109), (46, -0.284), (47, 0.11), (48, 0.071), (49, 0.094)]
simIndex simValue paperId paperTitle
same-paper 1 0.95890605 66 jmlr-2010-Linear Algorithms for Online Multitask Classification
Author: Giovanni Cavallanti, Nicolò Cesa-Bianchi, Claudio Gentile
Abstract: We introduce new Perceptron-based algorithms for the online multitask binary classification problem. Under suitable regularity conditions, our algorithms are shown to improve on their baselines by a factor proportional to the number of tasks. We achieve these improvements using various types of regularization that bias our algorithms towards specific notions of task relatedness. More specifically, similarity among tasks is either measured in terms of the geometric closeness of the task reference vectors or as a function of the dimension of their spanned subspace. In addition to adapting to the online setting a mix of known techniques, such as the multitask kernels of Evgeniou et al., our analysis also introduces a matrix-based multitask extension of the p-norm Perceptron, which is used to implement spectral co-regularization. Experiments on real-world data sets complement and support our theoretical findings. Keywords: mistake bounds, perceptron algorithm, multitask learning, spectral regularization
2 0.46685648 84 jmlr-2010-On Spectral Learning
Author: Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil
Abstract: In this paper, we study the problem of learning a matrix W from a set of linear measurements. Our formulation consists in solving an optimization problem which involves regularization with a spectral penalty term. That is, the penalty term is a function of the spectrum of the covariance of W . Instances of this problem in machine learning include multi-task learning, collaborative filtering and multi-view learning, among others. Our goal is to elucidate the form of the optimal solution of spectral learning. The theory of spectral learning relies on the von Neumann characterization of orthogonally invariant norms and their association with symmetric gauge functions. Using this tool we formulate a representer theorem for spectral regularization and specify it to several useful example, such as Schatten p−norms, trace norm and spectral norm, which should proved useful in applications. Keywords: kernel methods, matrix learning, minimal norm interpolation, multi-task learning, orthogonally invariant norms, regularization
3 0.31801403 37 jmlr-2010-Evolving Static Representations for Task Transfer
Author: Phillip Verbancsics, Kenneth O. Stanley
Abstract: An important goal for machine learning is to transfer knowledge between tasks. For example, learning to play RoboCup Keepaway should contribute to learning the full game of RoboCup soccer. Previous approaches to transfer in Keepaway have focused on transforming the original representation to fit the new task. In contrast, this paper explores the idea that transfer is most effective if the representation is designed to be the same even across different tasks. To demonstrate this point, a bird’s eye view (BEV) representation is introduced that can represent different tasks on the same two-dimensional map. For example, both the 3 vs. 2 and 4 vs. 3 Keepaway tasks can be represented on the same BEV. Yet the problem is that a raw two-dimensional map is high-dimensional and unstructured. This paper shows how this problem is addressed naturally by an idea from evolutionary computation called indirect encoding, which compresses the representation by exploiting its geometry. The result is that the BEV learns a Keepaway policy that transfers without further learning or manipulation. It also facilitates transferring knowledge learned in a different domain, Knight Joust, into Keepaway. Finally, the indirect encoding of the BEV means that its geometry can be changed without altering the solution. Thus static representations facilitate several kinds of transfer.
4 0.29998815 44 jmlr-2010-Graph Kernels
Author: S.V.N. Vishwanathan, Nicol N. Schraudolph, Risi Kondor, Karsten M. Borgwardt
Abstract: We present a unified framework to study graph kernels, special cases of which include the random a walk (G¨ rtner et al., 2003; Borgwardt et al., 2005) and marginalized (Kashima et al., 2003, 2004; Mah´ et al., 2004) graph kernels. Through reduction to a Sylvester equation we improve the time e complexity of kernel computation between unlabeled graphs with n vertices from O(n6 ) to O(n3 ). We find a spectral decomposition approach even more efficient when computing entire kernel matrices. For labeled graphs we develop conjugate gradient and fixed-point methods that take O(dn3 ) time per iteration, where d is the size of the label set. By extending the necessary linear algebra to Reproducing Kernel Hilbert Spaces (RKHS) we obtain the same result for d-dimensional edge kernels, and O(n4 ) in the infinite-dimensional case; on sparse graphs these algorithms only take O(n2 ) time per iteration in all cases. Experiments on graphs from bioinformatics and other application domains show that these techniques can speed up computation of the kernel by an order of magnitude or more. We also show that certain rational kernels (Cortes et al., 2002, 2003, 2004) when specialized to graphs reduce to our random walk graph kernel. Finally, we relate our framework to R-convolution kernels (Haussler, 1999) and provide a kernel that is close to the optimal assignment o kernel of Fr¨ hlich et al. (2006) yet provably positive semi-definite. Keywords: linear algebra in RKHS, Sylvester equations, spectral decomposition, bioinformatics, rational kernels, transducers, semirings, random walks
5 0.26044399 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
Author: Pannagadatta K. Shivaswamy, Tony Jebara
Abstract: Leading classification methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identifies its sensitivity to affine transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classification function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classification function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efficiency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classification and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity
6 0.25557312 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
7 0.23139273 80 jmlr-2010-On-Line Sequential Bin Packing
8 0.21719711 34 jmlr-2010-Erratum: SGDQN is Less Careful than Expected
9 0.21361879 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors
10 0.20685029 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
11 0.17776394 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
12 0.17193837 22 jmlr-2010-Classification Using Geometric Level Sets
13 0.16781306 24 jmlr-2010-Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials
14 0.16754791 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
15 0.1623791 32 jmlr-2010-Efficient Algorithms for Conditional Independence Inference
16 0.15678065 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
17 0.15353444 115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages
18 0.14761482 82 jmlr-2010-On Learning with Integral Operators
19 0.14502966 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond
20 0.14144461 54 jmlr-2010-Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization
topicId topicWeight
[(1, 0.01), (3, 0.025), (4, 0.014), (8, 0.016), (15, 0.023), (21, 0.019), (24, 0.017), (32, 0.046), (36, 0.041), (37, 0.061), (51, 0.019), (75, 0.157), (81, 0.02), (85, 0.081), (89, 0.331), (96, 0.011), (97, 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.70278138 66 jmlr-2010-Linear Algorithms for Online Multitask Classification
Author: Giovanni Cavallanti, Nicolò Cesa-Bianchi, Claudio Gentile
Abstract: We introduce new Perceptron-based algorithms for the online multitask binary classification problem. Under suitable regularity conditions, our algorithms are shown to improve on their baselines by a factor proportional to the number of tasks. We achieve these improvements using various types of regularization that bias our algorithms towards specific notions of task relatedness. More specifically, similarity among tasks is either measured in terms of the geometric closeness of the task reference vectors or as a function of the dimension of their spanned subspace. In addition to adapting to the online setting a mix of known techniques, such as the multitask kernels of Evgeniou et al., our analysis also introduces a matrix-based multitask extension of the p-norm Perceptron, which is used to implement spectral co-regularization. Experiments on real-world data sets complement and support our theoretical findings. Keywords: mistake bounds, perceptron algorithm, multitask learning, spectral regularization
2 0.50719696 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
Author: Shiliang Sun, John Shawe-Taylor
Abstract: In this paper, we propose a general framework for sparse semi-supervised learning, which concerns using a small portion of unlabeled data and a few labeled data to represent target functions and thus has the merit of accelerating function evaluations when predicting the output of a new example. This framework makes use of Fenchel-Legendre conjugates to rewrite a convex insensitive loss involving a regularization with unlabeled data, and is applicable to a family of semi-supervised learning methods such as multi-view co-regularized least squares and single-view Laplacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse multi-view SVMs which use a squared ε-insensitive loss. The resultant optimization is an inf-sup problem and the optimal solutions have arguably saddle-point properties. We present a globally optimal iterative algorithm to optimize the problem. We give the margin bound on the generalization error of the sparse multi-view SVMs, and derive the empirical Rademacher complexity for the induced function class. Experiments on artificial and real-world data show their effectiveness. We further give a sequential training approach to show their possibility and potential for uses in large-scale problems and provide encouraging experimental results indicating the efficacy of the margin bound and empirical Rademacher complexity on characterizing the roles of unlabeled data for semi-supervised learning. Keywords: semi-supervised learning, Fenchel-Legendre conjugate, representer theorem, multiview regularization, support vector machine, statistical learning theory
3 0.5050087 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
Author: Ming Yuan
Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity
4 0.50387764 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
Author: Pannagadatta K. Shivaswamy, Tony Jebara
Abstract: Leading classification methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identifies its sensitivity to affine transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classification function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classification function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efficiency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classification and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity
5 0.50000423 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
Author: Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian
Abstract: Estimating the error rates of classifiers or regression models is a fundamental task in machine learning which has thus far been studied exclusively using supervised learning techniques. We propose a novel unsupervised framework for estimating these error rates using only unlabeled data and mild assumptions. We prove consistency results for the framework and demonstrate its practical applicability on both synthetic and real world data. Keywords: classification and regression, maximum likelihood, latent variable models
6 0.49984151 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
7 0.49832651 63 jmlr-2010-Learning Instance-Specific Predictive Models
8 0.49642342 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
9 0.49424064 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
10 0.49363658 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory
11 0.49261379 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
12 0.49184078 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
13 0.49129957 102 jmlr-2010-Semi-Supervised Novelty Detection
14 0.49085921 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
15 0.49066171 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning
16 0.49042511 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
17 0.49020559 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
18 0.49003044 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
19 0.48964259 109 jmlr-2010-Stochastic Composite Likelihood
20 0.48860717 84 jmlr-2010-On Spectral Learning