jmlr jmlr2009 jmlr2009-51 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Brian Kulis, Mátyás A. Sustik, Inderjit S. Dhillon
Abstract: In this paper, we study low-rank matrix nearness problems, with a focus on learning low-rank positive semidefinite (kernel) matrices for machine learning applications. We propose efficient algorithms that scale linearly in the number of data points and quadratically in the rank of the input matrix. Existing algorithms for learning kernel matrices often scale poorly, with running times that are cubic in the number of data points. We employ Bregman matrix divergences as the measures of nearness—these divergences are natural for learning low-rank kernels since they preserve rank as well as positive semidefiniteness. Special cases of our framework yield faster algorithms for various existing learning problems, and experimental results demonstrate that our algorithms can effectively learn both low-rank and full-rank kernel matrices. Keywords: kernel methods, Bregman divergences, convex optimization, kernel learning, matrix nearness
Reference: text
sentIndex sentText sentNum sentScore
1 We employ Bregman matrix divergences as the measures of nearness—these divergences are natural for learning low-rank kernels since they preserve rank as well as positive semidefiniteness. [sent-13, score-0.447]
2 Keywords: kernel methods, Bregman divergences, convex optimization, kernel learning, matrix nearness 1. [sent-15, score-0.323]
3 A number of factors affect the choice of the particular divergence measure used: the data may be intrinsically suited for a specific divergence measure, an algorithm may be faster or easier to implement for some measure, or analysis may be simpler for a particular measure. [sent-17, score-0.586]
4 When measuring the divergence between matrices, similar issues must be considered. [sent-21, score-0.293]
5 , they have non-negative eigenvalues), then one may want to choose a divergence measure that is well-suited to such matrices; positive semidefinite (PSD) matrices arise frequently in machine learning, in particular with kernel methods. [sent-25, score-0.443]
6 This paper focuses on kernel learning using two divergence measures between PSD matrices— the LogDet divergence and the von Neumann divergence. [sent-33, score-0.894]
7 Our kernel learning goal is to find a PSD matrix which is as close as possible (under the LogDet or von Neumann divergence) to some input PSD matrix, but which additionally satisfies linear equality or inequality constraints. [sent-34, score-0.393]
8 For example, the LogDet divergence has a scale-invariance property which makes it particularly well-suited to machine learning applications. [sent-37, score-0.293]
9 Second, these divergences arise naturally in several problems; for example, the LogDet divergence arises in problems involving the differential relative entropy between Gaussian distributions while the von Neumann divergence arises in quantum computing and problems such as online PCA. [sent-38, score-1.008]
10 That is, the LogDet divergence between two matrices is finite if and only if their range spaces are identical, and a similar property holds for the von Neumann divergence. [sent-40, score-0.6]
11 The efficiency of the algorithms arises from new results developed in this paper which demonstrate that Bregman projections for these matrix divergences can be computed in time that scales quadratically with the rank of the input matrix. [sent-43, score-0.336]
12 Given an n × n kernel matrix K, if the matrix is of low rank, say r < n, we can represent the kernel matrix in terms of a factorization K = GGT , with G an n × r matrix. [sent-68, score-0.449]
13 (2004) have studied transductive learning of the kernel matrix and multiple kernel learning using semidefinite programming. [sent-84, score-0.317]
14 (2005), who learn a (full-rank) kernel matrix using von Neumann divergence under linear constraints. [sent-89, score-0.686]
15 ; we use exact instead of approximate projections to speed up convergence, and we consider algorithms for the LogDet divergence in addition to the von Neumann divergence. [sent-92, score-0.547]
16 In this paper, we substantially expand on the analysis of Bregman matrix divergences, giving a formal treatment of low-rank Bregman matrix divergences and proving several new properties. [sent-97, score-0.324]
17 We introduce Bregman matrix divergences—the matrix divergence measures considered in this paper—and discuss their properties. [sent-101, score-0.463]
18 The Bregman vector divergence (Bregman, 1967) with respect to ϕ is defined as Dϕ (x, y) = ϕ(x) − ϕ(y) − (x − y)T ∇ϕ(y). [sent-106, score-0.293]
19 For example, if ϕ(x) = xT x, then the resulting Bregman divergence is Dϕ (x, y) = x − y 2 . [sent-107, score-0.293]
20 An2 other example is ϕ(x) = ∑i (xi log xi − xi ), where the resulting Bregman divergence is the (unnorx malized) relative entropy Dϕ (x, y) = KL(x, y) = ∑i (xi log yii − xi + yi ). [sent-108, score-0.414]
21 Given a strictly convex, differentiable function φ : S n → R, the Bregman matrix divergence is defined to be Dφ (X,Y ) = φ(X) − φ(Y ) − tr((∇φ(Y ))T (X −Y )), where tr(A) denotes the trace of matrix A. [sent-112, score-0.463]
22 1 The resulting Bregman divergence is DvN (X,Y ) = tr(X log X − X logY − X +Y ), (1) 1. [sent-120, score-0.341]
23 If X = V ΛV T is the eigendecomposition of the positive definite matrix X, the matrix logarithm can be written as V log ΛV T , where log Λ is the diagonal matrix whose entries contain the logarithm of the eigenvalues. [sent-121, score-0.404]
24 This divergence is also called quantum relative entropy, and is used in quantum information theory (Nielsen and Chuang, 2000). [sent-124, score-0.357]
25 Another important matrix divergence arises by taking the Burg entropy of the eigenvalues, that is, φ(X) = − ∑i log λi , or equivalently as φ(X) = − log det X. [sent-125, score-0.551]
26 The resulting Bregman divergence over positive definite matrices is D d (X,Y ) = tr(XY −1 ) − log det(XY −1 ) − n, (2) and is commonly called the LogDet divergence (we called it the Burg matrix divergence in Kulis et al. [sent-126, score-1.065]
27 Examples include the scaleinvariance of the LogDet divergence (D d (X,Y ) = D d (αX, αY )) and, more generally, transformationinvariance (D d (X,Y ) = D d (M T XM, M T Y M) for any square, non-singular M), which are useful properties for learning algorithms. [sent-135, score-0.293]
28 Typically kernel learning algorithms work in the transductive setting, meaning that all of the data is given up-front, with some of it labeled and some unlabeled, and one learns a kernel matrix over all the data points. [sent-140, score-0.317]
29 Furthermore, both the LogDet divergence and the von Neumann divergence have precedence in a number of areas. [sent-147, score-0.797]
30 The von Neumann divergence is used in quantum information theory (Nielsen and Chuang, 2000), and has been employed in machine learning for online principal component analysis (Warmuth and Kuzmin, 2006). [sent-148, score-0.536]
31 The LogDet divergence is called Stein’s loss in the statistics literature, where it has been used as a measure of distance between covariance matrices (James and Stein, 1961). [sent-149, score-0.375]
32 In general, every such ϕ defines a Bregman matrix divergence over real, symmetric matrices via the eigenvalue mapping. [sent-153, score-0.431]
33 Consider a spectral Bregman matrix divergence Dφ (X,Y ), where φ = ϕ ◦ λ. [sent-156, score-0.378]
34 i, j Note that each of the divergences discussed earlier—the squared Frobenius divergence, the LogDet divergence, and the von Neumann divergence—arise from separable convex functions. [sent-180, score-0.365]
35 = ϕn ) and the corollary below follows: 346 L OW-R ANK K ERNEL L EARNING WITH B REGMAN M ATRIX D IVERGENCES Corollary 2 Given X = V ΛV T and Y = UΘU T , the squared Frobenius, von Neumann and LogDet divergences satisfy: Dφ (X,Y ) = ∑(viT u j )2 Dϕ (λi , θ j ). [sent-184, score-0.365]
36 (3) i, j This formula highlights the connection between a separable spectral matrix divergence and the corresponding vector divergence. [sent-185, score-0.378]
37 It also illustrates how the divergence relates to the geometrical properties of the argument matrices as represented by the eigenvectors. [sent-186, score-0.346]
38 However, when the rank of X0 does not exceed r, then this problem turns out to be convex when we use the von Neumann and LogDet matrix divergences. [sent-192, score-0.35]
39 Another advantage of using the von Neumann and LogDet divergences is that the algorithms used to solve the minimization problem implicitly maintain the positive semidefiniteness constraint, so no extra work needs to be done to enforce positive semidefiniteness. [sent-195, score-0.365]
40 In the case where Ai = zi ziT , by calculating the gradient for the LogDet and von Neumann matrix divergences, respectively, (7) simplifies to: Xt+1 = Xt−1 − αi zi ziT −1 Xt+1 = exp(log(Xt ) + αi zi ziT ), (10) subject to ziT Xt+1 zi = bi . [sent-236, score-0.449]
41 As we will see, for the von Neumann and LogDet divergences, these projections can be computed very efficiently (and are thus more desirable than other methods that involve multiple constraints at a time). [sent-238, score-0.335]
42 Bregman Divergences for Rank-Deficient Matrices As given in (1) and (2), the von Neumann and LogDet divergences are undefined for low-rank matrices. [sent-240, score-0.365]
43 For the von Neumann divergence the situation is somewhat better, since one can define φ(X) = tr(X log X − X) via continuity for rank-deficient matrices. [sent-242, score-0.552]
44 More specifically, for the von Neumann divergence we have: Dϕ (λi , θ j ) = λi log λi − λi log θ j − λi + θ j . [sent-251, score-0.6]
45 For finiteness of the corresponding matrix divergence we require that v iT u j = 0 whenever Dϕ (λi , θ j ) is infinite so a cancellation will occur (via appropriate continuity arguments) and the divergence will be finite. [sent-255, score-0.671]
46 This leads to properties of rank-deficient X and Y that are required for the matrix divergence to be finite. [sent-256, score-0.378]
47 349 K ULIS , S USTIK AND D HILLON Observation 1 The von Neumann divergence DvN (X,Y ) is finite iff range(X) ⊆ range(Y ). [sent-259, score-0.504]
48 Thus, if viT u j = 0 when θ j = 0, then λi (viT u j )2 log θ j = 0 (using 0 log 0 = 0), and the divergence is finite. [sent-262, score-0.389]
49 Observation 2 The LogDet divergence D d (X,Y ) is finite iff range(X) = range(Y ). [sent-265, score-0.293]
50 If we now revisit our optimization problem formulated in (4), where we aim to minimize Dφ (X, X0 ), we see that the LogDet and von Neumann divergences naturally maintain rank constraints if the problem is feasible. [sent-271, score-0.5]
51 2 Formalization Via Restrictions on the Range Space The above section demonstrated informally that for Dφ (X,Y ) to be finite the range spaces of X and Y must be equal for the LogDet divergence, and the range space of X must be contained in the range space of Y for the von Neumann divergence. [sent-275, score-0.34]
52 350 L OW-R ANK K ERNEL L EARNING WITH B REGMAN M ATRIX D IVERGENCES We are now ready to extend the domain of the von Neumann and LogDet divergences to lowrank matrices. [sent-284, score-0.365]
53 Let W be an n × r column orthogonal matrix such that range(Y ) ⊆ range(W ) and define: Dφ (X,Y ) = Dφ,W (X,Y ) = Dφ (W T XW,W T YW ), (13) where Dφ is either the von Neumann or the LogDet divergence. [sent-286, score-0.336]
54 The first equality is by Definition 4, and the third follows from the fact that the von Neumann and the LogDet divergences are invariant under any orthogonal similarity transformation 2 (see Proposition 11 in Appendix A). [sent-291, score-0.405]
55 In fact, in the case of the LogDet divergence we have invariance under any invertible congruence transformation, see Proposition 12 in Appendix A. [sent-307, score-0.293]
56 If we assume that the optimization problem (4) has a (rank-deficient) solution with finite divergence measure, then the corresponding full-rank optimization problem (14) also has a solution. [sent-327, score-0.293]
57 ˆ ˆ ˆ In case of the von Neumann divergence this leads to the update Xt+1 = exp(log(Xt ) + αA). [sent-333, score-0.547]
58 If we choose W = Vt from the reduced eigendecomposition Xt = Vt Λt VtT , then the update is written as: Xt+1 = Vt exp(log(Λt ) + αVtT AVt )VtT , (15) which we will use in the von Neumann kernel learning algorithm. [sent-336, score-0.404]
59 1 LogDet Divergence We first develop a cyclic projection algorithm to solve (5) when the matrix divergence is the LogDet divergence. [sent-346, score-0.44]
60 1 M ATRIX U PDATES Consider minimizing D d (X, X0 ), the LogDet divergence between X and X0 . [sent-349, score-0.293]
61 As given in (16), the projection update rule for low-rank X0 and a rank-one constraint matrix A = zz T is: Xt+1 = Vt ((VtT Xt Vt )−1 − α(VtT zz T Vt ))−1VtT , where the eigendecomposition of Xt is Vt Λt VtT . [sent-350, score-0.576]
62 The LogDet update produces: Xt+1 = GGT + βGGT zz T GGT = G(I + βGT zz T G)GT . [sent-369, score-0.333]
63 The matrix I + βGT zz T G is then I + βBT GT zz T G0 B. [sent-395, score-0.375]
64 This strategy leads to the following algorithm: 357 K ULIS , S USTIK AND D HILLON Algorithm 2 Learning a low-rank kernel matrix in LogDet divergence under distance constraints. [sent-427, score-0.504]
65 Input: G0 : input kernel factor matrix, that is, X0 = G0 GT , r: desired rank, {Ai }c : distance 0 i=1 constraints, where Ai = (ei1 − ei2 )(ei1 − ei2 )T Output: G: output low-rank kernel factor matrix, that is, X = GGT 1: Set B = Ir , i = 1, and ν j = 0 ∀ constraints j. [sent-428, score-0.304]
66 2 Von Neumann Divergence In this section we develop a cyclic projection algorithm to solve (5) when the matrix divergence is the von Neumann divergence. [sent-447, score-0.651]
67 1 M ATRIX U PDATES Consider minimizing DvN (X, X0 ), the von Neumann divergence between X and X0 . [sent-450, score-0.504]
68 Recall the projection update (15) for constraint i: Xt+1 = Vt exp(log(Λt ) + αVtT zz T Vt )VtT . [sent-451, score-0.293]
69 The matrix update becomes Xt+1 = Vt Wt exp(log Λt + αWtT VtT zz T Vt Wt )WtT VtT , yielding the following formulae: Vt+1 = Vt , Wt+1 = Wt Ut , Λt+1 = exp(Θt+1 ), where log Λt + αWtT VtT zz T Vt Wt = Ut Θt+1UtT . [sent-457, score-0.466]
70 In the von Neumann case and in the presence of distance constraints, the problem amounts to finding the unique root of the function f (α) = z T Vt Wt exp(Θt + αWtT VtT zz T Vt Wt )WtT VtT z − b. [sent-467, score-0.385]
71 One natural choice to find the root of f (α) is to apply Brent’s general root finding method (Brent, 1973), 359 K ULIS , S USTIK AND D HILLON Algorithm 3 Learning a low-rank kernel matrix in von Neumann matrix divergence under distance constraints. [sent-470, score-0.8]
72 3 K ERNEL L EARNING A LGORITHM The algorithm for distance inequality constraints using the von Neumann divergence is given as Algorithm 3. [sent-486, score-0.614]
73 Note that the asymptotic running time of this algorithm is the same as the LogDet divergence algorithm, although the root finding step makes Algorithm 3 slightly slower than Algorithm 2. [sent-488, score-0.293]
74 First, though we are able to learn low-rank kernel matrices in our framework, the initial kernel matrix must be lowrank. [sent-491, score-0.332]
75 When learning a kernel matrix minimizing either the LogDet divergence or the von Neumann divergence, the range space restriction implies that the learned kernel K has the form K = G 0 BBT GT , 0 where B is r × r and the input kernel is K0 = G0 GT . [sent-502, score-0.923]
76 1 S LACK VARIABLES In many cases, especially if the number of constraints is large, no feasible solution will exist to the Bregman divergence optimization problem given in (4). [sent-515, score-0.374]
77 We add a new vector variable b with coordinates representing perturbed right hand sides of the linear constraints, and use the corresponding vector divergence to measure the deviation from the original constraints described by the input vector b0 . [sent-518, score-0.374]
78 361 K ULIS , S USTIK AND D HILLON The γ > 0 parameter governs the tradeoff between satisfying the constraints and minimizing the divergence between X and X0 . [sent-521, score-0.374]
79 (23) In particular, for the LogDet divergence we arrive at the following updates: −1 Xt+1 = Xt−1 − αi Ai , eT bt+1 = i γeT bt i , γ + αi eT bt i tr(Xt+1 Ai ) − eT bt+1 = 0. [sent-526, score-0.401]
80 In case of the von Neumann divergence the update rules turn out to be: log Xt+1 = log Xt + αi Ai , αi eT bt+1 = eT bt e− γ , i i tr(Xt+1 Ai ) − eT bt+1 = 0. [sent-529, score-0.697]
81 i γ The scale invariance of the LogDet divergence implies that D d (b, b0 ) = D d (b/b0 , 1) where the division is elementwise and therefore we implicitly measure “relative” error, that is the magnitude of the coordinates of vector b0 are naturally taken into account. [sent-531, score-0.293]
82 For the von Neumann divergence scale invariance does not hold, but one may alternatively use D vN (b/b0 , 1) as the penalty function. [sent-532, score-0.504]
83 As with the DefiniteBoost algorithm, the projection is not an exact Bregman projection; however, the projection can be computed in the same way as in our von Neumann kernel learning projection. [sent-557, score-0.432]
84 Our algorithms from this paper give new divergence measures and methods for finding low-rank correlation matrices. [sent-562, score-0.293]
85 4 Connections to Semidefinite Programming In this section, we present a connection between minimizing the LogDet divergence and the solution to a semidefinite programming problem (SDP). [sent-571, score-0.293]
86 Therefore L† e = 0, which further implies eT L† e = 0 ⇒ eT Xe = 0 by the fact that the LogDet divergence preserves the range space of L † . [sent-587, score-0.336]
87 Thus, the null space restriction in LogDet divergence naturally yields the constraint e T Xe = 0 and so the min balanced cut SDP problem on A is equivalent (for sufficiently small ε) to finding the nearest correlation matrix to εL† under the LogDet divergence. [sent-588, score-0.449]
88 5 Changing the Range Space The key property of the LogDet and von Neumann divergences which allow the algorithms to learn low-rank kernel matrices is their range space preservation property, discussed in Section 4. [sent-594, score-0.558]
89 (2007) discussed how one can generalize to new data points with the LogDet divergence even after applying such a kernel function, making this approach practical in many situations. [sent-602, score-0.39]
90 In this case, we generate constraints from some target kernel matrix, and judge performance on the learned kernel matrix as we provide an increasing number of constraints. [sent-637, score-0.36]
91 In this case, we took our initial low-rank kernel to be a linear kernel over the original data matrix and we added constraints of the form d(i, j) ≤ (1 − ε)b for same class pairs and d(i, j) ≥ (1 + ε)b for different class pairs (b is the original distance and ε = . [sent-643, score-0.389]
92 95 600 LogDet Divergence von Neumann Divergence LogDet Divergence von Neumann Divergence 500 400 0. [sent-665, score-0.422]
93 For example, starting with the scaled identity as the initial kernel matrix and 100 constraints, it took our von Neumann algorithm only 11 cycles to converge, whereas it took 3220 cycles for the DefiniteBoost algorithm to converge. [sent-702, score-0.509]
94 For the LogDet and the von Neumann exact projection algorithms, the number of cycles required for convergence never exceeded 600 on runs of up to 1000 constraints on GyrB. [sent-707, score-0.432]
95 We stress that other methods for learning kernel matrices, such as using the squared Frobenius norm in place of the LogDet or von Neumann divergences, would lead to learning full-rank matrices and would require significantly more memory and computational resources. [sent-727, score-0.361]
96 Conclusions In this paper, we have developed algorithms for using Bregman matrix divergences for low-rank matrix nearness problems. [sent-765, score-0.368]
97 In particular, we have developed a framework for using the LogDet divergence and the von Neumann divergence when the initial matrices are low-rank; this is achieved via a restriction of the range space of the matrices. [sent-766, score-0.893]
98 We discussed how the LogDet divergence exhibits connections to Mahalanobis metric learning and semidefinite programming, and there is significant ongoing work in these areas. [sent-772, score-0.322]
99 Furthermore, this proposition may be extended to the case when X and Y are positive semidefinite, using the definition of the LogDet divergence for rankdeficient matrices. [sent-787, score-0.293]
100 The starting (identity) matrix satisfies the linear constraint, but a single projection step (to the corresponding equality constraint) without corrections will produce a suboptimal matrix, regardless of the divergence used. [sent-798, score-0.488]
wordName wordTfidf (topN-words)
[('logdet', 0.517), ('neumann', 0.336), ('divergence', 0.293), ('xt', 0.235), ('von', 0.211), ('vtt', 0.195), ('bregman', 0.193), ('divergences', 0.154), ('zz', 0.145), ('tr', 0.121), ('hillon', 0.113), ('ulis', 0.113), ('ustik', 0.113), ('semide', 0.109), ('ivergences', 0.107), ('regman', 0.107), ('atrix', 0.107), ('vt', 0.098), ('kernel', 0.097), ('vit', 0.096), ('ank', 0.091), ('matrix', 0.085), ('kulis', 0.082), ('constraints', 0.081), ('niteboost', 0.069), ('ernel', 0.063), ('projection', 0.062), ('zit', 0.059), ('cycles', 0.058), ('dvn', 0.057), ('ai', 0.056), ('rank', 0.054), ('bt', 0.054), ('weinberger', 0.054), ('eigendecomposition', 0.053), ('matrices', 0.053), ('det', 0.052), ('ggt', 0.05), ('nmi', 0.05), ('corrections', 0.048), ('log', 0.048), ('xai', 0.048), ('tsuda', 0.047), ('gt', 0.047), ('xy', 0.046), ('bi', 0.045), ('nearness', 0.044), ('sustik', 0.044), ('wtt', 0.044), ('projections', 0.043), ('range', 0.043), ('dhillon', 0.043), ('constraint', 0.043), ('update', 0.043), ('orthogonal', 0.04), ('clustering', 0.04), ('transductive', 0.038), ('nursery', 0.038), ('psd', 0.037), ('xr', 0.035), ('eigenvalues', 0.035), ('ur', 0.034), ('lv', 0.034), ('ww', 0.034), ('yw', 0.032), ('vr', 0.032), ('quantum', 0.032), ('spambase', 0.032), ('digits', 0.031), ('aiw', 0.031), ('qqt', 0.031), ('lx', 0.031), ('frobenius', 0.031), ('earning', 0.03), ('ut', 0.03), ('qt', 0.029), ('metric', 0.029), ('ir', 0.029), ('distance', 0.029), ('null', 0.028), ('zi', 0.027), ('multiplication', 0.026), ('nite', 0.026), ('eigendecompositions', 0.025), ('gyrb', 0.025), ('xvv', 0.025), ('entropy', 0.025), ('decompositions', 0.025), ('cholesky', 0.024), ('mahalanobis', 0.024), ('davis', 0.024), ('wt', 0.022), ('pdate', 0.022), ('dual', 0.022), ('censor', 0.021), ('xw', 0.021), ('eigenvector', 0.021), ('slack', 0.021), ('convergence', 0.02), ('xv', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
Author: Brian Kulis, Mátyás A. Sustik, Inderjit S. Dhillon
Abstract: In this paper, we study low-rank matrix nearness problems, with a focus on learning low-rank positive semidefinite (kernel) matrices for machine learning applications. We propose efficient algorithms that scale linearly in the number of data points and quadratically in the rank of the input matrix. Existing algorithms for learning kernel matrices often scale poorly, with running times that are cubic in the number of data points. We employ Bregman matrix divergences as the measures of nearness—these divergences are natural for learning low-rank kernels since they preserve rank as well as positive semidefiniteness. Special cases of our framework yield faster algorithms for various existing learning problems, and experimental results demonstrate that our algorithms can effectively learn both low-rank and full-rank kernel matrices. Keywords: kernel methods, Bregman divergences, convex optimization, kernel learning, matrix nearness
2 0.14164476 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
Author: Sanjoy Dasgupta, Adam Tauman Kalai, Claire Monteleoni
Abstract: We start by showing that in an active learning setting, the Perceptron algorithm needs Ω( ε12 ) labels to learn linear separators within generalization error ε. We then present a simple active learning algorithm for this problem, which combines a modification of the Perceptron update with an adaptive filtering rule for deciding which points to query. For data distributed uniformly over the unit ˜ sphere, we show that our algorithm reaches generalization error ε after asking for just O(d log 1 ) ε labels. This exponential improvement over the usual sample complexity of supervised learning had previously been demonstrated only for the computationally more complex query-by-committee algorithm. Keywords: active learning, perceptron, label complexity bounds, online learning
3 0.13950096 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques
Author: Barnabás Póczos, András Lőrincz
Abstract: We introduce novel online Bayesian methods for the identification of a family of noisy recurrent neural networks (RNNs). We present Bayesian active learning techniques for stimulus selection given past experiences. In particular, we consider the unknown parameters as stochastic variables and use A-optimality and D-optimality principles to choose optimal stimuli. We derive myopic cost functions in order to maximize the information gain concerning network parameters at each time step. We also derive the A-optimal and D-optimal estimations of the additive noise that perturbs the dynamical system of the RNN. Here we investigate myopic as well as non-myopic estimations, and study the problem of simultaneous estimation of both the system parameters and the noise. Employing conjugate priors our derivations remain approximation-free and give rise to simple update rules for the online learning of the parameters. The efficiency of our method is demonstrated for a number of selected cases, including the task of controlled independent component analysis. Keywords: active learning, system identification, online Bayesian learning, A-optimality, Doptimality, infomax control, optimal design
4 0.12318099 90 jmlr-2009-Structure Spaces
Author: Brijnesh J. Jain, Klaus Obermayer
Abstract: Finite structures such as point patterns, strings, trees, and graphs occur as ”natural” representations of structured data in different application areas of machine learning. We develop the theory of structure spaces and derive geometrical and analytical concepts such as the angle between structures and the derivative of functions on structures. In particular, we show that the gradient of a differentiable structural function is a well-defined structure pointing in the direction of steepest ascent. Exploiting the properties of structure spaces, it will turn out that a number of problems in structural pattern recognition such as central clustering or learning in structured output spaces can be formulated as optimization problems with cost functions that are locally Lipschitz. Hence, methods from nonsmooth analysis are applicable to optimize those cost functions. Keywords: graphs, graph matching, learning in structured domains, nonsmooth optimization
Author: Wenxin Jiang
Abstract: The statistical learning theory of risk minimization depends heavily on probability bounds for uniform deviations of the empirical risks. Classical probability bounds using Hoeffding’s inequality cannot accommodate more general situations with unbounded loss and dependent data. The current paper introduces an inequality that extends Hoeffding’s inequality to handle these more general situations. We will apply this inequality to provide probability bounds for uniform deviations in a very general framework, which can involve discrete decision rules, unbounded loss, and a dependence structure that can be more general than either martingale or strong mixing. We will consider two examples with high dimensional predictors: autoregression (AR) with ℓ1 -loss, and ARX model with variable selection for sign classification, which uses both lagged responses and exogenous predictors. Keywords: dependence, empirical risk, probability bound, unbounded loss, uniform deviation
6 0.091364376 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
7 0.084763832 13 jmlr-2009-Bounded Kernel-Based Online Learning
8 0.082941718 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
9 0.075805098 23 jmlr-2009-Discriminative Learning Under Covariate Shift
10 0.072973616 61 jmlr-2009-Nonextensive Information Theoretic Kernels on Measures
11 0.066404395 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers
12 0.06252607 49 jmlr-2009-Learning Permutations with Exponential Weights
13 0.062492728 29 jmlr-2009-Estimating Labels from Label Proportions
14 0.061210021 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization
15 0.048997857 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods
16 0.044722021 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks
17 0.044573151 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
18 0.043754779 88 jmlr-2009-Stable and Efficient Gaussian Process Calculations
19 0.042292323 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
20 0.039082408 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications
topicId topicWeight
[(0, 0.266), (1, 0.145), (2, -0.034), (3, -0.018), (4, 0.015), (5, -0.096), (6, 0.017), (7, -0.096), (8, -0.239), (9, -0.196), (10, 0.145), (11, 0.156), (12, 0.022), (13, 0.012), (14, -0.009), (15, 0.07), (16, 0.097), (17, 0.015), (18, 0.068), (19, 0.064), (20, 0.056), (21, 0.053), (22, -0.143), (23, -0.048), (24, -0.079), (25, 0.005), (26, -0.022), (27, -0.018), (28, 0.101), (29, 0.17), (30, 0.079), (31, -0.014), (32, 0.031), (33, -0.013), (34, 0.15), (35, -0.13), (36, -0.011), (37, -0.052), (38, -0.075), (39, 0.012), (40, -0.025), (41, -0.015), (42, 0.126), (43, 0.004), (44, 0.001), (45, -0.064), (46, -0.006), (47, 0.098), (48, 0.035), (49, 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.94916654 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
Author: Brian Kulis, Mátyás A. Sustik, Inderjit S. Dhillon
Abstract: In this paper, we study low-rank matrix nearness problems, with a focus on learning low-rank positive semidefinite (kernel) matrices for machine learning applications. We propose efficient algorithms that scale linearly in the number of data points and quadratically in the rank of the input matrix. Existing algorithms for learning kernel matrices often scale poorly, with running times that are cubic in the number of data points. We employ Bregman matrix divergences as the measures of nearness—these divergences are natural for learning low-rank kernels since they preserve rank as well as positive semidefiniteness. Special cases of our framework yield faster algorithms for various existing learning problems, and experimental results demonstrate that our algorithms can effectively learn both low-rank and full-rank kernel matrices. Keywords: kernel methods, Bregman divergences, convex optimization, kernel learning, matrix nearness
2 0.59460294 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques
Author: Barnabás Póczos, András Lőrincz
Abstract: We introduce novel online Bayesian methods for the identification of a family of noisy recurrent neural networks (RNNs). We present Bayesian active learning techniques for stimulus selection given past experiences. In particular, we consider the unknown parameters as stochastic variables and use A-optimality and D-optimality principles to choose optimal stimuli. We derive myopic cost functions in order to maximize the information gain concerning network parameters at each time step. We also derive the A-optimal and D-optimal estimations of the additive noise that perturbs the dynamical system of the RNN. Here we investigate myopic as well as non-myopic estimations, and study the problem of simultaneous estimation of both the system parameters and the noise. Employing conjugate priors our derivations remain approximation-free and give rise to simple update rules for the online learning of the parameters. The efficiency of our method is demonstrated for a number of selected cases, including the task of controlled independent component analysis. Keywords: active learning, system identification, online Bayesian learning, A-optimality, Doptimality, infomax control, optimal design
3 0.50292087 90 jmlr-2009-Structure Spaces
Author: Brijnesh J. Jain, Klaus Obermayer
Abstract: Finite structures such as point patterns, strings, trees, and graphs occur as ”natural” representations of structured data in different application areas of machine learning. We develop the theory of structure spaces and derive geometrical and analytical concepts such as the angle between structures and the derivative of functions on structures. In particular, we show that the gradient of a differentiable structural function is a well-defined structure pointing in the direction of steepest ascent. Exploiting the properties of structure spaces, it will turn out that a number of problems in structural pattern recognition such as central clustering or learning in structured output spaces can be formulated as optimization problems with cost functions that are locally Lipschitz. Hence, methods from nonsmooth analysis are applicable to optimize those cost functions. Keywords: graphs, graph matching, learning in structured domains, nonsmooth optimization
4 0.48418269 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
Author: Sanjoy Dasgupta, Adam Tauman Kalai, Claire Monteleoni
Abstract: We start by showing that in an active learning setting, the Perceptron algorithm needs Ω( ε12 ) labels to learn linear separators within generalization error ε. We then present a simple active learning algorithm for this problem, which combines a modification of the Perceptron update with an adaptive filtering rule for deciding which points to query. For data distributed uniformly over the unit ˜ sphere, we show that our algorithm reaches generalization error ε after asking for just O(d log 1 ) ε labels. This exponential improvement over the usual sample complexity of supervised learning had previously been demonstrated only for the computationally more complex query-by-committee algorithm. Keywords: active learning, perceptron, label complexity bounds, online learning
5 0.4669472 61 jmlr-2009-Nonextensive Information Theoretic Kernels on Measures
Author: André F. T. Martins, Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar, Mário A. T. Figueiredo
Abstract: Positive definite kernels on probability measures have been recently applied to classification problems involving text, images, and other types of structured data. Some of these kernels are related to classic information theoretic quantities, such as (Shannon’s) mutual information and the JensenShannon (JS) divergence. Meanwhile, there have been recent advances in nonextensive generalizations of Shannon’s information theory. This paper bridges these two trends by introducing nonextensive information theoretic kernels on probability measures, based on new JS-type divergences. These new divergences result from extending the the two building blocks of the classical JS divergence: convexity and Shannon’s entropy. The notion of convexity is extended to the wider concept of q-convexity, for which we prove a Jensen q-inequality. Based on this inequality, we introduce Jensen-Tsallis (JT) q-differences, a nonextensive generalization of the JS divergence, and define a k-th order JT q-difference between stochastic processes. We then define a new family of nonextensive mutual information kernels, which allow weights to be assigned to their arguments, and which includes the Boolean, JS, and linear kernels as particular cases. Nonextensive string kernels are also defined that generalize the p-spectrum kernel. We illustrate the performance of these kernels on text categorization tasks, in which documents are modeled both as bags of words and as sequences of characters. Keywords: positive definite kernels, nonextensive information theory, Tsallis entropy, JensenShannon divergence, string kernels ∗. Earlier versions of this work appeared in Martins et al. (2008a) and Martins et al. (2008b). †. Also at Instituto de Telecomunicacoes, Instituto Superior T´ cnico, Lisboa, Portugal. ¸˜ e c 2009 Andr´ F. T. Martins, Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar and M´ rio A. T. Figueiredo. e a M ARTINS , S MITH , X ING , AGUIAR AND F IGUEIREDO
6 0.35459623 49 jmlr-2009-Learning Permutations with Exponential Weights
7 0.35290384 88 jmlr-2009-Stable and Efficient Gaussian Process Calculations
8 0.34406319 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality
9 0.32061613 13 jmlr-2009-Bounded Kernel-Based Online Learning
10 0.31815851 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization
11 0.31724089 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
12 0.31071061 29 jmlr-2009-Estimating Labels from Label Proportions
13 0.28805009 23 jmlr-2009-Discriminative Learning Under Covariate Shift
14 0.28715914 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers
15 0.28292099 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
16 0.22743481 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning
17 0.21121243 7 jmlr-2009-An Analysis of Convex Relaxations for MAP Estimation of Discrete MRFs
18 0.20245284 36 jmlr-2009-Fourier Theoretic Probabilistic Inference over Permutations
19 0.19810574 59 jmlr-2009-Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions
20 0.19768515 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications
topicId topicWeight
[(8, 0.013), (11, 0.037), (19, 0.02), (21, 0.018), (26, 0.011), (35, 0.385), (38, 0.022), (47, 0.017), (52, 0.044), (55, 0.068), (58, 0.029), (66, 0.112), (68, 0.057), (90, 0.056), (96, 0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.70776904 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
Author: Brian Kulis, Mátyás A. Sustik, Inderjit S. Dhillon
Abstract: In this paper, we study low-rank matrix nearness problems, with a focus on learning low-rank positive semidefinite (kernel) matrices for machine learning applications. We propose efficient algorithms that scale linearly in the number of data points and quadratically in the rank of the input matrix. Existing algorithms for learning kernel matrices often scale poorly, with running times that are cubic in the number of data points. We employ Bregman matrix divergences as the measures of nearness—these divergences are natural for learning low-rank kernels since they preserve rank as well as positive semidefiniteness. Special cases of our framework yield faster algorithms for various existing learning problems, and experimental results demonstrate that our algorithms can effectively learn both low-rank and full-rank kernel matrices. Keywords: kernel methods, Bregman divergences, convex optimization, kernel learning, matrix nearness
2 0.68322611 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis
Author: Alexander L. Strehl, Lihong Li, Michael L. Littman
Abstract: We study the problem of learning near-optimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These “PAC-MDP” algorithms include the wellknown E3 and R-MAX algorithms as well as the more recent Delayed Q-learning algorithm. We summarize the current state-of-the-art by presenting bounds for the problem in a unified theoretical framework. A more refined analysis for upper and lower bounds is presented to yield insight into the differences between the model-free Delayed Q-learning and the model-based R-MAX. Keywords: reinforcement learning, Markov decision processes, PAC-MDP, exploration, sample complexity
3 0.46389255 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
Author: Emma Brunskill, Bethany R. Leffler, Lihong Li, Michael L. Littman, Nicholas Roy
Abstract: To quickly achieve good performance, reinforcement-learning algorithms for acting in large continuous-valued domains must use a representation that is both sufficiently powerful to capture important domain characteristics, and yet simultaneously allows generalization, or sharing, among experiences. Our algorithm balances this tradeoff by using a stochastic, switching, parametric dynamics representation. We argue that this model characterizes a number of significant, real-world domains, such as robot navigation across varying terrain. We prove that this representational assumption allows our algorithm to be probably approximately correct with a sample complexity that scales polynomially with all problem-specific quantities including the state-space dimension. We also explicitly incorporate the error introduced by approximate planning in our sample complexity bounds, in contrast to prior Probably Approximately Correct (PAC) Markov Decision Processes (MDP) approaches, which typically assume the estimated MDP can be solved exactly. Our experimental results on constructing plans for driving to work using real car trajectory data, as well as a small robot experiment on navigating varying terrain, demonstrate that our dynamics representation enables us to capture real-world dynamics in a sufficient manner to produce good performance. Keywords: reinforcement learning, provably efficient learning
4 0.40185001 57 jmlr-2009-Multi-task Reinforcement Learning in Partially Observable Stochastic Environments
Author: Hui Li, Xuejun Liao, Lawrence Carin
Abstract: We consider the problem of multi-task reinforcement learning (MTRL) in multiple partially observable stochastic environments. We introduce the regionalized policy representation (RPR) to characterize the agent’s behavior in each environment. The RPR is a parametric model of the conditional distribution over current actions given the history of past actions and observations; the agent’s choice of actions is directly based on this conditional distribution, without an intervening model to characterize the environment itself. We propose off-policy batch algorithms to learn the parameters of the RPRs, using episodic data collected when following a behavior policy, and show their linkage to policy iteration. We employ the Dirichlet process as a nonparametric prior over the RPRs across multiple environments. The intrinsic clustering property of the Dirichlet process imposes sharing of episodes among similar environments, which effectively reduces the number of episodes required for learning a good policy in each environment, when data sharing is appropriate. The number of distinct RPRs and the associated clusters (the sharing patterns) are automatically discovered by exploiting the episodic data as well as the nonparametric nature of the Dirichlet process. We demonstrate the effectiveness of the proposed RPR as well as the RPR-based MTRL framework on various problems, including grid-world navigation and multi-aspect target classification. The experimental results show that the RPR is a competitive reinforcement learning algorithm in partially observable domains, and the MTRL consistently achieves better performance than single task reinforcement learning. Keywords: reinforcement learning, partially observable Markov decision processes, multi-task learning, Dirichlet processes, regionalized policy representation
5 0.39523745 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
Author: John Duchi, Yoram Singer
Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization
6 0.37921301 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
7 0.37553778 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
8 0.37523785 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
10 0.3674978 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods
11 0.36718965 13 jmlr-2009-Bounded Kernel-Based Online Learning
12 0.36701009 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
13 0.36484137 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
14 0.36325231 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
15 0.36164731 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers
16 0.35955206 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications
17 0.3590934 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
18 0.35822135 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
19 0.35784146 18 jmlr-2009-Consistency and Localizability
20 0.35777137 38 jmlr-2009-Hash Kernels for Structured Data