jmlr jmlr2010 jmlr2010-87 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro
Abstract: Sparse coding—that is, modelling data vectors as sparse linear combinations of basis elements—is widely used in machine learning, neuroscience, signal processing, and statistics. This paper focuses on the large-scale matrix factorization problem that consists of learning the basis set in order to adapt it to specific data. Variations of this problem include dictionary learning in signal processing, non-negative matrix factorization and sparse principal component analysis. In this paper, we propose to address these tasks with a new online optimization algorithm, based on stochastic approximations, which scales up gracefully to large data sets with millions of training samples, and extends naturally to various matrix factorization formulations, making it suitable for a wide range of learning problems. A proof of convergence is presented, along with experiments with natural images and genomic data demonstrating that it leads to state-of-the-art performance in terms of speed and optimization for both small and large data sets. Keywords: basis pursuit, dictionary learning, matrix factorization, online learning, sparse coding, sparse principal component analysis, stochastic approximations, stochastic optimization, nonnegative matrix factorization
Reference: text
sentIndex sentText sentNum sentScore
1 This paper focuses on the large-scale matrix factorization problem that consists of learning the basis set in order to adapt it to specific data. [sent-9, score-0.198]
2 Variations of this problem include dictionary learning in signal processing, non-negative matrix factorization and sparse principal component analysis. [sent-10, score-0.93]
3 Keywords: basis pursuit, dictionary learning, matrix factorization, online learning, sparse coding, sparse principal component analysis, stochastic approximations, stochastic optimization, nonnegative matrix factorization 1. [sent-13, score-1.138]
4 1 In machine learning and statistics, slightly different matrix factorization problems are formulated in order to obtain a few interpretable basis elements from a set of data vectors. [sent-23, score-0.198]
5 This includes non-negative matrix factorization and its variants (Lee and Seung, 2001; Hoyer, 2002, 2004; Lin, 2007), and sparse principal component analysis (Zou et al. [sent-24, score-0.348]
6 As shown in this paper, these problems have strong similarities; even though we first focus on the problem of dictionary learning, the algorithm we propose is able to address all of them. [sent-28, score-0.527]
7 Addressing this challenge and designing a generic algorithm which is capable of efficiently handling various matrix factorization problems, is the topic of this paper. [sent-30, score-0.198]
8 We say that it admits a sparse approximation over a dictionary D in Rm×k , with k columns referred to as atoms, when one can find a linear combination of a “few” atoms from D that is “close” to the signal x. [sent-32, score-0.721]
9 Experiments have shown that modelling a signal with such a sparse decomposition (sparse coding) is very effective in many signal processing applications (Chen et al. [sent-33, score-0.216]
10 However, learning the dictionary instead of using off-the-shelf bases has been shown to dramatically improve signal reconstruction (Elad and Aharon, 2006). [sent-36, score-0.582]
11 Although some of the learned dictionary elements may sometimes “look like” wavelets (or Gabor filters), they are tuned to the input images or signals, leading to much better results in practice. [sent-37, score-0.59]
12 Most recent algorithms for dictionary learning (Olshausen and Field, 1997; Engan et al. [sent-38, score-0.527]
13 For example, first-order stochastic gradient descent with projections on the constraint set (Kushner and Yin, 2003) is sometimes used for dictionary learning (see Aharon and Elad, 2008; Kavukcuoglu et al. [sent-49, score-0.675]
14 We show in this paper that it is possible to go further and exploit the specific structure of sparse coding in the design of an optimization procedure tuned to this problem, with low memory consumption and lower computational cost than classical batch algorithms. [sent-51, score-0.366]
15 The paper is structured as follows: Section 2 presents the dictionary learning problem. [sent-53, score-0.527]
16 Note that the terminology “basis” is slightly abusive here since the elements of the dictionary are not necessarily linearly independent and the set can be overcomplete—that is, have more elements than the signal dimension. [sent-56, score-0.582]
17 • As shown experimentally in Section 6, our algorithm is significantly faster than previous approaches to dictionary learning on both small and large data sets of natural images. [sent-61, score-0.527]
18 To demonstrate that it is adapted to difficult, large-scale image-processing tasks, we learn a dictionary on a 12-Megapixel photograph and use it for inpainting—that is, filling some holes in the image. [sent-62, score-0.527]
19 • We show in Sections 5 and 6 that our approach is suitable to large-scale matrix factorization problems such as non-negative matrix factorization and sparse principal component analysis, while being still effective on small data sets. [sent-63, score-0.546]
20 • To extend our algorithm to several matrix factorization problems, we propose in Appendix B efficient procedures for projecting onto two convex sets, which can be useful for other applications that are beyond the scope of this paper. [sent-64, score-0.231]
21 For a sequence of vectors (or matrices) xt and scalars ut , we write i=1 j=1 xt = O(ut ) when there exists a constant K > 0 so that for all t, ||xt ||2 ≤ Kut . [sent-74, score-0.261]
22 Problem Statement Classical dictionary learning techniques for sparse representation (Olshausen and Field, 1997; Engan et al. [sent-82, score-0.633]
23 , 2007), we define ℓ(x, D) as the optimal value of the ℓ1 sparse coding problem: 1 △ ℓ(x, D) = min ||x − Dα||2 + λ||α||1 , 2 α∈Rk 2 (2) where λ is a regularization parameter. [sent-95, score-0.262]
24 It can be rewritten as a joint optimization problem with respect to the dictionary D and the coefficients α = [α1 , . [sent-110, score-0.527]
25 2 2 (4) This can be rewritten as a matrix factorization problem with a sparsity penalty: min D∈C ,α∈Rk×n 1 ||X − Dα||2 + λ||α||1,1 , F 2 3. [sent-114, score-0.256]
26 In the case of dictionary learning, the classical projected first-order projected stochastic gradient descent algorithm (as used by Aharon and Elad 2008; Kavukcuoglu et al. [sent-137, score-0.681]
27 2008 for instance) consists of a sequence of updates of D: Dt = ΠC Dt−1 − δt ∇D ℓ(xt , Dt−1 ) , where Dt is the estimate of the optimal dictionary at iteration t, δt is the gradient step, ΠC is the orthogonal projector onto C , and the vectors xt are i. [sent-138, score-0.646]
28 Note that first-order stochastic gradient descent has also been used for other matrix factorization problems (see Koren et al. [sent-148, score-0.318]
29 (2007), we have preferred to use the convex ℓ1 norm, that has empirically proven to be better behaved in general than the ℓ0 pseudo-norm for dictionary learning. [sent-155, score-0.527]
30 Online Dictionary Learning We present in this section the basic components of our online algorithm for dictionary learning (Sections 3. [sent-161, score-0.568]
31 One key aspect of our convergence analysis will be to show that fˆt (Dt ) and ft (Dt ) converge almost surely to the same limit, and thus that fˆt acts as a surrogate for ft . [sent-174, score-0.36]
32 (2) with fixed dictionary is an ℓ1 -regularized linear least-squares problem. [sent-178, score-0.527]
33 When the columns of the dictionary have low correlation, we have observed that these simple methods are very efficient. [sent-181, score-0.527]
34 3 Dictionary Update Our algorithm for updating the dictionary uses block-coordinate descent with warm restarts (see Bertsekas, 1999). [sent-189, score-0.607]
35 One of its main advantages is that it is parameter free and does not require any 24 O NLINE L EARNING FOR M ATRIX FACTORIZATION AND S PARSE C ODING Algorithm 1 Online dictionary learning. [sent-190, score-0.527]
36 6 After a few iterations of our algorithm, using the value of Dt−1 as a warm restart for computing Dt becomes effective, and a single iteration of Algorithm 2 has empirically found to be sufficient to achieve convergence of the dictionary update step. [sent-221, score-0.557]
37 3 M INI -BATCH E XTENSION In practice, we can also improve the convergence speed of our algorithm by drawing η > 1 signals at each iteration instead of a single one, which is a classical heuristic in stochastic gradient descent algorithms. [sent-265, score-0.195]
38 An initialization of the form A0 = t0 I and B0 = t0 D0 with t0 ≥ 0 also slows down the first steps of our algorithm by forcing the solution of the dictionary update to stay close to D0 . [sent-278, score-0.527]
39 5 P URGING THE D ICTIONARY FROM U NUSED ATOMS Every dictionary learning technique sometimes encounters situations where some of the dictionary atoms are never (or very seldom) used, which typically happens with a very bad initialization. [sent-282, score-1.087]
40 For these two reasons, it cannot be used for the dictionary learning problem, but nevertheless it shares some similarities with our algorithm, which we illustrate with the example of a different problem. [sent-290, score-0.527]
41 Suppose that two major modifications are brought to our original formulation: (i) the vectors αt are independent of the dictionary D—that is, they are drawn at the same time as xt ; (ii) the optimization is unconstrained—that is, C = Rm×k . [sent-291, score-0.582]
42 This setting leads to the least-square estimation problem min E(x,α) ||x − Dα||2 , 2 (8) D∈Rm×k which is of course different from the original dictionary learning formulation. [sent-292, score-0.556]
43 This hypothesis is in practice verified experimentally after a few itt erations of the algorithm when the initial dictionary is reasonable, consisting for example of a few elements from the training set, or any common dictionary, such as DCT (bases of cosines products) or wavelets (Mallat, 1999). [sent-307, score-0.557]
44 (C) A particular sufficient condition for the uniqueness of the sparse coding solution is satisfied. [sent-310, score-0.233]
45 It is of course easy to build a dictionary D for which this assumption fails. [sent-321, score-0.527]
46 Since the surrogate fˆt upperbounds the empirical cost ft , we also have ft (Dt ) − fˆt (Dt ) ≤ 0. [sent-399, score-0.34]
47 (14) conditioned on Ft , obtaining the following bound E[ℓ(xt+1 , Dt )|Ft ] − ft (Dt ) t +1 f (Dt ) − ft (Dt ) ≤ t +1 || f − ft ||∞ , ≤ t +1 √ For a specific matrix D, the central-limit theorem states that E[ t( f (Dt ) − ft (D√ is bounded. [sent-401, score-0.57]
48 Therefore, Lemma 7 applies and there exists a constant κ > 0 such that E[ut+1 − ut |Ft ] ≤ E[E[ut+1 − ut |Ft ]+ ] ≤ κ 3 t2 . [sent-408, score-0.302]
49 Therefore, defining δt as in Theorem 6, we have ∞ ∞ t=1 t=1 ∑ E[δt (ut+1 − ut )] = ∑ E[E[ut+1 − ut |Ft ]+ ] < +∞. [sent-409, score-0.302]
50 32 O NLINE L EARNING FOR M ATRIX FACTORIZATION AND S PARSE C ODING Thus, we can apply Theorem 6, which proves that ut converges almost surely and that ∞ ∑ |E[ut+1 − ut |Ft ]| < +∞ a. [sent-410, score-0.349]
51 Under assumptions (A) to (C), the distance between Dt and the set of stationary points of the dictionary learning problem converges almost surely to 0 when t tends to infinity. [sent-427, score-0.574]
52 Since fˆt upperbounds ft on Rm×k , for all t, fˆt (Dt + U) ≥ ft (Dt + U). [sent-432, score-0.293]
53 We first present different possible regularization terms for α and D, which can be used with our algorithm, and then detail some specific cases such as non-negative matrix factorization, sparse principal component analysis, constrained sparse coding, and simultaneous sparse coding. [sent-443, score-0.4]
54 However, as with any classical dictionary learning techniques exploiting non-convex regularizers (e. [sent-456, score-0.601]
55 For the dictionary learning problem, we have considered an ℓ2 -regularization on D by forcing its columns to have less than unit ℓ2 -norm. [sent-466, score-0.527]
56 We have shown that with this constraint set, the dictionary update step can be solved efficiently using a block-coordinate descent approach. [sent-467, score-0.605]
57 Updating the j-th column of D, when keeping the other ones fixed is solved by orthogonally projecting the vector u j = d j + (1/A[ j, j])(b j − Da j ) on the constraint set C , which in the classical dictionary learning case amounts to a projection of u j on the ℓ2 -ball. [sent-468, score-0.589]
58 2 △ These constraints induce sparsity in the dictionary D (in addition to the sparsity-inducing regularizer on the vectors αi ). [sent-482, score-0.556]
59 ” Here, γ is a new parameter, controlling the sparsity of the dictionary D. [sent-484, score-0.556]
60 The combination of ℓ1 and ℓ2 constraints has also been proposed recently for the problem of matrix factorization by Witten et al. [sent-489, score-0.198]
61 When one is looking for a dictionary whose columns are sparse and piecewise-constant, a fused lasso regularization can be used. [sent-493, score-0.801]
62 The second one is a homotopy method, which solves the projection on the fused lasso constraint set in O(ks), where s is the number of piecewise-constant parts in the solution. [sent-507, score-0.244]
63 This method also solves efficiently the fused lasso signal approximation problem presented in Friedman et al. [sent-508, score-0.223]
64 Now that we have presented a few possible regularizers for α and D, that can be used within our algorithm, we focus on a few classical problems which can be formulated as dictionary learning problems with specific combinations of such regularizers. [sent-515, score-0.601]
65 , xn ] in Rm×n , Lee and Seung (2001) have proposed the non negative matrix factorization problem (NMF), which consists of minimizing the following cost n min D∈C ,α∈Rk×n 1 ∑ 2 ||xi − Dαi ||2 2 i=1 s. [sent-520, score-0.227]
66 As for dictionary learning, classical approaches for addressing this problem are batch algorithms, such as the multiplicative update rules of Lee and Seung (2001), or the projected gradient descent algorithm of Lin (2007). [sent-525, score-0.741]
67 The only difference with the dictionary learning problem is that non-negativity constraints are imposed on D and the vectors αi . [sent-530, score-0.527]
68 2 As detailed above, our dictionary update procedure amounts to successive orthogonal projection of the vectors u j on the constraint set. [sent-548, score-0.555]
69 5 Constrained Sparse Coding Constrained sparse coding problems are often encountered in the literature, and lead to different loss functions such as (15) ℓ′ (x, D) = min ||x − Dα||2 s. [sent-554, score-0.262]
70 (15) in the sparse coding step leads to the minimization of the expected cost minD∈C Ex [ℓ′ (x, D)]. [sent-563, score-0.233]
71 Suppose one wants to obtain sparse decompositions of the signals on the dictionary D that share the same active set (non-zero coefficients). [sent-581, score-0.674]
72 , n which have bounded size and are independent from each other and identically distributed, one can learn an adapted dictionary by solving the optimization problem 1 n ′′′ ∑ ℓ (Xi , D). [sent-595, score-0.527]
73 The extension of our algorithm to this case is relatively easy, computing at each sparse coding step a matrix of coefficients α, and keeping the updates of At and Bt unchanged. [sent-599, score-0.271]
74 Experimental Validation In this section, we present experiments on natural images and genomic data to demonstrate the efficiency of our method for dictionary learning, non-negative matrix factorization, and sparse principal component analysis. [sent-607, score-0.796]
75 We have also implemented a firstorder stochastic gradient descent algorithm that shares most of its code with our algorithm, except for the dictionary update step. [sent-626, score-0.647]
76 2 Non Negative Matrix Factorization and Non Negative Sparse Coding In this section, we compare our method with the classical algorithm of Lee and Seung (2001) for NMF and the non-negative sparse coding algorithm of Hoyer (2002) for NNSC. [sent-683, score-0.267]
77 215 −1 10 4 10 time (in seconds) 0 10 1 10 2 10 3 10 4 10 time (in seconds) Figure 1: Left: Comparison between our method and the batch approach for dictionary learning. [sent-731, score-0.626]
78 • Data set F is composed of n = 100, 000 natural image patches of size m = 16 × 16 pixels from the Pascal VOC’06 image database (Everingham et al. [sent-740, score-0.208]
79 Second, we run one sparse coding step over all the input vectors to obtain α. [sent-758, score-0.233]
80 1 FACES AND NATURAL PATCHES In this section, we compare qualitatively the results obtained by PCA, NMF, our dictionary learning and our sparse principal component analysis algorithm on the data sets used in Section 6. [sent-771, score-0.677]
81 For dictionary learning, PCA and SPCA, the input vectors are first centered and normalized to have a unit norm. [sent-773, score-0.527]
82 The parameter λ for dictionary learning and SPCA was set so that the decomposition of each input signal has approximately 10 nonzero coefficients. [sent-775, score-0.582]
83 On the other hand, the dictionary learning technique is able to learn localized features on data set F, and SPCA is the only tested method that allows controlling the level of sparsity among the learned matrices. [sent-821, score-0.556]
84 (2009), this method can benefit from sparse regularizers such as the ℓ1 norm for the gene expression measurements and a fused lasso for the CGH arrays, which are classical choices used for these data. [sent-837, score-0.348]
85 44 O NLINE L EARNING FOR M ATRIX FACTORIZATION AND S PARSE C ODING (a) PCA (b) SPCA, τ = 70% (c) NMF (d) SPCA, τ = 30% (e) Dictionary Learning (f) SPCA, τ = 10% Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D. [sent-864, score-0.527]
86 45 M AIRAL , BACH , P ONCE AND S APIRO (a) PCA (b) SPCA, τ = 70% (c) NMF (d) SPCA, τ = 30% (e) Dictionary Learning (f) SPCA, τ = 10% Figure 4: Results obtained by PCA, NMF, dictionary learning, SPCA for data set E. [sent-865, score-0.527]
87 46 O NLINE L EARNING FOR M ATRIX FACTORIZATION AND S PARSE C ODING (a) PCA (b) SPCA, τ = 70% (c) NMF (d) SPCA, τ = 30% (e) Dictionary Learning (f) SPCA, τ = 10% Figure 5: Results obtained by PCA, NMF, dictionary learning, SPCA for data set F. [sent-866, score-0.527]
88 Using a multi-threaded version of our implementation, we have learned a dictionary with 256 elements from the roughly 7 × 106 undamaged 12 × 12 color patches in the image with two epochs in about 8 minutes on a 2. [sent-903, score-0.644]
89 Once the dictionary has been learned, the text is removed using the sparse coding technique for inpainting of Mairal et al. [sent-905, score-0.813]
90 Indeed, to the best of our knowledge, this is the first time that dictionary learning is used for image restoration on such large-scale data. [sent-909, score-0.587]
91 Conclusion We have introduced in this paper a new stochastic online algorithm for learning dictionaries adapted to sparse coding tasks, and proven its convergence. [sent-913, score-0.374]
92 Moreover, we have extended it to other matrix factorization problems such as non negative matrix factorization, and we have proposed a formulation for sparse principal component analysis which can be solved efficiently using our method. [sent-918, score-0.386]
93 Beyond this, we plan to use the proposed learning framework for sparse coding in computationally demanding video restoration tasks (Protter and Elad, 2009), with dynamic data sets whose size is not fixed, and extending this framework to different loss functions (Mairal et al. [sent-921, score-0.264]
94 50 O NLINE L EARNING FOR M ATRIX FACTORIZATION AND S PARSE C ODING ∞ If for all t, ut ≥ 0 and ∑t=1 E[δt (ut+1 − ut )] < ∞, then ut is a quasi-martingale and converges almost surely. [sent-941, score-0.453]
95 As for the dictionary learning problem, a simple modification to Algorithm 3 allows us to handle the non-negative case, replacing the scalars |b[ j]| by max(b[ j], 0) in the algorithm. [sent-1007, score-0.527]
96 (2007), the fused lasso signal approximation problem P (γ1 , γ2 , γ3 ): 1 γ3 min ||b − u||2 + γ1 ||u||1 + γ2 FL(u) + ||u||2 , 2 2 2 2 u∈Rm 52 (20) O NLINE L EARNING FOR M ATRIX FACTORIZATION AND S PARSE C ODING Algorithm 3 Efficient projection on the elastic-net constraint. [sent-1011, score-0.252]
97 Nonnegative matrix factorization with the itakura-saito e divergence: With application to music analysis. [sent-1278, score-0.198]
98 Fast inference in sparse coding algorithms with applications to object recognition. [sent-1391, score-0.233]
99 Learning multiscale sparse representations for image and video restoration. [sent-1474, score-0.197]
100 Linear spatial pyramid matching using sparse coding for image classification. [sent-1655, score-0.293]
wordName wordTfidf (topN-words)
[('dictionary', 0.527), ('dt', 0.383), ('airal', 0.186), ('apiro', 0.186), ('oding', 0.177), ('factorization', 0.16), ('mairal', 0.159), ('ut', 0.151), ('spca', 0.143), ('ft', 0.133), ('coding', 0.127), ('bach', 0.126), ('nline', 0.123), ('aharon', 0.115), ('nmf', 0.113), ('sparse', 0.106), ('atrix', 0.1), ('batch', 0.099), ('fused', 0.098), ('bt', 0.094), ('witten', 0.082), ('parse', 0.082), ('rm', 0.08), ('cgh', 0.071), ('lasso', 0.07), ('seung', 0.064), ('lee', 0.064), ('engan', 0.062), ('dictionaries', 0.061), ('earning', 0.061), ('image', 0.06), ('zou', 0.059), ('elad', 0.058), ('patches', 0.057), ('xt', 0.055), ('tt', 0.055), ('signal', 0.055), ('rk', 0.053), ('inpainting', 0.053), ('ponce', 0.053), ('hoyer', 0.053), ('descent', 0.05), ('genomic', 0.048), ('homotopy', 0.048), ('surrogate', 0.047), ('surely', 0.047), ('pca', 0.046), ('sapiro', 0.045), ('bonnans', 0.044), ('obozinski', 0.044), ('pmd', 0.044), ('principal', 0.044), ('online', 0.041), ('signals', 0.041), ('regularizers', 0.04), ('stochastic', 0.039), ('matrix', 0.038), ('seconds', 0.037), ('jenatton', 0.035), ('nnsc', 0.035), ('denoising', 0.035), ('classical', 0.034), ('duchi', 0.033), ('atoms', 0.033), ('onto', 0.033), ('bottou', 0.033), ('images', 0.033), ('objective', 0.032), ('video', 0.031), ('composed', 0.031), ('fl', 0.031), ('gradient', 0.031), ('tr', 0.03), ('kushner', 0.03), ('raina', 0.03), ('wavelets', 0.03), ('warm', 0.03), ('willow', 0.03), ('vaart', 0.03), ('min', 0.029), ('aspremont', 0.029), ('sparsity', 0.029), ('overcomplete', 0.028), ('millions', 0.028), ('constraint', 0.028), ('turlach', 0.027), ('lipschitz', 0.027), ('damaged', 0.027), ('dat', 0.027), ('everingham', 0.027), ('fisk', 0.027), ('harchaoui', 0.027), ('kavukcuoglu', 0.027), ('maculan', 0.027), ('mallat', 0.027), ('normale', 0.027), ('rieure', 0.027), ('shapiro', 0.027), ('upperbounds', 0.027), ('zass', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999678 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
Author: Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro
Abstract: Sparse coding—that is, modelling data vectors as sparse linear combinations of basis elements—is widely used in machine learning, neuroscience, signal processing, and statistics. This paper focuses on the large-scale matrix factorization problem that consists of learning the basis set in order to adapt it to specific data. Variations of this problem include dictionary learning in signal processing, non-negative matrix factorization and sparse principal component analysis. In this paper, we propose to address these tasks with a new online optimization algorithm, based on stochastic approximations, which scales up gracefully to large data sets with millions of training samples, and extends naturally to various matrix factorization formulations, making it suitable for a wide range of learning problems. A proof of convergence is presented, along with experiments with natural images and genomic data demonstrating that it leads to state-of-the-art performance in terms of speed and optimization for both small and large data sets. Keywords: basis pursuit, dictionary learning, matrix factorization, online learning, sparse coding, sparse principal component analysis, stochastic approximations, stochastic optimization, nonnegative matrix factorization
2 0.12627336 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning
Author: Evangelos Theodorou, Jonas Buchli, Stefan Schaal
Abstract: With the goal to generate more scalable algorithms with higher efficiency and fewer open parameters, reinforcement learning (RL) has recently moved towards combining classical techniques from optimal control and dynamic programming with modern learning techniques from statistical estimation theory. In this vein, this paper suggests to use the framework of stochastic optimal control with path integrals to derive a novel approach to RL with parameterized policies. While solidly grounded in value function estimation and optimal control based on the stochastic Hamilton-JacobiBellman (HJB) equations, policy improvements can be transformed into an approximation problem of a path integral which has no open algorithmic parameters other than the exploration noise. The resulting algorithm can be conceived of as model-based, semi-model-based, or even model free, depending on how the learning problem is structured. The update equations have no danger of numerical instabilities as neither matrix inversions nor gradient learning rates are required. Our new algorithm demonstrates interesting similarities with previous RL research in the framework of probability matching and provides intuition why the slightly heuristically motivated probability matching approach can actually perform well. Empirical evaluations demonstrate significant performance improvements over gradient-based policy learning and scalability to high-dimensional control problems. Finally, a learning experiment on a simulated 12 degree-of-freedom robot dog illustrates the functionality of our algorithm in a complex robot learning scenario. We believe that Policy Improvement with Path Integrals (PI2 ) offers currently one of the most efficient, numerically robust, and easy to implement algorithms for RL based on trajectory roll-outs. Keywords: stochastic optimal control, reinforcement learning, parameterized policies
3 0.097429402 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks
Author: Ido Cohn, Tal El-Hay, Nir Friedman, Raz Kupferman
Abstract: Continuous-time Bayesian networks is a natural structured representation language for multicomponent stochastic processes that evolve continuously over time. Despite the compact representation provided by this language, inference in such models is intractable even in relatively simple structured networks. We introduce a mean field variational approximation in which we use a product of inhomogeneous Markov processes to approximate a joint distribution over trajectories. This variational approach leads to a globally consistent distribution, which can be efficiently queried. Additionally, it provides a lower bound on the probability of observations, thus making it attractive for learning tasks. Here we describe the theoretical foundations for the approximation, an efficient implementation that exploits the wide range of highly optimized ordinary differential equations (ODE) solvers, experimentally explore characterizations of processes for which this approximation is suitable, and show applications to a large-scale real-world inference problem. Keywords: continuous time Markov processes, continuous time Bayesian networks, variational approximations, mean field approximation
4 0.080389164 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
Author: Ryo Yoshida, Mike West
Abstract: We describe a class of sparse latent factor models, called graphical factor models (GFMs), and relevant sparse learning algorithms for posterior mode estimation. Linear, Gaussian GFMs have sparse, orthogonal factor loadings matrices, that, in addition to sparsity of the implied covariance matrices, also induce conditional independence structures via zeros in the implied precision matrices. We describe the models and their use for robust estimation of sparse latent factor structure and data/signal reconstruction. We develop computational algorithms for model exploration and posterior mode search, addressing the hard combinatorial optimization involved in the search over a huge space of potential sparse configurations. A mean-field variational technique coupled with annealing is developed to successively generate “artificial” posterior distributions that, at the limiting temperature in the annealing schedule, define required posterior modes in the GFM parameter space. Several detailed empirical studies and comparisons to related approaches are discussed, including analyses of handwritten digit image and cancer gene expression data. Keywords: annealing, graphical factor models, variational mean-field method, MAP estimation, sparse factor analysis, gene expression profiling
5 0.078194603 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
Author: Tong Zhang
Abstract: We consider learning formulations with non-convex objective functions that often occur in practical applications. There are two approaches to this problem: • Heuristic methods such as gradient descent that only find a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. • Convex relaxation such as L1 -regularization that solves the problem under some conditions. However it often leads to a sub-optimal solution in reality. This paper tries to remedy the above gap between theory and practice. In particular, we present a multi-stage convex relaxation scheme for solving problems with non-convex objective functions. For learning formulations with sparse regularization, we analyze the behavior of a specific multistage relaxation scheme. Under appropriate conditions, we show that the local solution obtained by this procedure is superior to the global solution of the standard L1 convex relaxation for learning sparse targets. Keywords: sparsity, non-convex optimization, convex relaxation, multi-stage convex relaxation
6 0.07437814 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
7 0.072468415 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
8 0.070443749 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis
9 0.069689885 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models
10 0.067944951 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency
11 0.064078704 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
12 0.063862167 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization
13 0.062377207 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression
14 0.059145164 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
15 0.053527042 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
16 0.052943937 50 jmlr-2010-Image Denoising with Kernels Based on Natural Image Relations
17 0.052216895 57 jmlr-2010-Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models
18 0.052113257 84 jmlr-2010-On Spectral Learning
19 0.051123925 72 jmlr-2010-Matrix Completion from Noisy Entries
20 0.047326002 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
topicId topicWeight
[(0, -0.226), (1, -0.088), (2, 0.034), (3, -0.081), (4, 0.034), (5, -0.06), (6, 0.076), (7, -0.153), (8, 0.112), (9, -0.117), (10, 0.067), (11, 0.077), (12, 0.041), (13, 0.069), (14, 0.056), (15, 0.002), (16, 0.107), (17, 0.048), (18, -0.216), (19, -0.171), (20, -0.074), (21, 0.007), (22, -0.26), (23, -0.194), (24, 0.022), (25, -0.036), (26, -0.045), (27, -0.168), (28, 0.017), (29, 0.062), (30, -0.003), (31, 0.078), (32, 0.145), (33, 0.181), (34, -0.058), (35, -0.001), (36, -0.015), (37, 0.015), (38, 0.134), (39, 0.082), (40, -0.109), (41, 0.044), (42, 0.048), (43, 0.027), (44, -0.004), (45, 0.012), (46, 0.036), (47, -0.01), (48, 0.03), (49, -0.042)]
simIndex simValue paperId paperTitle
same-paper 1 0.94199055 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
Author: Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro
Abstract: Sparse coding—that is, modelling data vectors as sparse linear combinations of basis elements—is widely used in machine learning, neuroscience, signal processing, and statistics. This paper focuses on the large-scale matrix factorization problem that consists of learning the basis set in order to adapt it to specific data. Variations of this problem include dictionary learning in signal processing, non-negative matrix factorization and sparse principal component analysis. In this paper, we propose to address these tasks with a new online optimization algorithm, based on stochastic approximations, which scales up gracefully to large data sets with millions of training samples, and extends naturally to various matrix factorization formulations, making it suitable for a wide range of learning problems. A proof of convergence is presented, along with experiments with natural images and genomic data demonstrating that it leads to state-of-the-art performance in terms of speed and optimization for both small and large data sets. Keywords: basis pursuit, dictionary learning, matrix factorization, online learning, sparse coding, sparse principal component analysis, stochastic approximations, stochastic optimization, nonnegative matrix factorization
2 0.71222085 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning
Author: Evangelos Theodorou, Jonas Buchli, Stefan Schaal
Abstract: With the goal to generate more scalable algorithms with higher efficiency and fewer open parameters, reinforcement learning (RL) has recently moved towards combining classical techniques from optimal control and dynamic programming with modern learning techniques from statistical estimation theory. In this vein, this paper suggests to use the framework of stochastic optimal control with path integrals to derive a novel approach to RL with parameterized policies. While solidly grounded in value function estimation and optimal control based on the stochastic Hamilton-JacobiBellman (HJB) equations, policy improvements can be transformed into an approximation problem of a path integral which has no open algorithmic parameters other than the exploration noise. The resulting algorithm can be conceived of as model-based, semi-model-based, or even model free, depending on how the learning problem is structured. The update equations have no danger of numerical instabilities as neither matrix inversions nor gradient learning rates are required. Our new algorithm demonstrates interesting similarities with previous RL research in the framework of probability matching and provides intuition why the slightly heuristically motivated probability matching approach can actually perform well. Empirical evaluations demonstrate significant performance improvements over gradient-based policy learning and scalability to high-dimensional control problems. Finally, a learning experiment on a simulated 12 degree-of-freedom robot dog illustrates the functionality of our algorithm in a complex robot learning scenario. We believe that Policy Improvement with Path Integrals (PI2 ) offers currently one of the most efficient, numerically robust, and easy to implement algorithms for RL based on trajectory roll-outs. Keywords: stochastic optimal control, reinforcement learning, parameterized policies
3 0.46347994 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
Author: Ryo Yoshida, Mike West
Abstract: We describe a class of sparse latent factor models, called graphical factor models (GFMs), and relevant sparse learning algorithms for posterior mode estimation. Linear, Gaussian GFMs have sparse, orthogonal factor loadings matrices, that, in addition to sparsity of the implied covariance matrices, also induce conditional independence structures via zeros in the implied precision matrices. We describe the models and their use for robust estimation of sparse latent factor structure and data/signal reconstruction. We develop computational algorithms for model exploration and posterior mode search, addressing the hard combinatorial optimization involved in the search over a huge space of potential sparse configurations. A mean-field variational technique coupled with annealing is developed to successively generate “artificial” posterior distributions that, at the limiting temperature in the annealing schedule, define required posterior modes in the GFM parameter space. Several detailed empirical studies and comparisons to related approaches are discussed, including analyses of handwritten digit image and cancer gene expression data. Keywords: annealing, graphical factor models, variational mean-field method, MAP estimation, sparse factor analysis, gene expression profiling
4 0.46009898 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
Author: Dotan Di Castro, Ron Meir
Abstract: Actor-Critic based approaches were among the first to address reinforcement learning in a general setting. Recently, these algorithms have gained renewed interest due to their generality, good convergence properties, and possible biological relevance. In this paper, we introduce an online temporal difference based actor-critic algorithm which is proved to converge to a neighborhood of a local maximum of the average reward. Linear function approximation is used by the critic in order estimate the value function, and the temporal difference signal, which is passed from the critic to the actor. The main distinguishing feature of the present convergence proof is that both the actor and the critic operate on a similar time scale, while in most current convergence proofs they are required to have very different time scales in order to converge. Moreover, the same temporal difference signal is used to update the parameters of both the actor and the critic. A limitation of the proposed approach, compared to results available for two time scale convergence, is that convergence is guaranteed only to a neighborhood of an optimal value, rather to an optimal value itself. The single time scale and identical temporal difference signal used by the actor and the critic, may provide a step towards constructing more biologically realistic models of reinforcement learning in the brain. Keywords: actor critic, single time scale convergence, temporal difference
5 0.42846173 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis
Author: Michel Journée, Yurii Nesterov, Peter Richtárik, Rodolphe Sepulchre
Abstract: In this paper we develop a new approach to sparse principal component analysis (sparse PCA). We propose two single-unit and two block optimization formulations of the sparse PCA problem, aimed at extracting a single sparse dominant principal component of a data matrix, or more components at once, respectively. While the initial formulations involve nonconvex functions, and are therefore computationally intractable, we rewrite them into the form of an optimization program involving maximization of a convex function on a compact set. The dimension of the search space is decreased enormously if the data matrix has many more columns (variables) than rows. We then propose and analyze a simple gradient method suited for the task. It appears that our algorithm has best convergence properties in the case when either the objective function or the feasible set are strongly convex, which is the case with our single-unit formulations and can be enforced in the block case. Finally, we demonstrate numerically on a set of random and gene expression test problems that our approach outperforms existing algorithms both in quality of the obtained solution and in computational speed. Keywords: sparse PCA, power method, gradient ascent, strongly convex sets, block algorithms
6 0.36660802 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models
7 0.35819581 50 jmlr-2010-Image Denoising with Kernels Based on Natural Image Relations
8 0.33550021 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks
9 0.32355419 35 jmlr-2010-Error-Correcting Output Codes Library
10 0.31099772 32 jmlr-2010-Efficient Algorithms for Conditional Independence Inference
11 0.28385493 57 jmlr-2010-Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models
12 0.2807413 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
13 0.2779263 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression
14 0.27016699 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
15 0.26994556 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
16 0.26605856 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
17 0.26151514 84 jmlr-2010-On Spectral Learning
18 0.25641757 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
19 0.25193939 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
20 0.2463742 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
topicId topicWeight
[(3, 0.024), (4, 0.013), (8, 0.029), (15, 0.016), (21, 0.033), (24, 0.013), (32, 0.075), (33, 0.011), (36, 0.033), (37, 0.067), (75, 0.178), (81, 0.015), (83, 0.321), (85, 0.067), (96, 0.016)]
simIndex simValue paperId paperTitle
1 0.86170071 27 jmlr-2010-Consistent Nonparametric Tests of Independence
Author: Arthur Gretton, László Györfi
Abstract: Three simple and explicit procedures for testing the independence of two multi-dimensional random variables are described. Two of the associated test statistics (L1 , log-likelihood) are defined when the empirical distribution of the variables is restricted to finite partitions. A third test statistic is defined as a kernel-based independence measure. Two kinds of tests are provided. Distributionfree strong consistent tests are derived on the basis of large deviation bounds on the test statistics: these tests make almost surely no Type I or Type II error after a random sample size. Asymptotically α-level tests are obtained from the limiting distribution of the test statistics. For the latter tests, the Type I error converges to a fixed non-zero value α, and the Type II error drops to zero, for increasing sample size. All tests reject the null hypothesis of independence if the test statistics become large. The performance of the tests is evaluated experimentally on benchmark data. Keywords: hypothesis test, independence, L1, log-likelihood, kernel methods, distribution-free consistent test
same-paper 2 0.70603698 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
Author: Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro
Abstract: Sparse coding—that is, modelling data vectors as sparse linear combinations of basis elements—is widely used in machine learning, neuroscience, signal processing, and statistics. This paper focuses on the large-scale matrix factorization problem that consists of learning the basis set in order to adapt it to specific data. Variations of this problem include dictionary learning in signal processing, non-negative matrix factorization and sparse principal component analysis. In this paper, we propose to address these tasks with a new online optimization algorithm, based on stochastic approximations, which scales up gracefully to large data sets with millions of training samples, and extends naturally to various matrix factorization formulations, making it suitable for a wide range of learning problems. A proof of convergence is presented, along with experiments with natural images and genomic data demonstrating that it leads to state-of-the-art performance in terms of speed and optimization for both small and large data sets. Keywords: basis pursuit, dictionary learning, matrix factorization, online learning, sparse coding, sparse principal component analysis, stochastic approximations, stochastic optimization, nonnegative matrix factorization
3 0.55851275 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
Author: Ming Yuan
Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity
4 0.55477357 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
Author: Shiliang Sun, John Shawe-Taylor
Abstract: In this paper, we propose a general framework for sparse semi-supervised learning, which concerns using a small portion of unlabeled data and a few labeled data to represent target functions and thus has the merit of accelerating function evaluations when predicting the output of a new example. This framework makes use of Fenchel-Legendre conjugates to rewrite a convex insensitive loss involving a regularization with unlabeled data, and is applicable to a family of semi-supervised learning methods such as multi-view co-regularized least squares and single-view Laplacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse multi-view SVMs which use a squared ε-insensitive loss. The resultant optimization is an inf-sup problem and the optimal solutions have arguably saddle-point properties. We present a globally optimal iterative algorithm to optimize the problem. We give the margin bound on the generalization error of the sparse multi-view SVMs, and derive the empirical Rademacher complexity for the induced function class. Experiments on artificial and real-world data show their effectiveness. We further give a sequential training approach to show their possibility and potential for uses in large-scale problems and provide encouraging experimental results indicating the efficacy of the margin bound and empirical Rademacher complexity on characterizing the roles of unlabeled data for semi-supervised learning. Keywords: semi-supervised learning, Fenchel-Legendre conjugate, representer theorem, multiview regularization, support vector machine, statistical learning theory
5 0.55090582 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
Author: Pannagadatta K. Shivaswamy, Tony Jebara
Abstract: Leading classification methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identifies its sensitivity to affine transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classification function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classification function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efficiency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classification and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity
6 0.54859954 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
7 0.54722083 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
8 0.5470984 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
9 0.54597664 82 jmlr-2010-On Learning with Integral Operators
10 0.54471081 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
11 0.54413295 63 jmlr-2010-Learning Instance-Specific Predictive Models
12 0.54313248 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis
13 0.54309827 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
14 0.54176688 109 jmlr-2010-Stochastic Composite Likelihood
15 0.54146957 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
16 0.54107654 66 jmlr-2010-Linear Algorithms for Online Multitask Classification
17 0.53944254 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
18 0.53939307 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
19 0.53813374 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning
20 0.53683269 102 jmlr-2010-Semi-Supervised Novelty Detection