nips nips2013 nips2013-215 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yao-Liang Yu
Abstract: The proximal map is the key step in gradient-type algorithms, which have become prevalent in large-scale high-dimensional problems. For simple functions this proximal map is available in closed-form while for more complicated functions it can become highly nontrivial. Motivated by the need of combining regularizers to simultaneously induce different types of structures, this paper initiates a systematic investigation of when the proximal map of a sum of functions decomposes into the composition of the proximal maps of the individual summands. We not only unify a few known results scattered in the literature but also discover several new decompositions obtained almost effortlessly from our theory. 1
Reference: text
sentIndex sentText sentNum sentScore
1 ca Abstract The proximal map is the key step in gradient-type algorithms, which have become prevalent in large-scale high-dimensional problems. [sent-3, score-0.275]
2 For simple functions this proximal map is available in closed-form while for more complicated functions it can become highly nontrivial. [sent-4, score-0.291]
3 Motivated by the need of combining regularizers to simultaneously induce different types of structures, this paper initiates a systematic investigation of when the proximal map of a sum of functions decomposes into the composition of the proximal maps of the individual summands. [sent-5, score-0.536]
4 As real data become more and more complex, different types of regularizers, usually nonsmooth functions, have been designed. [sent-9, score-0.045]
5 Since many interesting regularizers are nonsmooth, they are harder to optimize numerically, especially in large-scale high-dimensional settings. [sent-11, score-0.039]
6 Thanks to recent advances [3–5], gradient-type algorithms have been generalized to take nonsmooth regularizers explicitly into account. [sent-12, score-0.068]
7 The key step of such gradient-type algorithms is to compute the proximal map (of the nonsmooth regularizer), which is available in closed-form for some specific regularizers. [sent-14, score-0.272]
8 However, the proximal map becomes highly nontrivial when we start to combine regularizers. [sent-15, score-0.261]
9 The main goal of this paper is to systematically investigate when the proximal map of a sum of functions decomposes into the composition of the proximal maps of the individual functions, which we simply term prox-decomposition. [sent-16, score-0.497]
10 Our motivation comes from a few known decomposition results scattered in the literature [6–8], all in the form of our interest. [sent-17, score-0.049]
11 After setting the context in Section 2, we motivate the decomposition rule with some justifications, as well as some cautionary results. [sent-20, score-0.056]
12 1, we study how “invariance” of the subdifferential of one function would lead to nontrivial proxdecompositions. [sent-22, score-0.141]
13 The generalization to cone invariance is considered in Section 3. [sent-25, score-0.082]
14 The symbol ¯ Id stands for the identity map and the extended real line R ∪ {∞} is denoted as R. [sent-29, score-0.064]
15 Throughout the paper we denote ∂f (x) as the subdifferential of the function f at point x. [sent-30, score-0.123]
16 ¯ For any closed convex proper function f : H → R, we define its Moreau envelop as [11] 1 x∈H 2 ∀y ∈ H, Mf (y) = min x−y 2 + f (x), (1) and the related proximal map Pf (y) = argmin 1 x − y 2 2 + f (x). [sent-33, score-0.316]
17 When f = ιC is the indicator of some closed convex set C, the proximal map reduces to the usual projection. [sent-36, score-0.295]
18 Pf +g = Pf ◦ Pg = Pg ◦ Pf , (5) where f, g ∈ Γ0 , the set of all closed convex proper functions on H, and f ◦ g denotes the mapping composition. [sent-44, score-0.068]
19 c] proved that P : H → H is a proximal map iff it is nonexpansive and it is the subdifferential of some convex function in Γ0 . [sent-54, score-0.452]
20 Although the latter condition in general is not easy to verify, it reduces to monotonic increasing when H = R (note that P must be continuous). [sent-55, score-0.045]
21 In a general Hilbert space H, we again easily conclude that the composition Pf ◦ Pg is always a nonexpansion, which means that it is “close” to be a proximal map. [sent-57, score-0.211]
22 This justifies the composition Pf ◦ Pg as a candidate for the decomposition of Pf +g . [sent-58, score-0.063]
23 The proximal maps in this case are simply projections: Pf (x) = ( x1 +x2 , x1 +x2 ) and Pg (x) = (x1 , 0). [sent-65, score-0.194]
24 We easily verify that the inequality 2 2 Pf (Pg (x)) − Pf (Pg (y)) 2 ≤ Pf (Pg (x)) − Pf (Pg (y)), x − y is not always true, contradiction if Pf ◦ Pg was a proximal map [11, Eq. [sent-67, score-0.264]
25 Nevertheless, as we will see, the equality in (5) does hold in many scenarios, and an interesting theory can be suitably developed. [sent-76, score-0.039]
26 Using the first order optimality condition and the definition of the proximal map (2), we have Pf +g (y) − y + ∂(f + g)(Pf +g (y)) Pg (y) − y + ∂g(Pg (y)) Pf (Pg (y)) − Pg (y) + ∂f (Pf (Pg (y))) 0 0 0. [sent-80, score-0.288]
27 Then by (9) and the subdifferential rule ∂(f + g) ⊇ ∂f + ∂g we verify that Pf (Pg (y)) satisfies (6), hence follows Pf +g = Pf ◦ Pg since the proximal map is single-valued. [sent-85, score-0.404]
28 We note that a special form of our sufficient condition has appeared in the proof of [8, Theorem 1], whose main result also follows immediately from our Theorem 4 below. [sent-86, score-0.085]
29 The condition ∂(g1 +g2 ) = ∂g1 +∂g2 in Proposition 2 is purely technical; it is satisfied when, say g1 is continuous at a single, arbitrary point in dom g1 ∩ dom g2 . [sent-91, score-0.311]
30 , in fact, they are the only differentiable functions that are s. [sent-115, score-0.05]
31 2 A gauge is a positively homogeneous convex function that vanishes at the origin. [sent-131, score-0.113]
32 On the other hand, if f, g are differentiable, then we actually have equality in (10), which is clearly necessary in this case. [sent-137, score-0.045]
33 Since convex functions are almost everywhere differentiable (in the interior of their domain), we expect the sufficient condition (10) to be necessary “almost everywhere” too. [sent-138, score-0.121]
34 Thus we see that the key for the decomposition (5) to hold is to let the proximal map of f and the subdifferential of g “interact well” in the sense of (10). [sent-139, score-0.411]
35 The second follows from the optimality condition Pf = (Id + ∂f )−1 . [sent-145, score-0.045]
36 Therefore some properties of the proximal map will transfer to some properties of the function f itself, and vice versa. [sent-148, score-0.243]
37 Pf (U x) = U Pf (x) for all unitary U iff f (U x) = f (x) for all unitary U ; iii). [sent-153, score-0.071]
38 Pf (Qx) = QPf (x) for all permutation Q (under some fixed basis) iff f is permutation invariant, that is f (Qx) = f (x) for all permutation Q. [sent-154, score-0.126]
39 In the following, we will put some invariance assumptions on the subdifferential of g and accordingly find the right family of f whose proximal map “respects” that invariance. [sent-155, score-0.422]
40 This way we will meet (10) by construction therefore effortlessly have the decomposition (5). [sent-156, score-0.05]
41 2 No Invariance To begin with, consider first the trivial case where no invariance on the subdifferential of g is assumed. [sent-158, score-0.193]
42 Pf +g = Pf ◦ Pg for all g ∈ Γ0 if and only if • dim(H) ≥ 2; f ≡ c or f = ι{w} + c for some c ∈ R and w ∈ H; • dim(H) = 1 and f = ιC + c for some closed and convex set C and c ∈ R. [sent-163, score-0.052]
43 Since x ∈ dom f is arbitrary, f is constant on its domain. [sent-168, score-0.139]
44 We consider the other case where dim(H) ≥ 2 and dom f contains at least two points. [sent-170, score-0.139]
45 If dom f = H, there exists z ∈ dom f such that Pf (z) = y for some y ∈ dom f , and closed convex set C ∩ dom f = ∅ with y ∈ C z. [sent-171, score-0.608]
46 Let g = ιC we obtain Pf +g (z) ∈ C ∩ dom f while Pf (Pg (z)) = Pf (z) = y ∈ C, contradiction. [sent-172, score-0.139]
47 Observe that the decomposition (5) is not symmetric in f and g, also reflected in the next result: Theorem 3. [sent-173, score-0.048]
48 Pf +g = Pf ◦ Pg for all f ∈ Γ0 iff g is a continuous affine function. [sent-175, score-0.039]
49 4 Naturally, the next step is to put invariance assumptions on the subdifferential of g, effectively restricting the function class of g. [sent-183, score-0.179]
50 3 Scaling Invariance The first invariance property we consider is scaling-invariance. [sent-186, score-0.056]
51 What kind of convex functions have their subdifferential invariant to (positive) scaling? [sent-187, score-0.188]
52 Assuming 0 ∈ dom g and by simple integration t t g(tx) − g(0) = ∂g(sx), x ds = t · [g(x) − g(0)], g (sx)ds = 0 0 where the last equality follows from the scaling invariance of the subdifferential of g. [sent-188, score-0.356]
53 (automatically 0 ∈ dom g), then from definition we verify that ∂g is scaling-invariant. [sent-194, score-0.16]
54 Consequently, the right function class for f is to have the proximal map Pf (x) = λ · x for some λ ∈ [0, 1] that may depend on x as well3 . [sent-198, score-0.243]
55 When dim(H) = 1, ii) is equivalent as requiring f to attain its minimum at 0, in which case the implication ii) =⇒ iv), under the redundant condition that f is differentiable, was proved by Combettes and Pesquet [14, Proposition 3. [sent-212, score-0.058]
56 Therefore there exists λ ∈ [0, 1] such that λu minimizes the Moreau envelop Mf hence we have Pf (u) = λu due to uniqueness. [sent-220, score-0.038]
57 iii) =⇒ iv): Note first that iii) implies 0 ∈ ∂f (0), therefore 0 ∈ dom f . [sent-221, score-0.154]
58 Since the subdifferential of κ is scaling-invariant, iii) implies the sufficient condition (10) hence iv). [sent-222, score-0.188]
59 iv) =⇒ iii): Fix y and construct the gauge function κ(z) = 0, ∞, if z = λ · y for some λ ≥ 0 . [sent-223, score-0.066]
60 On the other hand, 1 Mf +κ (y) = min 2 x − y x 3 2 2 + f (x) + κ(x) = minλ≥0 1 2 Note that λ ≤ 1 is necessary since any proximal map is nonexpansive. [sent-225, score-0.243]
61 iii) =⇒ ii): First note that iii) implies that Pf (0) = 0 hence 0 ∈ ∂f (0), in particular, 0 ∈ dom f . [sent-231, score-0.171]
62 Using the first order optimality condition for the proximal map we have 1 0 ∈ λx − x + ∂f (λx), that is ( λ − 1)y ∈ ∂f (y) for each y ∈ ran(Pf ) = H due to our assumption dom f = H. [sent-238, score-0.427]
63 For the case when dom f ⊂ H, we consider the proximal average [16] 1 g = A(f, q) = [( 2 (f ∗ + q)∗ + 1 q)∗ − q]∗ , 4 (12) where q = 1 · 2 . [sent-240, score-0.318]
64 Importantly, since q is defined on the whole space, the proximal average g has 2 1 1 full domain too [16, Corollary 4. [sent-241, score-0.179]
65 It is easy to check that i) is preserved under taking the Fenchel conjugation (note that the convexity of f implies that of h). [sent-245, score-0.041]
66 Let dim(H) ≥ 2, C ⊆ H be a closed convex set that contains the origin. [sent-252, score-0.052]
67 Then the projection onto C is simply a shrinkage towards the origin iff C is a ball (of the norm · ). [sent-253, score-0.069]
68 In many applications, in addition to the regularizer κ (usually a gauge), one adds the 2 regularizer λq either for stability or grouping effect or strong 2 convexity. [sent-257, score-0.064]
69 This incurs no computational cost in the sense of computing the proximal map: We easily 1 1 compute that Pλq = λ+1 Id. [sent-258, score-0.179]
70 By Theorem 4, for any gauge κ, Pκ+λq = λ+1 Pκ , whence it is also clear that adding an extra 2 regularizer tends to double “shrink” the solution. [sent-259, score-0.098]
71 In particular, let H = Rd and take κ = · 1 (the sum of absolute values) we recover the proximal map for the elastic-net regularizer [17]. [sent-260, score-0.286]
72 function f ∈ Γ0 satisfies any item of Theorem 4 iff it is a positive multiple of the norm · . [sent-271, score-0.069]
73 convex function f that satisfies Pf +κ = Pf ◦ Pκ for all gauge κ. [sent-275, score-0.092]
74 Then P m i=1 · gi =P · g1 ◦ ··· ◦ P · gm , where we arrange the groups so that gi ⊂ gj =⇒ i > j, and the notation Hilbertian norm that is restricted to the coordinates indexed by the group gi . [sent-283, score-0.342]
75 · gi denotes the m Proof: Let f = · g1 and κ = i=2 · gi . [sent-284, score-0.14]
76 By the tree-structured assumption we can partition κ = κ1 + κ2 , where gi ⊂ g1 for all gi appearing in κ1 while gj ∩ g1 = ∅ for all gj appearing in κ2 . [sent-288, score-0.224]
77 On the other hand, due to the non-overlapping property, nothing will be affected by adding κ2 , thus P m i=1 · =P gi · g1 ◦P m i=2 · gi . [sent-291, score-0.14]
78 We can clearly iterate the argument to unravel the proximal map as claimed. [sent-292, score-0.263]
79 For notational clarity, we have chosen not to incorporate weights in the sum of group seminorms: Such can be absorbed into the seminorm and the corollary clearly remains intact. [sent-293, score-0.088]
80 Our proof also reveals the fundamental reason why Corollary 3 is true: The 2 norm admits the decomposition (5) for any gauge g! [sent-294, score-0.153]
81 4 Cone Invariance In the previous subsection, we restricted the subdifferential of g to be constant along each ray. [sent-297, score-0.123]
82 Specifically, consider the gauge function κ(x) = max aj , x , (13) j∈J where J is a finite index set and each aj ∈ H. [sent-299, score-0.116]
83 Such polyhedral gauge functions have become extremely important due to the work of Chandrasekaran et al. [sent-300, score-0.115]
84 Define the polyhedral cones Kj = {x ∈ H : aj , x = κ(x)}. [sent-302, score-0.042]
85 (15) In other words, each cone Kj is “fixed” under the proximal map of f . [sent-305, score-0.269]
86 Denote E a collection of pairs (m, n), and define the total variational norm x tv = 4 {m,n}∈E wm,n |xm − xn |, where wm,n ≥ 0. [sent-308, score-0.068]
87 Then for any permutation invariant function f , Pf + · tv = Pf ◦ P · tv . [sent-309, score-0.094]
88 Since f is permutation invariant, its proximal map Pf (x) maintains the order of x, hence we establish (15). [sent-312, score-0.289]
89 1 p We call the permutation invariant function f symmetric if ∀x, f (|x|) = f (x), where | · | denotes the componentwise absolute value. [sent-320, score-0.069]
90 The proof for the next corollary is almost the same as that of Corollary 4, except that we also use the fact sign([Pf (x)]m ) = sign(xm ) for symmetric functions. [sent-321, score-0.111]
91 As in Corollary 4, define the norm x oct = {m,n}∈E wm,n max{|xm |, |xn |}. [sent-323, score-0.112]
92 Then for any symmetric function f , Pf + · oct = Pf ◦ P · oct . [sent-324, score-0.181]
93 4 All we need is the weaker condition: For all {m, n} ∈ E, xm ≥ xn =⇒ [Pf (x)]m ≥ [Pf (x)]n . [sent-325, score-0.071]
94 This norm · oct is proposed in [21] for feature grouping. [sent-327, score-0.112]
95 The proximal map P · oct is derived in [22], which turns out to be another decomposition result. [sent-329, score-0.356]
96 The proof is by recursion: Write · oct = f + g, where f = κ|I| . [sent-337, score-0.108]
97 Note that the subdifferential of g depends only on the ordering and sign of the first |I| − 1 coordinates while the proximal map of f preserves the ordering and sign of the first |I| − 1 coordinates (due to symmetry). [sent-338, score-0.456]
98 If we pre-sort x, the individual proximal maps Pκi (x) become easy to compute sequentially and we recover the algorithm in [22] with some bookkeeping. [sent-339, score-0.233]
99 As in Corollary 3, let G ⊆ 2I be a collection of tree-structured groups, then P m · gi ,k = P · g1 ,k ◦ · · · ◦ P · gm ,k , i=1 k where we arrange the groups so that gi ⊂ gj =⇒ i > j, and x gi ,k = j=1 |xgi |[j] is the sum of the k (absolute-value) largest elements in the group gi , i. [sent-341, score-0.363]
100 Therefore · as symmetric and the rest follows the proof of Corollary 5. [sent-349, score-0.043]
wordName wordTfidf (topN-words)
[('pf', 0.81), ('pg', 0.399), ('proximal', 0.179), ('dom', 0.139), ('subdifferential', 0.123), ('dim', 0.114), ('oct', 0.082), ('mf', 0.077), ('gi', 0.07), ('corollary', 0.068), ('gauge', 0.066), ('map', 0.064), ('kf', 0.059), ('invariance', 0.056), ('ph', 0.055), ('iii', 0.05), ('moreau', 0.045), ('kj', 0.042), ('iv', 0.041), ('hilbertian', 0.041), ('xm', 0.04), ('proposition', 0.039), ('regularizers', 0.039), ('iff', 0.039), ('ii', 0.038), ('differentiable', 0.034), ('condition', 0.033), ('composition', 0.032), ('regularizer', 0.032), ('decomposition', 0.031), ('gj', 0.03), ('norm', 0.03), ('permutation', 0.029), ('nonsmooth', 0.029), ('pq', 0.026), ('appeared', 0.026), ('cone', 0.026), ('sign', 0.026), ('fix', 0.026), ('closed', 0.026), ('proof', 0.026), ('convex', 0.026), ('id', 0.025), ('cautionary', 0.025), ('pnf', 0.025), ('aj', 0.025), ('implication', 0.025), ('equality', 0.025), ('theorem', 0.024), ('invariant', 0.023), ('verify', 0.021), ('tv', 0.021), ('sx', 0.021), ('envelop', 0.021), ('arrange', 0.021), ('qx', 0.021), ('nonexpansive', 0.021), ('homogeneous', 0.021), ('clearly', 0.02), ('fenchel', 0.019), ('yaoliang', 0.019), ('effortlessly', 0.019), ('remark', 0.019), ('satis', 0.019), ('coordinates', 0.019), ('pmf', 0.018), ('nontrivial', 0.018), ('scattered', 0.018), ('symmetric', 0.017), ('hence', 0.017), ('xn', 0.017), ('polyhedral', 0.017), ('become', 0.016), ('unitary', 0.016), ('groups', 0.016), ('prevalent', 0.016), ('gm', 0.016), ('falling', 0.016), ('functions', 0.016), ('ei', 0.015), ('maps', 0.015), ('implies', 0.015), ('convexity', 0.014), ('es', 0.014), ('trivial', 0.014), ('weaker', 0.014), ('hold', 0.014), ('ds', 0.013), ('everywhere', 0.012), ('decomposes', 0.012), ('appearing', 0.012), ('numerically', 0.012), ('formula', 0.012), ('easy', 0.012), ('optimality', 0.012), ('af', 0.012), ('nition', 0.012), ('orthonormal', 0.012), ('importantly', 0.012), ('recover', 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 215 nips-2013-On Decomposing the Proximal Map
Author: Yao-Liang Yu
Abstract: The proximal map is the key step in gradient-type algorithms, which have become prevalent in large-scale high-dimensional problems. For simple functions this proximal map is available in closed-form while for more complicated functions it can become highly nontrivial. Motivated by the need of combining regularizers to simultaneously induce different types of structures, this paper initiates a systematic investigation of when the proximal map of a sum of functions decomposes into the composition of the proximal maps of the individual summands. We not only unify a few known results scattered in the literature but also discover several new decompositions obtained almost effortlessly from our theory. 1
2 0.28971586 56 nips-2013-Better Approximation and Faster Algorithm Using the Proximal Average
Author: Yao-Liang Yu
Abstract: It is a common practice to approximate “complicated” functions with more friendly ones. In large-scale machine learning applications, nonsmooth losses/regularizers that entail great computational challenges are usually approximated by smooth functions. We re-examine this powerful methodology and point out a nonsmooth approximation which simply pretends the linearity of the proximal map. The new approximation is justified using a recent convex analysis tool— proximal average, and yields a novel proximal gradient algorithm that is strictly better than the one based on smoothing, without incurring any extra overhead. Numerical experiments conducted on two important applications, overlapping group lasso and graph-guided fused lasso, corroborate the theoretical claims. 1
3 0.1037762 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
Author: Vincent Q. Vu, Juhee Cho, Jing Lei, Karl Rohe
Abstract: We propose a novel convex relaxation of sparse principal subspace estimation based on the convex hull of rank-d projection matrices (the Fantope). The convex problem can be solved efficiently using alternating direction method of multipliers (ADMM). We establish a near-optimal convergence rate, in terms of the sparsity, ambient dimension, and sample size, for estimation of the principal subspace of a general covariance matrix without assuming the spiked covariance model. In the special case of d = 1, our result implies the near-optimality of DSPCA (d’Aspremont et al. [1]) even when the solution is not rank 1. We also provide a general theoretical framework for analyzing the statistical properties of the method for arbitrary input matrices that extends the applicability and provable guarantees to a wide array of settings. We demonstrate this with an application to Kendall’s tau correlation matrices and transelliptical component analysis. 1
4 0.078594781 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization
Author: Ke Hou, Zirui Zhou, Anthony Man-Cho So, Zhi-Quan Luo
Abstract: Motivated by various applications in machine learning, the problem of minimizing a convex smooth loss function with trace norm regularization has received much attention lately. Currently, a popular method for solving such problem is the proximal gradient method (PGM), which is known to have a sublinear rate of convergence. In this paper, we show that for a large class of loss functions, the convergence rate of the PGM is in fact linear. Our result is established without any strong convexity assumption on the loss function. A key ingredient in our proof is a new Lipschitzian error bound for the aforementioned trace norm–regularized problem, which may be of independent interest.
5 0.074286282 291 nips-2013-Sensor Selection in High-Dimensional Gaussian Trees with Nuisances
Author: Daniel S. Levine, Jonathan P. How
Abstract: We consider the sensor selection problem on multivariate Gaussian distributions where only a subset of latent variables is of inferential interest. For pairs of vertices connected by a unique path in the graph, we show that there exist decompositions of nonlocal mutual information into local information measures that can be computed efficiently from the output of message passing algorithms. We integrate these decompositions into a computationally efficient greedy selector where the computational expense of quantification can be distributed across nodes in the network. Experimental results demonstrate the comparative efficiency of our algorithms for sensor selection in high-dimensional distributions. We additionally derive an online-computable performance bound based on augmentations of the relevant latent variable set that, when such a valid augmentation exists, is applicable for any distribution with nuisances. 1
6 0.062755518 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
7 0.058943778 72 nips-2013-Convex Calibrated Surrogates for Low-Rank Loss Matrices with Applications to Subset Ranking Losses
8 0.056949258 249 nips-2013-Polar Operators for Structured Sparse Estimation
9 0.054732662 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting
10 0.051691238 147 nips-2013-Lasso Screening Rules via Dual Polytope Projection
11 0.046056911 268 nips-2013-Reflection methods for user-friendly submodular optimization
12 0.042880118 128 nips-2013-Generalized Method-of-Moments for Rank Aggregation
13 0.040598962 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization
14 0.040073585 91 nips-2013-Dirty Statistical Models
15 0.036699496 202 nips-2013-Multiclass Total Variation Clustering
16 0.03538752 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations
17 0.035188917 301 nips-2013-Sparse Additive Text Models with Low Rank Background
18 0.03312847 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space
19 0.033046279 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
20 0.031031936 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations
topicId topicWeight
[(0, 0.08), (1, 0.026), (2, 0.056), (3, 0.048), (4, 0.003), (5, 0.045), (6, -0.07), (7, -0.025), (8, 0.002), (9, -0.031), (10, 0.008), (11, 0.041), (12, -0.005), (13, -0.104), (14, -0.059), (15, -0.006), (16, 0.028), (17, -0.01), (18, 0.044), (19, 0.016), (20, -0.014), (21, 0.057), (22, 0.078), (23, 0.076), (24, -0.007), (25, -0.093), (26, 0.083), (27, 0.028), (28, 0.054), (29, 0.221), (30, -0.17), (31, 0.078), (32, -0.036), (33, -0.08), (34, -0.127), (35, -0.112), (36, 0.016), (37, 0.141), (38, 0.051), (39, 0.039), (40, 0.056), (41, -0.016), (42, -0.019), (43, -0.03), (44, -0.008), (45, 0.09), (46, 0.016), (47, -0.142), (48, -0.139), (49, 0.074)]
simIndex simValue paperId paperTitle
same-paper 1 0.97042179 215 nips-2013-On Decomposing the Proximal Map
Author: Yao-Liang Yu
Abstract: The proximal map is the key step in gradient-type algorithms, which have become prevalent in large-scale high-dimensional problems. For simple functions this proximal map is available in closed-form while for more complicated functions it can become highly nontrivial. Motivated by the need of combining regularizers to simultaneously induce different types of structures, this paper initiates a systematic investigation of when the proximal map of a sum of functions decomposes into the composition of the proximal maps of the individual summands. We not only unify a few known results scattered in the literature but also discover several new decompositions obtained almost effortlessly from our theory. 1
2 0.87277132 56 nips-2013-Better Approximation and Faster Algorithm Using the Proximal Average
Author: Yao-Liang Yu
Abstract: It is a common practice to approximate “complicated” functions with more friendly ones. In large-scale machine learning applications, nonsmooth losses/regularizers that entail great computational challenges are usually approximated by smooth functions. We re-examine this powerful methodology and point out a nonsmooth approximation which simply pretends the linearity of the proximal map. The new approximation is justified using a recent convex analysis tool— proximal average, and yields a novel proximal gradient algorithm that is strictly better than the one based on smoothing, without incurring any extra overhead. Numerical experiments conducted on two important applications, overlapping group lasso and graph-guided fused lasso, corroborate the theoretical claims. 1
3 0.65185547 249 nips-2013-Polar Operators for Structured Sparse Estimation
Author: Xinhua Zhang, Yao-Liang Yu, Dale Schuurmans
Abstract: Structured sparse estimation has become an important technique in many areas of data analysis. Unfortunately, these estimators normally create computational difficulties that entail sophisticated algorithms. Our first contribution is to uncover a rich class of structured sparse regularizers whose polar operator can be evaluated efficiently. With such an operator, a simple conditional gradient method can then be developed that, when combined with smoothing and local optimization, significantly reduces training time vs. the state of the art. We also demonstrate a new reduction of polar to proximal maps that enables more efficient latent fused lasso. 1
4 0.44222721 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization
Author: Ke Hou, Zirui Zhou, Anthony Man-Cho So, Zhi-Quan Luo
Abstract: Motivated by various applications in machine learning, the problem of minimizing a convex smooth loss function with trace norm regularization has received much attention lately. Currently, a popular method for solving such problem is the proximal gradient method (PGM), which is known to have a sublinear rate of convergence. In this paper, we show that for a large class of loss functions, the convergence rate of the PGM is in fact linear. Our result is established without any strong convexity assumption on the loss function. A key ingredient in our proof is a new Lipschitzian error bound for the aforementioned trace norm–regularized problem, which may be of independent interest.
5 0.43225023 202 nips-2013-Multiclass Total Variation Clustering
Author: Xavier Bresson, Thomas Laurent, David Uminsky, James von Brecht
Abstract: Ideas from the image processing literature have recently motivated a new set of clustering algorithms that rely on the concept of total variation. While these algorithms perform well for bi-partitioning tasks, their recursive extensions yield unimpressive results for multiclass clustering tasks. This paper presents a general framework for multiclass total variation clustering that does not rely on recursion. The results greatly outperform previous total variation algorithms and compare well with state-of-the-art NMF approaches. 1
6 0.43139893 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization
7 0.43068504 72 nips-2013-Convex Calibrated Surrogates for Low-Rank Loss Matrices with Applications to Subset Ranking Losses
8 0.41588658 328 nips-2013-The Total Variation on Hypergraphs - Learning on Hypergraphs Revisited
9 0.37665302 128 nips-2013-Generalized Method-of-Moments for Rank Aggregation
10 0.36491278 268 nips-2013-Reflection methods for user-friendly submodular optimization
11 0.31176627 131 nips-2013-Geometric optimisation on positive definite matrices for elliptically contoured distributions
12 0.27116504 301 nips-2013-Sparse Additive Text Models with Low Rank Background
13 0.26862407 152 nips-2013-Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions
14 0.26109275 265 nips-2013-Reconciling "priors" & "priors" without prejudice?
15 0.26067236 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators
16 0.25391906 52 nips-2013-Bayesian inference as iterated random functions with applications to sequential inference in graphical models
17 0.24960621 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space
18 0.24318898 257 nips-2013-Projected Natural Actor-Critic
19 0.22987956 303 nips-2013-Sparse Overlapping Sets Lasso for Multitask Learning and its Application to fMRI Analysis
20 0.22560917 117 nips-2013-Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis
topicId topicWeight
[(16, 0.02), (33, 0.117), (34, 0.072), (41, 0.04), (49, 0.034), (56, 0.106), (70, 0.065), (85, 0.03), (89, 0.038), (93, 0.333), (95, 0.023)]
simIndex simValue paperId paperTitle
1 0.97316766 339 nips-2013-Understanding Dropout
Author: Pierre Baldi, Peter J. Sadowski
Abstract: Dropout is a relatively new algorithm for training neural networks which relies on stochastically “dropping out” neurons during training in order to avoid the co-adaptation of feature detectors. We introduce a general formalism for studying dropout on either units or connections, with arbitrary probability values, and use it to analyze the averaging and regularizing properties of dropout in both linear and non-linear networks. For deep neural networks, the averaging properties of dropout are characterized by three recursive equations, including the approximation of expectations by normalized weighted geometric means. We provide estimates and bounds for these approximations and corroborate the results with simulations. Among other results, we also show how dropout performs stochastic gradient descent on a regularized error function. 1
2 0.94268233 65 nips-2013-Compressive Feature Learning
Author: Hristo S. Paskov, Robert West, John C. Mitchell, Trevor Hastie
Abstract: This paper addresses the problem of unsupervised feature learning for text data. Our method is grounded in the principle of minimum description length and uses a dictionary-based compression scheme to extract a succinct feature set. Specifically, our method finds a set of word k-grams that minimizes the cost of reconstructing the text losslessly. We formulate document compression as a binary optimization task and show how to solve it approximately via a sequence of reweighted linear programs that are efficient to solve and parallelizable. As our method is unsupervised, features may be extracted once and subsequently used in a variety of tasks. We demonstrate the performance of these features over a range of scenarios including unsupervised exploratory analysis and supervised text categorization. Our compressed feature space is two orders of magnitude smaller than the full k-gram space and matches the text categorization accuracy achieved in the full feature space. This dimensionality reduction not only results in faster training times, but it can also help elucidate structure in unsupervised learning tasks and reduce the amount of training data necessary for supervised learning. 1
3 0.9403491 211 nips-2013-Non-Linear Domain Adaptation with Boosting
Author: Carlos J. Becker, Christos M. Christoudias, Pascal Fua
Abstract: A common assumption in machine vision is that the training and test samples are drawn from the same distribution. However, there are many problems when this assumption is grossly violated, as in bio-medical applications where different acquisitions can generate drastic variations in the appearance of the data due to changing experimental conditions. This problem is accentuated with 3D data, for which annotation is very time-consuming, limiting the amount of data that can be labeled in new acquisitions for training. In this paper we present a multitask learning algorithm for domain adaptation based on boosting. Unlike previous approaches that learn task-specific decision boundaries, our method learns a single decision boundary in a shared feature space, common to all tasks. We use the boosting-trick to learn a non-linear mapping of the observations in each task, with no need for specific a-priori knowledge of its global analytical form. This yields a more parameter-free domain adaptation approach that successfully leverages learning on new tasks where labeled data is scarce. We evaluate our approach on two challenging bio-medical datasets and achieve a significant improvement over the state of the art. 1
4 0.92074543 146 nips-2013-Large Scale Distributed Sparse Precision Estimation
Author: Huahua Wang, Arindam Banerjee, Cho-Jui Hsieh, Pradeep Ravikumar, Inderjit Dhillon
Abstract: We consider the problem of sparse precision matrix estimation in high dimensions using the CLIME estimator, which has several desirable theoretical properties. We present an inexact alternating direction method of multiplier (ADMM) algorithm for CLIME, and establish rates of convergence for both the objective and optimality conditions. Further, we develop a large scale distributed framework for the computations, which scales to millions of dimensions and trillions of parameters, using hundreds of cores. The proposed framework solves CLIME in columnblocks and only involves elementwise operations and parallel matrix multiplications. We evaluate our algorithm on both shared-memory and distributed-memory architectures, which can use block cyclic distribution of data and parameters to achieve load balance and improve the efficiency in the use of memory hierarchies. Experimental results show that our algorithm is substantially more scalable than state-of-the-art methods and scales almost linearly with the number of cores. 1
5 0.90272045 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation
Author: Andriy Mnih, Koray Kavukcuoglu
Abstract: Continuous-valued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information about words very well, setting performance records on several word similarity tasks. The best results are obtained by learning high-dimensional embeddings from very large quantities of data, which makes scalability of the training method a critical factor. We propose a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation. Our approach is simpler, faster, and produces better results than the current state-of-theart method. We achieve results comparable to the best ones reported, which were obtained on a cluster, using four times less data and more than an order of magnitude less computing time. We also investigate several model types and find that the embeddings learned by the simpler models perform at least as well as those learned by the more complex ones. 1
same-paper 6 0.83406574 215 nips-2013-On Decomposing the Proximal Map
7 0.76991987 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning
8 0.75225264 99 nips-2013-Dropout Training as Adaptive Regularization
9 0.73753273 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies
10 0.73035735 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification
11 0.71632689 30 nips-2013-Adaptive dropout for training deep neural networks
12 0.70414633 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation
13 0.69834977 251 nips-2013-Predicting Parameters in Deep Learning
14 0.69763696 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality
15 0.69384974 69 nips-2013-Context-sensitive active sensing in humans
16 0.69149983 5 nips-2013-A Deep Architecture for Matching Short Texts
17 0.67904007 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization
18 0.67512423 64 nips-2013-Compete to Compute
19 0.66529727 263 nips-2013-Reasoning With Neural Tensor Networks for Knowledge Base Completion
20 0.65814263 183 nips-2013-Mapping paradigm ontologies to and from the brain