jmlr jmlr2012 jmlr2012-64 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sangkyun Lee, Stephen J. Wright
Abstract: Iterative methods that calculate their steps from approximate subgradient directions have proved to be useful for stochastic learning problems over large and streaming data sets. When the objective consists of a loss function plus a nonsmooth regularization term, the solution often lies on a lowdimensional manifold of parameter space along which the regularizer is smooth. (When an ℓ1 regularizer is used to induce sparsity in the solution, for example, this manifold is defined by the set of nonzero components of the parameter vector.) This paper shows that a regularized dual averaging algorithm can identify this manifold, with high probability, before reaching the solution. This observation motivates an algorithmic strategy in which, once an iterate is suspected of lying on an optimal or near-optimal manifold, we switch to a “local phase” that searches in this manifold, thus converging rapidly to a near-optimal point. Computational results are presented to verify the identification property and to illustrate the effectiveness of this approach. Keywords: regularization, dual averaging, partly smooth manifold, manifold identification
Reference: text
sentIndex sentText sentNum sentScore
1 When the objective consists of a loss function plus a nonsmooth regularization term, the solution often lies on a lowdimensional manifold of parameter space along which the regularizer is smooth. [sent-8, score-0.296]
2 (When an ℓ1 regularizer is used to induce sparsity in the solution, for example, this manifold is defined by the set of nonzero components of the parameter vector. [sent-9, score-0.324]
3 Keywords: regularization, dual averaging, partly smooth manifold, manifold identification 1. [sent-13, score-0.311]
4 This structure is characterized by a manifold, and identification of structure corresponds to identifying the manifold on which the solution lies. [sent-23, score-0.221]
5 The regularization function Ψ : Rn → R ∪ {+∞} is assumed to be a closed proper convex function whose effective domain (denoted by dom Ψ) is closed, and there is an open neighborhood O of dom Ψ that is contained in the domain of F(·, ξ), for all ξ ∈ Ξ. [sent-27, score-0.259]
6 Xiao (2010) recently developed the regularized dual averaging (RDA) method, in which the smooth term is approximated by an averaged gradient in the subproblem at each iteration, while the regularization term appears explicitly. [sent-41, score-0.304]
7 Specifically, the RDA subproblem is wt+1 = arg min w∈Rn gt , w + Ψ(w) + ¯ 1706 βt h(w) , t (5) M ANIFOLD I DENTIFICATION FOR R EGULARIZED S TOCHASTIC O NLINE L EARNING where gt = 1 ∑tj=1 g j and g j ∈ ∂F(w j ; ξ j ). [sent-43, score-0.351]
8 A characteristic of problems with nonsmooth regularizers is that the solution often lies on a manifold of low dimension. [sent-47, score-0.242]
9 When a reliable method for identifying an optimal (or near-optimal) manifold is available, we have the possibility of invoking an algorithm that searches just in the low-dimensional space defined by this manifold— possibly a very different algorithm from the one used on the full space. [sent-49, score-0.241]
10 A second motivation for aiming to identify the optimal manifold is that for problems of large dimension, it may be quite expensive even to store a single iterate wt , whereas an iterate whose structure is similar to that of the solution w∗ may be stored economically. [sent-53, score-0.534]
11 (To take the case of ℓ1 regularization again, we would need to store only the nonzero elements of wt and their locations. [sent-54, score-0.261]
12 We test this approach on ℓ1 -regularized logistic regression problems, in which an LPS-based algorithm is used to explore the near-optimal manifold identified by RDA. [sent-57, score-0.221]
13 In constrained optimization, the optimal manifold is typically a face or edge of the feasible set that contains the solution. [sent-60, score-0.241]
14 In nonsmooth optimization, the optimal manifold is a smooth surface passing through the optimum, parameterizable by relatively few variables, such that the restriction of the objective function to the manifold is smooth. [sent-61, score-0.507]
15 In either case, when a certain nondegeneracy condition is satisfied, this manifold may be identified without knowing the solution, usually as a by-product of an algorithm for solving the problem. [sent-62, score-0.319]
16 Lewis and Wright (2008) analyze a framework for composite minimization (which uses a subproblem in which f is replaced by an exact linearization around wt ) and prove identification results. [sent-63, score-0.273]
17 We mention a few relevant developments here, and discuss their manifold identification properties. [sent-67, score-0.221]
18 The stochastic gradient descent (SGD) method generates its iterates by stepping in the direction of a subgradient estimate and then projecting onto W, as follows: wt+1 = ΠW (wt − αt (gt + κt )) , t = 1, 2, . [sent-72, score-0.251]
19 As discussed by Xiao (2010), the SGD method does not exploit the problem structure that is present in the regularizer ψ, and there is no reason to believe that these algorithms have good manifold identification properties. [sent-83, score-0.252]
20 When ψ(·) = · 1 , for instance, there is no particular reason for the iterates wt to be sparse. [sent-84, score-0.339]
21 Though equal in expectation, gt and ∇ f (xt ) may be far apart, so that even if a careful choice of κt is made at each iteration, the updates may have the cumulative effect of destroying sparsity in the iterates wt . [sent-85, score-0.457]
22 Variants of SGD for the general convex case often work with averaged primal iterates, of the form ∑tj=1 ν j w j wt := ¯ , (6) ∑tj=1 ν j where the ν j are nonnegative weights (see, for example, Nemirovski et al. [sent-86, score-0.302]
23 For ℓ1 regularization, we can still expect the averaged iterates wt to be at least as dense as the “raw” iterates wt . [sent-89, score-0.709]
24 (2009), which generate their iterates by setting gt = ∇ f (wt ) and solving wt+1 = arg min w∈Rn gt , w + Ψ(w) + 1708 1 w − wt 2αt 2 2 , (7) M ANIFOLD I DENTIFICATION FOR R EGULARIZED S TOCHASTIC O NLINE L EARNING for some parameter αt > 0 (which plays a step-length-like role). [sent-96, score-0.6]
25 , 2011) also has the subproblem (7) at its core, but embeds it in a strategy for generating three sequences of iterates (rather than the single sequence {wt }), extending an idea of Nesterov (2004). [sent-99, score-0.246]
26 , 2009) solves a subproblem like (7) on some iterations; on other iterations it simply steps in the direction gt of the latest gradient estimate for f , ignoring the regularization term. [sent-101, score-0.258]
27 The inaccuracy of this gradient estimate may cause the iterates to step away from the optimal manifold, even from an iterate wt that is close to the solution w∗ . [sent-106, score-0.441]
28 ) In contrast, the dual average gt used by the RDA subproblem stabilizes around ∇ f (w∗ ) ¯ ∗ , allowing identification results to be derived from analysis like that of as the iterates converge to w Hare and Lewis (2004) and Lewis and Wright (2008), suitably generalized to the stochastic setting. [sent-108, score-0.441]
29 Considering again ℓ1 regularization, we observe that if component i of any iterate wt is nonzero, then component i of all averaged iterates at subsequent iterations will also be nonzero (unless some fortuitous cancellation occurs). [sent-110, score-0.48]
30 In the current paper, we facilitate the use of the raw iterates wt by RDA by adding two assumptions. [sent-112, score-0.339]
31 Together, these conditions ensure that w∗ is a strong minimizer of φ, so convergence of φ(wt ) to φ(w∗ ) forces convergence of wt to w∗ . [sent-115, score-0.287]
32 The iterate wt produced by the algorithm depends on ξ1 , ξ2 , . [sent-142, score-0.238]
33 , ξt−1 but not on ξt , ξt+1 , · · · ; we sometimes emphasize this fact by writing wt = wt (ξ[t−1] ). [sent-145, score-0.366]
34 ) Given that wt = wt (ξ[t−1] ), this implies E[∇F(wt ; ξt )] = E E[∇F(wt ; ξt )|ξ[t−1] ] = E [∇ f (wt )] . [sent-152, score-0.366]
35 (13) The nondegeneracy assumption is common in manifold identification analyses. [sent-170, score-0.319]
36 Definition 3 (Manifold) A set M ⊂ Rn is a manifold about z ∈ M if it can be described locally by ¯ p functions (p ≥ 2) with linearly independent gradients. [sent-174, score-0.265]
37 ¯ We refer to M as the active manifold at the point z, and as the optimal manifold when z = w∗ , where ¯ ¯ w∗ is a solution of (1). [sent-180, score-0.462]
38 A natural definition of the ¯ ¯ ¯ active manifold M is thus M = {z ∈ Rn | zi = 0 for all i ∈ Z}. [sent-192, score-0.24]
39 The restriction of Ψ to this manifold thus has the following form for all z ∈ M near z: ¯ − ∑ zi + ∑ zi , i∈N i∈P 1712 M ANIFOLD I DENTIFICATION FOR R EGULARIZED S TOCHASTIC O NLINE L EARNING so that (i) of Definition 4 is satisfied. [sent-194, score-0.288]
40 5 Strong Minimizer Properties We assume that w∗ is a locally strong minimizer of φ relative to the optimal manifold M with modulus cM , that is, there exists cM > 0 and rM > 0 such that {w ∈ Rn | w − w∗ ≤ rM } ⊂ O and φ|M (w) ≥ φ|M (w∗ ) + cM w − w∗ 2 , for all w ∈ M with w − w∗ ≤ rM . [sent-201, score-0.462]
41 (14) Together with a nondegeneracy assumption (13), these conditions imply that w∗ is a locally strong minimizer of φ(w) (without restriction to the optimal manifold), as we now prove. [sent-202, score-0.266]
42 Theorem 5 (Strong Minimizer for General Convex Case) Suppose that φ is partly smooth at w∗ relative to the optimal manifold M, that w∗ is a locally strong minimizer of φ|M with the modulus cM > 0 and radius rM > 0 defined in (14), and that the nondegeneracy condition (13) holds at w∗ . [sent-203, score-0.619]
43 Then there exist c ∈ (0, cM ] and r ∈ (0, rM ] such that ¯ φ(w) − φ(w∗ ) ≥ c w − w∗ 2 , for all w with w − w∗ ≤ r, ¯ (15) that is, w∗ is a locally strong minimizer of φ, without restriction to the manifold M. [sent-204, score-0.369]
44 If φ is strongly convex on dom Ψ with modulus σ > 0, then w∗ is a globally strong minimizer of (1) with modulus min(c, σ/2), that is, φ(w) ≥ φ(w∗ ) + min(c, σ/2) w − w∗ 2 , for all w ∈ O. [sent-216, score-0.447]
45 Assumption 2 The function φ is partly smooth at its minimizer w∗ relative to the optimal manifold M and w∗ is a locally strong minimizer of φ|M as defined in (14). [sent-222, score-0.52]
46 We then analyze the properties of the averaged gradient; this analysis forms the basis of the manifold identification result in Section 4. [sent-226, score-0.252]
47 As introduced in (5), the prox-function h : Rn → R∪ {+∞} is proper, strongly convex on dom Ψ, and subdifferentiable on ri dom Ψ. [sent-229, score-0.288]
48 Input: • a prox-function h(w) that is strongly convex on dom Ψ and also satisfies arg min h(w) ∈ arg min Ψ(w). [sent-241, score-0.247]
49 do Sample ξt from Ξ and compute a gradient gt = ∇F(wt ; ξt ); Update the average gradient: 1 t −1 gt−1 + gt . [sent-246, score-0.263]
50 ¯ gt = ¯ t t Compute the next iterate: gt , w + Ψ(w) + ¯ wt+1 = arg min w∈Rn βt h(w) . [sent-247, score-0.261]
51 As the objective function in the subproblem (18) is strongly convex when βt > 0 or when Ψ(·) is strongly convex (at least one of which we assume for the analysis below), wt+1 is uniquely defined by (18). [sent-254, score-0.302]
52 , wt generated by the stochastic RDA algorithm with {βt } chosen by (20), we have 1 t t ∑E I( j=1 1 t w j −w∗ ≤¯) r t ∑ E I( wj j=1 −w∗ >¯) r w j − w∗ 2 w j − w∗ ≤ 1 G2 −1/2 γD2 + t , c γ (24) ≤ G2 −1/2 1 γD2 + t . [sent-278, score-0.269]
53 c¯ r γ (25) When Ψ(w) is strongly convex with the modulus σ > 0, then with {βt } chosen by (21) we have 1 t t ∑E j=1 w j − w∗ 2 ≤ 6 + lnt G2 . [sent-279, score-0.302]
54 , wt generated by the stochastic RDA algorithm with the choice (20) for {βt }, we have 1 t t ∑E j=1 w j − w∗ ≤ µt −1/4 , (27) ˆ ˆ for all t ≥ t , where the constants t and µ are defined as follows: G2 2 γD2 + µ := √ γ c¯ r 1/2 1 2 c2 r ¯ ˆ t := , γD2 + G2 γ 2 . [sent-287, score-0.229]
55 (28) When Ψ(w) is strongly convex with the modulus σ > 0, then with the choice (21) for {βt } we have 1 t t ∑ E w j − w∗ ≤ µ′ j=1 for the constant µ′ defined by µ′ := 6 + lnt t 1/2 (29) G . [sent-288, score-0.302]
56 3 Stochastic Behavior of the Dual Average We now study the convergence properties of the dual average gt , showing that the probability that ¯ gt lies outside any given ball around ∇ f (w∗ ) goes to zero as t increases. [sent-291, score-0.267]
57 When Ψ is strongly convex and the choice (21) is used for {βt }, we have P( gt − ∇ f (w∗ ) ≥ ε) ≤ 2n exp − ¯ ε2t 32n2 G2 + 2µ′ ε−1 L 6 + lnt t 1/2 , ∀t ≥ 1, (32) where µ′ is defined in (30). [sent-294, score-0.347]
58 ¯ (37) When Ψ is strongly convex and the choice (21) is used for {βt }, and provided that ε ∈ 0, 4nG/ 2/e ¯ and t ≥ t ′ with √ µ′ L 6 −ε2 ′ 2 2 −2 ¯ , −2 ln , (38) t := 32n G ε max −W 32n2 G2 nε the bound (32) simplifies to P( gt − ∇ f (w∗ ) ≥ ε) ≤ 4µ′ ε−1 L ¯ 6 + lnt t 1/2 , t ≥ 1. [sent-311, score-0.381]
59 t t2 32n2 G2 1719 (42) L EE AND W RIGHT ¯ Thus for t ≥ t ′ , we have (42) 1 ε2t (38) ε2t + lnt ≤ − ≤ ln − 32n2 G2 2 64n2 G2 √ µ′ L 6 εn √ µ′ L 6 + lnt ≤ ln εn . [sent-320, score-0.314]
60 Our analysis is based upon the properties of the dual average discussed in the previous section and on basic results for manifold identification. [sent-327, score-0.252]
61 Specifically, we make use of a result of Hare and Lewis (2004), which states that when a sequence of points approaches a limit lying on an optimal nondegenerate manifold and the subgradients at these points approach zero, then all sufficiently advanced members of the sequence lie on the manifold. [sent-328, score-0.241]
62 We identify subsequences of the RDA sequence that lie on the optimal manifold with increasing likelihood as the iteration counter grows. [sent-329, score-0.259]
63 Theorem 15 Suppose that φ is partly smooth at the minimizer w∗ relative to the optimal manifold ¯ M and that the nondegeneracy condition (13) holds. [sent-382, score-0.47]
64 The next theorem is our main result, showing that the RDA algorithm identifies the optimal manifold with increasing probability as iterations proceed. [sent-393, score-0.259]
65 }, while Theorem 16 states that for all sufficiently large j ∈ S, w j lies on the optimal manifold with probability approaching one as j increases. [sent-435, score-0.241]
66 The major reason is that each iteration uses a “raw” sampled gradient gt of f , rather than the averaged (and thus smoothed) approximate gradient gt of RDA. [sent-439, score-0.339]
67 Thus, as we see now, even when the current iterate ¯ wt is optimal, the subproblem may step away from this point, and away from the optimal manifold. [sent-440, score-0.348]
68 With these definitions, the solution is w∗ = 0 and the optimal manifold is zero-dimensional: M = {0}. [sent-445, score-0.241]
69 Setting wt = w∗ = 0, the subproblem (7) is wt+1 = arg min ξt w + |w| + w 1726 1 2 w , 2αt M ANIFOLD I DENTIFICATION FOR R EGULARIZED S TOCHASTIC O NLINE L EARNING where ξt is selected uniformly at random from [−2, 2]. [sent-449, score-0.298]
70 Enhancing the Regularized Dual Averaging Algorithm Here we present a simple optimization strategy motivated by our analysis above, in which the RDA method gives way to a local phase after a near-optimal manifold is identified. [sent-456, score-0.301]
71 This algorithm starts with RDA steps until it identifies a near-optimal manifold, then searches this manifold using a different optimization strategy, possibly better suited to lower-dimensional spaces and possibly with better local convergence properties. [sent-458, score-0.239]
72 To decide when to make the switch to the local phase, we use a simple heuristic inspired by Theorem 16 and 17 that if the past τ consecutive iterates have been on the same manifold M, we take M to be approximately optimal. [sent-462, score-0.395]
73 ) The optimal manifold corresponds (near w∗ ) to the points in Rn that have the same nonzero patterns as w∗ . [sent-468, score-0.296]
74 Input: • a prox-function h(w) that is strongly convex on dom Ψ and also satisfies arg min h(w) ∈ arg min Ψ(w), w∈Rn w∈Rn b ≤ η w − w1 , ∀w ∈ dom Ψ, sup b∈∂h(w) where w1 ∈ arg min Ψ(w). [sent-480, score-0.363]
75 We use Mk to denote a subset of the restricted manifold M, again defined by the indices of the components in which we consider a move at iteration k. [sent-505, score-0.256]
76 We then tabulate how many iterations of RDA are required before it generates a point in the optimal manifold M containing w∗ . [sent-529, score-0.241]
77 We also check when the iterates of N RDA reach a modest superset of the optimal manifold—a “2×” superset composed of the points in Rn having the same sign pattern for the active components in M, and up to twice as many nonzeros as the points in M. [sent-530, score-0.318]
78 The median number of iterations required to identify the optimal manifold M, and the number required to identify a 2× superset, are presented, along with the standard deviations over the 100 tests (in parentheses). [sent-572, score-0.241]
79 , n, where gt is a sampled approximation to the gradient of f at wt , obtained from a single training example. [sent-584, score-0.328]
80 ) For LPS and the local phase of RDA+ , we compute a reduced Newton step on the current active manifold only when the number of nonzeros falls below 200. [sent-593, score-0.366]
81 For SGD, TG, and RDA, we keep track not only of the primal iteration sequence {wt }, but 1 T also the primal averages wT := T ∑t=1 wt , where T for each algorithm denotes the iteration number ¯ where the algorithm is stopped. [sent-595, score-0.287]
82 RDA tends to produce much sparser iterates with less fluctuation than SGD and TG, but it fails to reduce the number of nonzeros to the smallest number identified by RDA+ in the given time, for the values λ = 1. [sent-613, score-0.221]
83 1732 M ANIFOLD I DENTIFICATION FOR R EGULARIZED S TOCHASTIC O NLINE L EARNING Figure 2 shows similar plots but for the primal averaged iterates wt . [sent-619, score-0.422]
84 We present the results for the iterates without averaging in the three plots on the left, and those for the primal-averaged iterates (for algorithms SGD, TG, and RDA) in the plots on the right. [sent-625, score-0.406]
85 The averaged iterates of SGD and TG show smaller test error for λ ≥ 1 than others, but they require a large number of nonzero components, despite the strong regularization imposed. [sent-638, score-0.297]
86 Conclusion We have shown that under assumptions of nondegeneracy and strong local minimization at the optimum, the (non-averaged) iterates generated by the RDA algorithm identify the optimal manifold with probability approaching one as iterations proceed. [sent-645, score-0.565]
87 Convergence is measured in terms of the optimality measure (left) and the number of nonzero components (excluding intercepts) in the iterates (right). [sent-653, score-0.287]
88 Convergence is measured in terms of the optimality measure (left) and the number of nonzero components (excluding intercepts) in the iterates (right). [sent-662, score-0.287]
89 1735 L EE AND W RIGHT Left: iterates without averaging Right: primal averages 2 2 10 10 SGD TG RDA RDA+ LPS (baseline) 0 Optimality 10 0 10 -2 -2 10 10 -4 -4 10 10 -6 -6 10 0. [sent-666, score-0.248]
90 The plots on the left show the results for the iterates without averaging, and those on the right show averaged primal iterates for algorithms SGD, TG, and RDA, and non-averaged iterates for RDA+ and LPS. [sent-680, score-0.551]
91 Third, it is likely that the nondegeneracy assumption can be weakened, in which case a “super-manifold” of the optimal manifold would be identifiable. [sent-701, score-0.339]
92 5, based on results from manifold analysis and other elementary arguments. [sent-709, score-0.221]
93 We first state an elementary result on manifold characterization, which is proved by Vaisman (1984, Sections 1. [sent-712, score-0.221]
94 2), shows how perturbations from a point at which the objective function is partly smooth can be decomposed according to the manifold characterization above. [sent-719, score-0.28]
95 Lemma 20 Consider a function ϕ : Rn → R ∪ {+∞}, a point z ∈ Rn , and a manifold M containing ¯ z such that ϕ is partly smooth at z with respect to M. [sent-724, score-0.28]
96 Recall the following assumptions: 1738 M ANIFOLD I DENTIFICATION FOR R EGULARIZED S TOCHASTIC O NLINE L EARNING (i) φ is partly smooth at w∗ relative to the optimal manifold M. [sent-728, score-0.3]
97 (ii) w∗ is a locally strong minimizer of φ|M with modulus cM > 0 and radius rM > 0, and (iii) the nondegeneracy condition (13) holds at w∗ . [sent-729, score-0.319]
98 For the minimizer w∗ of (1) and the optimal manifold M containing w∗ , we consider the mappings H and G, the matrix Y , and the point y ∈ Rn−k satisfying Lemma 18 and Lemma 19, associ¯ ∗ ∈ Rn . [sent-730, score-0.313]
99 ) We have thus shown that w∗ indeed is a local strong minimizer of φ, without the restriction 1 ¯ to the manifold M, with modulus c := 2 min(ε/2, cM ) and radius r. [sent-741, score-0.416]
100 For the strongly convex case, we have from Cauchy-Schwarz and Jensen’s inequalities that 1 t t ∑E j=1 ∗ wj −w 1 ≤ t t ∑ j=1 1/2 ∗ E wj −w 2 1 ≤ t t ∑E j=1 1/2 ∗ 2 wj −w . [sent-765, score-0.226]
wordName wordTfidf (topN-words)
[('rda', 0.718), ('manifold', 0.221), ('lps', 0.212), ('wt', 0.183), ('tg', 0.161), ('iterates', 0.156), ('sgd', 0.156), ('dentification', 0.144), ('egularized', 0.123), ('lnt', 0.123), ('gt', 0.118), ('tochastic', 0.111), ('anifold', 0.111), ('nondegeneracy', 0.098), ('xiao', 0.097), ('dom', 0.091), ('subproblem', 0.09), ('nline', 0.081), ('modulus', 0.073), ('ee', 0.073), ('minimizer', 0.072), ('wright', 0.066), ('nonzeros', 0.065), ('rn', 0.062), ('phase', 0.062), ('nnzs', 0.06), ('optimality', 0.059), ('averaging', 0.058), ('cm', 0.055), ('iterate', 0.055), ('nonzero', 0.055), ('convex', 0.054), ('zk', 0.054), ('hare', 0.053), ('nemirovski', 0.052), ('strongly', 0.052), ('fd', 0.051), ('lewis', 0.047), ('earning', 0.047), ('stochastic', 0.046), ('ttg', 0.045), ('unbiasedness', 0.045), ('identi', 0.044), ('locally', 0.044), ('regret', 0.043), ('wj', 0.04), ('dist', 0.038), ('intercepts', 0.038), ('partly', 0.035), ('ln', 0.034), ('primal', 0.034), ('tj', 0.033), ('card', 0.033), ('strong', 0.032), ('regularizer', 0.031), ('averaged', 0.031), ('dual', 0.031), ('stc', 0.03), ('superset', 0.03), ('duchi', 0.03), ('event', 0.03), ('near', 0.029), ('gradient', 0.027), ('mnist', 0.026), ('arg', 0.025), ('nm', 0.025), ('mk', 0.025), ('manifolds', 0.024), ('smooth', 0.024), ('nesterov', 0.023), ('regularization', 0.023), ('runtime', 0.023), ('lipschitz', 0.023), ('safeguarding', 0.023), ('sangkyun', 0.023), ('zj', 0.023), ('dk', 0.023), ('subgradient', 0.022), ('nonsmooth', 0.021), ('solutions', 0.021), ('assumptions', 0.02), ('digits', 0.02), ('shi', 0.02), ('regularized', 0.02), ('online', 0.02), ('smoothness', 0.02), ('optimal', 0.02), ('switching', 0.019), ('zi', 0.019), ('iteration', 0.018), ('theorem', 0.018), ('plots', 0.018), ('lemma', 0.018), ('lt', 0.018), ('local', 0.018), ('st', 0.018), ('inequality', 0.018), ('dortmund', 0.017), ('convexity', 0.017), ('components', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
Author: Sangkyun Lee, Stephen J. Wright
Abstract: Iterative methods that calculate their steps from approximate subgradient directions have proved to be useful for stochastic learning problems over large and streaming data sets. When the objective consists of a loss function plus a nonsmooth regularization term, the solution often lies on a lowdimensional manifold of parameter space along which the regularizer is smooth. (When an ℓ1 regularizer is used to induce sparsity in the solution, for example, this manifold is defined by the set of nonzero components of the parameter vector.) This paper shows that a regularized dual averaging algorithm can identify this manifold, with high probability, before reaching the solution. This observation motivates an algorithmic strategy in which, once an iterate is suspected of lying on an optimal or near-optimal manifold, we switch to a “local phase” that searches in this manifold, thus converging rapidly to a near-optimal point. Computational results are presented to verify the identification property and to illustrate the effectiveness of this approach. Keywords: regularization, dual averaging, partly smooth manifold, manifold identification
2 0.13732418 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks
Author: Trinh Minh Tri Do, Thierry Artières
Abstract: Machine learning is most often cast as an optimization problem. Ideally, one expects a convex objective function to rely on efficient convex optimizers with nice guarantees such as no local optima. Yet, non-convexity is very frequent in practice and it may sometimes be inappropriate to look for convexity at any price. Alternatively one can decide not to limit a priori the modeling expressivity to models whose learning may be solved by convex optimization and rely on non-convex optimization algorithms. The main motivation of this work is to provide efficient and scalable algorithms for non-convex optimization. We focus on regularized unconstrained optimization problems which cover a large number of modern machine learning problems such as logistic regression, conditional random fields, large margin estimation, etc. We propose a novel algorithm for minimizing a regularized objective that is able to handle convex and non-convex, smooth and non-smooth risks. The algorithm is based on the cutting plane technique and on the idea of exploiting the regularization term in the objective function. It may be thought as a limited memory extension of convex regularized bundle methods for dealing with convex and non convex risks. In case the risk is convex the algorithm is proved to converge to a stationary solution with accuracy ε with a rate O(1/λε) where λ is the regularization parameter of the objective function under the assumption of a Lipschitz empirical risk. In case the risk is not convex getting such a proof is more difficult and requires a stronger and more disputable assumption. Yet we provide experimental results on artificial test problems, and on five standard and difficult machine learning problems that are cast as convex and non-convex optimization problems that show how our algorithm compares well in practice with state of the art optimization algorithms. Keywords: optimization, non-convex, non-smooth, cutting plane, bundle method, regularized risk
3 0.12315395 97 jmlr-2012-Regularization Techniques for Learning with Matrices
Author: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari
Abstract: There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning. Keywords: regularization, strong convexity, regret bounds, generalization bounds, multi-task learning, multi-class learning, multiple kernel learning
4 0.091394342 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices
Author: Uri Shalit, Daphna Weinshall, Gal Chechik
Abstract: When learning models that are represented in matrix forms, enforcing a low-rank constraint can dramatically improve the memory and run time complexity, while providing a natural regularization of the model. However, naive approaches to minimizing functions over the set of low-rank matrices are either prohibitively time consuming (repeated singular value decomposition of the matrix) or numerically unstable (optimizing a factored representation of the low-rank matrix). We build on recent advances in optimization over manifolds, and describe an iterative online learning procedure, consisting of a gradient step, followed by a second-order retraction back to the manifold. While the ideal retraction is costly to compute, and so is the projection operator that approximates it, we describe another retraction that can be computed efficiently. It has run time and memory complexity of O ((n + m)k) for a rank-k matrix of dimension m × n, when using an online procedure with rank-one gradients. We use this algorithm, L ORETA, to learn a matrix-form similarity measure over pairs of documents represented as high dimensional vectors. L ORETA improves the mean average precision over a passive-aggressive approach in a factorized model, and also improves over a full model trained on pre-selected features using the same memory requirements. We further adapt L ORETA to learn positive semi-definite low-rank matrices, providing an online algorithm for low-rank metric learning. L ORETA also shows consistent improvement over standard weakly supervised methods in a large (1600 classes and 1 million images, using ImageNet) multi-label image classification task. Keywords: low rank, Riemannian manifolds, metric learning, retractions, multitask learning, online learning
5 0.089523792 68 jmlr-2012-Minimax Manifold Estimation
Author: Christopher Genovese, Marco Perone-Pacifico, Isabella Verdinelli, Larry Wasserman
Abstract: We find the minimax rate of convergence in Hausdorff distance for estimating a manifold M of dimension d embedded in RD given a noisy sample from the manifold. Under certain conditions, we show that the optimal rate of convergence is n−2/(2+d) . Thus, the minimax rate depends only on the dimension of the manifold, not on the dimension of the space in which M is embedded. Keywords: manifold learning, minimax estimation
6 0.08819823 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization
7 0.08461637 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
8 0.083079569 50 jmlr-2012-Human Gesture Recognition on Product Manifolds
9 0.08238858 23 jmlr-2012-Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training
10 0.072610982 84 jmlr-2012-Online Submodular Minimization
11 0.061991282 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
12 0.05540498 80 jmlr-2012-On Ranking and Generalization Bounds
13 0.050965279 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
14 0.050808609 25 jmlr-2012-Characterization and Greedy Learning of Interventional Markov Equivalence Classes of Directed Acyclic Graphs
15 0.049346864 22 jmlr-2012-Bounding the Probability of Error for High Precision Optical Character Recognition
16 0.04700559 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
17 0.033627834 82 jmlr-2012-On the Necessity of Irrelevant Variables
18 0.032331947 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
19 0.031912763 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods
20 0.030855522 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality
topicId topicWeight
[(0, -0.197), (1, -0.174), (2, -0.056), (3, 0.024), (4, 0.009), (5, 0.014), (6, 0.145), (7, -0.098), (8, 0.184), (9, -0.149), (10, 0.041), (11, 0.259), (12, -0.066), (13, 0.079), (14, 0.043), (15, -0.009), (16, 0.008), (17, 0.059), (18, -0.128), (19, -0.059), (20, 0.226), (21, 0.087), (22, -0.042), (23, 0.031), (24, -0.106), (25, 0.015), (26, 0.101), (27, 0.007), (28, -0.056), (29, -0.037), (30, -0.011), (31, 0.068), (32, 0.001), (33, -0.017), (34, 0.108), (35, 0.138), (36, -0.134), (37, -0.091), (38, 0.035), (39, 0.01), (40, -0.001), (41, 0.049), (42, 0.009), (43, -0.089), (44, -0.016), (45, -0.129), (46, -0.009), (47, 0.05), (48, 0.052), (49, -0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.93109977 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
Author: Sangkyun Lee, Stephen J. Wright
Abstract: Iterative methods that calculate their steps from approximate subgradient directions have proved to be useful for stochastic learning problems over large and streaming data sets. When the objective consists of a loss function plus a nonsmooth regularization term, the solution often lies on a lowdimensional manifold of parameter space along which the regularizer is smooth. (When an ℓ1 regularizer is used to induce sparsity in the solution, for example, this manifold is defined by the set of nonzero components of the parameter vector.) This paper shows that a regularized dual averaging algorithm can identify this manifold, with high probability, before reaching the solution. This observation motivates an algorithmic strategy in which, once an iterate is suspected of lying on an optimal or near-optimal manifold, we switch to a “local phase” that searches in this manifold, thus converging rapidly to a near-optimal point. Computational results are presented to verify the identification property and to illustrate the effectiveness of this approach. Keywords: regularization, dual averaging, partly smooth manifold, manifold identification
2 0.63934743 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks
Author: Trinh Minh Tri Do, Thierry Artières
Abstract: Machine learning is most often cast as an optimization problem. Ideally, one expects a convex objective function to rely on efficient convex optimizers with nice guarantees such as no local optima. Yet, non-convexity is very frequent in practice and it may sometimes be inappropriate to look for convexity at any price. Alternatively one can decide not to limit a priori the modeling expressivity to models whose learning may be solved by convex optimization and rely on non-convex optimization algorithms. The main motivation of this work is to provide efficient and scalable algorithms for non-convex optimization. We focus on regularized unconstrained optimization problems which cover a large number of modern machine learning problems such as logistic regression, conditional random fields, large margin estimation, etc. We propose a novel algorithm for minimizing a regularized objective that is able to handle convex and non-convex, smooth and non-smooth risks. The algorithm is based on the cutting plane technique and on the idea of exploiting the regularization term in the objective function. It may be thought as a limited memory extension of convex regularized bundle methods for dealing with convex and non convex risks. In case the risk is convex the algorithm is proved to converge to a stationary solution with accuracy ε with a rate O(1/λε) where λ is the regularization parameter of the objective function under the assumption of a Lipschitz empirical risk. In case the risk is not convex getting such a proof is more difficult and requires a stronger and more disputable assumption. Yet we provide experimental results on artificial test problems, and on five standard and difficult machine learning problems that are cast as convex and non-convex optimization problems that show how our algorithm compares well in practice with state of the art optimization algorithms. Keywords: optimization, non-convex, non-smooth, cutting plane, bundle method, regularized risk
3 0.5324235 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices
Author: Uri Shalit, Daphna Weinshall, Gal Chechik
Abstract: When learning models that are represented in matrix forms, enforcing a low-rank constraint can dramatically improve the memory and run time complexity, while providing a natural regularization of the model. However, naive approaches to minimizing functions over the set of low-rank matrices are either prohibitively time consuming (repeated singular value decomposition of the matrix) or numerically unstable (optimizing a factored representation of the low-rank matrix). We build on recent advances in optimization over manifolds, and describe an iterative online learning procedure, consisting of a gradient step, followed by a second-order retraction back to the manifold. While the ideal retraction is costly to compute, and so is the projection operator that approximates it, we describe another retraction that can be computed efficiently. It has run time and memory complexity of O ((n + m)k) for a rank-k matrix of dimension m × n, when using an online procedure with rank-one gradients. We use this algorithm, L ORETA, to learn a matrix-form similarity measure over pairs of documents represented as high dimensional vectors. L ORETA improves the mean average precision over a passive-aggressive approach in a factorized model, and also improves over a full model trained on pre-selected features using the same memory requirements. We further adapt L ORETA to learn positive semi-definite low-rank matrices, providing an online algorithm for low-rank metric learning. L ORETA also shows consistent improvement over standard weakly supervised methods in a large (1600 classes and 1 million images, using ImageNet) multi-label image classification task. Keywords: low rank, Riemannian manifolds, metric learning, retractions, multitask learning, online learning
4 0.44015336 97 jmlr-2012-Regularization Techniques for Learning with Matrices
Author: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari
Abstract: There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning. Keywords: regularization, strong convexity, regret bounds, generalization bounds, multi-task learning, multi-class learning, multiple kernel learning
5 0.43028972 68 jmlr-2012-Minimax Manifold Estimation
Author: Christopher Genovese, Marco Perone-Pacifico, Isabella Verdinelli, Larry Wasserman
Abstract: We find the minimax rate of convergence in Hausdorff distance for estimating a manifold M of dimension d embedded in RD given a noisy sample from the manifold. Under certain conditions, we show that the optimal rate of convergence is n−2/(2+d) . Thus, the minimax rate depends only on the dimension of the manifold, not on the dimension of the space in which M is embedded. Keywords: manifold learning, minimax estimation
6 0.42832908 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
7 0.42157286 23 jmlr-2012-Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training
8 0.37106067 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
9 0.35110423 50 jmlr-2012-Human Gesture Recognition on Product Manifolds
10 0.28614938 80 jmlr-2012-On Ranking and Generalization Bounds
11 0.28094846 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization
12 0.25641975 22 jmlr-2012-Bounding the Probability of Error for High Precision Optical Character Recognition
13 0.24270019 25 jmlr-2012-Characterization and Greedy Learning of Interventional Markov Equivalence Classes of Directed Acyclic Graphs
14 0.24148069 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization
15 0.24018064 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods
16 0.21727584 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
17 0.17141186 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage
18 0.16765921 52 jmlr-2012-Iterative Reweighted Algorithms for Matrix Rank Minimization
19 0.16584407 84 jmlr-2012-Online Submodular Minimization
20 0.16502489 109 jmlr-2012-Stability of Density-Based Clustering
topicId topicWeight
[(7, 0.044), (21, 0.039), (26, 0.045), (27, 0.011), (29, 0.03), (35, 0.023), (37, 0.315), (49, 0.018), (56, 0.013), (64, 0.022), (75, 0.083), (77, 0.019), (79, 0.021), (92, 0.114), (96, 0.11)]
simIndex simValue paperId paperTitle
same-paper 1 0.69769901 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
Author: Sangkyun Lee, Stephen J. Wright
Abstract: Iterative methods that calculate their steps from approximate subgradient directions have proved to be useful for stochastic learning problems over large and streaming data sets. When the objective consists of a loss function plus a nonsmooth regularization term, the solution often lies on a lowdimensional manifold of parameter space along which the regularizer is smooth. (When an ℓ1 regularizer is used to induce sparsity in the solution, for example, this manifold is defined by the set of nonzero components of the parameter vector.) This paper shows that a regularized dual averaging algorithm can identify this manifold, with high probability, before reaching the solution. This observation motivates an algorithmic strategy in which, once an iterate is suspected of lying on an optimal or near-optimal manifold, we switch to a “local phase” that searches in this manifold, thus converging rapidly to a near-optimal point. Computational results are presented to verify the identification property and to illustrate the effectiveness of this approach. Keywords: regularization, dual averaging, partly smooth manifold, manifold identification
2 0.52113885 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
Author: Matus Telgarsky
Abstract: Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: • Weak learnability aids the whole loss family: for any ε > 0, O (ln(1/ε)) iterations suffice to produce a predictor with empirical risk ε-close to the infimum; • The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rate O (ln(1/ε)); • Arbitrary instances may be decomposed into the above two, granting rate O (1/ε), with a matching lower bound provided for the logistic loss. Keywords: boosting, convex analysis, weak learnability, coordinate descent, maximum entropy
3 0.51689923 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao
Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization
4 0.51444679 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
Author: Mehrdad Mahdavi, Rong Jin, Tianbao Yang
Abstract: In this paper we propose efficient algorithms for solving constrained online convex optimization problems. Our motivation stems from the observation that most algorithms proposed for online convex optimization require a projection onto the convex set K from which the decisions are made. While the projection is straightforward for simple shapes (e.g., Euclidean ball), for arbitrary complex sets it is the main computational challenge and may be inefficient in practice. In this paper, we consider an alternative online convex optimization problem. Instead of requiring that decisions belong to K for all rounds, we only require that the constraints, which define the set K , be satisfied in the long run. By turning the problem into an online convex-concave optimization problem, √ we propose an efficient algorithm which achieves O( T ) regret bound and O(T 3/4 ) bound on the violation of constraints. Then, we modify the algorithm in order to guarantee that the constraints are satisfied in the long run. This gain is achieved at the price of getting O(T 3/4 ) regret bound. Our second algorithm is based on the mirror prox method (Nemirovski, 2005) to solve variational inequalities which achieves O(T 2/3 ) bound for both regret and the violation of constraints when the domain K can be described by a finite number of linear constraints. Finally, we extend the results to the setting where we only have partial access to the convex set K and propose a multipoint bandit feedback algorithm with the same bounds in expectation as our first algorithm. Keywords: online convex optimization, convex-concave optimization, bandit feedback, variational inequality
5 0.50763923 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
Author: Ofer Dekel, Claudio Gentile, Karthik Sridharan
Abstract: We present a new online learning algorithm in the selective sampling framework, where labels must be actively queried before they are revealed. We prove bounds on the regret of our algorithm and on the number of labels it queries when faced with an adaptive adversarial strategy of generating the instances. Our bounds both generalize and strictly improve over previous bounds in similar settings. Additionally, our selective sampling algorithm can be converted into an efficient statistical active learning algorithm. We extend our algorithm and analysis to the multiple-teacher setting, where the algorithm can choose which subset of teachers to query for each label. Finally, we demonstrate the effectiveness of our techniques on a real-world Internet search problem. Keywords: online learning, regret, label-efficient, crowdsourcing
6 0.50573993 97 jmlr-2012-Regularization Techniques for Learning with Matrices
7 0.50436693 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
8 0.50044459 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
9 0.49826598 13 jmlr-2012-Active Learning via Perfect Selective Classification
10 0.49649739 80 jmlr-2012-On Ranking and Generalization Bounds
11 0.49541789 73 jmlr-2012-Multi-task Regression using Minimal Penalties
12 0.49518836 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO
13 0.49480313 111 jmlr-2012-Structured Sparsity and Generalization
14 0.49314207 84 jmlr-2012-Online Submodular Minimization
15 0.49167743 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
16 0.48953229 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality
17 0.4891727 103 jmlr-2012-Sampling Methods for the Nyström Method
18 0.48828036 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment
19 0.48822653 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms
20 0.48753482 82 jmlr-2012-On the Necessity of Irrelevant Variables