nips nips2010 nips2010-63 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alekh Agarwal, Martin J. Wainwright, John C. Duchi
Abstract: The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is confirmed both by theoretical lower bounds and simulations for various networks. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. [sent-5, score-0.336]
2 We develop and analyze distributed algorithms based on dual averaging of subgradients, and provide sharp bounds on their convergence rates as a function of the network size and topology. [sent-6, score-0.942]
3 Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. [sent-7, score-0.373]
4 We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. [sent-8, score-0.341]
5 Problems such as multi-agent coordination, estimation problems in sensor networks, and packet routing also are all naturally cast as distributed convex minimization [1, 13, 24]. [sent-15, score-0.351]
6 In this paper, we provide a simple new subgradient algorithm for distributed constrained optimization of a convex function. [sent-18, score-0.52]
7 We refer to it as a dual averaging subgradient method, since it is based on maintaining and forming weighted averages of subgradients throughout the network. [sent-19, score-0.69]
8 This approach is essentially different from previously developed distributed subgradient methods [18, 15, 21, 11], and these differences facilitate our analysis of network scaling issues—how convergence rates depend on network size and topology. [sent-20, score-0.951]
9 Indeed, the second main contribution of this paper is a careful analysis that demonstrates a close link between convergence of the algorithm and the underlying spectral properties of the network. [sent-21, score-0.247]
10 The convergence rates for a different algorithm given by the papers [18, 15] grow exponentially in the number of nodes n in the network. [sent-22, score-0.305]
11 [21] provide tighter analysis that yields convergence rates that scale cubically in the network size, but are independent of the network topology. [sent-24, score-0.474]
12 Consequently, their analysis does not capture the intuition that distributed algorithms should converge faster on “well-connected” networks—expander graphs being a prime example—than on poorly connected networks (e. [sent-25, score-0.274]
13 [11] analyze a low communication peer-to-peer protocol that attains rates dependent on network structure. [sent-29, score-0.329]
14 However, in their algorithm only one node has a current parameter value, while all nodes in our algorithm maintain good estimates of the optimum at all times. [sent-30, score-0.231]
15 Our development yields an algorithm with convergence rate that scales inversely in the spectral gap of the network. [sent-33, score-0.34]
16 By exploiting known results on spectral gaps for graphs with n nodes, we show that our algorithm obtains an -optimal solution in O(n2 / 2 ) iterations for a single cycle or path, O(n/ 2 ) iterations for a two-dimensional grid, and O(1/ 2 ) iterations for a bounded degree expander graph. [sent-34, score-0.874]
17 2 Problem set-up and algorithm In this section, we provide a formal statement of the distributed minimization problem and a description of the distributed dual averaging algorithm. [sent-36, score-0.844]
18 Distributed minimization: We consider an optimization problem based on functions that are distributed over a network. [sent-37, score-0.223]
19 Associated with each i ∈ V is convex function fi : Rd → R, and our overarching goal is to solve the constrained optimization problem n 1 minx∈X n i=1 fi (x), where X is a closed convex set. [sent-42, score-0.704]
20 Each function fi is convex and hence subdifferentiable, but need not be smooth. [sent-43, score-0.332]
21 Each node i ∈ V is associated with a separate agent, and each agent i maintains its own parameter vector xi ∈ Rd . [sent-45, score-0.285]
22 The graph G imposes communication constraints on the agents: in particular, agent i has local access to only the objective function fi and can communicate directly only with its immediate neighbors j ∈ N (i) := {j ∈ V | (i, j) ∈ E}. [sent-46, score-0.56]
23 Each function fi is the empirical loss over the subset of data assigned to processor i, and the average f is the empirical loss over the entire dataset. [sent-49, score-0.323]
24 We use cluster computing as our model, so each processor is a node in the cluster and the graph G contains edges between processors connected with small latencies; this setup avoids communication bottlenecks of architectures with a centralized master node. [sent-50, score-0.534]
25 Dual averaging: Our algorithm is based on a dual averaging algorithm [20] for minimization of a (potentially nonsmooth) convex function f subject to the constraint that x ∈ X . [sent-51, score-0.549]
26 The dual averaging scheme is based on a proximal function ψ : Rd → R assumed to be strongly convex with respect to a norm · , more precisely, 2 ψ(y) ≥ ψ(x) + ψ(x), y − x + 1 x − y for all x, y ∈ X . [sent-53, score-0.601]
27 Such proximal functions include the canonical quadratic ψ(x) = 2 x 2 , which d is strongly convex with respect to the 2 -norm, and the negative entropy ψ(x) = j=1 xi log xi −xi , which is strongly convex with respect to the 1 -norm for x in the probability simplex. [sent-59, score-0.583]
28 We assume that each function fi is L-Lipschitz with respect to the same norm · —that is, |fi (x) − fi (y)| ≤ L x − y for x, y ∈ X . [sent-60, score-0.522]
29 (1) Many cost functions fi satisfy this type of Lipschitz condition, for instance, convex functions on a compact domain X or any polyhedral function on an arbitrary domain [8]. [sent-61, score-0.332]
30 The Lipschitz condition (1) implies that for any x ∈ X and any subgradient gi ∈ ∂fi (x), we have gi ∗ ≤ L, where · ∗ denotes the dual norm to · , defined by v ∗ := sup u =1 v, u . [sent-62, score-0.928]
31 The dual averaging algorithm generates a sequence of iterates {x(t), z(t)}∞ contained within X × t=0 Rd . [sent-63, score-0.473]
32 At time step t, the algorithm receives a subgradient g(t) ∈ ∂f (x(t)), and updates Here z(t + 1) = z(t) − g(t) {α(t)}∞ t=0 x(t + 1) = Πψ (−z(t + 1), α(t)). [sent-64, score-0.302]
33 Intuitively, given the current iterate (x(t), z(t)), the next iterate x(t + 1) to chosen to minimize an averaged first-order approximation to the function f , while the proximal 2 function ψ and stepsize α(t) > 0 enforce that the iterates {x(t)}∞ do not oscillate wildly. [sent-66, score-0.253]
34 Distributed dual averaging: Here we consider a novel extension of dual averaging to the distributed setting. [sent-69, score-0.834]
35 For all times t, each node i ∈ V maintains a pair of vectors (xi (t), zi (t)) ∈ X × Rd . [sent-70, score-0.33]
36 At iteration t, node i computes a subgradient gi (t) ∈ ∂fi (xi (t)) of the local function fi and receives {zj (t), j ∈ N (i)} from its neighbors. [sent-71, score-0.881]
37 Given a non-increasing sequence {α(t)}∞ t=0 of positive stepsizes, each node i ∈ V updates zi (t + 1) = j∈N (i) Pji zj (t) − gi (t), and xi (t + 1) = Πψ (−zi (t + 1), α(t)), X (4) where the projection Πψ was defined in (3). [sent-75, score-0.903]
38 In words, node i computes the new dual parameter X zi (t + 1) from a weighted average of its own subgradient gi (t) and the parameters {zj (t), j ∈ N (i)} in its neighborhood; it then computes the local iterate xi (t + 1) by a proximal projection. [sent-76, score-1.291]
39 We show convergence of the local sequence {xi (t)}∞ to an optimum of the global objective via the local t=1 T 1 average xi (T ) = T t=1 xi (t), which can evidently be computed in a decentralized manner. [sent-77, score-0.482]
40 Convergence of distributed dual averaging: We start with a result on the convergence of the distributed dual averaging algorithm that provides a decomposition of the error into an optimization term and the cost associated with network communication. [sent-80, score-1.296]
41 In order to state this theorem, we define n 1 the averaged dual variable z (t) := n i=1 zi (t), and we recall the local time-average xi (T ). [sent-81, score-0.566]
42 Given a sequence {xi (t)}∞ and {zi (t)}∞ generated by t=0 t=0 the updates (4) with step size sequence {α(t)}∞ , for each node i ∈ V and any x∗ ∈ X , we have t=0 f (xi (T )) − f (x∗ ) ≤ 1 L2 ψ(x∗ ) + T α(T ) 2T T t=1 T α(t − 1) + 3L max α(t) z (t) − zj (t) ¯ T j=1,. [sent-83, score-0.368]
43 Theorem 1 guarantees that after T steps of the algorithm, every node i ∈ V has access to a locally defined quantity xi (T ) such that the difference f (xi (T )) − f (x∗ ) is upper bounded by a sum of three terms. [sent-87, score-0.291]
44 The first two terms in the upper bound in the theorem are optimization error terms that are common to subgradient algorithms. [sent-88, score-0.385]
45 Thus, roughly, Theorem 1 ensures that as long the bound on the deviation z (t) − zi (t) ∗ is tight enough, for appropriately chosen α(t) (say ¯ √ α(t) ≈ 1/ t), the error of xi (T ) is small uniformly across all nodes i ∈ V . [sent-90, score-0.534]
46 Convergence rates and network topology: We now turn to investigation of the effects of network topology on convergence rates. [sent-91, score-0.534]
47 In this section,1 we assume that the network topology is static and that communication occurs via a fixed doubly stochastic weight matrix P at every round. [sent-92, score-0.412]
48 As the following result shows, the convergence of our algorithm is controlled by the spectral gap γ(P ) := 1 − σ2 (P ) of P . [sent-94, score-0.301]
49 This theorem establishes a tight connection between the convergence rate of distributed subgradient methods and the spectral properties of the underlying network. [sent-104, score-0.73]
50 [11] establish rates for their Markov incremental gradient method (MIGD) of nΓii /T , where √ Γ = (I − P + 1 1 /n)−1 ; performing an eigen-decomposition of the Γ matrix shows that nΓii is 11 always lower bounded by 1/ 1 − σ2 (P ), our bound in Theorem 2. [sent-107, score-0.226]
51 Using Theorem 2, one can derive explicit convergence rates for several classes of interesting networks, and Figure 1 illustrates four graph topologies of interest. [sent-108, score-0.366]
52 As a first example, the k-connected cycle in panel (a) is formed by placing n nodes on a circle and connecting each node to its k neighbors on the right and left. [sent-109, score-0.488]
53 The grid (panel (b)) is obtained by connecting nodes to their k nearest neighbors in axis-aligned directions. [sent-110, score-0.235]
54 In panel (c), we show a random geometric graph, constructed by placing nodes uniformly at random in [0, 1]2 and connecting any two nodes separated by a distance less than some radius r > 0. [sent-111, score-0.456]
55 Finally, panel (d) shows an instance of a bounded degree expander, which belongs to a special class of sparse graphs that have very good mixing properties [3]. [sent-113, score-0.349]
56 For many random graph models, a typical sample is an expander with high probability (e. [sent-114, score-0.314]
57 In addition, there are several deterministic constructions of expanders that are degree regular (see Section 6. [sent-117, score-0.229]
58 In order to state explicit convergence rates, we need to specify a particular choice of the matrix P that respects the graph structure. [sent-119, score-0.222]
59 For each node n i ∈ V , let δi = |N (i)| = j=1 Aij denote the degree of node i and define the diagonal matrix D = diag{δ1 , . [sent-121, score-0.3]
60 We state the results in terms of optimization error achieved after T iterations and the number of iterations TG ( ; n) required to achieve error for network type G with n nodes. [sent-127, score-0.378]
61 4 maximum node degree: By comparison, the results in the paper [11] give similar bounds for grids and cycles, but for d-dimensional grids we have T ( ; n) = O(n2/d / 2 ) while MIGD achieves T ( ; n = O(n/ 2 ); for expanders and the complete graph MIGD achieves T ( ; n) = O(n/ 2 ). [sent-135, score-0.509]
62 Up to logarithmic factors, the optimization term in the convergence rate √ is always of the order RL/ T , while the remaining terms vary depending on the network topology. [sent-137, score-0.279]
63 On one hand, it is known that even for centralized optimization algorithms, any subgradient method requires at least Ω 1 iterations to achieve 2 -accuracy [19], so that the 1/ 2 term is unavoidable. [sent-140, score-0.415]
64 The next proposition addresses the complementary issue, namely whether the inverse spectral gap term is unavoidable for the dual averaging 2 1 algorithm. [sent-141, score-0.631]
65 For the quadratic proximal function ψ(x) = 2 x 2 , the following result establishes a lower bound on the number of iterations in terms of graph topology and network structure: Proposition 1. [sent-142, score-0.617]
66 Consider the dual averaging algorithm (4) with quadratic proximal function and communication matrix Pn (G). [sent-143, score-0.669]
67 For any graph G with n nodes, the number of iterations TG (c; n) 1 required to achieve a fixed accuracy c > 0 is lower bounded as TG (c; n) = Ω 1−σ2 (Pn (G)) . [sent-144, score-0.27]
68 Indeed, in Section 5, we show that the theoretical scalings from Corollary 1—namely, quadratic, linear, and constant in network size n—are well-matched in simulations of our algorithm. [sent-147, score-0.22]
69 4 Proof sketches Setting up the analysis: Using techniques similar to some past work [18], we establish convern 1 z gence via the two sequences z (t) := n i=1 zi (t) and y(t) := Πψ (−¯(t), α). [sent-148, score-0.214]
70 The average sum of ¯ X gradients z (t) evolves in a very simple way: in particular, we have ¯ z (t + 1) = ¯ 1 n n n i=1 j=1 Pji (zj (t) − z (t)) + z (t) − ¯ ¯ 1 n n j=1 gj (t) = z (t) − ¯ 1 n n gj (t), (6) j=1 where the second equality follows from the double-stochasticity of P . [sent-149, score-0.318]
71 The simple evolution (6) of the averaged dual sequence allows us to avoid difficulties with the non-linearity of projection that have been challenging in earlier work. [sent-150, score-0.27]
72 Before proceeding with the proof of Theorem 1, we state a few useful results regarding the convergence of the standard dual averaging algorithm [20]. [sent-151, score-0.566]
73 t=1 t=0 t=0 Then for each i ∈ V and any x∗ ∈ X , we have T t=1 T f (xi (t)) − f (x∗ ) ≤ t=1 T f (y(t)) − f (x∗ ) + L t=1 α(t) z (t) − zi (t) ¯ ∗ . [sent-158, score-0.214]
74 For any x∗ ∈ X , t=0 T T t=1 f (y(t)) − f (x∗ ) = ≤ t=1 1 n 1 n T n i=1 T fi (xi (t)) − f (x∗ ) + T n t=1 i=1 t=1 fi (xi (t)) − f (x∗ ) + 5 1 n n i=1 n t=1 i=1 [fi (y(t)) − fi (xi (t))] L y(t) − xi (t) , n (7) by the L-Lipschitz continuity of the fi . [sent-161, score-1.172]
75 Letting gi (t) ∈ ∂fi (xi (t)) be a subgradient of fi at xi (t), T 1 n n t=1 i=1 n fi (xi (t)) − fi (x∗ ) ≤ n gi (t), y(t) − x∗ + i=1 gi (t), xi (t) − y(t) . [sent-162, score-1.982]
76 i=1 t−1 (8) n 1 1 By definition of z (t) and y(t), we have y(t) = argminx∈X { n s=1 i=1 gi (s), x + α(t) ψ(x)}. [sent-163, score-0.239]
77 ¯ Thus, we see that the first term in the decomposition (8) can be written in the same way as the bound in Lemma 2, and as a consequence, we have the bound 1 n T n t=1 i=1 gi (t), y(t) − x∗ ≤ L2 2 T t=1 1 ψ(x∗ ). [sent-164, score-0.327]
78 Since gi (t) ∗ ≤ L by assumption, we use the α-Lipschitz continuity of the projection Πψ (·, α) [9, Theorem X. [sent-166, score-0.239]
79 1] to see X T n t=1 i=1 = L 1 y(t) − xi (t) + n n 2L n T T n gi (t), xi (t) − y(t) ≤ 2L n Πψ (−¯(t), α(t)) − Πψ (−zi (t), α(t)) ≤ z X X 2L n t=1 i=1 n t=1 i=1 T n y(t) − xi (t) t=1 i=1 T n t=1 i=1 α(t) z (t) − zi (t) ¯ ∗ . [sent-169, score-0.837]
80 (10) − f (x∗ )] is upper bounded by n t=1 j=1 ∗ T α(t) z (t) − zj (t) ¯ ∗+L t=1 α(t) z (t) − zi (t) ¯ ∗ . [sent-171, score-0.384]
81 2 (11) n ¯ We focus on controlling the network error term in Theorem 1, L t=1 i=1 α(t) z (t) − zi (t) ∗ . [sent-184, score-0.356]
82 Then n zi (t + 1) = j=1 t [Φ(t, s)]ji zj (s) − n r=s+1 j=1 [Φ(t, r)]ji gj (r − 1) − gi (t). [sent-187, score-0.735]
83 that zi (0) = 0 and use (12) to see t−1 zi (t) − z (t) = ¯ n s=1 j=1 (1/n − [Φ(t − 1, s)]ji )gj (s − 1) + 6 1 n n j=1 (gj (t − 1) − gi (t − 1)) . [sent-193, score-0.667]
84 5 Simulations In this section, we report experimental results on the network scaling behavior of the distributed dual averaging algorithm as a function of the graph structure and number of processors n. [sent-208, score-1.065]
85 For all experiments reported here, we consider distributed minimization of a sum of hinge losses. [sent-210, score-0.234]
86 Given the shorthand notation [c]+ := max{0, c}, the hinge loss associated with a linear classifier based on x is given by fi (x) = [1 − yi ai , x ]+ . [sent-212, score-0.315]
87 Plot of the function error versus the number of iterations for a grid graph. [sent-220, score-0.214]
88 Each plot shows the number of iterations required to reach a fixed accuracy (vertical axis) versus the network size n (horizontal axis). [sent-225, score-0.279]
89 Panels show the same plot for different graph topologies: (a) single cycle; (b) two-dimensional grid; and (c) bounded degree expander. [sent-226, score-0.24]
90 Figure 2 provides plots of the function error maxi [f (xi (T ) − f (x∗ )] versus the number of iterations for grid graphs with a varying number of nodes n ∈ {225, 400, 625}. [sent-227, score-0.467]
91 In addition to demonstrating convergence, these plots also show how the convergence time scales as a function of the graph size. [sent-228, score-0.222]
92 In Figure 3, we compare the theoretical predictions of Corollary 1 with the actual behavior of dual subgradient averaging. [sent-230, score-0.523]
93 Each panel shows the function TG ( ; n) versus the graph size n for the fixed value = 0. [sent-231, score-0.268]
94 1; the three different panels correspond to different graph types: cycles (a), grids (b) and expanders (c). [sent-232, score-0.436]
95 In particular, panel (a) exhibits the quadratic scaling predicted for the cycle, panel (b) exhibits the the linear scaling expected for the grid, and panel (c) shows that expander graphs have the desirable property of having constant network scaling. [sent-236, score-0.915]
96 6 Conclusions In this paper, we have developed and analyzed an efficient algorithm for distributed optimization based on dual averaging of subgradients. [sent-237, score-0.65]
97 Our results show an inverse scaling in the spectral gap of the graph, and we showed that this prediction is tight in general via a matching lower bound. [sent-239, score-0.305]
98 Dual averaging for distributed optimization: convergence analysis and network scaling. [sent-271, score-0.625]
99 A randomized incremental subgradient method for distributed optimization in networked systems. [sent-313, score-0.491]
100 Hitting times, commute distances, and the spectral gap for large random geometric graphs. [sent-378, score-0.242]
wordName wordTfidf (topN-words)
[('fi', 0.261), ('gi', 0.239), ('subgradient', 0.226), ('dual', 0.224), ('zi', 0.214), ('averaging', 0.203), ('expander', 0.189), ('distributed', 0.183), ('gj', 0.159), ('network', 0.142), ('tg', 0.138), ('xi', 0.128), ('expanders', 0.126), ('graph', 0.125), ('zj', 0.123), ('node', 0.116), ('spectral', 0.116), ('nodes', 0.115), ('pn', 0.106), ('panel', 0.104), ('proximal', 0.103), ('pij', 0.103), ('iterations', 0.098), ('convergence', 0.097), ('migd', 0.094), ('nedic', 0.094), ('stepsizes', 0.094), ('communication', 0.094), ('rates', 0.093), ('graphs', 0.091), ('gap', 0.088), ('processors', 0.086), ('decentralized', 0.083), ('doubly', 0.083), ('ji', 0.079), ('grid', 0.077), ('johansson', 0.076), ('theorem', 0.075), ('convex', 0.071), ('grids', 0.071), ('cycle', 0.069), ('degree', 0.068), ('rd', 0.068), ('scaling', 0.068), ('stepsize', 0.068), ('corollary', 0.067), ('rl', 0.065), ('processor', 0.062), ('topology', 0.06), ('panels', 0.058), ('cycles', 0.056), ('aa', 0.056), ('pji', 0.055), ('ai', 0.054), ('tx', 0.053), ('minimization', 0.051), ('centralized', 0.051), ('lemar', 0.051), ('topologies', 0.051), ('aij', 0.049), ('wireless', 0.047), ('maxi', 0.047), ('bounded', 0.047), ('sensor', 0.046), ('sequence', 0.046), ('quadratic', 0.045), ('excellent', 0.044), ('bound', 0.044), ('ram', 0.043), ('url', 0.043), ('connecting', 0.043), ('proof', 0.042), ('incremental', 0.042), ('placing', 0.041), ('agent', 0.041), ('iterate', 0.041), ('agreement', 0.041), ('symmetric', 0.04), ('optimization', 0.04), ('mixing', 0.039), ('receives', 0.039), ('theoretical', 0.039), ('versus', 0.039), ('simulations', 0.039), ('communicate', 0.039), ('nonsmooth', 0.039), ('inversely', 0.039), ('lemma', 0.038), ('geometric', 0.038), ('log', 0.037), ('subgradients', 0.037), ('updates', 0.037), ('agents', 0.036), ('regular', 0.035), ('nesterov', 0.034), ('behavior', 0.034), ('chains', 0.034), ('careful', 0.034), ('stochastic', 0.033), ('tight', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 63 nips-2010-Distributed Dual Averaging In Networks
Author: Alekh Agarwal, Martin J. Wainwright, John C. Duchi
Abstract: The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is confirmed both by theoretical lower bounds and simulations for various networks. 1
2 0.15942694 170 nips-2010-Moreau-Yosida Regularization for Grouped Tree Structure Learning
Author: Jun Liu, Jieping Ye
Abstract: We consider the tree structured group Lasso where the structure over the features can be represented as a tree with leaf nodes as features and internal nodes as clusters of the features. The structured regularization with a pre-defined tree structure is based on a group-Lasso penalty, where one group is defined for each node in the tree. Such a regularization can help uncover the structured sparsity, which is desirable for applications with some meaningful tree structures on the features. However, the tree structured group Lasso is challenging to solve due to the complex regularization. In this paper, we develop an efficient algorithm for the tree structured group Lasso. One of the key steps in the proposed algorithm is to solve the Moreau-Yosida regularization associated with the grouped tree structure. The main technical contributions of this paper include (1) we show that the associated Moreau-Yosida regularization admits an analytical solution, and (2) we develop an efficient algorithm for determining the effective interval for the regularization parameter. Our experimental results on the AR and JAFFE face data sets demonstrate the efficiency and effectiveness of the proposed algorithm.
3 0.15591404 224 nips-2010-Regularized estimation of image statistics by Score Matching
Author: Diederik P. Kingma, Yann L. Cun
Abstract: Score Matching is a recently-proposed criterion for training high-dimensional density models for which maximum likelihood training is intractable. It has been applied to learning natural image statistics but has so-far been limited to simple models due to the difficulty of differentiating the loss with respect to the model parameters. We show how this differentiation can be automated with an extended version of the double-backpropagation algorithm. In addition, we introduce a regularization term for the Score Matching loss that enables its use for a broader range of problem by suppressing instabilities that occur with finite training sample sizes and quantized input values. Results are reported for image denoising and super-resolution.
4 0.14139533 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
Author: Alekh Agarwal, Sahand Negahban, Martin J. Wainwright
Abstract: Many statistical M -estimators are based on convex optimization problems formed by the weighted sum of a loss function with a norm-based regularizer. We analyze the convergence rates of first-order gradient methods for solving such problems within a high-dimensional framework that allows the data dimension d to grow with (and possibly exceed) the sample size n. This high-dimensional structure precludes the usual global assumptions— namely, strong convexity and smoothness conditions—that underlie classical optimization analysis. We define appropriately restricted versions of these conditions, and show that they are satisfied with high probability for various statistical models. Under these conditions, our theory guarantees that Nesterov’s first-order method [12] has a globally geometric rate of convergence up to the statistical precision of the model, meaning the typical Euclidean distance between the true unknown parameter θ∗ and the optimal solution θ. This globally linear rate is substantially faster than previous analyses of global convergence for specific methods that yielded only sublinear rates. Our analysis applies to a wide range of M -estimators and statistical models, including sparse linear regression using Lasso ( 1 regularized regression), group Lasso, block sparsity, and low-rank matrix recovery using nuclear norm regularization. Overall, this result reveals an interesting connection between statistical precision and computational efficiency in high-dimensional estimation. 1
5 0.12464343 243 nips-2010-Smoothness, Low Noise and Fast Rates
Author: Nathan Srebro, Karthik Sridharan, Ambuj Tewari
Abstract: √ ˜ We establish an excess risk bound of O HR2 + HL∗ Rn for ERM with an H-smooth loss n function and a hypothesis class with Rademacher complexity Rn , where L∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = R/n, this translates to ˜ ˜ a learning rate of O (RH/n) in the separable (L∗ = 0) case and O RH/n + L∗ RH/n more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth non-negative objective. 1
6 0.12384148 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
7 0.11680659 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction
8 0.11308835 181 nips-2010-Network Flow Algorithms for Structured Sparsity
9 0.11108281 117 nips-2010-Identifying graph-structured activation patterns in networks
10 0.11065662 85 nips-2010-Exact learning curves for Gaussian process regression on large random graphs
11 0.10749649 148 nips-2010-Learning Networks of Stochastic Differential Equations
12 0.10436456 202 nips-2010-Parallelized Stochastic Gradient Descent
13 0.10355358 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods
14 0.10305879 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA
15 0.10061869 214 nips-2010-Probabilistic Belief Revision with Structural Constraints
16 0.098679736 151 nips-2010-Learning from Candidate Labeling Sets
17 0.09839011 150 nips-2010-Learning concept graphs from text with stick-breaking priors
18 0.096343115 162 nips-2010-Link Discovery using Graph Feature Tracking
19 0.095300518 32 nips-2010-Approximate Inference by Compilation to Arithmetic Circuits
20 0.092692681 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
topicId topicWeight
[(0, 0.288), (1, 0.055), (2, 0.154), (3, 0.103), (4, -0.019), (5, -0.059), (6, 0.001), (7, -0.004), (8, 0.059), (9, 0.07), (10, -0.089), (11, -0.056), (12, -0.068), (13, 0.097), (14, -0.034), (15, -0.117), (16, 0.021), (17, -0.106), (18, -0.062), (19, -0.021), (20, 0.101), (21, -0.066), (22, 0.024), (23, -0.041), (24, 0.06), (25, 0.118), (26, 0.096), (27, -0.065), (28, 0.091), (29, -0.038), (30, -0.024), (31, 0.048), (32, 0.154), (33, -0.089), (34, -0.071), (35, 0.015), (36, -0.083), (37, 0.064), (38, -0.032), (39, -0.004), (40, -0.083), (41, -0.001), (42, 0.11), (43, 0.099), (44, 0.141), (45, 0.053), (46, -0.06), (47, -0.089), (48, 0.044), (49, 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.96067953 63 nips-2010-Distributed Dual Averaging In Networks
Author: Alekh Agarwal, Martin J. Wainwright, John C. Duchi
Abstract: The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is confirmed both by theoretical lower bounds and simulations for various networks. 1
2 0.65540212 190 nips-2010-On the Convexity of Latent Social Network Inference
Author: Seth Myers, Jure Leskovec
Abstract: In many real-world scenarios, it is nearly impossible to collect explicit social network data. In such cases, whole networks must be inferred from underlying observations. Here, we formulate the problem of inferring latent social networks based on network diffusion or disease propagation data. We consider contagions propagating over the edges of an unobserved social network, where we only observe the times when nodes became infected, but not who infected them. Given such node infection times, we then identify the optimal network that best explains the observed data. We present a maximum likelihood approach based on convex programming with a l1 -like penalty term that encourages sparsity. Experiments on real and synthetic data reveal that our method near-perfectly recovers the underlying network structure as well as the parameters of the contagion propagation model. Moreover, our approach scales well as it can infer optimal networks of thousands of nodes in a matter of minutes.
3 0.63399297 126 nips-2010-Inference with Multivariate Heavy-Tails in Linear Models
Author: Danny Bickson, Carlos Guestrin
Abstract: Heavy-tailed distributions naturally occur in many real life problems. Unfortunately, it is typically not possible to compute inference in closed-form in graphical models which involve such heavy-tailed distributions. In this work, we propose a novel simple linear graphical model for independent latent random variables, called linear characteristic model (LCM), defined in the characteristic function domain. Using stable distributions, a heavy-tailed family of distributions which is a generalization of Cauchy, L´ vy and Gaussian distrie butions, we show for the first time, how to compute both exact and approximate inference in such a linear multivariate graphical model. LCMs are not limited to stable distributions, in fact LCMs are always defined for any random variables (discrete, continuous or a mixture of both). We provide a realistic problem from the field of computer networks to demonstrate the applicability of our construction. Other potential application is iterative decoding of linear channels with non-Gaussian noise. 1
4 0.62937135 224 nips-2010-Regularized estimation of image statistics by Score Matching
Author: Diederik P. Kingma, Yann L. Cun
Abstract: Score Matching is a recently-proposed criterion for training high-dimensional density models for which maximum likelihood training is intractable. It has been applied to learning natural image statistics but has so-far been limited to simple models due to the difficulty of differentiating the loss with respect to the model parameters. We show how this differentiation can be automated with an extended version of the double-backpropagation algorithm. In addition, we introduce a regularization term for the Score Matching loss that enables its use for a broader range of problem by suppressing instabilities that occur with finite training sample sizes and quantized input values. Results are reported for image denoising and super-resolution.
5 0.62936872 117 nips-2010-Identifying graph-structured activation patterns in networks
Author: James Sharpnack, Aarti Singh
Abstract: We consider the problem of identifying an activation pattern in a complex, largescale network that is embedded in very noisy measurements. This problem is relevant to several applications, such as identifying traces of a biochemical spread by a sensor network, expression levels of genes, and anomalous activity or congestion in the Internet. Extracting such patterns is a challenging task specially if the network is large (pattern is very high-dimensional) and the noise is so excessive that it masks the activity at any single node. However, typically there are statistical dependencies in the network activation process that can be leveraged to fuse the measurements of multiple nodes and enable reliable extraction of highdimensional noisy patterns. In this paper, we analyze an estimator based on the graph Laplacian eigenbasis, and establish the limits of mean square error recovery of noisy patterns arising from a probabilistic (Gaussian or Ising) model based on an arbitrary graph structure. We consider both deterministic and probabilistic network evolution models, and our results indicate that by leveraging the network interaction structure, it is possible to consistently recover high-dimensional patterns even when the noise variance increases with network size. 1
6 0.60225064 58 nips-2010-Decomposing Isotonic Regression for Efficiently Solving Large Problems
7 0.60053188 3 nips-2010-A Bayesian Framework for Figure-Ground Interpretation
8 0.58535063 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
9 0.56972659 289 nips-2010-b-Bit Minwise Hashing for Estimating Three-Way Similarities
10 0.56907874 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods
11 0.5617758 204 nips-2010-Penalized Principal Component Regression on Graphs for Analysis of Subnetworks
12 0.56022531 105 nips-2010-Getting lost in space: Large sample analysis of the resistance distance
13 0.54212254 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA
14 0.53982675 214 nips-2010-Probabilistic Belief Revision with Structural Constraints
15 0.53655505 202 nips-2010-Parallelized Stochastic Gradient Descent
16 0.53467172 32 nips-2010-Approximate Inference by Compilation to Arithmetic Circuits
17 0.5250861 259 nips-2010-Subgraph Detection Using Eigenvector L1 Norms
18 0.52339357 84 nips-2010-Exact inference and learning for cumulative distribution functions on loopy graphs
19 0.51931232 148 nips-2010-Learning Networks of Stochastic Differential Equations
20 0.50386989 150 nips-2010-Learning concept graphs from text with stick-breaking priors
topicId topicWeight
[(13, 0.046), (17, 0.013), (27, 0.073), (30, 0.076), (35, 0.028), (45, 0.225), (50, 0.075), (52, 0.037), (59, 0.142), (60, 0.067), (77, 0.071), (78, 0.036), (90, 0.043)]
simIndex simValue paperId paperTitle
1 0.94618517 4 nips-2010-A Computational Decision Theory for Interactive Assistants
Author: Alan Fern, Prasad Tadepalli
Abstract: We study several classes of interactive assistants from the points of view of decision theory and computational complexity. We first introduce a class of POMDPs called hidden-goal MDPs (HGMDPs), which formalize the problem of interactively assisting an agent whose goal is hidden and whose actions are observable. In spite of its restricted nature, we show that optimal action selection in finite horizon HGMDPs is PSPACE-complete even in domains with deterministic dynamics. We then introduce a more restricted model called helper action MDPs (HAMDPs), where the assistant’s action is accepted by the agent when it is helpful, and can be easily ignored by the agent otherwise. We show classes of HAMDPs that are complete for PSPACE and NP along with a polynomial time class. Furthermore, we show that for general HAMDPs a simple myopic policy achieves a regret, compared to an omniscient assistant, that is bounded by the entropy of the initial goal distribution. A variation of this policy is shown to achieve worst-case regret that is logarithmic in the number of goals for any goal distribution. 1
2 0.91413206 206 nips-2010-Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine
Author: George Dahl, Marc'aurelio Ranzato, Abdel-rahman Mohamed, Geoffrey E. Hinton
Abstract: Straightforward application of Deep Belief Nets (DBNs) to acoustic modeling produces a rich distributed representation of speech data that is useful for recognition and yields impressive results on the speaker-independent TIMIT phone recognition task. However, the first-layer Gaussian-Bernoulli Restricted Boltzmann Machine (GRBM) has an important limitation, shared with mixtures of diagonalcovariance Gaussians: GRBMs treat different components of the acoustic input vector as conditionally independent given the hidden state. The mean-covariance restricted Boltzmann machine (mcRBM), first introduced for modeling natural images, is a much more representationally efficient and powerful way of modeling the covariance structure of speech data. Every configuration of the precision units of the mcRBM specifies a different precision matrix for the conditional distribution over the acoustic space. In this work, we use the mcRBM to learn features of speech data that serve as input into a standard DBN. The mcRBM features combined with DBNs allow us to achieve a phone error rate of 20.5%, which is superior to all published results on speaker-independent TIMIT to date. 1
same-paper 3 0.90733659 63 nips-2010-Distributed Dual Averaging In Networks
Author: Alekh Agarwal, Martin J. Wainwright, John C. Duchi
Abstract: The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is confirmed both by theoretical lower bounds and simulations for various networks. 1
4 0.90360767 254 nips-2010-Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models
Author: Han Liu, Kathryn Roeder, Larry Wasserman
Abstract: A challenging problem in estimating high-dimensional graphical models is to choose the regularization parameter in a data-dependent way. The standard techniques include K-fold cross-validation (K-CV), Akaike information criterion (AIC), and Bayesian information criterion (BIC). Though these methods work well for low-dimensional problems, they are not suitable in high dimensional settings. In this paper, we present StARS: a new stability-based method for choosing the regularization parameter in high dimensional inference for undirected graphs. The method has a clear interpretation: we use the least amount of regularization that simultaneously makes a graph sparse and replicable under random sampling. This interpretation requires essentially no conditions. Under mild conditions, we show that StARS is partially sparsistent in terms of graph estimation: i.e. with high probability, all the true edges will be included in the selected model even when the graph size diverges with the sample size. Empirically, the performance of StARS is compared with the state-of-the-art model selection procedures, including K-CV, AIC, and BIC, on both synthetic data and a real microarray dataset. StARS outperforms all these competing procedures.
5 0.87217432 207 nips-2010-Phoneme Recognition with Large Hierarchical Reservoirs
Author: Fabian Triefenbach, Azarakhsh Jalalvand, Benjamin Schrauwen, Jean-pierre Martens
Abstract: Automatic speech recognition has gradually improved over the years, but the reliable recognition of unconstrained speech is still not within reach. In order to achieve a breakthrough, many research groups are now investigating new methodologies that have potential to outperform the Hidden Markov Model technology that is at the core of all present commercial systems. In this paper, it is shown that the recently introduced concept of Reservoir Computing might form the basis of such a methodology. In a limited amount of time, a reservoir system that can recognize the elementary sounds of continuous speech has been built. The system already achieves a state-of-the-art performance, and there is evidence that the margin for further improvements is still significant. 1
6 0.87023753 83 nips-2010-Evidence-Specific Structures for Rich Tractable CRFs
7 0.86982465 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
8 0.8671183 117 nips-2010-Identifying graph-structured activation patterns in networks
9 0.86609888 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
10 0.86386639 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
11 0.86361915 222 nips-2010-Random Walk Approach to Regret Minimization
12 0.86338753 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
13 0.86230415 148 nips-2010-Learning Networks of Stochastic Differential Equations
14 0.86178839 155 nips-2010-Learning the context of a category
15 0.86085159 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
16 0.86041099 265 nips-2010-The LASSO risk: asymptotic results and real world examples
17 0.85941869 158 nips-2010-Learning via Gaussian Herding
18 0.85884809 282 nips-2010-Variable margin losses for classifier design
19 0.85753006 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices
20 0.85748625 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework