jmlr jmlr2010 jmlr2010-76 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Pradeep Ravikumar, Alekh Agarwal, Martin J. Wainwright
Abstract: The problem of computing a maximum a posteriori (MAP) configuration is a central computational challenge associated with Markov random fields. There has been some focus on “tree-based” linear programming (LP) relaxations for the MAP problem. This paper develops a family of super-linearly convergent algorithms for solving these LPs, based on proximal minimization schemes using Bregman divergences. As with standard message-passing on graphs, the algorithms are distributed and exploit the underlying graphical structure, and so scale well to large problems. Our algorithms have a double-loop character, with the outer loop corresponding to the proximal sequence, and an inner loop of cyclic Bregman projections used to compute each proximal update. We establish convergence guarantees for our algorithms, and illustrate their performance via some simulations. We also develop two classes of rounding schemes, deterministic and randomized, for obtaining integral configurations from the LP solutions. Our deterministic rounding schemes use a “re-parameterization” property of our algorithms so that when the LP solution is integral, the MAP solution can be obtained even before the LP-solver converges to the optimum. We also propose graph-structured randomized rounding schemes applicable to iterative LP-solving algorithms in general. We analyze the performance of and report simulations comparing these rounding schemes. Keywords: graphical models, MAP Estimation, LP relaxation, proximal minimization, rounding schemes
Reference: text
sentIndex sentText sentNum sentScore
1 Our algorithms have a double-loop character, with the outer loop corresponding to the proximal sequence, and an inner loop of cyclic Bregman projections used to compute each proximal update. [sent-12, score-0.702]
2 We also develop two classes of rounding schemes, deterministic and randomized, for obtaining integral configurations from the LP solutions. [sent-14, score-0.409]
3 Our deterministic rounding schemes use a “re-parameterization” property of our algorithms so that when the LP solution is integral, the MAP solution can be obtained even before the LP-solver converges to the optimum. [sent-15, score-0.444]
4 We also propose graph-structured randomized rounding schemes applicable to iterative LP-solving algorithms in general. [sent-16, score-0.466]
5 Keywords: graphical models, MAP Estimation, LP relaxation, proximal minimization, rounding schemes 1. [sent-18, score-0.696]
6 Our second contribution is to develop various types of rounding schemes that allow for early termination of LP-solving algorithms. [sent-81, score-0.424]
7 Our use of rounding is rather different: instead, we consider rounding schemes applied to problems for which the LP solution is integral, so that rounding would be unnecessary if the LP were solved to optimality. [sent-85, score-1.15]
8 Our deterministic rounding schemes apply to a class of algorithms which, like the proximal minimization algorithms that we propose, maintain a certain invariant of the original problem. [sent-87, score-0.716]
9 We also propose and analyze a class of graph-structured randomized rounding procedures that apply to any algorithm that approaches the optimal LP solution from the interior of the relaxed polytope. [sent-88, score-0.405]
10 We analyze these rounding schemes, and give finite bounds on the number of iterations required for the rounding schemes to obtain an integral MAP solution. [sent-89, score-0.837]
11 Section 4 is devoted to the development and analysis of rounding schemes, both for our proximal schemes as well as other classes of LP-solving algorithms. [sent-93, score-0.696]
12 These pseudomarginals are constrained to be non-negative, as well to normalize and be locally consistent in the following sense: ∑ µs (xs ) xs ∈X ∑ µst (xs , xt ) = 1, for all s ∈ V , and = µs (xs ) xt ∈X 1046 for all (s,t) ∈ E, xs ∈ X . [sent-113, score-1.844]
13 The LP relaxation is based on maximizing the linear function θ, µ := ∑ ∑ θs (xs )µs (xs ) + ∑ ∑ θst (xs , xt )µst (xs , xt ), (4) (s,t)∈E xs ,xt s∈V xs subject to the constraint µ ∈ L(G). [sent-115, score-1.841]
14 Proximal Minimization Schemes We begin by defining the notion of a proximal minimization scheme, and various types of divergences (among them Bregman) that we use to define our proximal sequences. [sent-124, score-0.563]
15 For appropriate choice of weights and proximal functions, these proximal minimization schemes converge to the LP optimum with at least geometric and possibly superlinear rates (Bertsekas and Tsitsiklis, 1997; Iusem and Teboulle, 1995). [sent-143, score-0.656]
16 1 Q UADRATIC D IVERGENCE This choice is the simplest, and corresponds to setting the inducing Bregman function f in (6) to be the quadratic function q(µ) : = 1 2 ∑ ∑ µ2 (xs ) + ∑ s s∈V xs ∈X ∑ µ2 (xs , xt ) , st (s,t)∈E (xs ,xt )∈X ×X defined over nodes and edges of the graph. [sent-156, score-1.197]
17 2 Proximal Sequences via Bregman Projection The key in designing an efficient proximal minimization scheme is ensuring that the proximal sequence {µn } can be computed efficiently. [sent-187, score-0.58]
18 In this section, we first describe how sequences of proximal minimizations (when the proximal function is a Bregman divergence) can be reformulated as a particular Bregman projection. [sent-188, score-0.544]
19 1053 R AVIKUMAR , AGARWAL AND WAINWRIGHT Algorithm 2 Quadratic Messages for µn+1 Initialization: (n,0) µst (n) (xs , xt ) = µst (xs , xt ) + wn θst (xs , xt ), (n,0) µs (xs ) = (19) (n) µs (xs ) + wn θs (xs ). [sent-251, score-0.543]
20 end for for each node s ∈ V do (n,τ+1) µs (n,τ) (xs ) = µs (xs ) + 1 m 1 − ∑ µs (n,τ+1) Cs (xs ) = min{Zs (xs ), µs Zs (xs ) = Zs (xs ) −Cs (xs ), (n,τ+1) µs (n,τ+1) (xs ) = µs (n,τ) (xs ) , xs (xs )}, (xs ) −Cs (xs ). [sent-254, score-0.746]
21 repeat for each edge (s,t) ∈ E do (n,τ+1) (xs , xt ) µst (n,τ+1) µs (xs ) (n,τ) µs = (n,τ) µst (xs , xt ) = αs (n,τ) µs (xs ) αs +αst (xs ) (n,τ) ∑xt µst ∑ (xs , xt ) (n,τ) µst (xs , xt ) αs αs +αst , αst αs +αst and (22) . [sent-257, score-0.756]
22 At a high-level, for any Bregman proximal function, convergence follows from two sets of known results: (a) convergence of proximal algorithms; and (b) convergence of cyclic Bregman projections. [sent-265, score-0.626]
23 , (a) Update the parameters: θn (xs ) = ωn θs (xs ) + log(µn (xs )) + 1, s θn (xs , xt ) = ωn θst (xs , xt ) + ρst log st µn (xs , xt ) st −1 . [sent-271, score-1.134]
24 ′ ′ st ∑xs µn (xs , xt ) ∑xt′ µn (xs , xt′ ) st (b) Run a convergent TRW-solver on a graphical model with parameters θn , so as to compute µn+1 = arg min ν∈L(G) − θn , ν + ftrw (ν) . [sent-272, score-0.83]
25 (a) Using the quadratic proximal function and positive weight sequence ωn → +∞ satisfying infinite travel, the proximal sequence {µn } converges superlinearly. [sent-277, score-0.563]
26 (b) Using the entropic proximal function and positive weight sequence ωn satisfying infinite travel, the proximal sequence {µn } converges: (i) superlinearly if ωn → 0, and (ii) at least linearly if 1/ωn ≥ c for some constant c > 0. [sent-278, score-0.635]
27 We note that the rate-of-convergence results for the outer proximal loops assume that the proximal update (computed within each inner loop) has been performed exactly. [sent-285, score-0.6]
28 That is, if the proximal optimization function at outer iteration n is hn (µ) with minimum µn+1 , then the computed proximal update µn+1 is sub-optimal, with hn (µn+1 ) − hn (µn+1 ) ≤ ε. [sent-287, score-0.597]
29 Such rounding schemes may either be randomized or deterministic. [sent-322, score-0.466]
30 In this section, we show that rounding schemes can be useful even when the LP optimum is integral, since they may permit an LP-solving algorithm to be finitely terminated—that is, before it has actually solved the LP—while retaining the same optimality guarantees about the final output. [sent-328, score-0.447]
31 An attractive feature of our proximal Bregman procedures is the existence of precisely such rounding schemes—namely, that under certain conditions, rounding pseudomarginals at intermediate iterations yields the integral LP optimum. [sent-329, score-1.11]
32 We describe these rounding schemes in the following sections, and provide two kinds of results. [sent-330, score-0.424]
33 We provide certificates under which the rounded solution is guaranteed to be MAP optimal; moreover, we provide upper bounds on the number of outer iterations required for the rounding scheme to obtain the LP optimum. [sent-331, score-0.483]
34 1, we describe and analyze deterministic rounding schemes that are specifically tailored to the proximal Bregman procedures that we have described. [sent-333, score-0.716]
35 2, we propose and analyze a graph-structured randomized rounding scheme, which applies not only to our proximal Bregman procedures, but more broadly to any algorithm that generates a sequence of iterates contained within the local polytope L(G). [sent-335, score-0.714]
36 1 Deterministic Rounding Schemes We begin by describing three deterministic rounding schemes that exploit the particular structure of the Bregman proximal updates. [sent-337, score-0.716]
37 1 N ODE - BASED ROUNDING This method is the simplest of the deterministic rounding procedures, and applies to the quadratic and entropic updates. [sent-340, score-0.493]
38 It operates as follows: given the vector µn of pseudomarginals at iteration n, obtain an integral configuration xn (µn ) ∈ X N by choosing n ′ xs ∈ arg max µn (xs ), ′ xs ∈X for each s ∈ V . [sent-341, score-1.549]
39 2 N EIGHBORHOOD - BASED ROUNDING This rounding scheme applies to all three proximal schemes. [sent-345, score-0.671]
40 (a) Define the neighborhood-based energy function 2µn (xs ) + ∑ µn (xs , xt ) for QUA t∈N(s) 2αs log µn (xs ) + ∑ αst log µn (xs , xt ) for ENT s st Fs (x; µn ) : = t∈N(s) n (x ,x ) 2 log µn (x ) + ∑ ρ log nµst s n t for TRW. [sent-348, score-0.736]
41 3 T REE - BASED ROUNDING This method applies to all three proximal schemes, but most naturally to the TRW proximal method. [sent-353, score-0.544]
42 , K: (a) Define the tree-structured energy function 1 n ∑ log µn (xs ) + ∑ for QUA ρst log µst (xs , xt ) s∈V (s,t)∈E(Ti ) αst n ∑ αs log µn (xs ) + ∑ Fi (x; µn ) : = ρst log µst (xs , xt ) for ENT s∈V (s,t)∈E(Ti ) ∑ log µn (x ) + ∑ log µn (xs ,xt ) st for TRW. [sent-365, score-0.77]
43 Theorem 2 (Deterministic rounding with MAP certificate) Consider a sequence of iterates {µn } generated by the quadratic or entropic proximal schemes. [sent-376, score-0.763]
44 Second, given a rounding scheme that maximizes tractable sub-parts of the reparameterized cost function, the rounding is said to be admissible if these partial solutions agree with one another. [sent-388, score-0.762]
45 Our deterministic rounding schemes and optimality guarantees follow this approach, as we detail in the proof of Theorem 2. [sent-389, score-0.444]
46 An attractive feature of all the rounding schemes that we consider is their relatively low computational cost. [sent-399, score-0.424]
47 5 B OUNDS ON I TERATIONS FOR D ETERMINISTIC ROUNDING Of course, the natural question is how many iterations are sufficient for a a given rounding scheme to succeed. [sent-406, score-0.423]
48 Then we have the following guarantees: (a) for quadratic and entropic schemes, all three types of rounding recover the MAP solution once µn − µ ∞ ≤ 1/2. [sent-408, score-0.473]
49 (b) for the TRW-based proximal method, tree-based rounding recovers the MAP solution once 1 µn − µ ∞ ≤ 4N . [sent-409, score-0.635]
50 Thus, ∗ node-based rounding returns xs at each node s, and hence overall, it returns the MAP configuration ∗ . [sent-413, score-1.109]
51 The same argument also shows that the inequality µn (x∗ , x∗ ) > 1 holds, which implies that x st s t 2 ∗ (xs , xt∗ ) = arg maxxs ,xt µn (xs , xt ) for all (s,t) ∈ E. [sent-414, score-0.488]
52 Indeed, consider the set S = {x ∈ X bound, we have ∗ P(S) = P[∃s ∈ V | xs = xs ] N ≤ = ∑ P(xs = xs∗ ) s=1 N 1 ∑ (1 − µs (xs∗ )) ≤ 4 , s=1 1061 R AVIKUMAR , AGARWAL AND WAINWRIGHT showing that we have P(x∗ ) ≥ 3/4, so that x∗ must be the MAP configuration. [sent-426, score-1.42]
53 2C ≤ ∗ f (µ )− Consequently, after n∗ : = logC|log(1/γ)f (µ )| iterations, the rounding scheme would be guaranteed to configuration for the entropic proximal method. [sent-436, score-0.762]
54 Similar finite iteration bounds can also be obtained for the other proximal methods, showing finite convergence through use of our rounding schemes. [sent-437, score-0.673]
55 Note that we proved correctness of the neighborhood and tree-based rounding schemes by leveraging the correctness of the node-based rounding scheme. [sent-438, score-0.806]
56 In practice, it is possible for neighborhoodor tree-based rounding to succeed even if node-based rounding fails; however, we currently do not have any sharper sufficient conditions for these rounding schemes. [sent-439, score-1.089]
57 In particular, we state the following simple lemma: Lemma 5 The rounding procedures have the following guarantees: (a) Any edge-consistent configuration from node rounding maximizes F(x; µn ) for the quadratic and entropic schemes. [sent-453, score-0.872]
58 (b) Any neighborhood-consistent configuration from neighborhood rounding maximizes F(x; µn ) for the quadratic and entropic schemes. [sent-454, score-0.492]
59 By definition, it maximizes µn (xs ) for all s ∈ V , and µn (xs , xt ) for all st (s,t) ∈ E, and so by inspection, also maximizes F(x; µn ) for the quadratic and proximal cases. [sent-458, score-0.759]
60 2 Randomized Rounding Schemes The schemes considered in the previous section were all deterministic, since (disregarding any possible ties), the output of the rounding procedure was a deterministic function of the given pseudomarginals {µn , µn }. [sent-470, score-0.506]
61 In this section, we consider randomized rounding procedures, in which the s st output is a random variable. [sent-471, score-0.692]
62 Perhaps the most naive randomized rounding scheme is the following: for each node r ∈ V , assign it value xr ∈ X with probability µn (xr ). [sent-472, score-0.477]
63 We propose a graph-structured generalization of v this naive randomized rounding scheme, in which we perform the rounding in a dependent way across sub-groups of nodes, and establish guarantees for its success. [sent-473, score-0.768]
64 In particular, we show that when the LP relaxation has a unique integral optimum that is well-separated from the second best configuration, then the rounding scheme succeeds with high probability after a pre-specified number of iterations. [sent-474, score-0.484]
65 1 T HE R ANDOMIZED ROUNDING S CHEME Our randomized rounding scheme is based on any given subset E ′ of the edge set E. [sent-477, score-0.473]
66 ′ ⊆ E, Algorithm 5 defines our randomized rounding scheme. [sent-493, score-0.405]
67 The randomized rounding scheme can be “derandomized” so that we obtain a deterministic solution xd (µn ; E ′ ) that does at least well as the randomized scheme does in expectation. [sent-505, score-0.562]
68 xs ,xt xs ,xt 1064 M ESSAGE - PASSING F OR G RAPH - STRUCTURED L INEAR P ROGRAMS Algorithm 6 D ERANDOMIZED Initialize: µ = µn . [sent-513, score-1.42]
69 , K do Solve d xVi = arg max ∑ θs (xs ) + xVi s∈Vi ¯ ∑ ∑ µt (xt )θst (xs , xt ) + t: (s,t)∈E ′ xt ∑ θst (xs , xt ). [sent-517, score-0.563]
70 (s,t)∈Ei Update µ: ¯ µs (xs ) ¯ if s ∈ Vi / d 0 if s ∈ Vi , xs = xs d 1 if s ∈ Vi , xs = xs . [sent-518, score-2.84]
71 3 O PTIMALITY G UARANTEES FOR R ANDOMIZED ROUNDING We show, in this section, that when the pseudomarginals µn are within a specified ℓ1 norm ball around the unique MAP optimum µ∗ , the randomized rounding scheme outputs the MAP configuration with high probability. [sent-538, score-0.526]
72 For the naive rounding based on E ′ = E, the sequence {µn } need not belong to L(G), but instead need only satisfy the milder conditions µn (xs ) ≥ 0 for all s ∈ V and xs ∈ X , and ∑xs µn (xs ) = 1 for all s ∈ V . [sent-544, score-1.073]
73 s s The derandomized rounding scheme enjoys a similar guarantee, as shown in the following theorem, proved in Appendix F. [sent-545, score-0.425]
74 If at some iteration n, we have µn ∈ L(G), and µn − µ∗ 1 ≤ ∆(θ; G) , 1 + d(E ′ ) then the (E ′ , µn )-derandomized rounding scheme in Algorithm 6 outputs the MAP solution, xd (µn ; E ′ ) = x∗ . [sent-547, score-0.443]
75 Then: (a) The randomized rounding in Algorithm 5 succeeds with probability at least 1 − ε for all iterations greater than n∗ : = 1 2 log Nm + N 2 m2 + log µ0 − µ∗ 2 + log 1+d(E ′ ) ∆(θ;G) + log(1/ε) log(1/γ) . [sent-556, score-0.48]
76 (b) The derandomized rounding in Algorithm 6 yields the MAP solution for all iterations greater than ∗ n := 1 2 log Nm + N 2 m2 + log µ0 − µ∗ log(1/γ) 2 + log 1+d(E ′ ) ∆(θ;G) . [sent-557, score-0.464]
77 Moreover, Theorems 7 and 8 provide an ments, so that µ 1 ≤ ℓ1 -ball radius such that the rounding schemes succeed (either with probability greater than 1 − ε, or deterministically) for all pseudomarginal vectors within these balls. [sent-559, score-0.424]
78 The edge potentials were set to Potts functions, of the form θst (xs , xt ) = βst 0 if xs = xt otherwise. [sent-563, score-1.104]
79 In applying all of the proximal procedures, we set the proximal weights as ωn = n. [sent-568, score-0.544]
80 (b) Plot of distance log10 µn − µ∗ 2 between the current entropic proximal iterate µn and the LP optimum µ∗ versus iteration number for Potts models on grids with m = 5 labels, N = 900 vertices, and a range of signal-to-noise ratios SNR ∈ {0. [sent-596, score-0.407]
81 In Figure 3, we compare two of our proximal schemes—the entropic and the quadratic schemes— with a subgradient descent method, as previously proposed (Feldman et al. [sent-603, score-0.414]
82 Plotted in Figure 3(a) are the log probabilities of the solutions from the TRW-proximal and entropic proximal methods, compared to the dual upper bound that is provided by the sub-gradient method. [sent-608, score-0.41]
83 The TRW-based proximal scheme converges the fastest, essentially within four outer iterations, whereas the entropic scheme requires a few more iterations. [sent-612, score-0.467]
84 2 Comparison of Rounding Schemes In Figure 4, we compare five of our rounding schemes on a Potts model on grid graphs with N = 400, m = 3 labels and SNR = 2. [sent-617, score-0.424]
85 For the deterministic rounding schemes, we used the node-based, neighborhood-based and the tree-based rounding schemes. [sent-631, score-0.746]
86 Panel (a) of Figure 4 shows rounding schemes as applied to the entropic proximal algorithm, whereas panel (b) shows rounding schemes applied to the TRW proximal scheme. [sent-632, score-1.483]
87 We have also developed a series of graph-structured rounding schemes that can be used to generate integral solutions along with a certificate of optimality. [sent-638, score-0.45]
88 380 12 375 0 2 4 6 8 10 Iteration Figure 4: Plots of the log probability of rounded solutions versus the number of iterations for the entropic proximal scheme (panel (a)), and the TRW proximal scheme (panel (b)). [sent-661, score-0.776]
89 In both cases, five different rounding schemes are compared: node-based randomized rounding (Node Rand. [sent-662, score-0.829]
90 It would be interesting to modify our algorithms so that maintaining these explicitly could be avoided; note that our derandomized rounding scheme (Algorithm 4. [sent-673, score-0.425]
91 We use Zs (xs ) as the Lagrange variables for the node positivity constraints µs (xs ) ≥ 0, and Zst (xs , xt ) for the edge-positivity constraints µst (xs , xt ) ≥ 0. [sent-702, score-0.457]
92 Li (G) 1072 M ESSAGE - PASSING F OR G RAPH - STRUCTURED L INEAR P ROGRAMS Consider the subset Li (G) defined by the marginalization constraint along edge (s,t), namely ∑xt′ ∈X µst (xs , xt′ ) = µs (xs ) for each xs ∈ X . [sent-712, score-0.786]
93 Denoting the dual (Lagrange) parameters corresponding to these constraint by λst (xs ), the KKT conditions are given by ∇h(µn,τ+1 (xs , xt )) = ∇h(µn,τ (xs , xt )) + λst (xs ), st st ∇h(µn,τ+1 (xs )) s and ∇h(µn,τ (xs )) − λst (xs ). [sent-713, score-0.989]
94 n ω µst (xs , xt ) Solving for the optimum µ = µn+1 yields 2αs 2αs n+1 log µs (xs ) = θs (xs ) + n log µn (xs ) − ∑ λts (xs ) +C, s n ω ω t∈N(s) 2αst 2αst n+1 log µst (xs , xt ) = θst (xs , xt ) + n log µn (xs , xt ) − γst (xs , xt ) st ωn ω +λts (xs ) + λst (xt ) +C′ . [sent-724, score-1.283]
95 u v H(µn ; E ′ ) : = (u,v)∈E ′ xs ,xt We now show by induction that the de-randomized rounding scheme achieves cost at least as large as this expected value. [sent-736, score-1.109]
96 It will be convenient to use the decomposition G = Gi + G\i , where Gi (¯ ) : = µ ¯ ∑ ∑ µs (xs ) θs (xs ) + ∑ ¯ ∑ µt (xt )θst (xs , xt ) + {t | (s,t)∈E ′ } xt s∈Vi xs ¯ ∑ ∑ µst (xs , xt ) θst (xs , xt ), (s,t)∈Ei xs ,xt and G\i = G − Gi . [sent-740, score-2.144]
97 µ The updated pseudomarginals µ(i) at the end the i-th step of the algorithm are given by, ¯ (i−1) (i) µs (xs ) ¯ (i) µs ¯ = (xs ) 0 1 if s ∈ Vi / if s ∈ Vi , Xd,s = xs if s ∈ Vi , Xd,s = xs . [sent-743, score-1.482]
98 st xs ,xt In asserting this inequality, we have used the fact that that the matrix with entries given by µ∗ (xs )µt∗ (xt ) − µn (xs , xt ) is a difference of probability distributions, meaning that all its entries are st s between −1 and 1, and their sum is zero. [sent-761, score-1.465]
99 xu xu Combining the pieces, we obtain θ, µ∗ − E[ θ, R(µn ) ] ≤ δG (θ) µn − µ∗ 1+ ∑ d(s; E ′ ) ∑ |µn (xs ) − µ∗ (xs )| s s xs s∈V ′ n ∗ ≤ (1 + d(E ))δG (θ) µ − µ 1. [sent-763, score-0.764]
100 ∗ x=x Consequently, conditioning on whether the rounding succeeds or fails, we have θ, µ∗ − E[ θ, R(µn ) ] ≥ psucc θ, µ∗ − θ, µ∗ + (1 − psucc ) max[F(x∗ ; θ) − F(x; θ)] ∗ x=x = (1 − psucc ) max[F(x ; θ) − F(x; θ)]. [sent-765, score-0.429]
wordName wordTfidf (topN-words)
[('xs', 0.71), ('rounding', 0.363), ('st', 0.287), ('proximal', 0.272), ('xt', 0.181), ('lp', 0.162), ('bregman', 0.131), ('entropic', 0.091), ('avikumar', 0.084), ('wainwright', 0.083), ('essage', 0.079), ('rograms', 0.079), ('pseudomarginals', 0.062), ('schemes', 0.061), ('agarwal', 0.058), ('guration', 0.057), ('xvi', 0.057), ('inear', 0.05), ('raph', 0.05), ('passing', 0.044), ('randomized', 0.042), ('map', 0.04), ('snr', 0.038), ('node', 0.036), ('scheme', 0.036), ('relaxation', 0.036), ('ftrw', 0.035), ('zenios', 0.035), ('trw', 0.034), ('censor', 0.034), ('outer', 0.032), ('edge', 0.032), ('subgradient', 0.032), ('cyclic', 0.031), ('projections', 0.031), ('cst', 0.031), ('ti', 0.03), ('potts', 0.03), ('dual', 0.03), ('xv', 0.029), ('superlinear', 0.028), ('rounded', 0.028), ('ei', 0.028), ('updates', 0.027), ('xu', 0.027), ('derandomized', 0.026), ('zst', 0.026), ('relaxations', 0.026), ('integral', 0.026), ('inner', 0.024), ('iterations', 0.024), ('corrections', 0.024), ('zs', 0.024), ('constraint', 0.023), ('optimum', 0.023), ('xd', 0.023), ('chekuri', 0.022), ('psucc', 0.022), ('legendre', 0.022), ('vd', 0.022), ('vi', 0.022), ('iteration', 0.021), ('bertsekas', 0.021), ('projection', 0.021), ('hs', 0.021), ('marginalization', 0.021), ('constraints', 0.021), ('feldman', 0.02), ('loop', 0.02), ('deterministic', 0.02), ('convergent', 0.02), ('ascent', 0.02), ('arg', 0.02), ('neighborhood', 0.019), ('structured', 0.019), ('divergences', 0.019), ('quadratic', 0.019), ('tree', 0.019), ('komodakis', 0.019), ('reparameterization', 0.019), ('polytope', 0.019), ('energy', 0.019), ('iterates', 0.018), ('vertex', 0.018), ('convex', 0.018), ('andomized', 0.018), ('bayati', 0.018), ('ent', 0.018), ('hst', 0.018), ('iusem', 0.018), ('koval', 0.018), ('qua', 0.018), ('convergence', 0.017), ('log', 0.017), ('positivity', 0.017), ('int', 0.017), ('globerson', 0.017), ('con', 0.016), ('maximizers', 0.016), ('trees', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 76 jmlr-2010-Message-passing for Graph-structured Linear Programs: Proximal Methods and Rounding Schemes
Author: Pradeep Ravikumar, Alekh Agarwal, Martin J. Wainwright
Abstract: The problem of computing a maximum a posteriori (MAP) configuration is a central computational challenge associated with Markov random fields. There has been some focus on “tree-based” linear programming (LP) relaxations for the MAP problem. This paper develops a family of super-linearly convergent algorithms for solving these LPs, based on proximal minimization schemes using Bregman divergences. As with standard message-passing on graphs, the algorithms are distributed and exploit the underlying graphical structure, and so scale well to large problems. Our algorithms have a double-loop character, with the outer loop corresponding to the proximal sequence, and an inner loop of cyclic Bregman projections used to compute each proximal update. We establish convergence guarantees for our algorithms, and illustrate their performance via some simulations. We also develop two classes of rounding schemes, deterministic and randomized, for obtaining integral configurations from the LP solutions. Our deterministic rounding schemes use a “re-parameterization” property of our algorithms so that when the LP solution is integral, the MAP solution can be obtained even before the LP-solver converges to the optimum. We also propose graph-structured randomized rounding schemes applicable to iterative LP-solving algorithms in general. We analyze the performance of and report simulations comparing these rounding schemes. Keywords: graphical models, MAP Estimation, LP relaxation, proximal minimization, rounding schemes
Author: Dapo Omidiran, Martin J. Wainwright
Abstract: We consider the problem of high-dimensional variable selection: given n noisy observations of a k-sparse vector β∗ ∈ R p , estimate the subset of non-zero entries of β∗ . A significant body of work has studied behavior of ℓ1 -relaxations when applied to random measurement matrices that are dense (e.g., Gaussian, Bernoulli). In this paper, we analyze sparsified measurement ensembles, and consider the trade-off between measurement sparsity, as measured by the fraction γ of nonzero entries, and the statistical efficiency, as measured by the minimal number of observations n required for correct variable selection with probability converging to one. Our main result is to prove that it is possible to let the fraction on non-zero entries γ → 0 at some rate, yielding measurement matrices with a vanishing fraction of non-zeros per row, while retaining the same statistical efficiency as dense ensembles. A variety of simulation results confirm the sharpness of our theoretical predictions. Keywords: variable selection, sparse random projections, high-dimensional statistics, Lasso, consistency, ℓ1 -regularization
3 0.12009002 80 jmlr-2010-On-Line Sequential Bin Packing
Author: András György, Gábor Lugosi, György Ottucsàk
Abstract: We consider a sequential version of the classical bin packing problem in which items are received one by one. Before the size of the next item is revealed, the decision maker needs to decide whether the next item is packed in the currently open bin or the bin is closed and a new bin is opened. If the new item does not fit, it is lost. If a bin is closed, the remaining free space in the bin accounts for a loss. The goal of the decision maker is to minimize the loss accumulated over n periods. We present an algorithm that has a cumulative loss not much larger than any strategy in a finite class of reference strategies for any sequence of items. Special attention is payed to reference strategies that use a fixed threshold at each step to decide whether a new bin is opened. Some positive and negative results are presented for this case. Keywords: bin packing, on-line learning, prediction with expert advice
4 0.061931189 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning
Author: Thomas Jaksch, Ronald Ortner, Peter Auer
Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s′ there is a policy which moves from s to s′ in at most D steps (on average). √ ˜ We present a reinforcement learning algorithm with total regret O(DS AT ) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of √ Ω( DSAT ) on the total regret of any learning algorithm is given as well. These results are complemented by a sample complexity bound on the number of suboptimal steps taken by our algorithm. This bound can be used to achieve a (gap-dependent) regret bound that is logarithmic in T . Finally, we also consider a setting where the MDP is allowed to change a fixed number of ℓ times. We present a modification of our algorithm that is able to deal with this setting and show a √ ˜ regret bound of O(ℓ1/3 T 2/3 DS A). Keywords: undiscounted reinforcement learning, Markov decision process, regret, online learning, sample complexity
5 0.056782257 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization
Author: Lin Xiao
Abstract: We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as ℓ1 -norm for promoting sparsity. We develop extensions of Nesterov’s dual averaging method, that can exploit the regularization structure in an online setting. At each iteration of these methods, the learning variables are adjusted by solving a simple minimization problem that involves the running average of all past subgradients of the loss function and the whole regularization term, not just its subgradient. In the case of ℓ1 -regularization, our method is particularly effective in obtaining sparse solutions. We show that these methods achieve the optimal convergence rates or regret bounds that are standard in the literature on stochastic and online convex optimization. For stochastic learning problems in which the loss functions have Lipschitz continuous gradients, we also present an accelerated version of the dual averaging method. Keywords: stochastic learning, online optimization, ℓ1 -regularization, structural convex optimization, dual averaging methods, accelerated gradient methods
6 0.055648439 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond
7 0.050799098 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
8 0.04794405 32 jmlr-2010-Efficient Algorithms for Conditional Independence Inference
9 0.047275528 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox
10 0.038588058 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
11 0.037990101 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning
12 0.037451588 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks
13 0.035812806 66 jmlr-2010-Linear Algorithms for Online Multitask Classification
14 0.035721082 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
15 0.035581809 24 jmlr-2010-Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials
16 0.034669016 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
17 0.034335051 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning
18 0.034312263 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
19 0.033861164 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
20 0.030282371 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
topicId topicWeight
[(0, -0.153), (1, -0.099), (2, 0.077), (3, -0.121), (4, 0.011), (5, -0.11), (6, -0.019), (7, -0.087), (8, -0.01), (9, 0.201), (10, 0.238), (11, 0.01), (12, 0.098), (13, 0.188), (14, -0.259), (15, 0.011), (16, 0.12), (17, 0.19), (18, 0.188), (19, 0.136), (20, 0.031), (21, 0.255), (22, 0.067), (23, 0.108), (24, -0.007), (25, -0.015), (26, 0.027), (27, -0.074), (28, -0.078), (29, 0.09), (30, 0.067), (31, 0.142), (32, 0.204), (33, 0.009), (34, 0.189), (35, -0.036), (36, -0.067), (37, 0.069), (38, -0.145), (39, -0.045), (40, 0.015), (41, -0.057), (42, -0.031), (43, 0.142), (44, -0.04), (45, -0.007), (46, 0.093), (47, -0.007), (48, -0.014), (49, 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.98774838 76 jmlr-2010-Message-passing for Graph-structured Linear Programs: Proximal Methods and Rounding Schemes
Author: Pradeep Ravikumar, Alekh Agarwal, Martin J. Wainwright
Abstract: The problem of computing a maximum a posteriori (MAP) configuration is a central computational challenge associated with Markov random fields. There has been some focus on “tree-based” linear programming (LP) relaxations for the MAP problem. This paper develops a family of super-linearly convergent algorithms for solving these LPs, based on proximal minimization schemes using Bregman divergences. As with standard message-passing on graphs, the algorithms are distributed and exploit the underlying graphical structure, and so scale well to large problems. Our algorithms have a double-loop character, with the outer loop corresponding to the proximal sequence, and an inner loop of cyclic Bregman projections used to compute each proximal update. We establish convergence guarantees for our algorithms, and illustrate their performance via some simulations. We also develop two classes of rounding schemes, deterministic and randomized, for obtaining integral configurations from the LP solutions. Our deterministic rounding schemes use a “re-parameterization” property of our algorithms so that when the LP solution is integral, the MAP solution can be obtained even before the LP-solver converges to the optimum. We also propose graph-structured randomized rounding schemes applicable to iterative LP-solving algorithms in general. We analyze the performance of and report simulations comparing these rounding schemes. Keywords: graphical models, MAP Estimation, LP relaxation, proximal minimization, rounding schemes
Author: Dapo Omidiran, Martin J. Wainwright
Abstract: We consider the problem of high-dimensional variable selection: given n noisy observations of a k-sparse vector β∗ ∈ R p , estimate the subset of non-zero entries of β∗ . A significant body of work has studied behavior of ℓ1 -relaxations when applied to random measurement matrices that are dense (e.g., Gaussian, Bernoulli). In this paper, we analyze sparsified measurement ensembles, and consider the trade-off between measurement sparsity, as measured by the fraction γ of nonzero entries, and the statistical efficiency, as measured by the minimal number of observations n required for correct variable selection with probability converging to one. Our main result is to prove that it is possible to let the fraction on non-zero entries γ → 0 at some rate, yielding measurement matrices with a vanishing fraction of non-zeros per row, while retaining the same statistical efficiency as dense ensembles. A variety of simulation results confirm the sharpness of our theoretical predictions. Keywords: variable selection, sparse random projections, high-dimensional statistics, Lasso, consistency, ℓ1 -regularization
3 0.47403896 80 jmlr-2010-On-Line Sequential Bin Packing
Author: András György, Gábor Lugosi, György Ottucsàk
Abstract: We consider a sequential version of the classical bin packing problem in which items are received one by one. Before the size of the next item is revealed, the decision maker needs to decide whether the next item is packed in the currently open bin or the bin is closed and a new bin is opened. If the new item does not fit, it is lost. If a bin is closed, the remaining free space in the bin accounts for a loss. The goal of the decision maker is to minimize the loss accumulated over n periods. We present an algorithm that has a cumulative loss not much larger than any strategy in a finite class of reference strategies for any sequence of items. Special attention is payed to reference strategies that use a fixed threshold at each step to decide whether a new bin is opened. Some positive and negative results are presented for this case. Keywords: bin packing, on-line learning, prediction with expert advice
4 0.24349923 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond
Author: Zhihua Zhang, Guang Dai, Congfu Xu, Michael I. Jordan
Abstract: Fisher linear discriminant analysis (FDA) and its kernel extension—kernel discriminant analysis (KDA)—are well known methods that consider dimensionality reduction and classification jointly. While widely deployed in practical problems, there are still unresolved issues surrounding their efficient implementation and their relationship with least mean squares procedures. In this paper we address these issues within the framework of regularized estimation. Our approach leads to a flexible and efficient implementation of FDA as well as KDA. We also uncover a general relationship between regularized discriminant analysis and ridge regression. This relationship yields variations on conventional FDA based on the pseudoinverse and a direct equivalence to an ordinary least squares estimator. Keywords: Fisher discriminant analysis, reproducing kernel, generalized eigenproblems, ridge regression, singular value decomposition, eigenvalue decomposition
5 0.21814644 32 jmlr-2010-Efficient Algorithms for Conditional Independence Inference
Author: Remco Bouckaert, Raymond Hemmecke, Silvia Lindner, Milan Studený
Abstract: The topic of the paper is computer testing of (probabilistic) conditional independence (CI) implications by an algebraic method of structural imsets. The basic idea is to transform (sets of) CI statements into certain integral vectors and to verify by a computer the corresponding algebraic relation between the vectors, called the independence implication. We interpret the previous methods for computer testing of this implication from the point of view of polyhedral geometry. However, the main contribution of the paper is a new method, based on linear programming (LP). The new method overcomes the limitation of former methods to the number of involved variables. We recall/describe the theoretical basis for all four methods involved in our computational experiments, whose aim was to compare the efficiency of the algorithms. The experiments show that the LP method is clearly the fastest one. As an example of possible application of such algorithms we show that testing inclusion of Bayesian network structures or whether a CI statement is encoded in an acyclic directed graph can be done by the algebraic method. Keywords: conditional independence inference, linear programming approach
6 0.21166231 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
7 0.20015177 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning
8 0.15947005 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
9 0.15629528 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning
10 0.15233825 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
11 0.14992577 109 jmlr-2010-Stochastic Composite Likelihood
12 0.14931554 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization
13 0.14534183 72 jmlr-2010-Matrix Completion from Noisy Entries
14 0.13918839 14 jmlr-2010-Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes
15 0.13494322 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression
16 0.13272683 63 jmlr-2010-Learning Instance-Specific Predictive Models
17 0.13060737 13 jmlr-2010-Approximate Inference on Planar Graphs using Loop Calculus and Belief Propagation
18 0.12791963 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
19 0.12509334 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning
20 0.12463568 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
topicId topicWeight
[(3, 0.014), (4, 0.019), (8, 0.043), (15, 0.016), (21, 0.023), (32, 0.075), (36, 0.042), (37, 0.051), (63, 0.011), (73, 0.335), (75, 0.143), (81, 0.012), (85, 0.049), (96, 0.039)]
simIndex simValue paperId paperTitle
1 0.75295258 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
Author: Mehryar Mohri, Afshin Rostamizadeh
Abstract: Most generalization bounds in learning theory are based on some measure of the complexity of the hypothesis class used, independently of any algorithm. In contrast, the notion of algorithmic stability can be used to derive tight generalization bounds that are tailored to specific learning algorithms by exploiting their particular properties. However, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed. In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence. This paper studies the scenario where the observations are drawn from a stationary ϕ-mixing or β-mixing sequence, a widely adopted assumption in the study of non-i.i.d. processes that implies a dependence between observations weakening over time. We prove novel and distinct stability-based generalization bounds for stationary ϕ-mixing and β-mixing sequences. These bounds strictly generalize the bounds given in the i.i.d. case and apply to all stable learning algorithms, thereby extending the use of stability-bounds to non-i.i.d. scenarios. We also illustrate the application of our ϕ-mixing generalization bounds to general classes of learning algorithms, including Support Vector Regression, Kernel Ridge Regression, and Support Vector Machines, and many other kernel regularization-based and relative entropy-based regularization algorithms. These novel bounds can thus be viewed as the first theoretical basis for the use of these algorithms in non-i.i.d. scenarios. Keywords: learning in non-i.i.d. scenarios, weakly dependent observations, mixing distributions, algorithmic stability, generalization bounds, learning theory
same-paper 2 0.73952049 76 jmlr-2010-Message-passing for Graph-structured Linear Programs: Proximal Methods and Rounding Schemes
Author: Pradeep Ravikumar, Alekh Agarwal, Martin J. Wainwright
Abstract: The problem of computing a maximum a posteriori (MAP) configuration is a central computational challenge associated with Markov random fields. There has been some focus on “tree-based” linear programming (LP) relaxations for the MAP problem. This paper develops a family of super-linearly convergent algorithms for solving these LPs, based on proximal minimization schemes using Bregman divergences. As with standard message-passing on graphs, the algorithms are distributed and exploit the underlying graphical structure, and so scale well to large problems. Our algorithms have a double-loop character, with the outer loop corresponding to the proximal sequence, and an inner loop of cyclic Bregman projections used to compute each proximal update. We establish convergence guarantees for our algorithms, and illustrate their performance via some simulations. We also develop two classes of rounding schemes, deterministic and randomized, for obtaining integral configurations from the LP solutions. Our deterministic rounding schemes use a “re-parameterization” property of our algorithms so that when the LP solution is integral, the MAP solution can be obtained even before the LP-solver converges to the optimum. We also propose graph-structured randomized rounding schemes applicable to iterative LP-solving algorithms in general. We analyze the performance of and report simulations comparing these rounding schemes. Keywords: graphical models, MAP Estimation, LP relaxation, proximal minimization, rounding schemes
3 0.46877369 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
Author: Ming Yuan
Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity
4 0.46804529 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
Author: Yevgeny Seldin, Naftali Tishby
Abstract: We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering.1 We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for finding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved stateof-the-art performance in the MovieLens collaborative filtering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization terms, and new algorithms for this form of matrix factorization. The analysis of co-clustering is extended to tree-shaped graphical models, which can be used to analyze high dimensional tensors. According to the bounds, the generalization abilities of treeshaped graphical models depend on a trade-off between their empirical data fit and the mutual information that is propagated up the tree levels. We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The analysis of co-clustering easily
5 0.46153784 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
Author: Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian
Abstract: Estimating the error rates of classifiers or regression models is a fundamental task in machine learning which has thus far been studied exclusively using supervised learning techniques. We propose a novel unsupervised framework for estimating these error rates using only unlabeled data and mild assumptions. We prove consistency results for the framework and demonstrate its practical applicability on both synthetic and real world data. Keywords: classification and regression, maximum likelihood, latent variable models
6 0.4608196 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
7 0.45970675 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
8 0.45953694 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
9 0.45953187 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers
10 0.45922261 109 jmlr-2010-Stochastic Composite Likelihood
11 0.45775011 63 jmlr-2010-Learning Instance-Specific Predictive Models
12 0.45559385 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
13 0.45400661 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
14 0.45342952 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis
15 0.45336026 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
16 0.45324537 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
17 0.45171627 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
18 0.4508574 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
19 0.44998813 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
20 0.44841054 24 jmlr-2010-Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials