nips nips2012 nips2012-44 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Joachim Giesen, Jens Mueller, Soeren Laue, Sascha Swiercy
Abstract: We consider an abstract class of optimization problems that are parameterized concavely in a single parameter, and show that the solution path along the √ parameter can always be approximated with accuracy ε > 0 by a set of size O(1/ ε). A √ lower bound of size Ω(1/ ε) shows that the upper bound is tight up to a constant factor. We also devise an algorithm that calls a step-size oracle and computes an √ approximate path of size O(1/ ε). Finally, we provide an implementation of the oracle for soft-margin support vector machines, and a parameterized semi-definite program for matrix completion. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Approximating Concavely Parameterized Optimization Problems S¨ ren Laue o Friedrich-Schiller-Universit¨ t Jena a Germany soeren. [sent-1, score-0.08]
2 com Abstract We consider an abstract class of optimization problems that are parameterized concavely in a single parameter, and show that the solution path along the √ parameter can always be approximated with accuracy ε > 0 by a set of size O(1/ ε). [sent-10, score-0.536]
3 A √ lower bound of size Ω(1/ ε) shows that the upper bound is tight up to a constant factor. [sent-11, score-0.15]
4 We also devise an algorithm that calls a step-size oracle and computes an √ approximate path of size O(1/ ε). [sent-12, score-0.522]
5 Finally, we provide an implementation of the oracle for soft-margin support vector machines, and a parameterized semi-definite program for matrix completion. [sent-13, score-0.189]
6 Let D be a set, I ⊆ R an interval, and f : I × D → R such that (1) f (t, ·) is bounded from below for every t ∈ I, and (2) f (·, x) is concave for every x ∈ D. [sent-15, score-0.079]
7 We study the parameterized optimization problem h(t) = minx∈D f (t, x). [sent-16, score-0.074]
8 A solution x∗ ∈ D is called optimal at parameter value t if f (t, x∗ ) = h(t), and x ∈ D is called t t an ε-approximation at t if ε(t, x) := f (t, x) − h(t) ≤ ε. [sent-17, score-0.028]
9 The size of a smallest ε-approximation path is called the ε-path complexity of the parameterized optimization problem. [sent-20, score-0.52]
10 The aim of this paper is to derive upper and lower bounds on the path complexity, and to provide efficient algorithms to compute ε-paths. [sent-21, score-0.414]
11 The rather abstract problem from above is motivated by regularized optimization problems that are abundant in machine learning, i. [sent-23, score-0.029]
12 , by problems of the form min f (t, x) := r(x) + t · l(x), x∈D where r(x) is a regularization- and l(x) a loss term. [sent-25, score-0.034]
13 The parameter t controls the trade-off between regularization and loss. [sent-26, score-0.051]
14 Note that here f (·, x) is always linear and hence concave in the parameter t. [sent-27, score-0.128]
15 Due to the widespread use of regularized optimization methods in machine learning regularization path following algorithms have become an active area of research. [sent-29, score-0.479]
16 Initially, exact path tracking methods have been developed for many machine learning problems [16, 18, 3, 9] starting with the algorithm for SVMs by Hastie et al. [sent-30, score-0.375]
17 Also, the exact regularization path can be exponentially large in the input size [5, 14]. [sent-33, score-0.426]
18 Approximation path algorithms with approximation guarantees have been developed for SVMs with square loss [6], the LASSO [14], and matrix completion and factorization problems [8, 7]. [sent-35, score-0.483]
19 We provide a structural upper bound in O(1/ ε) for the ε-path complexity for the abstract problem class described above. [sent-37, score-0.145]
20 We√ show that this bound is tight up to a multiplicative constant by constructing a lower bound in Ω(1/ ε). [sent-38, score-0.111]
21 Finally, we devise a generic algorithm to compute ε-paths that calls a problem specific oracle providing a step-size certificate. [sent-39, score-0.208]
22 If such a certificate √ exists, then the algorithm computes a path of complexity in O(1/ ε). [sent-40, score-0.446]
23 Finally, we demonstrate the implementation of the oracle for standard SVMs and a matrix completion problem. [sent-41, score-0.167]
24 Resulting in the √ first algorithms for both problems that compute ε-paths of complexity in O 1/ ε . [sent-42, score-0.071]
25 Previously, no approximation path algorithms have been known for standard SVMs but only a heuristic [12] and an approximation algorithm for square loss SVMs [6] with complexity in O(1/ε). [sent-43, score-0.498]
26 The best approximation path algorithm for matrix completion also has complexity in √ O(1/ε). [sent-44, score-0.554]
27 To our knowledge, the only known approximation path algorithm with complexity in O 1/ ε is [14] for the LASSO. [sent-45, score-0.472]
28 2 Upper Bound Here we show that any problem that fits the problem definition from the introduction for a compact √ interval I = [a, b] has an ε-path with complexity in O(1/ ε). [sent-46, score-0.155]
29 Let (a, b) be the interior of [a, b] and let g : (a, b) → R be concave, then g is continuous and has a left- and right derivative g− (t) and g+ (t), respectively, at every point t ∈ I (see for example [15]). [sent-47, score-0.067]
30 Note that f (·, x) is concave by assumption and h is concave as the minimum over a family of concave functions. [sent-48, score-0.237]
31 For all t < t it holds h(t ) ≤ f (t , x∗ ) and hence h(t) − h(t ) ≥ f (t, x∗ ) − f (t , x∗ ) which t t t implies h(t) − h(t ) f (t, x∗ ) − f (t , x∗ ) t t h− (t) := lim ≥ lim =: f− (t, x∗ ). [sent-52, score-0.151]
32 t t ↑t t ↑t t−t t−t The inequality f+ (t, x∗ ) ≥ h+ (t) follows analogously, and f− (t, x∗ ) ≥ f+ (t, x∗ ) follows after t t t some algebra from the concavity of f (·, x∗ ) and the definition of the derivatives (see [15]). [sent-53, score-0.182]
33 Let Tk = t | t ∈ (tk−1 , b] such that ε(t, x∗k−1 ) := f (t, x∗k−1 ) − h(t) = ε , t t and tk = min Tk for all integral k > 0 such that Tk = ∅. [sent-56, score-0.544]
34 Assume the claim holds for n − 1 and 1 let a = s1 + . [sent-71, score-0.101]
35 The rectangle with side lengths as−1 and n 1 n−1 bsn has circumference 2(as−1 + bsn ) and area as−1 bsn √ ab. [sent-78, score-0.527]
36 Since the square minimizes the = n n circumference for a given area we have 2(as−1 + bsn ) ≥ 4 ab. [sent-79, score-0.231]
37 The claim for n now follows from n √ √ −1 −1 (a + sn )(b + sn ) = ab + asn + bsn + 1 ≥ ab + 2 ab + 1 = ( ab + 1)2 ≥ ((n − 1) + 1)2 = n2 . [sent-80, score-0.612]
38 Define δk = tk+1 − tk and ∆k = h− (tk ) − h− (tk+1 ). [sent-88, score-0.51]
39 Thus, there exists sk > 0 such that δk ≥ εsk and ∆k ≥ s−1 . [sent-90, score-0.031]
40 + ∆n ) (b − a)(h− (a) − h− (b)), where the last inequality follows from h− (b) ≤ h− (t) for t ≤ b (which can be proved from concavity, see again [15]). [sent-106, score-0.084]
41 The t claim of the theorem follows since the proof of Lemma 4 shows that the sequence (tk ) is finite and hence the intervals [tk , tk+1 ] cover the whole [a, b]. [sent-113, score-0.157]
42 3 Lower Bound Here we show that there exists a problem that fits the problem description from the introduction √ whose ε-path complexity is in Ω(1/ ε). [sent-114, score-0.138]
43 This shows that the upper bound from the previous section is tight up to a constant. [sent-115, score-0.115]
44 Let I = [a, b], D = R, f (t, x) = 1 x2 − tx and thus 2 h(t) = min x∈R 1 2 x − tx 2 = 1 ∗ x 2 t 2 1 − tx∗ = − t2 , t 2 where the last equality follows from the convexity and differentiability of f (t, x) in x which together imply ∂f (t, x∗ ) = x∗ − t = 0. [sent-116, score-0.219]
45 t t ∂x 1 For ε > 0 and x ∈ R let Ix = t ∈ [a, b] ε(t, x) := 2 x2 − tx + 1 t2 ≤ ε , which is an interval 2 √ 1 2 1 2 since 2 x − tx + 2 t is a quadratic function in t. [sent-117, score-0.289]
46 The length of this interval is 2 2ε independent √ of x. [sent-118, score-0.084]
47 Hence, the ε-path complexity for the problem is at least (b − a)/2 2ε. [sent-119, score-0.071]
48 Let us compare this lower bound with the upper from the previous section which gives for the (b − a)(h− (a) − h− (b)) /ε = √ bound is tight up to constant of at most 2 2. [sent-120, score-0.15]
49 ε Hence the upper Generic Algorithm So far we have only discussed structural complexity √ bounds for ε-paths. [sent-122, score-0.11]
50 Now we give a generic algorithm to compute an ε-path of complexity in O(1/ ε). [sent-123, score-0.132]
51 When applying the generic algorithm to 3 a specific problem a plugin-subroutine PATH P OLYNOMIAL needs to be implemented for the specific problem. [sent-124, score-0.061]
52 The generic algorithm builds on the simple idea that has been introduced in [6] to compute an (ε/γ)-approximation (for γ > 1) and only update this approximation along the parameter interval I = [a, b] when it fails to be an ε-approximation. [sent-125, score-0.171]
53 The plugin-subroutine PATH P OLYNOMIAL provides a bound on the step-size for the algorithm, i. [sent-126, score-0.035]
54 , a certificate for how long the approximation is valid along the interval I. [sent-128, score-0.11]
55 1 Step-size certificate and algorithm We always consider a problem that fits the problem description from the introduction. [sent-131, score-0.036]
56 Let P be the set of all concave polynomials p : I → R of degree at most 2. [sent-133, score-0.13]
57 Note that P contains constant and linear polynomials with second derivative p = 0 and quadratic polynomials with constant second derivative p < 0. [sent-135, score-0.215]
58 If Pt (x, ε) = ∅, then x is an ε-approximation at parameter value t, because there exists p ∈ P such that ε(t, x) ≤ f (t, x) − p(t) ≤ ε. [sent-136, score-0.031]
59 If there exists p ∈ Pt (x, ε/γ), then x is an ε-approximation for all t ∈ [t, b] with t ≤ t + ∆t (p). [sent-144, score-0.031]
60 (ii) If p < 0, then g(t ) − p(t ) is a quadratic polynomial in t with second derivative −p > 0, (2) and the equation g(t )−p(t ) ≤ ε solves to t −t ≤ ∆t (p). [sent-151, score-0.07]
61 Let p the restriction of p onto the interval ˆ ˆ p [ˆ, b] and δt = t − a, then p = p , and thus ρt (ˆ) = ε/ γ δ 2 |ˆ | = 1 . [sent-155, score-0.084]
62 p p Assume now that we have an oracle PATH P OLYNOMIAL available that on input t ∈ (a, b) and ε/γ > 0 returns x ∈ D and p ∈ Pt (x, ε/γ), then the following algorithm G ENERIC PATH returns an ε-path if it terminates. [sent-158, score-0.139]
63 2 Analysis of the generic algorithm The running time of the algorithm G ENERIC PATH is essentially determined by the complexity of the computed path times the cost of the oracle PATH √ P OLYNOMIAL. [sent-160, score-0.592]
64 In the following we show that the complexity of the computed path is at most O(1/ ε). [sent-161, score-0.446]
65 Thus, φc (x) > 0 for c < 0 and φc is 2 monotonously increasing. [sent-167, score-0.236]
66 The continuity for |p | > 0 follows from the definitions of ∆t (p) and ∆t (p), and from Observation 8. [sent-172, score-0.074]
67 Since ρt (p) > 1/2 for small |p | the continuity at |p | = 0 follows from Observation 10, because (2) (1) 2 lim ∆t (p) = lim φδt γ(1−γ) (δt · (ρt (p) + γ − 1/2)) + δt (γ − 1) = δt (γ − 1) = ∆t (p), |p |→0 |p |→0 where we have used ρt (p) → ∞ as |p | → 0. [sent-173, score-0.176]
68 The claim is that ∆t (p) is monotonously decreasing in |p |. [sent-177, score-0.354]
69 Since ∆t is continuous in |p | (1) (2) (3) by Lemma 11 it is enough to check the monotonicity of ∆t (p), ∆t (p) and ∆t (p). [sent-178, score-0.083]
70 The mono(1) (3) tonicity of ∆t (p) and ∆t (p) follows directly from the definitions of the latter. [sent-179, score-0.031]
71 The monotonicity (2) of ∆t (p) follows from Observation 10 since we have (2) 2 ∆t (p) = φδt γ(1−γ) δt ρt (p) + γ − 1 2 + δt (γ − 1), (2) 2 and thus ∆t (p) is monotonously decreasing in |p | because δt γ(1 − γ) < 0 and ρt (p) is monotonously decreasing in |p |. [sent-180, score-0.668]
72 Given t ∈ I and p ∈ P, then ∆t (p) is monotonously increasing in δt and hence in t. [sent-182, score-0.285]
73 Since ∆t (p) is continuous in δt by Observation 8 it is enough to check the monotonicity of (1) (2) (3) (1) (3) ∆t (p), ∆t (p) and ∆t (p). [sent-184, score-0.083]
74 The monotonicity of ∆t (p) and ∆t (p) follows directly from the (2) 1 definitions of the latter. [sent-185, score-0.114]
75 It remains to show the monotonicity of ∆t (p) for ρt (p) ≥ 2 . [sent-186, score-0.083]
76 The notation is justified because for φ−1 is monotonously decreasing, and we have c φ−1 (y) > 0 we have c (2) ∆t (p) = φc1 (φ−1 (δt )) − δt = φc1 (φ−1 (δt )) − φc2 (φ−1 (δt )), c2 c2 c2 1 2ε with c1 = |p | and c2 = c1 . [sent-189, score-0.236]
77 c2 γ −1 Because φc1 − φc2 < 0 for c1 > c2 , both φc2 and φc1 − φc2 are monotonously decreasing in their (2) respective arguments. [sent-191, score-0.277]
78 If there exists p ∈ P and ε > 0 such that |q | ≤ |p | for all q that are returned by the ˆ oracle PATH P OLYNOMIAL on input t ∈ [a, b] and ε ≤ ε. [sent-194, score-0.116]
79 Then Algorithm 1 terminates after at most ˆ √ √ O 1/ ε steps, and thus returns an ε-path of complexity in O(1/ ε). [sent-195, score-0.098]
80 Here the first inequality is due to Lemma 12 and the second inequality is due ˆ to Lemma 13. [sent-198, score-0.106]
81 Hence, the number of steps in the first call of C OMPUTE PATH is upper bounded by ˆ ˆ (b− t)/(min{∆t (p), b− t})+1. [sent-199, score-0.039]
82 Similarly, the number of steps in the second call of C OMPUTE PATH ˆ ˆ ˆ is upper bounded by (t − a)/(min{∆a+b−t (p), t − a}) + 1. [sent-200, score-0.039]
83 Hence, there exists ε > 0 such that ρt (p, ε) < 1/2 and ˆ ˆ ˆ (3) ˆ ∆t (p, ε) ≤ b − t for all ε < ε, and thus ˆ ˆ √ ˆ ˆ γ |p | b−t 1 b−t ˆ + 1 = (3) (b − t) + 1 ∈ O √ . [sent-203, score-0.031]
84 ˆ 5 Applications Here we demonstrate on two examples that Lagrange duality can be a tool for implementing the oracle PATH P OLYNOMIAL in the generic path algorithm. [sent-205, score-0.521]
85 A support vector machine (SVM) is the following parameterized optimization problem 1 w 2 min w∈Rd ,b∈R n 2 max{0, 1 − yi (wT xi + b)} =: f (t, w) +t i=1 parameterized in the regularization parameter t ∈ [0, ∞). [sent-212, score-0.263]
86 Algorithm 3 PATH P OLYNOMIAL SVM Input: t ∈ (0, ∞) and ε > 0 Output: w ∈ Rd and p ∈ Pt (w, ε) compute a primal solution w ∈ Rd and a dual solution α ∈ Rn such that f (t, w) − d(α) < ε define p : I → R, t → d αt /t return (w, p) Lemma 15. [sent-222, score-0.2]
87 Let α be the dual solution computed by PATH P OLYNOMIAL SVM and p be the polynomial defined in PATH P OLYNOMIAL SVM. [sent-226, score-0.094]
88 For t > 0 let α = αt /t, then α is feasible for the dual SVM at parameter value t since α is feasible for the dual SVM at t. [sent-231, score-0.156]
89 In the kernel case one has the following primal SVM (see [2]), min m β∈R ,b 1 T β Kβ + t · 2 n n max 0, 1 − yi i=1 βj yj Kij + b j=1 7 =: f (t, β) . [sent-235, score-0.083]
90 1 shows the 1 size of paths as a function of ε using double logarithmic plots. [sent-244, score-0.035]
91 A straight line plot with slope − 2 √ corresponds to an empirical path complexity that follows the function 1/ ε. [sent-245, score-0.477]
92 1/sqrt(epsilon) a1a duke fourclass scale mushrooms w1a # nodes # nodes 1/sqrt(epsilon) a1a duke fourclass scale mushrooms w1a 1 10 −3 −2 10 −1 10 1 10 −3 10 10 epsilon −1 10 epsilon (a) Path complexity for a linear SVM 5. [sent-246, score-0.539]
93 2 −2 10 (b) Path complexity for a SVM with Gaussian kernel exp(−γ u − v 2 ) for γ = 0. [sent-247, score-0.071]
94 5 2 Matrix completion Matrix completion asks for a completion X of an (n × m)-matrix Y that has been observed only at the indices in Ω ⊂ {1, . [sent-248, score-0.246]
95 n×m , A∈Rn×n , B∈Rm×m XT B 2 X∈R (i,j)∈Ω The Lagrangian dual of this convex semidefinite program is given as 1 2 tI Λ max − 0, and Λij = 0 if (i, j) ∈ Ω. [sent-258, score-0.091]
96 n×m ΛT tI 2 ij Λ∈R (i,j)∈Ω ˆ ˆ Let f (t, X) for X = (X, A, B) be the primal objective function at parameter value t, and d(Λ) be the dual objective function. [sent-261, score-0.146]
97 [6] Joachim Giesen, Martin Jaggi, and S¨ ren Laue. [sent-288, score-0.08]
98 [7] Joachim Giesen, Martin Jaggi, and S¨ ren Laue. [sent-291, score-0.08]
99 [8] Joachim Giesen, Martin Jaggi, and S¨ ren Laue. [sent-294, score-0.08]
100 The entire regularization path for the support vector machine. [sent-301, score-0.456]
wordName wordTfidf (topN-words)
[('tk', 0.51), ('path', 0.375), ('olynomial', 0.354), ('monotonously', 0.236), ('bsn', 0.148), ('ompute', 0.148), ('certi', 0.136), ('eneric', 0.13), ('pt', 0.121), ('jaggi', 0.12), ('cate', 0.12), ('giesen', 0.104), ('jena', 0.104), ('svm', 0.101), ('lemma', 0.097), ('epsilon', 0.09), ('joachim', 0.09), ('oracle', 0.085), ('interval', 0.084), ('monotonicity', 0.083), ('completion', 0.082), ('ren', 0.08), ('concave', 0.079), ('claim', 0.077), ('tx', 0.077), ('svms', 0.077), ('parameterized', 0.074), ('complexity', 0.071), ('sn', 0.07), ('concavity', 0.067), ('dual', 0.066), ('generic', 0.061), ('martin', 0.059), ('circumference', 0.059), ('concavely', 0.059), ('fourclass', 0.059), ('semide', 0.057), ('germany', 0.057), ('ab', 0.054), ('inequality', 0.053), ('esa', 0.052), ('mushrooms', 0.052), ('lim', 0.051), ('polynomials', 0.051), ('regularization', 0.051), ('analogously', 0.05), ('hence', 0.049), ('primal', 0.049), ('ti', 0.048), ('saharon', 0.048), ('libsvm', 0.047), ('ts', 0.045), ('minw', 0.045), ('rd', 0.044), ('rosset', 0.043), ('continuity', 0.043), ('derivative', 0.043), ('tight', 0.041), ('decreasing', 0.041), ('rn', 0.04), ('upper', 0.039), ('yij', 0.037), ('description', 0.036), ('hastie', 0.035), ('paths', 0.035), ('bound', 0.035), ('min', 0.034), ('duke', 0.033), ('observation', 0.032), ('calls', 0.032), ('follows', 0.031), ('ij', 0.031), ('exists', 0.031), ('bin', 0.031), ('nitions', 0.03), ('support', 0.03), ('trevor', 0.03), ('devise', 0.03), ('regularized', 0.029), ('return', 0.029), ('solution', 0.028), ('quadratic', 0.027), ('returns', 0.027), ('nition', 0.027), ('ji', 0.026), ('proceedings', 0.026), ('lagrangian', 0.026), ('bernd', 0.026), ('deutsche', 0.026), ('ichiro', 0.026), ('plexity', 0.026), ('sulovsk', 0.026), ('wayne', 0.026), ('yue', 0.026), ('approximation', 0.026), ('convex', 0.025), ('julien', 0.024), ('holger', 0.024), ('area', 0.024), ('let', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 44 nips-2012-Approximating Concavely Parameterized Optimization Problems
Author: Joachim Giesen, Jens Mueller, Soeren Laue, Sascha Swiercy
Abstract: We consider an abstract class of optimization problems that are parameterized concavely in a single parameter, and show that the solution path along the √ parameter can always be approximated with accuracy ε > 0 by a set of size O(1/ ε). A √ lower bound of size Ω(1/ ε) shows that the upper bound is tight up to a constant factor. We also devise an algorithm that calls a step-size oracle and computes an √ approximate path of size O(1/ ε). Finally, we provide an implementation of the oracle for soft-margin support vector machines, and a parameterized semi-definite program for matrix completion. 1
2 0.14482729 64 nips-2012-Calibrated Elastic Regularization in Matrix Completion
Author: Tingni Sun, Cun-hui Zhang
Abstract: This paper concerns the problem of matrix completion, which is to estimate a matrix from observations in a small subset of indices. We propose a calibrated spectrum elastic net method with a sum of the nuclear and Frobenius penalties and develop an iterative algorithm to solve the convex minimization problem. The iterative algorithm alternates between imputing the missing entries in the incomplete matrix by the current guess and estimating the matrix by a scaled soft-thresholding singular value decomposition of the imputed matrix until the resulting matrix converges. A calibration step follows to correct the bias caused by the Frobenius penalty. Under proper coherence conditions and for suitable penalties levels, we prove that the proposed estimator achieves an error bound of nearly optimal order and in proportion to the noise level. This provides a unified analysis of the noisy and noiseless matrix completion problems. Simulation results are presented to compare our proposal with previous ones. 1
3 0.090199038 197 nips-2012-Learning with Recursive Perceptual Representations
Author: Oriol Vinyals, Yangqing Jia, Li Deng, Trevor Darrell
Abstract: Linear Support Vector Machines (SVMs) have become very popular in vision as part of state-of-the-art object recognition and other classification tasks but require high dimensional feature spaces for good performance. Deep learning methods can find more compact representations but current methods employ multilayer perceptrons that require solving a difficult, non-convex optimization problem. We propose a deep non-linear classifier whose layers are SVMs and which incorporates random projection as its core stacking element. Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous –often more complicated– methods on several vision and speech benchmarks. 1
4 0.084662564 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions
Author: Paul Vernaza, Drew Bagnell
Abstract: Maximum entropy (MaxEnt) modeling is a popular choice for sequence analysis in applications such as natural language processing, where the sequences are embedded in discrete, tractably-sized spaces. We consider the problem of applying MaxEnt to distributions over paths in continuous spaces of high dimensionality— a problem for which inference is generally intractable. Our main contribution is to show that this intractability can be avoided as long as the constrained features possess a certain kind of low dimensional structure. In this case, we show that the associated partition function is symmetric and that this symmetry can be exploited to compute the partition function efficiently in a compressed form. Empirical results are given showing an application of our method to learning models of high-dimensional human motion capture data. 1
5 0.07979726 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
Author: Bruno Scherrer, Boris Lesner
Abstract: We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a significant improvement in the usual situation when γ is close to 1. Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. 1
6 0.078242749 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins
7 0.078218438 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models
8 0.076756202 282 nips-2012-Proximal Newton-type methods for convex optimization
9 0.076460615 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
10 0.072659865 69 nips-2012-Clustering Sparse Graphs
11 0.068935573 260 nips-2012-Online Sum-Product Computation Over Trees
12 0.067246184 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking
13 0.060583115 324 nips-2012-Stochastic Gradient Descent with Only One Projection
14 0.060235638 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms
15 0.059996154 27 nips-2012-A quasi-Newton proximal splitting method
16 0.059550568 264 nips-2012-Optimal kernel choice for large-scale two-sample tests
17 0.055640403 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation
18 0.055253975 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions
19 0.054063264 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs
20 0.053882524 285 nips-2012-Query Complexity of Derivative-Free Optimization
topicId topicWeight
[(0, 0.15), (1, -0.002), (2, 0.083), (3, -0.051), (4, 0.058), (5, 0.05), (6, -0.005), (7, 0.013), (8, -0.017), (9, 0.002), (10, 0.053), (11, 0.06), (12, -0.002), (13, 0.003), (14, -0.004), (15, 0.035), (16, -0.025), (17, -0.003), (18, 0.009), (19, -0.033), (20, -0.0), (21, 0.033), (22, -0.019), (23, -0.046), (24, -0.11), (25, 0.056), (26, -0.066), (27, -0.008), (28, -0.05), (29, -0.028), (30, 0.004), (31, 0.009), (32, 0.038), (33, 0.034), (34, 0.01), (35, -0.059), (36, -0.016), (37, 0.028), (38, 0.026), (39, 0.116), (40, 0.025), (41, 0.069), (42, 0.005), (43, 0.148), (44, 0.035), (45, -0.001), (46, -0.057), (47, -0.075), (48, -0.085), (49, -0.075)]
simIndex simValue paperId paperTitle
same-paper 1 0.93129426 44 nips-2012-Approximating Concavely Parameterized Optimization Problems
Author: Joachim Giesen, Jens Mueller, Soeren Laue, Sascha Swiercy
Abstract: We consider an abstract class of optimization problems that are parameterized concavely in a single parameter, and show that the solution path along the √ parameter can always be approximated with accuracy ε > 0 by a set of size O(1/ ε). A √ lower bound of size Ω(1/ ε) shows that the upper bound is tight up to a constant factor. We also devise an algorithm that calls a step-size oracle and computes an √ approximate path of size O(1/ ε). Finally, we provide an implementation of the oracle for soft-margin support vector machines, and a parameterized semi-definite program for matrix completion. 1
2 0.68338221 64 nips-2012-Calibrated Elastic Regularization in Matrix Completion
Author: Tingni Sun, Cun-hui Zhang
Abstract: This paper concerns the problem of matrix completion, which is to estimate a matrix from observations in a small subset of indices. We propose a calibrated spectrum elastic net method with a sum of the nuclear and Frobenius penalties and develop an iterative algorithm to solve the convex minimization problem. The iterative algorithm alternates between imputing the missing entries in the incomplete matrix by the current guess and estimating the matrix by a scaled soft-thresholding singular value decomposition of the imputed matrix until the resulting matrix converges. A calibration step follows to correct the bias caused by the Frobenius penalty. Under proper coherence conditions and for suitable penalties levels, we prove that the proposed estimator achieves an error bound of nearly optimal order and in proportion to the noise level. This provides a unified analysis of the noisy and noiseless matrix completion problems. Simulation results are presented to compare our proposal with previous ones. 1
3 0.56487894 102 nips-2012-Distributed Non-Stochastic Experts
Author: Varun Kanade, Zhenming Liu, Bozidar Radunovic
Abstract: We consider the online distributed non-stochastic experts problem, where the distributed system consists of one coordinator node that is connected to k sites, and the sites are required to communicate with each other via the coordinator. At each time-step t, one of the k site nodes has to pick an expert from the set {1, . . . , n}, and the same site receives information about payoffs of all experts for that round. The goal of the distributed system is to minimize regret at time horizon T , while simultaneously keeping communication to a minimum. The two extreme solutions to this problem are: (i) Full communication: This essentially simulates the nondistributed setting to obtain the optimal O( log(n)T ) regret bound at the cost of T communication. (ii) No communication: Each site runs an independent copy – the regret is O( log(n)kT ) and the communication is 0. This paper shows the √ difficulty of simultaneously achieving regret asymptotically better than kT and communication better than T . We give a novel algorithm that for an oblivious √ adversary achieves a non-trivial trade-off: regret O( k 5(1+ )/6 T ) and communication O(T /k ), for any value of ∈ (0, 1/5). We also consider a variant of the model, where the coordinator picks the expert. In this model, we show that the label-efficient forecaster of Cesa-Bianchi et al. (2005) already gives us strategy that is near optimal in regret vs communication trade-off. 1
4 0.53039479 206 nips-2012-Majorization for CRFs and Latent Likelihoods
Author: Tony Jebara, Anna Choromanska
Abstract: The partition function plays a key role in probabilistic modeling including conditional random fields, graphical models, and maximum likelihood estimation. To optimize partition functions, this article introduces a quadratic variational upper bound. This inequality facilitates majorization methods: optimization of complicated functions through the iterative solution of simpler sub-problems. Such bounds remain efficient to compute even when the partition function involves a graphical model (with small tree-width) or in latent likelihood settings. For large-scale problems, low-rank versions of the bound are provided and outperform LBFGS as well as first-order methods. Several learning applications are shown and reduce to fast and convergent update rules. Experimental results show advantages over state-of-the-art optimization methods. This supplement presents additional details in support of the full article. These include the application of the majorization method to maximum entropy problems. It also contains proofs of the various theorems, in particular, a guarantee that the bound majorizes the partition function. In addition, a proof is provided guaranteeing convergence on (non-latent) maximum conditional likelihood problems. The supplement also contains supporting lemmas that show the bound remains applicable in constrained optimization problems. The supplement then proves the soundness of the junction tree implementation of the bound for graphical models with large n. It also proves the soundness of the low-rank implementation of the bound for problems with large d. Finally, the supplement contains additional experiments and figures to provide further empirical support for the majorization methodology. Supplement for Section 2 Proof of Theorem 1 Rewrite the partition function as a sum over the integer index j = 1, . . . , n under the random ordering π : Ω → {1, . . . , n}. This defines j∑ π(y) and associates h and f with = n hj = h(π −1 (j)) and fj = f (π −1 (j)). Next, write Z(θ) = j=1 αj exp(λ⊤ fj ) by introducing ˜ ˜ λ = θ − θ and αj = hj exp(θ ⊤ fj ). Define the partition function over only the first i components ∑i as Zi (θ) = j=1 αj exp(λ⊤ fj ). When i = 0, a trivial quadratic upper bound holds ( ) Z0 (θ) ≤ z0 exp 1 λ⊤ Σ0 λ + λ⊤ µ0 2 with the parameters z0 → 0+ , µ0 = 0, and Σ0 = z0 I. Next, add one term to the current partition function Z1 (θ) = Z0 (θ) + α1 exp(λ⊤ f1 ). Apply the current bound Z0 (θ) to obtain 1 Z1 (θ) ≤ z0 exp( 2 λ⊤ Σ0 λ + λ⊤ µ0 ) + α1 exp(λ⊤ f1 ). Consider the following change of variables u γ 1/2 −1/2 = Σ0 λ − Σ0 = α1 z0 exp( 1 (f1 2 (f1 − µ0 )) − µ0 )⊤ Σ−1 (f1 − µ0 )) 0 and rewrite the logarithm of the bound as log Z1 (θ) ( ) 1 ≤ log z0 − 1 (f1 − µ0 )⊤ Σ−1 (f1 − µ0 ) + λ⊤ f1 + log exp( 2 ∥u∥2 ) + γ . 0 2 Apply Lemma 1 (cf. [32] p. 100) to the last term to get ( (1 ) ) log Z1 (θ) ≤ log z0 − 1 (f1 − µ0 )⊤ Σ−1 (f1 − µ0 ) + λ⊤ f1 + log exp 2 ∥v∥2 +γ 0 2 ( ) v⊤ (u − v) 1 + + (u − v)⊤ I + Γvv⊤ (u − v) 1+γ exp(− 1 ∥v∥2 ) 2 2 10 where Γ = 1 1 tanh( 2 log(γ exp(− 2 ∥v∥2 ))) . 1 2 log(γ exp(− 2 ∥v∥2 )) The bound in [32] is tight when u = v. To achieve tightness −1/2 ˜ when θ = θ or, equivalently, λ = 0, we choose v = Σ0 (µ0 − f1 ) yielding ( ) Z1 (θ) ≤ z1 exp 1 λ⊤ Σ1 λ + λ⊤ µ1 2 where we have z1 = z0 + α1 z0 α1 = µ0 + f1 z0 + α1 z0 + α1 1 tanh( 2 log(α1 /z0 )) = Σ0 + (µ0 − f1 )(µ0 − f1 )⊤ . 2 log(α1 /z0 ) µ1 Σ1 This rule updates the bound parameters z0 , µ0 , Σ0 to incorporate an extra term in the sum over i in Z(θ). The process is iterated n times (replacing 1 with i and 0 with i − 1) to produce an overall bound on all terms. Lemma 1 (See [32] p. 100) ( ( ) ) For all u ∈ Rd , any v ∈ Rd and any γ ≥ 0, the bound log exp 1 ∥u∥2 + γ ≤ 2 ( ( ) ) log exp 1 ∥v∥2 + γ + 2 ( ) v⊤ (u − v) 1 + (u − v)⊤ I + Γvv⊤ (u − v) 1 1 + γ exp(− 2 ∥v∥2 ) 2 holds when the scalar term Γ = 1 tanh( 2 log(γ exp(−∥v∥2 /2))) . 2 log(γ exp(−∥v∥2 /2)) Equality is achieved when u = v. Proof of Lemma 1 The proof is provided in [32]. Supplement for Section 3 Maximum entropy problem We show here that partition functions arise naturally in maximum ∑ p(y) entropy estimation or minimum relative entropy RE(p∥h) = y p(y) log h(y) estimation. Consider the following problem: ∑ ∑ min RE(p∥h) s.t. p(y)f (y) = 0, p(y)g(y) ≥ 0. p y y d′ Here, assume that f : Ω → R and g : Ω → R are arbitrary (non-constant) vector-valued functions ( ) over the sample space. The solution distribution p(y) = h(y) exp θ ⊤ f (y) + ϑ⊤ g(y) /Z(θ, ϑ) is recovered by the dual optimization ∑ ( ) arg θ, ϑ = max − log h(y) exp θ ⊤ f (y) + ϑ⊤ g(y) d ϑ≥0,θ y ′ where θ ∈ Rd and ϑ ∈ Rd . These are obtained by minimizing Z(θ, ϑ) or equivalently by maximizing its negative logarithm. Algorithm 1 permits variational maximization of the dual via the quadratic program min 1 (β ϑ≥0,θ 2 ˜ ˜ − β)⊤ Σ(β − β) + β ⊤ µ ′ where β ⊤ = [θ ⊤ ϑ⊤ ]. Note that any general convex hull of constraints β ∈ Λ ⊆ Rd+d could be imposed without loss of generality. Proof of Theorem 2 We begin by proving a lemma that will be useful later. Lemma 2 If κΨ ≽ Φ ≻ 0 for Φ, Ψ ∈ Rd×d , then 1 ˜ ˜ ˜ L(θ) = − 2 (θ − θ)⊤ Φ(θ − θ) − (θ − θ)⊤ µ ˜ ˜ ˜ U (θ) = − 1 (θ − θ)⊤ Ψ(θ − θ) − (θ − θ)⊤ µ 2 satisfy supθ∈Λ L(θ) ≥ 1 κ ˜ supθ∈Λ U (θ) for any convex Λ ⊆ Rd , θ ∈ Λ, µ ∈ Rd and κ ∈ R+ . 11 Proof of Lemma 2 Define the primal problems of interest as PL = supθ∈Λ L(θ) and PU = supθ∈Λ U (θ). The constraints θ ∈ Λ can be summarized by a set of linear inequalities Aθ ≤ b where A ∈ Rk×d and b ∈ Rk for some (possibly infinite) k ∈ Z. Apply the change of variables ˜ ˜ ˜ ˜ ˜ ˜ z = θ− θ. The constraint A(z+ θ) ≤ b simplifies into Az ≤ b where b = b−Aθ. Since θ ∈ Λ, it 1 ⊤ ˜ ≥ 0. We obtain the equivalent primal problems PL = sup is easy to show that b ˜ − z Φz − Az≤b z⊤ µ and PU = supAz≤b − 1 z⊤ Ψz − z⊤ µ. The corresponding dual problems are ˜ 2 2 ⊤ −1 y⊤AΦ−1A⊤y ˜ µ Φ µ +y⊤AΦ−1µ+y⊤b+ y≥0 2 2 y⊤AΨ−1 A⊤y µ⊤Ψ−1µ ˜ DU = inf +y⊤AΨ−1µ+y⊤b+ . y≥0 2 2 DL = inf ˜ Due to strong duality, PL = DL and PU = DU . Apply the inequalities Φ ≼ κΨ and y⊤ b > 0 as ⊤ −1 y⊤AΨ−1 A⊤y y⊤AΨ−1 µ κ ˜ µΨ µ + + y⊤b + PL ≥ sup − z⊤ Ψz − z⊤ µ = inf y≥0 2 2κ κ 2κ ˜ Az≤b 1 1 ≥ DU = PU . κ κ 1 This proves that PL ≥ κ PU . We will use the above to prove Theorem 2. First, we will upper-bound (in the Loewner ordering sense) the matrices Σj in Algorithm 2. Since ∥fxj (y)∥2 ≤ r for all y ∈ Ωj and since µj in Algorithm 1 is a convex combination of fxj (y), the outer-product terms in the update for Σj satisfy (fxj (y) − µ)(fxj (y) − µ)⊤ ≼ 4r2 I. Thus, Σj ≼ F(α1 , . . . , αn )4r2 I holds where α 1 F(α1 , . . . , αn ) = i n ∑ tanh( 2 log( ∑i−1 αk )) k=1 αi 2 log( ∑i−1 α ) i=2 k k=1 using the definition of α1 , . . . , αn in the proof of Theorem 1. The formula for F starts at i = 2 since z0 → 0+ . Assume permutation π is sampled uniformly at random. The expected value of F is then α π(i) 1 n 1 ∑ ∑ tanh( 2 log( ∑i−1 απ(k) )) k=1 . Eπ [F(α1 , . . . , αn )] = απ(i) n! π i=2 ) 2 log( ∑i−1 k=1 απ(k) We claim that the expectation is maximized when all αi = 1 or any positive constant. Also, F is invariant under uniform scaling of its arguments. Write the expected value of F as E for short. ∂E ∂E ∂E Next, consider ∂αl at the setting αi = 1, ∀i. Due to the expectation over π, we have ∂αl = ∂αo for any l, o. Therefore, the gradient vector is constant when all αi = 1. Since F(α1 , . . . , αn ) is invariant to scaling, the gradient vector must therefore be the all zeros vector. Thus, the point ∂ ∂E when all αi = 1 is an extremum or a saddle. Next, consider ∂αo ∂αl for any l, o. At the setting 2 ∂ ∂E αi = 1, ∂ E = −c(n) and, ∂αo ∂αl = c(n)/(n − 1) for some non-negative constant function ∂α2 l c(n). Thus, the αi = 1 extremum is locally concave and is a maximum. This establishes that Eπ [F(α1 , . . . , αn )] ≤ Eπ [F(1, . . . , 1)] and yields the Loewner bound ) ( n−1 ∑ tanh(log(i)/2) 2 I = ωI. Σj ≼ 2r log(i) i=1 Apply this bound to each Σj in the lower bound on J(θ) and also note a corresponding upper bound ∑ ˜ ˜ ˜ J(θ) ≥ J(θ)−tω+tλ ∥θ − θ∥2− (θ − θ)⊤(µj −fxj (yj )) 2 j ∑ ˜ ˜ ˜ J(θ) ≤ J(θ)−tλ ∥θ − θ∥2− (θ − θ)⊤(µj −fxj (yj )) 2 j 12 ˜ which follows from Jensen’s inequality. Define the current θ at time τ as θτ and denote by Lτ (θ) the above lower bound and by Uτ (θ) the above upper bound at time τ . Clearly, Lτ (θ) ≤ J(θ) ≤ Uτ (θ) with equality when θ = θτ . Algorithm 2 maximizes J(θ) after initializing at θ0 and performing an update by maximizing a lower bound based on Σj . Since Lτ (θ) replaces the definition of Σj with ωI ≽ Σj , Lτ (θ) is a looser bound than the one used by Algorithm 2. Thus, performing θτ +1 = arg maxθ∈Λ Lτ (θ) makes less progress than a step of Algorithm 1. Consider computing the slower update at each iteration τ and returning θτ +1 = arg maxθ∈Λ Lτ (θ). Setting Φ = (tω +tλ)I, Ψ = tλI and κ = ω+λ allows us to apply Lemma 2 as follows λ sup Lτ (θ) − Lτ (θτ ) = θ∈Λ 1 sup Uτ (θ) − Uτ (θτ ). κ θ∈Λ Since Lτ (θτ ) = J(θτ ) = Uτ (θτ ), J(θτ +1 ) ≥ supθ∈Λ Lτ (θ) and supθ∈Λ Uτ (θ) ≥ J(θ ∗ ), we obtain ( ) 1 J(θτ +1 ) − J(θ ∗ ) ≥ 1− (J(θτ ) − J(θ ∗ )) . κ Iterate the above inequality starting at t = 0 to obtain ( )τ 1 ∗ J(θτ ) − J(θ ) ≥ 1− (J(θ0 ) − J(θ ∗ )) . κ ( ) 1 τ κ A solution within a multiplicative factor of ϵ implies that ϵ = 1 − κ or log(1/ϵ) = τ log κ−1 . ⌉ ⌈ Inserting the definition for κ shows that the number of iterations τ is at most log(1/ϵ) or κ log κ−1 ⌉ ⌈ log(1/ϵ) log(1+λ/ω) . Inserting the definition for ω gives the bound. Y12,0 Y11,1 Y21,1 Y31,1 ··· 1,1 Ym1,1 Figure 3: Junction tree of depth 2. Algorithm 5 SmallJunctionTree ˜ Input Parameters θ and h(u), f (u) ∀u ∈ Y12,0 and zi , Σi , µi ∀i = 1, . . . , m1,1 + Initialize z → 0 , µ = 0, Σ = zI For each configuration u ∈ Y12,0 { ∏m1,1 ∑m1,1 ∏m1,1 ˜ ˜ ˜ α = h(u)( ∑ zi exp(−θ ⊤ µi )) exp(θ ⊤ (f (u) + i=1 µi )) = h(u) exp(θ ⊤ f (u)) i=1 zi i=1 m1,1 l = f (u) + i=1 µi − µ 1 ∑m1,1 tanh( 2 log(α/z)) ⊤ Σ + = i=1 Σi + ll 2 log(α/z) α µ + = z+α l z += α } Output z, µ, Σ Supplement for Section 5 Proof of correctness for Algorithm 3 Consider a simple junction tree of depth 2 shown on Figure 3. The notation Yca,b refers to the cth tree node located at tree level a (first level is considered as the one with∑ leaves) whose parent is the bth from the higher tree level (the root has no parent so b = 0). tree ∑ Let Y a1 ,b1 refer to the sum over all configurations of variables in Yca1 ,b1 and Y a1 ,b1 \Y a2 ,b2 1 c1 c1 c2 refers to the sum over all configurations of variables that are in Yca1 ,b1 but not in Yca2 ,b2 . Let ma,b 1 2 denote the number of children of the bth node located at tree level a + 1. For short-hand, use ψ(Y ) = h(Y ) exp(θ ⊤ f (Y )). The partition function can be expressed as: 13 Y13,0 Y12,1 ··· Y11,1 Y21,1 ··· Y22,1 1,1 Ym1,1 ··· Y11,2 Y21,2 1,2 Ym1,2 2,1 Ym2,1 1,m2,1 Y1 ··· 1,m2,1 Y2 1,m 2,1 Ym1,m2,1 Figure 4: Junction tree of depth 3. Z(θ) = ∑ 2,0 u∈Y1 ≤ ∑ 2,0 u∈Y1 = ∑ ψ(u) ∏ m1,1 i=1 [ ∏ m1,1 ψ(u) i=1 [ ∑ ψ(v) 2,0 v∈Yi1,1 \Y1 ) 1( ˜ ˜ ˜ zi exp( θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi 2 ∏ ( m1,1 ⊤ h(u) exp(θ f (u)) 2,0 u∈Y1 zi exp i=1 ] 1 ˜ ˜ ˜ (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi 2 )] ∑ where the upper-bound is obtained by applying Theorem 1 to each of the terms v∈Y 1,1 \Y 2,0 ψ(v). 1 i By simply rearranging terms we get: ) ( ( [ (m1,1 )) m1,1 ∏ ∑ ∑ ˜ zi exp(−θ ⊤ µi ) exp θ ⊤ f (u) + µi h(u) Z(θ) ≤ 2,0 u∈Y1 i=1 ( exp 1 ˜ (θ − θ)⊤ 2 (m1,1 ∑ ) Σi i=1 ˜ (θ − θ) )] . i=1 One ( can prove that this ) expression can be upper-bounded by 1 ˆ ⊤ Σ(θ − θ) + (θ − θ)⊤ µ where z, Σ and µ can be computed using Algoˆ ˆ z exp 2 (θ − θ) rithm 5 (a simplification of Algorithm 3). We will call this result Lemma A. The proof is similar to the proof of Theorem 1 so is not repeated here. Consider enlarging the tree to a depth 3 as shown on Figure 4. The partition function is now m2,1 m1,i ∑ ∏ ∑ ∏ ∑ Z(θ) = ψ(w) . ψ(u) ψ(v) 3,0 u∈Y1 i=1 3,0 v∈Yi2,1 \Y1 j=1 w∈Yj1,i \Yi2,1 ( )) ∏m1,i (∑ ∑ term By Lemma A we can upper bound each v∈Y 2,1 \Y 3,0 ψ(v) j=1 w∈Yj1,i \Yi2,1 ψ(w) 1 i ( ) ˆ ˆ ˆ by the expression zi exp 1 (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi . This yields 2 [ )] ( m2,1 ∑ ∏ 1 ˜ ˜ ˜ Z(θ) ≤ ψ(u) (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi . zi exp 2 3,0 i=1 u∈Y1 2,1 2,1 2,1 This process can be viewed as collapsing the sub-trees S1 , S2 , . . ., Sm2,1 to super-nodes that are represented by bound parameters, zi , Σi and µi , i = {1, 2, · · · , m2,1 }, where the sub-trees are 14 defined as: 2,1 S1 = 1,1 {Y12,1 , Y11,1 , Y21,1 , Y31,1 , . . . , Ym1,1 } 2,1 S2 = 1,2 {Y22,1 , Y11,2 , Y21,2 , Y31,2 , . . . , Ym1,2 } = 2,1 {Ym2,1 , Y1 . . . 2,1 Sm2,1 1,m2,1 1,m2,1 , Y2 1,m2,1 , Y3 1,m2,1 , . . . , Ym1,m2,1 }. Notice that the obtained expression can be further upper bounded using again Lemma A (induction) ( ) ˆ ˆ ˆ yielding a bound of the form: z exp 1 (θ − θ)⊤ Σ(θ − θ) + (θ − θ)⊤ µ . 2 Finally, for a general tree, follow the same steps described above, starting from leaves and collapsing nodes to super-nodes, each represented by bound parameters. This procedure effectively yields Algorithm 3 for the junction tree under consideration. Supplement for Section 6 Proof of correctness for Algorithm 4 We begin by proving a lemma that will be useful later. Lemma 3 For all x ∈ Rd and for all l ∈ Rd , 2 d d 2 ∑ ∑ l(i) . x(i)2 l(i)2 ≥ x(i) √∑ d l(j)2 i=1 i=1 j=1 Proof of Lemma 3 By Jensen’s inequality, 2 ( d )2 d d ∑ x(i)l(i)2 ∑ ∑ x(i)l(i)2 . √∑ x(i)2 ∑d ⇐⇒ x(i)2 l(i)2 ≥ ≥ ∑d d l(j)2 l(j)2 l(j)2 j=1 j=1 i=1 i=1 i=1 i=1 d ∑ l(i)2 j=1 Now we prove the correctness of Algorithm 4. At the ith iteration, the algorithm stores Σi using ⊤ a low-rank representation Vi Si Vi + Di where Vi ∈ Rk×d is orthonormal, Si ∈ Rk×k positive d×d semi-definite and Di ∈ R is non-negative diagonal. The diagonal terms D are initialized to tλI where λ is the regularization term. To mimic Algorithm 1 we must increment the Σ matrix by a rank one update of the form Σi = Σi−1 + ri r⊤ . By projecting ri onto each eigenvector in V, we i ∑k ⊤ can decompose it as ri = j=1 Vi−1 (j, ·)ri Vi−1 (j, ·)⊤ + g = Vi−1 Vi−1 ri + g where g is the remaining residue. Thus the update rule can be rewritten as: Σi ⊤ ⊤ ⊤ = Σi−1 + ri r⊤ = Vi−1 Si−1 Vi−1 + Di−1 + (Vi−1 Vi−1 ri + g)(Vi−1 Vi−1 ri + g)⊤ i ′ ′ ′ ⊤ ⊤ ⊤ = Vi−1 (Si−1 + Vi−1 ri r⊤ Vi−1 )Vi−1 + Di−1 + gg⊤ = Vi−1 Si−1 Vi−1 + gg⊤ + Di−1 i ′ where we define Vi−1 = Qi−1 Vi−1 and defined Qi−1 in terms of the singular value decomposition, ′ ′ ⊤ Q⊤ Si−1 Qi−1 = svd(Si−1 + Vi−1 ri r⊤ Vi−1 ). Note that Si−1 is diagonal and nonnegative by i−1 i construction. The current formula for Σi shows that we have a rank (k + 1) system (plus diagonal term) which needs to be converted back to a rank k system (plus diagonal term) which we denote by ′ Σi . We have two options as follows. Case 1) Remove g from Σi to obtain ′ ′ ′ ′ ⊤ Σi = Vi−1 Si−1 Vi−1 + Di−1 = Σi − gg⊤ = Σi − cvv⊤ 1 where c = ∥g∥2 and v = ∥g∥ g. th Case 2) Remove the m (smallest) eigenvalue in S′ and its corresponding eigenvector: i−1 ′ Σi ′ ′ ′ ′ ′ ′ ⊤ = Vi−1 Si−1 Vi−1 + Di−1 + gg⊤ − S (m, m)V (m, ·)⊤ V (m, ·) = Σi − cvv⊤ ′ ′ where c = S (m, m) and v = V(m, ·) . 15 ′ Clearly, both cases can be written as an update of the form Σi = Σi + cvv⊤ where c ≥ 0 and v⊤ v = 1. We choose the case with smaller c value to minimize the change as we drop from a system of order (k + 1) to order k. Discarding the smallest singular value and its corresponding eigenvector would violate the bound. Instead, consider absorbing this term into the diagonal component to ′′ ′ preserve the bound. Formally, we look for a diagonal matrix F such that Σi = Σi + F which also ′′ maintains x⊤ Σi x ≥ x⊤ Σi x for all x ∈ Rd . Thus, we want to satisfy: ( d )2 d ∑ ∑ ⊤ ′′ ⊤ ⊤ ⊤ ⊤ x Σi x ≥ x Σi x ⇐⇒ x cvv x ≤ x Fx ⇐⇒ c x(i)v(i) ≤ x(i)2 F(i) i=1 i=1 where, for ease of notation, we take F(i) = F(i, i). ′ where w = v⊤ 1. Consider the case where v ≥ 0 though we will soon get rid of (∑ )2 ∑d d this assumption. We need an F such that i=1 x(i)2 F(i) ≥ c i=1 x(i)v(i) . Equivalently, we (∑ ) ∑d ′ 2 ′ d need i=1 x(i)2 F(i) ≥ . Define F(i) = F(i) for all i = 1, . . . , d. So, we need 2 i=1 x(i)v(i) cw cw2 (∑ ) ∑d ′ ′ ′ 2 d an F such that i=1 x(i)2 F(i) ≥ . Using Lemma 3 it is easy to show that we i=1 x(i)v(i) Define v = 1 wv ′ ′ ′ may choose F (i) = v(i) . Thus, we obtain F(i) = cw2 F(i) = cwv(i). Therefore, for all x ∈ Rd , ∑d all v ≥ 0, and for F(i) = cv(i) j=1 v(j) we have d ∑ ( x(i) F(i) ≥ c 2 i=1 d ∑ )2 x(i)v(i) . (3) i=1 To generalize the inequality to hold for all vectors v ∈ Rd with potentially negative entries, it is ∑d sufficient to set F(i) = c|v(i)| j=1 |v(j)|. To verify this, consider flipping the sign of any v(i). The left side of the Inequality 3 does not change. For the right side of this inequality, flipping the sign of v(i) is equivalent to flipping the sign of x(i) and not changing the sign of v(i). However, in this case the inequality holds as shown before (it holds for any x ∈ Rd ). Thus for all x, v ∈ Rd and ∑d for F(i) = c|v(i)| j=1 |v(j)|, Inequality 3 holds. Supplement for Section 7 Small scale experiments In additional small-scale experiments, we compared Algorithm 2 with steepest descent (SD), conjugate gradient (CG), BFGS and Newton-Raphson. Small-scale problems may be interesting in real-time learning settings, for example, when a website has to learn from a user’s uploaded labeled data in a split second to perform real-time retrieval. We considered logistic regression on five UCI data sets where missing values were handled via mean-imputation. A range of regularization settings λ ∈ {100 , 102 , 104 } was explored and all algorithms were initialized from the same ten random start-points. Table 3 shows the average number of seconds each algorithm needed to achieve the same solution that BFGS converged to (all algorithms achieve the same solution due to concavity). The bound is the fastest algorithm as indicated in bold. data|λ BFGS SD CG Newton Bound a|100 1.90 1.74 0.78 0.31 0.01 a|102 0.89 0.92 0.83 0.25 0.01 a|104 2.45 1.60 0.85 0.22 0.01 b|100 3.14 2.18 0.70 0.43 0.07 b|102 2.00 6.17 0.67 0.37 0.04 b|104 1.60 5.83 0.83 0.35 0.04 c|100 4.09 1.92 0.65 0.39 0.07 c|102 1.03 0.64 0.64 0.34 0.02 c|104 1.90 0.56 0.72 0.32 0.02 d|100 5.62 12.04 1.36 0.92 0.16 d|102 2.88 1.27 1.21 0.63 0.09 d|104 3.28 1.94 1.23 0.60 0.07 e|100 2.63 2.68 0.48 0.35 0.03 e|102 2.01 2.49 0.55 0.26 0.03 e|104 1.49 1.54 0.43 0.20 0.03 Table 3: Convergence time in seconds under various regularization levels for a) Bupa (t = 345, dim = 7), b) Wine (t = 178, dim = 14), c) Heart (t = 187, dim = 23), d) Ion (t = 351, dim = 34), and e) Hepatitis (t = 155, dim = 20) data sets. Influence of rank k on bound performance in large scale experiments We also examined the influence of k on bound performance and compared it with LBFGS, SD and CG. Several choices 16 of k were explored. Table 4 shows results for the SRBCT data-set. In general, the bound performs best but slows down for superfluously large values of k. Steepest descent and conjugate gradient are slow yet obviously do not vary with k. Note that each iteration takes less time with smaller k for the bound. However, we are reporting overall runtime which is also a function of the number of iterations. Therefore, total runtime (a function of both) may not always decrease/increase with k. k LBFGS SD CG Bound 1 1.37 8.80 4.39 0.56 2 1.32 8.80 4.39 0.56 4 1.39 8.80 4.39 0.67 8 1.35 8.80 4.39 0.96 16 1.46 8.80 4.39 1.34 32 1.40 8.80 4.39 2.11 64 1.54 8.80 4.39 4.57 Table 4: Convergence time in seconds as a function of k. Additional latent-likelihood results For completeness, Figure 5 depicts two additional data-sets to complement Figure 2. Similarly, Table 5 shows all experimental settings explored in order to provide the summary Table 2 in the main article. bupa wine −19 0 −5 −log(J(θ)) −log(J(θ)) −20 −21 −22 Bound Newton BFGS Conjugate gradient Steepest descent −15 −23 −24 −5 −10 0 5 log(Time) [sec] 10 −20 −4 −2 0 2 4 log(Time) [sec] 6 8 Figure 5: Convergence of test latent log-likelihood for bupa and wine data-sets. Data-set Algorithm BFGS SD CG Newton Bound ion m=1 m=2m=3m=4 -4.96 -5.55 -5.88 -5.79 -11.80 -9.92 -5.56 -8.59 -5.47 -5.81 -5.57 -5.22 -5.95 -5.95 -5.95 -5.95 -6.08 -4.84 -4.18 -5.17 Data-set Algorithm BFGS SD CG Newton Bound bupa m=1 m=2 m=3 m=4 -22.07 -21.78 -21.92 -21.87 -21.76 -21.74 -21.73 -21.83 -21.81 -21.81 -21.81 -21.81 -21.85 -21.85 -21.85 -21.85 -21.85 -19.95 -20.01 -19.97 wine m=1m=2m=3m=4 -0.90 -0.91 -1.79 -1.35 -1.61 -1.60 -1.37 -1.63 -0.51 -0.78 -0.95 -0.51 -0.71 -0.71 -0.71 -0.71 -0.51 -0.51 -0.48 -0.51 hepatitis m=1m=2m=3m=4 -4.42 -5.28 -4.95 -4.93 -4.93 -5.14 -5.01 -5.20 -4.84 -4.84 -4.84 -4.84 -5.50 -5.50 -5.50 -4.50 -5.47 -4.40 -4.75 -4.92 SRBCT m=1m=2m=3m=4 -5.99 -6.17 -6.09 -6.06 -5.61 -5.62 -5.62 -5.61 -5.62 -5.49 -5.36 -5.76 -5.54 -5.54 -5.54 -5.54 -5.31 -5.31 -4.90 -0.11 Table 5: Test latent log-likelihood at convergence for different values of m ∈ {1, 2, 3, 4} on ion, bupa, hepatitis, wine and SRBCT data-sets. 17
5 0.52284247 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering
Author: Xavier Bresson, Thomas Laurent, David Uminsky, James V. Brecht
Abstract: This paper provides both theoretical and algorithmic results for the 1 -relaxation of the Cheeger cut problem. The 2 -relaxation, known as spectral clustering, only loosely relates to the Cheeger cut; however, it is convex and leads to a simple optimization problem. The 1 -relaxation, in contrast, is non-convex but is provably equivalent to the original problem. The 1 -relaxation therefore trades convexity for exactness, yielding improved clustering results at the cost of a more challenging optimization. The first challenge is understanding convergence of algorithms. This paper provides the first complete proof of convergence for algorithms that minimize the 1 -relaxation. The second challenge entails comprehending the 1 energy landscape, i.e. the set of possible points to which an algorithm might converge. We show that 1 -algorithms can get trapped in local minima that are not globally optimal and we provide a classification theorem to interpret these local minima. This classification gives meaning to these suboptimal solutions and helps to explain, in terms of graph structure, when the 1 -relaxation provides the solution of the original Cheeger cut problem. 1
6 0.52224177 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection
7 0.5152213 320 nips-2012-Spectral Learning of General Weighted Automata via Constrained Matrix Completion
8 0.5043835 301 nips-2012-Scaled Gradients on Grassmann Manifolds for Matrix Completion
9 0.49194461 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound
10 0.48587203 361 nips-2012-Volume Regularization for Binary Classification
11 0.47544456 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders
12 0.46255457 128 nips-2012-Fast Resampling Weighted v-Statistics
13 0.46207875 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification
14 0.45618746 34 nips-2012-Active Learning of Multi-Index Function Models
15 0.45168874 27 nips-2012-A quasi-Newton proximal splitting method
16 0.44682947 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics
17 0.44493169 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach
18 0.4436152 337 nips-2012-The Lovász ϑ function, SVMs and finding large dense subgraphs
19 0.43916371 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)
20 0.43654051 86 nips-2012-Convex Multi-view Subspace Learning
topicId topicWeight
[(0, 0.023), (38, 0.126), (42, 0.012), (54, 0.022), (74, 0.022), (76, 0.118), (80, 0.069), (92, 0.509)]
simIndex simValue paperId paperTitle
same-paper 1 0.87836123 44 nips-2012-Approximating Concavely Parameterized Optimization Problems
Author: Joachim Giesen, Jens Mueller, Soeren Laue, Sascha Swiercy
Abstract: We consider an abstract class of optimization problems that are parameterized concavely in a single parameter, and show that the solution path along the √ parameter can always be approximated with accuracy ε > 0 by a set of size O(1/ ε). A √ lower bound of size Ω(1/ ε) shows that the upper bound is tight up to a constant factor. We also devise an algorithm that calls a step-size oracle and computes an √ approximate path of size O(1/ ε). Finally, we provide an implementation of the oracle for soft-margin support vector machines, and a parameterized semi-definite program for matrix completion. 1
2 0.87319952 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification
Author: Richard Socher, Brody Huval, Bharath Bath, Christopher D. Manning, Andrew Y. Ng
Abstract: Recent advances in 3D sensing technologies make it possible to easily record color and depth images which together can improve object recognition. Most current methods rely on very well-designed features for this new 3D modality. We introduce a model based on a combination of convolutional and recursive neural networks (CNN and RNN) for learning features and classifying RGB-D images. The CNN layer learns low-level translationally invariant features which are then given as inputs to multiple, fixed-tree RNNs in order to compose higher order features. RNNs can be seen as combining convolution and pooling into one efficient, hierarchical operation. Our main result is that even RNNs with random weights compose powerful features. Our model obtains state of the art performance on a standard RGB-D object dataset while being more accurate and faster during training and testing than comparable architectures such as two-layer CNNs. 1
3 0.85960913 42 nips-2012-Angular Quantization-based Binary Codes for Fast Similarity Search
Author: Yunchao Gong, Sanjiv Kumar, Vishal Verma, Svetlana Lazebnik
Abstract: This paper focuses on the problem of learning binary codes for efficient retrieval of high-dimensional non-negative data that arises in vision and text applications where counts or frequencies are used as features. The similarity of such feature vectors is commonly measured using the cosine of the angle between them. In this work, we introduce a novel angular quantization-based binary coding (AQBC) technique for such data and analyze its properties. In its most basic form, AQBC works by mapping each non-negative feature vector onto the vertex of the binary hypercube with which it has the smallest angle. Even though the number of vertices (quantization landmarks) in this scheme grows exponentially with data dimensionality d, we propose a method for mapping feature vectors to their smallest-angle binary vertices that scales as O(d log d). Further, we propose a method for learning a linear transformation of the data to minimize the quantization error, and show that it results in improved binary codes. Experiments on image and text datasets show that the proposed AQBC method outperforms the state of the art. 1
4 0.85744083 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo
Author: Yichuan Zhang, Zoubin Ghahramani, Amos J. Storkey, Charles A. Sutton
Abstract: Continuous relaxations play an important role in discrete optimization, but have not seen much use in approximate probabilistic inference. Here we show that a general form of the Gaussian Integral Trick makes it possible to transform a wide class of discrete variable undirected models into fully continuous systems. The continuous representation allows the use of gradient-based Hamiltonian Monte Carlo for inference, results in new ways of estimating normalization constants (partition functions), and in general opens up a number of new avenues for inference in difficult discrete systems. We demonstrate some of these continuous relaxation inference algorithms on a number of illustrative problems. 1
5 0.82378769 349 nips-2012-Training sparse natural image models with a fast Gibbs sampler of an extended state space
Author: Lucas Theis, Jascha Sohl-dickstein, Matthias Bethge
Abstract: We present a new learning strategy based on an efficient blocked Gibbs sampler for sparse overcomplete linear models. Particular emphasis is placed on statistical image modeling, where overcomplete models have played an important role in discovering sparse representations. Our Gibbs sampler is faster than general purpose sampling schemes while also requiring no tuning as it is free of parameters. Using the Gibbs sampler and a persistent variant of expectation maximization, we are able to extract highly sparse distributions over latent sources from data. When applied to natural images, our algorithm learns source distributions which resemble spike-and-slab distributions. We evaluate the likelihood and quantitatively compare the performance of the overcomplete linear model to its complete counterpart as well as a product of experts model, which represents another overcomplete generalization of the complete linear model. In contrast to previous claims, we find that overcomplete representations lead to significant improvements, but that the overcomplete linear model still underperforms other models. 1
6 0.81811047 110 nips-2012-Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems
7 0.60376209 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection
8 0.59990001 329 nips-2012-Super-Bit Locality-Sensitive Hashing
9 0.58005631 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines
10 0.57273614 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves
11 0.56183362 94 nips-2012-Delay Compensation with Dynamical Synapses
12 0.56057006 65 nips-2012-Cardinality Restricted Boltzmann Machines
13 0.55973792 260 nips-2012-Online Sum-Product Computation Over Trees
14 0.55420983 24 nips-2012-A mechanistic model of early sensory processing based on subtracting sparse representations
15 0.55351865 102 nips-2012-Distributed Non-Stochastic Experts
16 0.55189466 163 nips-2012-Isotropic Hashing
17 0.55015951 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video
18 0.5437969 4 nips-2012-A Better Way to Pretrain Deep Boltzmann Machines
19 0.54304528 197 nips-2012-Learning with Recursive Perceptual Representations
20 0.54005259 148 nips-2012-Hamming Distance Metric Learning