nips nips2012 nips2012-44 knowledge-graph by maker-knowledge-mining

44 nips-2012-Approximating Concavely Parameterized Optimization Problems

Source: pdf

Author: Joachim Giesen, Jens Mueller, Soeren Laue, Sascha Swiercy

Abstract: We consider an abstract class of optimization problems that are parameterized concavely in a single parameter, and show that the solution path along the √ parameter can always be approximated with accuracy ε > 0 by a set of size O(1/ ε). A √ lower bound of size Ω(1/ ε) shows that the upper bound is tight up to a constant factor. We also devise an algorithm that calls a step-size oracle and computes an √ approximate path of size O(1/ ε). Finally, we provide an implementation of the oracle for soft-margin support vector machines, and a parameterized semi-deﬁnite program for matrix completion. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Approximating Concavely Parameterized Optimization Problems S¨ ren Laue o Friedrich-Schiller-Universit¨ t Jena a Germany soeren. [sent-1, score-0.08]

2 com Abstract We consider an abstract class of optimization problems that are parameterized concavely in a single parameter, and show that the solution path along the √ parameter can always be approximated with accuracy ε > 0 by a set of size O(1/ ε). [sent-10, score-0.536]

3 A √ lower bound of size Ω(1/ ε) shows that the upper bound is tight up to a constant factor. [sent-11, score-0.15]

4 We also devise an algorithm that calls a step-size oracle and computes an √ approximate path of size O(1/ ε). [sent-12, score-0.522]

5 Finally, we provide an implementation of the oracle for soft-margin support vector machines, and a parameterized semi-deﬁnite program for matrix completion. [sent-13, score-0.189]

6 Let D be a set, I ⊆ R an interval, and f : I × D → R such that (1) f (t, ·) is bounded from below for every t ∈ I, and (2) f (·, x) is concave for every x ∈ D. [sent-15, score-0.079]

7 We study the parameterized optimization problem h(t) = minx∈D f (t, x). [sent-16, score-0.074]

8 A solution x∗ ∈ D is called optimal at parameter value t if f (t, x∗ ) = h(t), and x ∈ D is called t t an ε-approximation at t if ε(t, x) := f (t, x) − h(t) ≤ ε. [sent-17, score-0.028]

9 The size of a smallest ε-approximation path is called the ε-path complexity of the parameterized optimization problem. [sent-20, score-0.52]

10 The aim of this paper is to derive upper and lower bounds on the path complexity, and to provide efﬁcient algorithms to compute ε-paths. [sent-21, score-0.414]

11 The rather abstract problem from above is motivated by regularized optimization problems that are abundant in machine learning, i. [sent-23, score-0.029]

12 , by problems of the form min f (t, x) := r(x) + t · l(x), x∈D where r(x) is a regularization- and l(x) a loss term. [sent-25, score-0.034]

13 The parameter t controls the trade-off between regularization and loss. [sent-26, score-0.051]

14 Note that here f (·, x) is always linear and hence concave in the parameter t. [sent-27, score-0.128]

15 Due to the widespread use of regularized optimization methods in machine learning regularization path following algorithms have become an active area of research. [sent-29, score-0.479]

16 Initially, exact path tracking methods have been developed for many machine learning problems [16, 18, 3, 9] starting with the algorithm for SVMs by Hastie et al. [sent-30, score-0.375]

17 Also, the exact regularization path can be exponentially large in the input size [5, 14]. [sent-33, score-0.426]

18 Approximation path algorithms with approximation guarantees have been developed for SVMs with square loss [6], the LASSO [14], and matrix completion and factorization problems [8, 7]. [sent-35, score-0.483]

19 We provide a structural upper bound in O(1/ ε) for the ε-path complexity for the abstract problem class described above. [sent-37, score-0.145]

20 We√ show that this bound is tight up to a multiplicative constant by constructing a lower bound in Ω(1/ ε). [sent-38, score-0.111]

21 Finally, we devise a generic algorithm to compute ε-paths that calls a problem speciﬁc oracle providing a step-size certiﬁcate. [sent-39, score-0.208]

22 If such a certiﬁcate √ exists, then the algorithm computes a path of complexity in O(1/ ε). [sent-40, score-0.446]

23 Finally, we demonstrate the implementation of the oracle for standard SVMs and a matrix completion problem. [sent-41, score-0.167]

24 Resulting in the √ ﬁrst algorithms for both problems that compute ε-paths of complexity in O 1/ ε . [sent-42, score-0.071]

25 Previously, no approximation path algorithms have been known for standard SVMs but only a heuristic [12] and an approximation algorithm for square loss SVMs [6] with complexity in O(1/ε). [sent-43, score-0.498]

26 The best approximation path algorithm for matrix completion also has complexity in √ O(1/ε). [sent-44, score-0.554]

27 To our knowledge, the only known approximation path algorithm with complexity in O 1/ ε is [14] for the LASSO. [sent-45, score-0.472]

28 2 Upper Bound Here we show that any problem that ﬁts the problem deﬁnition from the introduction for a compact √ interval I = [a, b] has an ε-path with complexity in O(1/ ε). [sent-46, score-0.155]

29 Let (a, b) be the interior of [a, b] and let g : (a, b) → R be concave, then g is continuous and has a left- and right derivative g− (t) and g+ (t), respectively, at every point t ∈ I (see for example [15]). [sent-47, score-0.067]

30 Note that f (·, x) is concave by assumption and h is concave as the minimum over a family of concave functions. [sent-48, score-0.237]

31 For all t < t it holds h(t ) ≤ f (t , x∗ ) and hence h(t) − h(t ) ≥ f (t, x∗ ) − f (t , x∗ ) which t t t implies h(t) − h(t ) f (t, x∗ ) − f (t , x∗ ) t t h− (t) := lim ≥ lim =: f− (t, x∗ ). [sent-52, score-0.151]

32 t t ↑t t ↑t t−t t−t The inequality f+ (t, x∗ ) ≥ h+ (t) follows analogously, and f− (t, x∗ ) ≥ f+ (t, x∗ ) follows after t t t some algebra from the concavity of f (·, x∗ ) and the deﬁnition of the derivatives (see [15]). [sent-53, score-0.182]

33 Let Tk = t | t ∈ (tk−1 , b] such that ε(t, x∗k−1 ) := f (t, x∗k−1 ) − h(t) = ε , t t and tk = min Tk for all integral k > 0 such that Tk = ∅. [sent-56, score-0.544]

34 Assume the claim holds for n − 1 and 1 let a = s1 + . [sent-71, score-0.101]

35 The rectangle with side lengths as−1 and n 1 n−1 bsn has circumference 2(as−1 + bsn ) and area as−1 bsn √ ab. [sent-78, score-0.527]

36 Since the square minimizes the = n n circumference for a given area we have 2(as−1 + bsn ) ≥ 4 ab. [sent-79, score-0.231]

37 The claim for n now follows from n √ √ −1 −1 (a + sn )(b + sn ) = ab + asn + bsn + 1 ≥ ab + 2 ab + 1 = ( ab + 1)2 ≥ ((n − 1) + 1)2 = n2 . [sent-80, score-0.612]

38 Deﬁne δk = tk+1 − tk and ∆k = h− (tk ) − h− (tk+1 ). [sent-88, score-0.51]

39 Thus, there exists sk > 0 such that δk ≥ εsk and ∆k ≥ s−1 . [sent-90, score-0.031]

40 + ∆n ) (b − a)(h− (a) − h− (b)), where the last inequality follows from h− (b) ≤ h− (t) for t ≤ b (which can be proved from concavity, see again [15]). [sent-106, score-0.084]

41 The t claim of the theorem follows since the proof of Lemma 4 shows that the sequence (tk ) is ﬁnite and hence the intervals [tk , tk+1 ] cover the whole [a, b]. [sent-113, score-0.157]

42 3 Lower Bound Here we show that there exists a problem that ﬁts the problem description from the introduction √ whose ε-path complexity is in Ω(1/ ε). [sent-114, score-0.138]

43 This shows that the upper bound from the previous section is tight up to a constant. [sent-115, score-0.115]

44 Let I = [a, b], D = R, f (t, x) = 1 x2 − tx and thus 2 h(t) = min x∈R 1 2 x − tx 2 = 1 ∗ x 2 t 2 1 − tx∗ = − t2 , t 2 where the last equality follows from the convexity and differentiability of f (t, x) in x which together imply ∂f (t, x∗ ) = x∗ − t = 0. [sent-116, score-0.219]

45 t t ∂x 1 For ε > 0 and x ∈ R let Ix = t ∈ [a, b] ε(t, x) := 2 x2 − tx + 1 t2 ≤ ε , which is an interval 2 √ 1 2 1 2 since 2 x − tx + 2 t is a quadratic function in t. [sent-117, score-0.289]

46 The length of this interval is 2 2ε independent √ of x. [sent-118, score-0.084]

47 Hence, the ε-path complexity for the problem is at least (b − a)/2 2ε. [sent-119, score-0.071]

48 Let us compare this lower bound with the upper from the previous section which gives for the (b − a)(h− (a) − h− (b)) /ε = √ bound is tight up to constant of at most 2 2. [sent-120, score-0.15]

49 ε Hence the upper Generic Algorithm So far we have only discussed structural complexity √ bounds for ε-paths. [sent-122, score-0.11]

50 Now we give a generic algorithm to compute an ε-path of complexity in O(1/ ε). [sent-123, score-0.132]

51 When applying the generic algorithm to 3 a speciﬁc problem a plugin-subroutine PATH P OLYNOMIAL needs to be implemented for the speciﬁc problem. [sent-124, score-0.061]

52 The generic algorithm builds on the simple idea that has been introduced in [6] to compute an (ε/γ)-approximation (for γ > 1) and only update this approximation along the parameter interval I = [a, b] when it fails to be an ε-approximation. [sent-125, score-0.171]

53 The plugin-subroutine PATH P OLYNOMIAL provides a bound on the step-size for the algorithm, i. [sent-126, score-0.035]

54 , a certiﬁcate for how long the approximation is valid along the interval I. [sent-128, score-0.11]

55 1 Step-size certiﬁcate and algorithm We always consider a problem that ﬁts the problem description from the introduction. [sent-131, score-0.036]

56 Let P be the set of all concave polynomials p : I → R of degree at most 2. [sent-133, score-0.13]

57 Note that P contains constant and linear polynomials with second derivative p = 0 and quadratic polynomials with constant second derivative p < 0. [sent-135, score-0.215]

58 If Pt (x, ε) = ∅, then x is an ε-approximation at parameter value t, because there exists p ∈ P such that ε(t, x) ≤ f (t, x) − p(t) ≤ ε. [sent-136, score-0.031]

59 If there exists p ∈ Pt (x, ε/γ), then x is an ε-approximation for all t ∈ [t, b] with t ≤ t + ∆t (p). [sent-144, score-0.031]

60 (ii) If p < 0, then g(t ) − p(t ) is a quadratic polynomial in t with second derivative −p > 0, (2) and the equation g(t )−p(t ) ≤ ε solves to t −t ≤ ∆t (p). [sent-151, score-0.07]

61 Let p the restriction of p onto the interval ˆ ˆ p [ˆ, b] and δt = t − a, then p = p , and thus ρt (ˆ) = ε/ γ δ 2 |ˆ | = 1 . [sent-155, score-0.084]

62 p p Assume now that we have an oracle PATH P OLYNOMIAL available that on input t ∈ (a, b) and ε/γ > 0 returns x ∈ D and p ∈ Pt (x, ε/γ), then the following algorithm G ENERIC PATH returns an ε-path if it terminates. [sent-158, score-0.139]

63 2 Analysis of the generic algorithm The running time of the algorithm G ENERIC PATH is essentially determined by the complexity of the computed path times the cost of the oracle PATH √ P OLYNOMIAL. [sent-160, score-0.592]

64 In the following we show that the complexity of the computed path is at most O(1/ ε). [sent-161, score-0.446]

65 Thus, φc (x) > 0 for c < 0 and φc is 2 monotonously increasing. [sent-167, score-0.236]

66 The continuity for |p | > 0 follows from the deﬁnitions of ∆t (p) and ∆t (p), and from Observation 8. [sent-172, score-0.074]

67 Since ρt (p) > 1/2 for small |p | the continuity at |p | = 0 follows from Observation 10, because (2) (1) 2 lim ∆t (p) = lim φδt γ(1−γ) (δt · (ρt (p) + γ − 1/2)) + δt (γ − 1) = δt (γ − 1) = ∆t (p), |p |→0 |p |→0 where we have used ρt (p) → ∞ as |p | → 0. [sent-173, score-0.176]

68 The claim is that ∆t (p) is monotonously decreasing in |p |. [sent-177, score-0.354]

69 Since ∆t is continuous in |p | (1) (2) (3) by Lemma 11 it is enough to check the monotonicity of ∆t (p), ∆t (p) and ∆t (p). [sent-178, score-0.083]

70 The mono(1) (3) tonicity of ∆t (p) and ∆t (p) follows directly from the deﬁnitions of the latter. [sent-179, score-0.031]

71 The monotonicity (2) of ∆t (p) follows from Observation 10 since we have (2) 2 ∆t (p) = φδt γ(1−γ) δt ρt (p) + γ − 1 2 + δt (γ − 1), (2) 2 and thus ∆t (p) is monotonously decreasing in |p | because δt γ(1 − γ) < 0 and ρt (p) is monotonously decreasing in |p |. [sent-180, score-0.668]

72 Given t ∈ I and p ∈ P, then ∆t (p) is monotonously increasing in δt and hence in t. [sent-182, score-0.285]

73 Since ∆t (p) is continuous in δt by Observation 8 it is enough to check the monotonicity of (1) (2) (3) (1) (3) ∆t (p), ∆t (p) and ∆t (p). [sent-184, score-0.083]

74 The monotonicity of ∆t (p) and ∆t (p) follows directly from the (2) 1 deﬁnitions of the latter. [sent-185, score-0.114]

75 It remains to show the monotonicity of ∆t (p) for ρt (p) ≥ 2 . [sent-186, score-0.083]

76 The notation is justiﬁed because for φ−1 is monotonously decreasing, and we have c φ−1 (y) > 0 we have c (2) ∆t (p) = φc1 (φ−1 (δt )) − δt = φc1 (φ−1 (δt )) − φc2 (φ−1 (δt )), c2 c2 c2 1 2ε with c1 = |p | and c2 = c1 . [sent-189, score-0.236]

77 c2 γ −1 Because φc1 − φc2 < 0 for c1 > c2 , both φc2 and φc1 − φc2 are monotonously decreasing in their (2) respective arguments. [sent-191, score-0.277]

78 If there exists p ∈ P and ε > 0 such that |q | ≤ |p | for all q that are returned by the ˆ oracle PATH P OLYNOMIAL on input t ∈ [a, b] and ε ≤ ε. [sent-194, score-0.116]

79 Then Algorithm 1 terminates after at most ˆ √ √ O 1/ ε steps, and thus returns an ε-path of complexity in O(1/ ε). [sent-195, score-0.098]

80 Here the ﬁrst inequality is due to Lemma 12 and the second inequality is due ˆ to Lemma 13. [sent-198, score-0.106]

81 Hence, the number of steps in the ﬁrst call of C OMPUTE PATH is upper bounded by ˆ ˆ (b− t)/(min{∆t (p), b− t})+1. [sent-199, score-0.039]

82 Similarly, the number of steps in the second call of C OMPUTE PATH ˆ ˆ ˆ is upper bounded by (t − a)/(min{∆a+b−t (p), t − a}) + 1. [sent-200, score-0.039]

83 Hence, there exists ε > 0 such that ρt (p, ε) < 1/2 and ˆ ˆ ˆ (3) ˆ ∆t (p, ε) ≤ b − t for all ε < ε, and thus ˆ ˆ √ ˆ ˆ γ |p | b−t 1 b−t ˆ + 1 = (3) (b − t) + 1 ∈ O √ . [sent-203, score-0.031]

84 ˆ 5 Applications Here we demonstrate on two examples that Lagrange duality can be a tool for implementing the oracle PATH P OLYNOMIAL in the generic path algorithm. [sent-205, score-0.521]

85 A support vector machine (SVM) is the following parameterized optimization problem 1 w 2 min w∈Rd ,b∈R n 2 max{0, 1 − yi (wT xi + b)} =: f (t, w) +t i=1 parameterized in the regularization parameter t ∈ [0, ∞). [sent-212, score-0.263]

86 Algorithm 3 PATH P OLYNOMIAL SVM Input: t ∈ (0, ∞) and ε > 0 Output: w ∈ Rd and p ∈ Pt (w, ε) compute a primal solution w ∈ Rd and a dual solution α ∈ Rn such that f (t, w) − d(α) < ε deﬁne p : I → R, t → d αt /t return (w, p) Lemma 15. [sent-222, score-0.2]

87 Let α be the dual solution computed by PATH P OLYNOMIAL SVM and p be the polynomial deﬁned in PATH P OLYNOMIAL SVM. [sent-226, score-0.094]

88 For t > 0 let α = αt /t, then α is feasible for the dual SVM at parameter value t since α is feasible for the dual SVM at t. [sent-231, score-0.156]

89 In the kernel case one has the following primal SVM (see [2]), min m β∈R ,b 1 T β Kβ + t · 2 n n max 0, 1 − yi i=1 βj yj Kij + b j=1 7 =: f (t, β) . [sent-235, score-0.083]

90 1 shows the 1 size of paths as a function of ε using double logarithmic plots. [sent-244, score-0.035]

91 A straight line plot with slope − 2 √ corresponds to an empirical path complexity that follows the function 1/ ε. [sent-245, score-0.477]

92 1/sqrt(epsilon) a1a duke fourclass scale mushrooms w1a # nodes # nodes 1/sqrt(epsilon) a1a duke fourclass scale mushrooms w1a 1 10 −3 −2 10 −1 10 1 10 −3 10 10 epsilon −1 10 epsilon (a) Path complexity for a linear SVM 5. [sent-246, score-0.539]

93 2 −2 10 (b) Path complexity for a SVM with Gaussian kernel exp(−γ u − v 2 ) for γ = 0. [sent-247, score-0.071]

94 5 2 Matrix completion Matrix completion asks for a completion X of an (n × m)-matrix Y that has been observed only at the indices in Ω ⊂ {1, . [sent-248, score-0.246]

95 n×m , A∈Rn×n , B∈Rm×m XT B 2 X∈R (i,j)∈Ω The Lagrangian dual of this convex semideﬁnite program is given as 1 2 tI Λ max − 0, and Λij = 0 if (i, j) ∈ Ω. [sent-258, score-0.091]

96 n×m ΛT tI 2 ij Λ∈R (i,j)∈Ω ˆ ˆ Let f (t, X) for X = (X, A, B) be the primal objective function at parameter value t, and d(Λ) be the dual objective function. [sent-261, score-0.146]

97 [6] Joachim Giesen, Martin Jaggi, and S¨ ren Laue. [sent-288, score-0.08]

98 [7] Joachim Giesen, Martin Jaggi, and S¨ ren Laue. [sent-291, score-0.08]

99 [8] Joachim Giesen, Martin Jaggi, and S¨ ren Laue. [sent-294, score-0.08]

100 The entire regularization path for the support vector machine. [sent-301, score-0.456]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('tk', 0.51), ('path', 0.375), ('olynomial', 0.354), ('monotonously', 0.236), ('bsn', 0.148), ('ompute', 0.148), ('certi', 0.136), ('eneric', 0.13), ('pt', 0.121), ('jaggi', 0.12), ('cate', 0.12), ('giesen', 0.104), ('jena', 0.104), ('svm', 0.101), ('lemma', 0.097), ('epsilon', 0.09), ('joachim', 0.09), ('oracle', 0.085), ('interval', 0.084), ('monotonicity', 0.083), ('completion', 0.082), ('ren', 0.08), ('concave', 0.079), ('claim', 0.077), ('tx', 0.077), ('svms', 0.077), ('parameterized', 0.074), ('complexity', 0.071), ('sn', 0.07), ('concavity', 0.067), ('dual', 0.066), ('generic', 0.061), ('martin', 0.059), ('circumference', 0.059), ('concavely', 0.059), ('fourclass', 0.059), ('semide', 0.057), ('germany', 0.057), ('ab', 0.054), ('inequality', 0.053), ('esa', 0.052), ('mushrooms', 0.052), ('lim', 0.051), ('polynomials', 0.051), ('regularization', 0.051), ('analogously', 0.05), ('hence', 0.049), ('primal', 0.049), ('ti', 0.048), ('saharon', 0.048), ('libsvm', 0.047), ('ts', 0.045), ('minw', 0.045), ('rd', 0.044), ('rosset', 0.043), ('continuity', 0.043), ('derivative', 0.043), ('tight', 0.041), ('decreasing', 0.041), ('rn', 0.04), ('upper', 0.039), ('yij', 0.037), ('description', 0.036), ('hastie', 0.035), ('paths', 0.035), ('bound', 0.035), ('min', 0.034), ('duke', 0.033), ('observation', 0.032), ('calls', 0.032), ('follows', 0.031), ('ij', 0.031), ('exists', 0.031), ('bin', 0.031), ('nitions', 0.03), ('support', 0.03), ('trevor', 0.03), ('devise', 0.03), ('regularized', 0.029), ('return', 0.029), ('solution', 0.028), ('quadratic', 0.027), ('returns', 0.027), ('nition', 0.027), ('ji', 0.026), ('proceedings', 0.026), ('lagrangian', 0.026), ('bernd', 0.026), ('deutsche', 0.026), ('ichiro', 0.026), ('plexity', 0.026), ('sulovsk', 0.026), ('wayne', 0.026), ('yue', 0.026), ('approximation', 0.026), ('convex', 0.025), ('julien', 0.024), ('holger', 0.024), ('area', 0.024), ('let', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 44 nips-2012-Approximating Concavely Parameterized Optimization Problems

Author: Joachim Giesen, Jens Mueller, Soeren Laue, Sascha Swiercy

2 0.14482729 64 nips-2012-Calibrated Elastic Regularization in Matrix Completion

Author: Tingni Sun, Cun-hui Zhang

Abstract: This paper concerns the problem of matrix completion, which is to estimate a matrix from observations in a small subset of indices. We propose a calibrated spectrum elastic net method with a sum of the nuclear and Frobenius penalties and develop an iterative algorithm to solve the convex minimization problem. The iterative algorithm alternates between imputing the missing entries in the incomplete matrix by the current guess and estimating the matrix by a scaled soft-thresholding singular value decomposition of the imputed matrix until the resulting matrix converges. A calibration step follows to correct the bias caused by the Frobenius penalty. Under proper coherence conditions and for suitable penalties levels, we prove that the proposed estimator achieves an error bound of nearly optimal order and in proportion to the noise level. This provides a uniﬁed analysis of the noisy and noiseless matrix completion problems. Simulation results are presented to compare our proposal with previous ones. 1

3 0.090199038 197 nips-2012-Learning with Recursive Perceptual Representations

Author: Oriol Vinyals, Yangqing Jia, Li Deng, Trevor Darrell

Abstract: Linear Support Vector Machines (SVMs) have become very popular in vision as part of state-of-the-art object recognition and other classiﬁcation tasks but require high dimensional feature spaces for good performance. Deep learning methods can ﬁnd more compact representations but current methods employ multilayer perceptrons that require solving a difﬁcult, non-convex optimization problem. We propose a deep non-linear classiﬁer whose layers are SVMs and which incorporates random projection as its core stacking element. Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous –often more complicated– methods on several vision and speech benchmarks. 1

4 0.084662564 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

Author: Paul Vernaza, Drew Bagnell

Abstract: Maximum entropy (MaxEnt) modeling is a popular choice for sequence analysis in applications such as natural language processing, where the sequences are embedded in discrete, tractably-sized spaces. We consider the problem of applying MaxEnt to distributions over paths in continuous spaces of high dimensionality— a problem for which inference is generally intractable. Our main contribution is to show that this intractability can be avoided as long as the constrained features possess a certain kind of low dimensional structure. In this case, we show that the associated partition function is symmetric and that this symmetry can be exploited to compute the partition function efﬁciently in a compressed form. Empirical results are given showing an application of our method to learning models of high-dimensional human motion capture data. 1

5 0.07979726 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

Author: Bruno Scherrer, Boris Lesner

Abstract: We consider inﬁnite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a signiﬁcant improvement in the usual situation when γ is close to 1. Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. 1

6 0.078242749 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins

7 0.078218438 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models

8 0.076756202 282 nips-2012-Proximal Newton-type methods for convex optimization

9 0.076460615 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

10 0.072659865 69 nips-2012-Clustering Sparse Graphs

11 0.068935573 260 nips-2012-Online Sum-Product Computation Over Trees

12 0.067246184 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

13 0.060583115 324 nips-2012-Stochastic Gradient Descent with Only One Projection

14 0.060235638 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms

15 0.059996154 27 nips-2012-A quasi-Newton proximal splitting method

16 0.059550568 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

17 0.055640403 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

18 0.055253975 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

19 0.054063264 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

20 0.053882524 285 nips-2012-Query Complexity of Derivative-Free Optimization

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.15), (1, -0.002), (2, 0.083), (3, -0.051), (4, 0.058), (5, 0.05), (6, -0.005), (7, 0.013), (8, -0.017), (9, 0.002), (10, 0.053), (11, 0.06), (12, -0.002), (13, 0.003), (14, -0.004), (15, 0.035), (16, -0.025), (17, -0.003), (18, 0.009), (19, -0.033), (20, -0.0), (21, 0.033), (22, -0.019), (23, -0.046), (24, -0.11), (25, 0.056), (26, -0.066), (27, -0.008), (28, -0.05), (29, -0.028), (30, 0.004), (31, 0.009), (32, 0.038), (33, 0.034), (34, 0.01), (35, -0.059), (36, -0.016), (37, 0.028), (38, 0.026), (39, 0.116), (40, 0.025), (41, 0.069), (42, 0.005), (43, 0.148), (44, 0.035), (45, -0.001), (46, -0.057), (47, -0.075), (48, -0.085), (49, -0.075)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93129426 44 nips-2012-Approximating Concavely Parameterized Optimization Problems

Author: Joachim Giesen, Jens Mueller, Soeren Laue, Sascha Swiercy

2 0.68338221 64 nips-2012-Calibrated Elastic Regularization in Matrix Completion

Author: Tingni Sun, Cun-hui Zhang

3 0.56487894 102 nips-2012-Distributed Non-Stochastic Experts

Author: Varun Kanade, Zhenming Liu, Bozidar Radunovic

Abstract: We consider the online distributed non-stochastic experts problem, where the distributed system consists of one coordinator node that is connected to k sites, and the sites are required to communicate with each other via the coordinator. At each time-step t, one of the k site nodes has to pick an expert from the set {1, . . . , n}, and the same site receives information about payoffs of all experts for that round. The goal of the distributed system is to minimize regret at time horizon T , while simultaneously keeping communication to a minimum. The two extreme solutions to this problem are: (i) Full communication: This essentially simulates the nondistributed setting to obtain the optimal O( log(n)T ) regret bound at the cost of T communication. (ii) No communication: Each site runs an independent copy – the regret is O( log(n)kT ) and the communication is 0. This paper shows the √ difﬁculty of simultaneously achieving regret asymptotically better than kT and communication better than T . We give a novel algorithm that for an oblivious √ adversary achieves a non-trivial trade-off: regret O( k 5(1+ )/6 T ) and communication O(T /k ), for any value of ∈ (0, 1/5). We also consider a variant of the model, where the coordinator picks the expert. In this model, we show that the label-efﬁcient forecaster of Cesa-Bianchi et al. (2005) already gives us strategy that is near optimal in regret vs communication trade-off. 1

4 0.53039479 206 nips-2012-Majorization for CRFs and Latent Likelihoods

Author: Tony Jebara, Anna Choromanska

Abstract: The partition function plays a key role in probabilistic modeling including conditional random ﬁelds, graphical models, and maximum likelihood estimation. To optimize partition functions, this article introduces a quadratic variational upper bound. This inequality facilitates majorization methods: optimization of complicated functions through the iterative solution of simpler sub-problems. Such bounds remain efﬁcient to compute even when the partition function involves a graphical model (with small tree-width) or in latent likelihood settings. For large-scale problems, low-rank versions of the bound are provided and outperform LBFGS as well as ﬁrst-order methods. Several learning applications are shown and reduce to fast and convergent update rules. Experimental results show advantages over state-of-the-art optimization methods. This supplement presents additional details in support of the full article. These include the application of the majorization method to maximum entropy problems. It also contains proofs of the various theorems, in particular, a guarantee that the bound majorizes the partition function. In addition, a proof is provided guaranteeing convergence on (non-latent) maximum conditional likelihood problems. The supplement also contains supporting lemmas that show the bound remains applicable in constrained optimization problems. The supplement then proves the soundness of the junction tree implementation of the bound for graphical models with large n. It also proves the soundness of the low-rank implementation of the bound for problems with large d. Finally, the supplement contains additional experiments and ﬁgures to provide further empirical support for the majorization methodology. Supplement for Section 2 Proof of Theorem 1 Rewrite the partition function as a sum over the integer index j = 1, . . . , n under the random ordering π : Ω → {1, . . . , n}. This deﬁnes j∑ π(y) and associates h and f with = n hj = h(π −1 (j)) and fj = f (π −1 (j)). Next, write Z(θ) = j=1 αj exp(λ⊤ fj ) by introducing ˜ ˜ λ = θ − θ and αj = hj exp(θ ⊤ fj ). Deﬁne the partition function over only the ﬁrst i components ∑i as Zi (θ) = j=1 αj exp(λ⊤ fj ). When i = 0, a trivial quadratic upper bound holds ( ) Z0 (θ) ≤ z0 exp 1 λ⊤ Σ0 λ + λ⊤ µ0 2 with the parameters z0 → 0+ , µ0 = 0, and Σ0 = z0 I. Next, add one term to the current partition function Z1 (θ) = Z0 (θ) + α1 exp(λ⊤ f1 ). Apply the current bound Z0 (θ) to obtain 1 Z1 (θ) ≤ z0 exp( 2 λ⊤ Σ0 λ + λ⊤ µ0 ) + α1 exp(λ⊤ f1 ). Consider the following change of variables u γ 1/2 −1/2 = Σ0 λ − Σ0 = α1 z0 exp( 1 (f1 2 (f1 − µ0 )) − µ0 )⊤ Σ−1 (f1 − µ0 )) 0 and rewrite the logarithm of the bound as log Z1 (θ) ( ) 1 ≤ log z0 − 1 (f1 − µ0 )⊤ Σ−1 (f1 − µ0 ) + λ⊤ f1 + log exp( 2 ∥u∥2 ) + γ . 0 2 Apply Lemma 1 (cf. [32] p. 100) to the last term to get ( (1 ) ) log Z1 (θ) ≤ log z0 − 1 (f1 − µ0 )⊤ Σ−1 (f1 − µ0 ) + λ⊤ f1 + log exp 2 ∥v∥2 +γ 0 2 ( ) v⊤ (u − v) 1 + + (u − v)⊤ I + Γvv⊤ (u − v) 1+γ exp(− 1 ∥v∥2 ) 2 2 10 where Γ = 1 1 tanh( 2 log(γ exp(− 2 ∥v∥2 ))) . 1 2 log(γ exp(− 2 ∥v∥2 )) The bound in [32] is tight when u = v. To achieve tightness −1/2 ˜ when θ = θ or, equivalently, λ = 0, we choose v = Σ0 (µ0 − f1 ) yielding ( ) Z1 (θ) ≤ z1 exp 1 λ⊤ Σ1 λ + λ⊤ µ1 2 where we have z1 = z0 + α1 z0 α1 = µ0 + f1 z0 + α1 z0 + α1 1 tanh( 2 log(α1 /z0 )) = Σ0 + (µ0 − f1 )(µ0 − f1 )⊤ . 2 log(α1 /z0 ) µ1 Σ1 This rule updates the bound parameters z0 , µ0 , Σ0 to incorporate an extra term in the sum over i in Z(θ). The process is iterated n times (replacing 1 with i and 0 with i − 1) to produce an overall bound on all terms. Lemma 1 (See [32] p. 100) ( ( ) ) For all u ∈ Rd , any v ∈ Rd and any γ ≥ 0, the bound log exp 1 ∥u∥2 + γ ≤ 2 ( ( ) ) log exp 1 ∥v∥2 + γ + 2 ( ) v⊤ (u − v) 1 + (u − v)⊤ I + Γvv⊤ (u − v) 1 1 + γ exp(− 2 ∥v∥2 ) 2 holds when the scalar term Γ = 1 tanh( 2 log(γ exp(−∥v∥2 /2))) . 2 log(γ exp(−∥v∥2 /2)) Equality is achieved when u = v. Proof of Lemma 1 The proof is provided in [32]. Supplement for Section 3 Maximum entropy problem We show here that partition functions arise naturally in maximum ∑ p(y) entropy estimation or minimum relative entropy RE(p∥h) = y p(y) log h(y) estimation. Consider the following problem: ∑ ∑ min RE(p∥h) s.t. p(y)f (y) = 0, p(y)g(y) ≥ 0. p y y d′ Here, assume that f : Ω → R and g : Ω → R are arbitrary (non-constant) vector-valued functions ( ) over the sample space. The solution distribution p(y) = h(y) exp θ ⊤ f (y) + ϑ⊤ g(y) /Z(θ, ϑ) is recovered by the dual optimization ∑ ( ) arg θ, ϑ = max − log h(y) exp θ ⊤ f (y) + ϑ⊤ g(y) d ϑ≥0,θ y ′ where θ ∈ Rd and ϑ ∈ Rd . These are obtained by minimizing Z(θ, ϑ) or equivalently by maximizing its negative logarithm. Algorithm 1 permits variational maximization of the dual via the quadratic program min 1 (β ϑ≥0,θ 2 ˜ ˜ − β)⊤ Σ(β − β) + β ⊤ µ ′ where β ⊤ = [θ ⊤ ϑ⊤ ]. Note that any general convex hull of constraints β ∈ Λ ⊆ Rd+d could be imposed without loss of generality. Proof of Theorem 2 We begin by proving a lemma that will be useful later. Lemma 2 If κΨ ≽ Φ ≻ 0 for Φ, Ψ ∈ Rd×d , then 1 ˜ ˜ ˜ L(θ) = − 2 (θ − θ)⊤ Φ(θ − θ) − (θ − θ)⊤ µ ˜ ˜ ˜ U (θ) = − 1 (θ − θ)⊤ Ψ(θ − θ) − (θ − θ)⊤ µ 2 satisfy supθ∈Λ L(θ) ≥ 1 κ ˜ supθ∈Λ U (θ) for any convex Λ ⊆ Rd , θ ∈ Λ, µ ∈ Rd and κ ∈ R+ . 11 Proof of Lemma 2 Deﬁne the primal problems of interest as PL = supθ∈Λ L(θ) and PU = supθ∈Λ U (θ). The constraints θ ∈ Λ can be summarized by a set of linear inequalities Aθ ≤ b where A ∈ Rk×d and b ∈ Rk for some (possibly inﬁnite) k ∈ Z. Apply the change of variables ˜ ˜ ˜ ˜ ˜ ˜ z = θ− θ. The constraint A(z+ θ) ≤ b simpliﬁes into Az ≤ b where b = b−Aθ. Since θ ∈ Λ, it 1 ⊤ ˜ ≥ 0. We obtain the equivalent primal problems PL = sup is easy to show that b ˜ − z Φz − Az≤b z⊤ µ and PU = supAz≤b − 1 z⊤ Ψz − z⊤ µ. The corresponding dual problems are ˜ 2 2 ⊤ −1 y⊤AΦ−1A⊤y ˜ µ Φ µ +y⊤AΦ−1µ+y⊤b+ y≥0 2 2 y⊤AΨ−1 A⊤y µ⊤Ψ−1µ ˜ DU = inf +y⊤AΨ−1µ+y⊤b+ . y≥0 2 2 DL = inf ˜ Due to strong duality, PL = DL and PU = DU . Apply the inequalities Φ ≼ κΨ and y⊤ b > 0 as ⊤ −1 y⊤AΨ−1 A⊤y y⊤AΨ−1 µ κ ˜ µΨ µ + + y⊤b + PL ≥ sup − z⊤ Ψz − z⊤ µ = inf y≥0 2 2κ κ 2κ ˜ Az≤b 1 1 ≥ DU = PU . κ κ 1 This proves that PL ≥ κ PU . We will use the above to prove Theorem 2. First, we will upper-bound (in the Loewner ordering sense) the matrices Σj in Algorithm 2. Since ∥fxj (y)∥2 ≤ r for all y ∈ Ωj and since µj in Algorithm 1 is a convex combination of fxj (y), the outer-product terms in the update for Σj satisfy (fxj (y) − µ)(fxj (y) − µ)⊤ ≼ 4r2 I. Thus, Σj ≼ F(α1 , . . . , αn )4r2 I holds where α 1 F(α1 , . . . , αn ) = i n ∑ tanh( 2 log( ∑i−1 αk )) k=1 αi 2 log( ∑i−1 α ) i=2 k k=1 using the deﬁnition of α1 , . . . , αn in the proof of Theorem 1. The formula for F starts at i = 2 since z0 → 0+ . Assume permutation π is sampled uniformly at random. The expected value of F is then α π(i) 1 n 1 ∑ ∑ tanh( 2 log( ∑i−1 απ(k) )) k=1 . Eπ [F(α1 , . . . , αn )] = απ(i) n! π i=2 ) 2 log( ∑i−1 k=1 απ(k) We claim that the expectation is maximized when all αi = 1 or any positive constant. Also, F is invariant under uniform scaling of its arguments. Write the expected value of F as E for short. ∂E ∂E ∂E Next, consider ∂αl at the setting αi = 1, ∀i. Due to the expectation over π, we have ∂αl = ∂αo for any l, o. Therefore, the gradient vector is constant when all αi = 1. Since F(α1 , . . . , αn ) is invariant to scaling, the gradient vector must therefore be the all zeros vector. Thus, the point ∂ ∂E when all αi = 1 is an extremum or a saddle. Next, consider ∂αo ∂αl for any l, o. At the setting 2 ∂ ∂E αi = 1, ∂ E = −c(n) and, ∂αo ∂αl = c(n)/(n − 1) for some non-negative constant function ∂α2 l c(n). Thus, the αi = 1 extremum is locally concave and is a maximum. This establishes that Eπ [F(α1 , . . . , αn )] ≤ Eπ [F(1, . . . , 1)] and yields the Loewner bound ) ( n−1 ∑ tanh(log(i)/2) 2 I = ωI. Σj ≼ 2r log(i) i=1 Apply this bound to each Σj in the lower bound on J(θ) and also note a corresponding upper bound ∑ ˜ ˜ ˜ J(θ) ≥ J(θ)−tω+tλ ∥θ − θ∥2− (θ − θ)⊤(µj −fxj (yj )) 2 j ∑ ˜ ˜ ˜ J(θ) ≤ J(θ)−tλ ∥θ − θ∥2− (θ − θ)⊤(µj −fxj (yj )) 2 j 12 ˜ which follows from Jensen’s inequality. Deﬁne the current θ at time τ as θτ and denote by Lτ (θ) the above lower bound and by Uτ (θ) the above upper bound at time τ . Clearly, Lτ (θ) ≤ J(θ) ≤ Uτ (θ) with equality when θ = θτ . Algorithm 2 maximizes J(θ) after initializing at θ0 and performing an update by maximizing a lower bound based on Σj . Since Lτ (θ) replaces the deﬁnition of Σj with ωI ≽ Σj , Lτ (θ) is a looser bound than the one used by Algorithm 2. Thus, performing θτ +1 = arg maxθ∈Λ Lτ (θ) makes less progress than a step of Algorithm 1. Consider computing the slower update at each iteration τ and returning θτ +1 = arg maxθ∈Λ Lτ (θ). Setting Φ = (tω +tλ)I, Ψ = tλI and κ = ω+λ allows us to apply Lemma 2 as follows λ sup Lτ (θ) − Lτ (θτ ) = θ∈Λ 1 sup Uτ (θ) − Uτ (θτ ). κ θ∈Λ Since Lτ (θτ ) = J(θτ ) = Uτ (θτ ), J(θτ +1 ) ≥ supθ∈Λ Lτ (θ) and supθ∈Λ Uτ (θ) ≥ J(θ ∗ ), we obtain ( ) 1 J(θτ +1 ) − J(θ ∗ ) ≥ 1− (J(θτ ) − J(θ ∗ )) . κ Iterate the above inequality starting at t = 0 to obtain ( )τ 1 ∗ J(θτ ) − J(θ ) ≥ 1− (J(θ0 ) − J(θ ∗ )) . κ ( ) 1 τ κ A solution within a multiplicative factor of ϵ implies that ϵ = 1 − κ or log(1/ϵ) = τ log κ−1 . ⌉ ⌈ Inserting the deﬁnition for κ shows that the number of iterations τ is at most log(1/ϵ) or κ log κ−1 ⌉ ⌈ log(1/ϵ) log(1+λ/ω) . Inserting the deﬁnition for ω gives the bound. Y12,0 Y11,1 Y21,1 Y31,1 ··· 1,1 Ym1,1 Figure 3: Junction tree of depth 2. Algorithm 5 SmallJunctionTree ˜ Input Parameters θ and h(u), f (u) ∀u ∈ Y12,0 and zi , Σi , µi ∀i = 1, . . . , m1,1 + Initialize z → 0 , µ = 0, Σ = zI For each conﬁguration u ∈ Y12,0 { ∏m1,1 ∑m1,1 ∏m1,1 ˜ ˜ ˜ α = h(u)( ∑ zi exp(−θ ⊤ µi )) exp(θ ⊤ (f (u) + i=1 µi )) = h(u) exp(θ ⊤ f (u)) i=1 zi i=1 m1,1 l = f (u) + i=1 µi − µ 1 ∑m1,1 tanh( 2 log(α/z)) ⊤ Σ + = i=1 Σi + ll 2 log(α/z) α µ + = z+α l z += α } Output z, µ, Σ Supplement for Section 5 Proof of correctness for Algorithm 3 Consider a simple junction tree of depth 2 shown on Figure 3. The notation Yca,b refers to the cth tree node located at tree level a (ﬁrst level is considered as the one with∑ leaves) whose parent is the bth from the higher tree level (the root has no parent so b = 0). tree ∑ Let Y a1 ,b1 refer to the sum over all conﬁgurations of variables in Yca1 ,b1 and Y a1 ,b1 \Y a2 ,b2 1 c1 c1 c2 refers to the sum over all conﬁgurations of variables that are in Yca1 ,b1 but not in Yca2 ,b2 . Let ma,b 1 2 denote the number of children of the bth node located at tree level a + 1. For short-hand, use ψ(Y ) = h(Y ) exp(θ ⊤ f (Y )). The partition function can be expressed as: 13 Y13,0 Y12,1 ··· Y11,1 Y21,1 ··· Y22,1 1,1 Ym1,1 ··· Y11,2 Y21,2 1,2 Ym1,2 2,1 Ym2,1 1,m2,1 Y1 ··· 1,m2,1 Y2 1,m 2,1 Ym1,m2,1 Figure 4: Junction tree of depth 3. Z(θ) = ∑ 2,0 u∈Y1 ≤ ∑ 2,0 u∈Y1 = ∑  ψ(u) ∏ m1,1 i=1 [ ∏ m1,1 ψ(u) i=1 [    ∑ ψ(v) 2,0 v∈Yi1,1 \Y1 ) 1( ˜ ˜ ˜ zi exp( θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi 2 ∏ ( m1,1 ⊤ h(u) exp(θ f (u)) 2,0 u∈Y1 zi exp i=1 ] 1 ˜ ˜ ˜ (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi 2 )] ∑ where the upper-bound is obtained by applying Theorem 1 to each of the terms v∈Y 1,1 \Y 2,0 ψ(v). 1 i By simply rearranging terms we get: ) ( ( [ (m1,1 )) m1,1 ∏ ∑ ∑ ˜ zi exp(−θ ⊤ µi ) exp θ ⊤ f (u) + µi h(u) Z(θ) ≤ 2,0 u∈Y1 i=1 ( exp 1 ˜ (θ − θ)⊤ 2 (m1,1 ∑ ) Σi i=1 ˜ (θ − θ) )] . i=1 One ( can prove that this ) expression can be upper-bounded by 1 ˆ ⊤ Σ(θ − θ) + (θ − θ)⊤ µ where z, Σ and µ can be computed using Algoˆ ˆ z exp 2 (θ − θ) rithm 5 (a simpliﬁcation of Algorithm 3). We will call this result Lemma A. The proof is similar to the proof of Theorem 1 so is not repeated here. Consider enlarging the tree to a depth 3 as shown on Figure 4. The partition function is now      m2,1 m1,i ∑  ∏ ∑ ∏ ∑   Z(θ) = ψ(w) . ψ(u)  ψ(v)  3,0 u∈Y1 i=1 3,0 v∈Yi2,1 \Y1 j=1 w∈Yj1,i \Yi2,1 ( )) ∏m1,i (∑ ∑ term By Lemma A we can upper bound each v∈Y 2,1 \Y 3,0 ψ(v) j=1 w∈Yj1,i \Yi2,1 ψ(w) 1 i ( ) ˆ ˆ ˆ by the expression zi exp 1 (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi . This yields 2 [ )] ( m2,1 ∑ ∏ 1 ˜ ˜ ˜ Z(θ) ≤ ψ(u) (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi . zi exp 2 3,0 i=1 u∈Y1 2,1 2,1 2,1 This process can be viewed as collapsing the sub-trees S1 , S2 , . . ., Sm2,1 to super-nodes that are represented by bound parameters, zi , Σi and µi , i = {1, 2, · · · , m2,1 }, where the sub-trees are 14 deﬁned as: 2,1 S1 = 1,1 {Y12,1 , Y11,1 , Y21,1 , Y31,1 , . . . , Ym1,1 } 2,1 S2 = 1,2 {Y22,1 , Y11,2 , Y21,2 , Y31,2 , . . . , Ym1,2 } = 2,1 {Ym2,1 , Y1 . . . 2,1 Sm2,1 1,m2,1 1,m2,1 , Y2 1,m2,1 , Y3 1,m2,1 , . . . , Ym1,m2,1 }. Notice that the obtained expression can be further upper bounded using again Lemma A (induction) ( ) ˆ ˆ ˆ yielding a bound of the form: z exp 1 (θ − θ)⊤ Σ(θ − θ) + (θ − θ)⊤ µ . 2 Finally, for a general tree, follow the same steps described above, starting from leaves and collapsing nodes to super-nodes, each represented by bound parameters. This procedure effectively yields Algorithm 3 for the junction tree under consideration. Supplement for Section 6 Proof of correctness for Algorithm 4 We begin by proving a lemma that will be useful later. Lemma 3 For all x ∈ Rd and for all l ∈ Rd , 2  d d 2 ∑ ∑ l(i)  . x(i)2 l(i)2 ≥  x(i) √∑ d l(j)2 i=1 i=1 j=1 Proof of Lemma 3 By Jensen’s inequality, 2  ( d )2 d d ∑ x(i)l(i)2 ∑ ∑ x(i)l(i)2  . √∑ x(i)2 ∑d ⇐⇒ x(i)2 l(i)2 ≥  ≥ ∑d d l(j)2 l(j)2 l(j)2 j=1 j=1 i=1 i=1 i=1 i=1 d ∑ l(i)2 j=1 Now we prove the correctness of Algorithm 4. At the ith iteration, the algorithm stores Σi using ⊤ a low-rank representation Vi Si Vi + Di where Vi ∈ Rk×d is orthonormal, Si ∈ Rk×k positive d×d semi-deﬁnite and Di ∈ R is non-negative diagonal. The diagonal terms D are initialized to tλI where λ is the regularization term. To mimic Algorithm 1 we must increment the Σ matrix by a rank one update of the form Σi = Σi−1 + ri r⊤ . By projecting ri onto each eigenvector in V, we i ∑k ⊤ can decompose it as ri = j=1 Vi−1 (j, ·)ri Vi−1 (j, ·)⊤ + g = Vi−1 Vi−1 ri + g where g is the remaining residue. Thus the update rule can be rewritten as: Σi ⊤ ⊤ ⊤ = Σi−1 + ri r⊤ = Vi−1 Si−1 Vi−1 + Di−1 + (Vi−1 Vi−1 ri + g)(Vi−1 Vi−1 ri + g)⊤ i ′ ′ ′ ⊤ ⊤ ⊤ = Vi−1 (Si−1 + Vi−1 ri r⊤ Vi−1 )Vi−1 + Di−1 + gg⊤ = Vi−1 Si−1 Vi−1 + gg⊤ + Di−1 i ′ where we deﬁne Vi−1 = Qi−1 Vi−1 and deﬁned Qi−1 in terms of the singular value decomposition, ′ ′ ⊤ Q⊤ Si−1 Qi−1 = svd(Si−1 + Vi−1 ri r⊤ Vi−1 ). Note that Si−1 is diagonal and nonnegative by i−1 i construction. The current formula for Σi shows that we have a rank (k + 1) system (plus diagonal term) which needs to be converted back to a rank k system (plus diagonal term) which we denote by ′ Σi . We have two options as follows. Case 1) Remove g from Σi to obtain ′ ′ ′ ′ ⊤ Σi = Vi−1 Si−1 Vi−1 + Di−1 = Σi − gg⊤ = Σi − cvv⊤ 1 where c = ∥g∥2 and v = ∥g∥ g. th Case 2) Remove the m (smallest) eigenvalue in S′ and its corresponding eigenvector: i−1 ′ Σi ′ ′ ′ ′ ′ ′ ⊤ = Vi−1 Si−1 Vi−1 + Di−1 + gg⊤ − S (m, m)V (m, ·)⊤ V (m, ·) = Σi − cvv⊤ ′ ′ where c = S (m, m) and v = V(m, ·) . 15 ′ Clearly, both cases can be written as an update of the form Σi = Σi + cvv⊤ where c ≥ 0 and v⊤ v = 1. We choose the case with smaller c value to minimize the change as we drop from a system of order (k + 1) to order k. Discarding the smallest singular value and its corresponding eigenvector would violate the bound. Instead, consider absorbing this term into the diagonal component to ′′ ′ preserve the bound. Formally, we look for a diagonal matrix F such that Σi = Σi + F which also ′′ maintains x⊤ Σi x ≥ x⊤ Σi x for all x ∈ Rd . Thus, we want to satisfy: ( d )2 d ∑ ∑ ⊤ ′′ ⊤ ⊤ ⊤ ⊤ x Σi x ≥ x Σi x ⇐⇒ x cvv x ≤ x Fx ⇐⇒ c x(i)v(i) ≤ x(i)2 F(i) i=1 i=1 where, for ease of notation, we take F(i) = F(i, i). ′ where w = v⊤ 1. Consider the case where v ≥ 0 though we will soon get rid of (∑ )2 ∑d d this assumption. We need an F such that i=1 x(i)2 F(i) ≥ c i=1 x(i)v(i) . Equivalently, we (∑ ) ∑d ′ 2 ′ d need i=1 x(i)2 F(i) ≥ . Deﬁne F(i) = F(i) for all i = 1, . . . , d. So, we need 2 i=1 x(i)v(i) cw cw2 (∑ ) ∑d ′ ′ ′ 2 d an F such that i=1 x(i)2 F(i) ≥ . Using Lemma 3 it is easy to show that we i=1 x(i)v(i) Deﬁne v = 1 wv ′ ′ ′ may choose F (i) = v(i) . Thus, we obtain F(i) = cw2 F(i) = cwv(i). Therefore, for all x ∈ Rd , ∑d all v ≥ 0, and for F(i) = cv(i) j=1 v(j) we have d ∑ ( x(i) F(i) ≥ c 2 i=1 d ∑ )2 x(i)v(i) . (3) i=1 To generalize the inequality to hold for all vectors v ∈ Rd with potentially negative entries, it is ∑d sufﬁcient to set F(i) = c|v(i)| j=1 |v(j)|. To verify this, consider ﬂipping the sign of any v(i). The left side of the Inequality 3 does not change. For the right side of this inequality, ﬂipping the sign of v(i) is equivalent to ﬂipping the sign of x(i) and not changing the sign of v(i). However, in this case the inequality holds as shown before (it holds for any x ∈ Rd ). Thus for all x, v ∈ Rd and ∑d for F(i) = c|v(i)| j=1 |v(j)|, Inequality 3 holds. Supplement for Section 7 Small scale experiments In additional small-scale experiments, we compared Algorithm 2 with steepest descent (SD), conjugate gradient (CG), BFGS and Newton-Raphson. Small-scale problems may be interesting in real-time learning settings, for example, when a website has to learn from a user’s uploaded labeled data in a split second to perform real-time retrieval. We considered logistic regression on ﬁve UCI data sets where missing values were handled via mean-imputation. A range of regularization settings λ ∈ {100 , 102 , 104 } was explored and all algorithms were initialized from the same ten random start-points. Table 3 shows the average number of seconds each algorithm needed to achieve the same solution that BFGS converged to (all algorithms achieve the same solution due to concavity). The bound is the fastest algorithm as indicated in bold. data|λ BFGS SD CG Newton Bound a|100 1.90 1.74 0.78 0.31 0.01 a|102 0.89 0.92 0.83 0.25 0.01 a|104 2.45 1.60 0.85 0.22 0.01 b|100 3.14 2.18 0.70 0.43 0.07 b|102 2.00 6.17 0.67 0.37 0.04 b|104 1.60 5.83 0.83 0.35 0.04 c|100 4.09 1.92 0.65 0.39 0.07 c|102 1.03 0.64 0.64 0.34 0.02 c|104 1.90 0.56 0.72 0.32 0.02 d|100 5.62 12.04 1.36 0.92 0.16 d|102 2.88 1.27 1.21 0.63 0.09 d|104 3.28 1.94 1.23 0.60 0.07 e|100 2.63 2.68 0.48 0.35 0.03 e|102 2.01 2.49 0.55 0.26 0.03 e|104 1.49 1.54 0.43 0.20 0.03 Table 3: Convergence time in seconds under various regularization levels for a) Bupa (t = 345, dim = 7), b) Wine (t = 178, dim = 14), c) Heart (t = 187, dim = 23), d) Ion (t = 351, dim = 34), and e) Hepatitis (t = 155, dim = 20) data sets. Inﬂuence of rank k on bound performance in large scale experiments We also examined the inﬂuence of k on bound performance and compared it with LBFGS, SD and CG. Several choices 16 of k were explored. Table 4 shows results for the SRBCT data-set. In general, the bound performs best but slows down for superﬂuously large values of k. Steepest descent and conjugate gradient are slow yet obviously do not vary with k. Note that each iteration takes less time with smaller k for the bound. However, we are reporting overall runtime which is also a function of the number of iterations. Therefore, total runtime (a function of both) may not always decrease/increase with k. k LBFGS SD CG Bound 1 1.37 8.80 4.39 0.56 2 1.32 8.80 4.39 0.56 4 1.39 8.80 4.39 0.67 8 1.35 8.80 4.39 0.96 16 1.46 8.80 4.39 1.34 32 1.40 8.80 4.39 2.11 64 1.54 8.80 4.39 4.57 Table 4: Convergence time in seconds as a function of k. Additional latent-likelihood results For completeness, Figure 5 depicts two additional data-sets to complement Figure 2. Similarly, Table 5 shows all experimental settings explored in order to provide the summary Table 2 in the main article. bupa wine −19 0 −5 −log(J(θ)) −log(J(θ)) −20 −21 −22 Bound Newton BFGS Conjugate gradient Steepest descent −15 −23 −24 −5 −10 0 5 log(Time) [sec] 10 −20 −4 −2 0 2 4 log(Time) [sec] 6 8 Figure 5: Convergence of test latent log-likelihood for bupa and wine data-sets. Data-set Algorithm BFGS SD CG Newton Bound ion m=1 m=2m=3m=4 -4.96 -5.55 -5.88 -5.79 -11.80 -9.92 -5.56 -8.59 -5.47 -5.81 -5.57 -5.22 -5.95 -5.95 -5.95 -5.95 -6.08 -4.84 -4.18 -5.17 Data-set Algorithm BFGS SD CG Newton Bound bupa m=1 m=2 m=3 m=4 -22.07 -21.78 -21.92 -21.87 -21.76 -21.74 -21.73 -21.83 -21.81 -21.81 -21.81 -21.81 -21.85 -21.85 -21.85 -21.85 -21.85 -19.95 -20.01 -19.97 wine m=1m=2m=3m=4 -0.90 -0.91 -1.79 -1.35 -1.61 -1.60 -1.37 -1.63 -0.51 -0.78 -0.95 -0.51 -0.71 -0.71 -0.71 -0.71 -0.51 -0.51 -0.48 -0.51 hepatitis m=1m=2m=3m=4 -4.42 -5.28 -4.95 -4.93 -4.93 -5.14 -5.01 -5.20 -4.84 -4.84 -4.84 -4.84 -5.50 -5.50 -5.50 -4.50 -5.47 -4.40 -4.75 -4.92 SRBCT m=1m=2m=3m=4 -5.99 -6.17 -6.09 -6.06 -5.61 -5.62 -5.62 -5.61 -5.62 -5.49 -5.36 -5.76 -5.54 -5.54 -5.54 -5.54 -5.31 -5.31 -4.90 -0.11 Table 5: Test latent log-likelihood at convergence for different values of m ∈ {1, 2, 3, 4} on ion, bupa, hepatitis, wine and SRBCT data-sets. 17

5 0.52284247 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering

Author: Xavier Bresson, Thomas Laurent, David Uminsky, James V. Brecht

Abstract: This paper provides both theoretical and algorithmic results for the 1 -relaxation of the Cheeger cut problem. The 2 -relaxation, known as spectral clustering, only loosely relates to the Cheeger cut; however, it is convex and leads to a simple optimization problem. The 1 -relaxation, in contrast, is non-convex but is provably equivalent to the original problem. The 1 -relaxation therefore trades convexity for exactness, yielding improved clustering results at the cost of a more challenging optimization. The ﬁrst challenge is understanding convergence of algorithms. This paper provides the ﬁrst complete proof of convergence for algorithms that minimize the 1 -relaxation. The second challenge entails comprehending the 1 energy landscape, i.e. the set of possible points to which an algorithm might converge. We show that 1 -algorithms can get trapped in local minima that are not globally optimal and we provide a classiﬁcation theorem to interpret these local minima. This classiﬁcation gives meaning to these suboptimal solutions and helps to explain, in terms of graph structure, when the 1 -relaxation provides the solution of the original Cheeger cut problem. 1

6 0.52224177 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection

7 0.5152213 320 nips-2012-Spectral Learning of General Weighted Automata via Constrained Matrix Completion

8 0.5043835 301 nips-2012-Scaled Gradients on Grassmann Manifolds for Matrix Completion

9 0.49194461 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound

10 0.48587203 361 nips-2012-Volume Regularization for Binary Classification

11 0.47544456 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

12 0.46255457 128 nips-2012-Fast Resampling Weighted v-Statistics

13 0.46207875 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

14 0.45618746 34 nips-2012-Active Learning of Multi-Index Function Models

15 0.45168874 27 nips-2012-A quasi-Newton proximal splitting method

16 0.44682947 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics

17 0.44493169 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach

18 0.4436152 337 nips-2012-The Lovász ϑ function, SVMs and finding large dense subgraphs

19 0.43916371 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

20 0.43654051 86 nips-2012-Convex Multi-view Subspace Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.023), (38, 0.126), (42, 0.012), (54, 0.022), (74, 0.022), (76, 0.118), (80, 0.069), (92, 0.509)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.87836123 44 nips-2012-Approximating Concavely Parameterized Optimization Problems

Author: Joachim Giesen, Jens Mueller, Soeren Laue, Sascha Swiercy

2 0.87319952 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification

Author: Richard Socher, Brody Huval, Bharath Bath, Christopher D. Manning, Andrew Y. Ng

Abstract: Recent advances in 3D sensing technologies make it possible to easily record color and depth images which together can improve object recognition. Most current methods rely on very well-designed features for this new 3D modality. We introduce a model based on a combination of convolutional and recursive neural networks (CNN and RNN) for learning features and classifying RGB-D images. The CNN layer learns low-level translationally invariant features which are then given as inputs to multiple, ﬁxed-tree RNNs in order to compose higher order features. RNNs can be seen as combining convolution and pooling into one efﬁcient, hierarchical operation. Our main result is that even RNNs with random weights compose powerful features. Our model obtains state of the art performance on a standard RGB-D object dataset while being more accurate and faster during training and testing than comparable architectures such as two-layer CNNs. 1

3 0.85960913 42 nips-2012-Angular Quantization-based Binary Codes for Fast Similarity Search

Author: Yunchao Gong, Sanjiv Kumar, Vishal Verma, Svetlana Lazebnik

Abstract: This paper focuses on the problem of learning binary codes for efﬁcient retrieval of high-dimensional non-negative data that arises in vision and text applications where counts or frequencies are used as features. The similarity of such feature vectors is commonly measured using the cosine of the angle between them. In this work, we introduce a novel angular quantization-based binary coding (AQBC) technique for such data and analyze its properties. In its most basic form, AQBC works by mapping each non-negative feature vector onto the vertex of the binary hypercube with which it has the smallest angle. Even though the number of vertices (quantization landmarks) in this scheme grows exponentially with data dimensionality d, we propose a method for mapping feature vectors to their smallest-angle binary vertices that scales as O(d log d). Further, we propose a method for learning a linear transformation of the data to minimize the quantization error, and show that it results in improved binary codes. Experiments on image and text datasets show that the proposed AQBC method outperforms the state of the art. 1

4 0.85744083 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo

Author: Yichuan Zhang, Zoubin Ghahramani, Amos J. Storkey, Charles A. Sutton

Abstract: Continuous relaxations play an important role in discrete optimization, but have not seen much use in approximate probabilistic inference. Here we show that a general form of the Gaussian Integral Trick makes it possible to transform a wide class of discrete variable undirected models into fully continuous systems. The continuous representation allows the use of gradient-based Hamiltonian Monte Carlo for inference, results in new ways of estimating normalization constants (partition functions), and in general opens up a number of new avenues for inference in difﬁcult discrete systems. We demonstrate some of these continuous relaxation inference algorithms on a number of illustrative problems. 1

5 0.82378769 349 nips-2012-Training sparse natural image models with a fast Gibbs sampler of an extended state space

Author: Lucas Theis, Jascha Sohl-dickstein, Matthias Bethge

Abstract: We present a new learning strategy based on an efﬁcient blocked Gibbs sampler for sparse overcomplete linear models. Particular emphasis is placed on statistical image modeling, where overcomplete models have played an important role in discovering sparse representations. Our Gibbs sampler is faster than general purpose sampling schemes while also requiring no tuning as it is free of parameters. Using the Gibbs sampler and a persistent variant of expectation maximization, we are able to extract highly sparse distributions over latent sources from data. When applied to natural images, our algorithm learns source distributions which resemble spike-and-slab distributions. We evaluate the likelihood and quantitatively compare the performance of the overcomplete linear model to its complete counterpart as well as a product of experts model, which represents another overcomplete generalization of the complete linear model. In contrast to previous claims, we ﬁnd that overcomplete representations lead to signiﬁcant improvements, but that the overcomplete linear model still underperforms other models. 1

6 0.81811047 110 nips-2012-Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems

7 0.60376209 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection

8 0.59990001 329 nips-2012-Super-Bit Locality-Sensitive Hashing

9 0.58005631 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

10 0.57273614 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves

11 0.56183362 94 nips-2012-Delay Compensation with Dynamical Synapses

12 0.56057006 65 nips-2012-Cardinality Restricted Boltzmann Machines

13 0.55973792 260 nips-2012-Online Sum-Product Computation Over Trees

14 0.55420983 24 nips-2012-A mechanistic model of early sensory processing based on subtracting sparse representations

15 0.55351865 102 nips-2012-Distributed Non-Stochastic Experts

16 0.55189466 163 nips-2012-Isotropic Hashing

17 0.55015951 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

18 0.5437969 4 nips-2012-A Better Way to Pretrain Deep Boltzmann Machines

19 0.54304528 197 nips-2012-Learning with Recursive Perceptual Representations

20 0.54005259 148 nips-2012-Hamming Distance Metric Learning