nips nips2012 nips2012-85 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Xavier Bresson, Thomas Laurent, David Uminsky, James V. Brecht
Abstract: This paper provides both theoretical and algorithmic results for the 1 -relaxation of the Cheeger cut problem. The 2 -relaxation, known as spectral clustering, only loosely relates to the Cheeger cut; however, it is convex and leads to a simple optimization problem. The 1 -relaxation, in contrast, is non-convex but is provably equivalent to the original problem. The 1 -relaxation therefore trades convexity for exactness, yielding improved clustering results at the cost of a more challenging optimization. The first challenge is understanding convergence of algorithms. This paper provides the first complete proof of convergence for algorithms that minimize the 1 -relaxation. The second challenge entails comprehending the 1 energy landscape, i.e. the set of possible points to which an algorithm might converge. We show that 1 -algorithms can get trapped in local minima that are not globally optimal and we provide a classification theorem to interpret these local minima. This classification gives meaning to these suboptimal solutions and helps to explain, in terms of graph structure, when the 1 -relaxation provides the solution of the original Cheeger cut problem. 1
Reference: text
sentIndex sentText sentNum sentScore
1 hk David Uminsky University of San Francisco San Francisco, CA 94117 duminsky@usfca. [sent-5, score-0.178]
2 The second challenge entails comprehending the 1 energy landscape, i. [sent-15, score-0.221]
3 We show that 1 -algorithms can get trapped in local minima that are not globally optimal and we provide a classification theorem to interpret these local minima. [sent-18, score-0.598]
4 Typically these 1 -algorithms provide excellent unsupervised clustering results 1 and improve upon the standard 2 (spectral clustering) method [10, 13] in terms of both Cheeger energy and classification error. [sent-26, score-0.267]
5 We provide convergence results for both algorithms and also analyze the energy landscape. [sent-31, score-0.303]
6 This understanding of the energy landscape provides intuition for when and how the algorithms get trapped in local minima. [sent-33, score-0.508]
7 In contrast, we cannot guarantee this holds for the IPM without further assumptions on the energy landscape. [sent-38, score-0.253]
8 The simpler mathematical structure of the SD algorithm also provides better control of the energy descent. [sent-39, score-0.244]
9 For set valued maps, closedness provides the correct notion of continuity [8]. [sent-42, score-0.299]
10 This property alone is not enough to obtain convergence, and the closedness property proves the most challenging ingredient to establish for the algorithms we consider. [sent-44, score-0.362]
11 In Section 3 we show that that if the iterates of either algorithm approach a neighborhood of a strict local minimum then both algorithms will converge to this minimum. [sent-46, score-0.57]
12 When the energy is non-degenerate, section 4 extends this local convergence to global convergence toward critical points for the SD algorithm by using the additional structure afforded by the gradient flow. [sent-48, score-0.78]
13 In Section 5 we develop an understanding of the energy landscape of the continuous relaxation problem. [sent-49, score-0.4]
14 For non-convex problems an understanding of local minima is crucial. [sent-50, score-0.421]
15 We therefore provide a complete classification of the local minima of (2) in terms of the combinatorial local minima of (1) by means of an explicit formula. [sent-51, score-0.937]
16 As a consequence of this formula, the problem of finding local minima of the combinatorial problem is equivalent to finding local minima of the continuous relaxation. [sent-52, score-0.999]
17 If T and B were differentiable, a mixed explicit-implicit gradient flow of the energy would take the form (f k+1 −f k )/τ k = −( T (f k+1 )−E(f k ) B(f k ))/(B(f k )), where {τ k } denotes a sequence of time steps. [sent-56, score-0.262]
18 This property allows us to conclude global convergence of the SD algorithm in cases where we can not conclude convergence for the IPM algorithm. [sent-66, score-0.265]
19 while E(f k ) − E(f k+1 ) ≥ TOL do v k ∈ ∂0 B(f k ) gk = f k + c vk k ˆ hk = arg min T (u)+ E(f ) ||u−g k ||2 2 2c u∈Rn ˆ ˆ h = hk − med(hk )1 hk k+1 f = hk 2 end while k AIPM : Modifed IPM Algorithm [6]. [sent-74, score-0.712]
20 while E(f k ) − E(f k+1 ) ≥ TOL do v k ∈ ∂0 B(f k ) Dk = min||u||2 ≤1 T (u) − E(f k ) u, v k g k = arg min T (u) −E(f k ) u, v k if Dk< 0 ||u||2 ≤1 k g = f k if Dk = 0 hk = g k − med(g k )1 hk f k+1 = ||hk ||2 end while As the successive iterates have zero median, ∂0 B(f k ) is never empty. [sent-76, score-0.488]
21 The ˆ notion of a closed map proves useful when analyzing the step hk ∈ H(f k ) in the SD algorithm. [sent-89, score-0.314]
22 H(f ) := arg min T (u) + u E(f ) ||u − (f + c∂0 B(f ))||2 2 2c Currently, we can only show that lemma 1 holds at strict local minima for the analogous step, g k , n−1 of the IPM algorithm. [sent-92, score-0.634]
23 That lemma 1 holds without this further restriction on f ∈ S0 will allow us to demonstrate stronger global convergence results for the SD algorithm. [sent-93, score-0.234]
24 1 we first elucidate the monotonicity and compactness of ASD and AIPM . [sent-97, score-0.232]
25 2 demonstrates that a local notion of closedness holds for each algorithm. [sent-99, score-0.419]
26 This form of closedness suffices to show local convergence toward isolated local minima (c. [sent-100, score-0.961]
27 1 Monotonicity and Compactness We provide the monotonicity and compactness results for each algorithm in turn. [sent-105, score-0.232]
28 Lemmas 2 and 3 establish monotonicity and compactness for ASD while Lemmas 4 and 5 establish monotonicity and compactness for AIPM . [sent-106, score-0.464]
29 Let f 0 ∈ S0 and define a sequence of iterates ˆ (g k , hk , hk , f k+1 ) according to the SD algorithm. [sent-113, score-0.493]
30 Then for any such sequence √ √ ˆ ˆ (7) hk 2 ≤ g k 2 , 1 ≤ ||g k ||2 ≤ 1 + c n and 0 < ||hk ||2 ≤ (1 + n)||hk ||2 . [sent-114, score-0.219]
31 2 Closedness Properties The final ingredient to prove local convergence is some form of closedness. [sent-132, score-0.247]
32 We require closedness of the set valued maps A at strict local minima of the energy. [sent-133, score-0.801]
33 As the energy (2) is invariant under constant shifts and scalings, the usual notion of a strict local minimum on Rn does not apply. [sent-134, score-0.703]
34 We must therefore remove the effects of these invariances when referring to a local minimum as strict. [sent-135, score-0.308]
35 With these in hand we introduce the proper definition of a strict local minimum. [sent-137, score-0.281]
36 We say f ∞ is a strict local minimum of the energy if there exists > 0 so that f ∈ B (f ∞ ) and f = f ∞ imply E(f ) > E(f ∞ ). [sent-140, score-0.694]
37 This definition then allows us to formally define closedness at a strict local minimum in Definition 3. [sent-141, score-0.66]
38 For the IPM algorithm this is the only form of closedness we are able to establish. [sent-142, score-0.212]
39 k k We say A(f ) is closed at local minima (CLM) if z ∈ A(f ) and f k → f ∞ imply z k → f ∞ whenever f ∞ is a local minimum of the energy. [sent-149, score-0.766]
40 If z k → f ∞ holds only when f ∞ is a strict local minimum then we say A(f ) is closed at strict local minima (CSLM). [sent-150, score-1.078]
41 The CLM property for the SD algorithm, provided by lemma 6, follows as a straight forward consequence of lemma 1. [sent-151, score-0.206]
42 The CSLM property for the IPM algorithm provided by lemma 7 requires the additional hypothesis that the local minimum is strict. [sent-152, score-0.417]
43 4 3 Local Convergence of ASD and AIPM at Strict Local Minima Due to the lack of convexity of the energy (2) , at best we can only hope to obtain convergence to a local minimum of the energy. [sent-159, score-0.633]
44 An analogue of Lyapunov’s method from differential equations allows us to show that such convergence does occur provided the iterates reach a neighborhood of n−1 an isolated local minimum. [sent-160, score-0.473]
45 To apply the lemmas from section 2 we must assume that f ∞ ∈ S0 is a local minimum of the energy. [sent-161, score-0.343]
46 We will assume further that f ∞ is an isolated critical point of the energy according to the following definition. [sent-162, score-0.515]
47 We say that f is a critical point of the energy E(f ) if there exist w ∈ ∂T (f ) and v ∈ ∂0 B(f ) so that 0 = w − E(f )v. [sent-165, score-0.412]
48 If there exists > 0 so that f is the only critical point in B (f ∞ ) we say f is an isolated critical point of the energy. [sent-167, score-0.485]
49 Note that as any local minimum is a critical point of the energy, if f ∞ is an isolated critical point and a local minimum then it is necessarily a strict local minimum. [sent-168, score-1.357]
50 We say that A(f ) satisfies the critical point property (CP property) if, given any sequence satisfying f k+1 ∈ A(f k ), all limit points of {f k } are critical points of the energy. [sent-175, score-0.475]
51 For the IPM algorithm it follows from closedness of the minimization step. [sent-177, score-0.24]
52 The proof of local convergence utilizes a version of Lyapunov’s direct method for set-valued maps, and we adapt this technique from the strategy outlined in [8]. [sent-178, score-0.223]
53 We first demonstrate that if any iterate f k lies in a sufficiently small neighborhood Bγ (f ∞ ) of the strict local minimum then all subsequent iterates remain in the neighborhood B (f ∞ ) in which f ∞ is an isolated critical point. [sent-179, score-0.868]
54 By compactness and the CP property, any subsequence of {f k } must have a further subsequence that converges to the only critical point in B (f ∞ ), i. [sent-180, score-0.338]
55 If f ∞ is a strict local minimum of the energy, then for any > 0 there exists a γ > 0 so that if f 0 ∈ Bγ (f ∞ ) then {f k } ⊂ B (f ∞ ). [sent-188, score-0.448]
56 Let f ∞ denote a local minimum that is an isolated critical point of the energy. [sent-192, score-0.602]
57 Note that both algorithms satisfy the hypothesis of theorem 1, and therefore possess identical local convergence properties. [sent-194, score-0.257]
58 If any accumulation point f ∗ of the sequence {f k } is both an isolated critical point of the energy and a local minimum, then the whole sequence f k → f ∗ . [sent-198, score-0.802]
59 In particular, from lemma 3 we know that ||f k+1 − f k ||2 → 0 without any further assumptions regarding the initialization of the algorithm or the energy landscape. [sent-201, score-0.285]
60 Any accumulation point f ∗ of the sequence is a critical point of the energy. [sent-208, score-0.271]
61 Either the sequence converges, or the set of accumulation points form a continuum in S0 . [sent-210, score-0.184]
62 the supplementary material) simple examples to show that a continuum of local or global minima can in fact happen. [sent-214, score-0.528]
63 This degeneracy of a continuum of critical points arises from a lack of uniqueness in the underlying combinatorial problem. [sent-215, score-0.379]
64 By assuming additional structure in the energy landscape we can generalize the local convergence result, theorem 1, to yield global convergence of both algorithms. [sent-217, score-0.714]
65 If the energy has only n−1 countably many critical points in S0 then {f k } converges. [sent-223, score-0.392]
66 Suppose all critical n−1 points of the energy are isolated in S0 and are either local maxima or local minima. [sent-226, score-0.823]
67 While at first glance corollary 3 provides hope that global convergence holds for the IPM algorithm, our simple examples in the supplementary material demonstrate that even benign graphs with welldefined cuts have critical points of the energy that are neither local maxima nor local minima. [sent-228, score-1.025]
68 Specifically, we provide an explicit formula that gives an exact correspondence between the global minimizers of the continuous problem and the global minimizers of the combinatorial problem. [sent-230, score-0.401]
69 This extends previous work [12, 11, 9] on the relationship between the global minima of (1) and (2). [sent-231, score-0.313]
70 We also completely classifiy the local minima of the continuous problem by introducing a notion of local minimum for the combinatorial problem. [sent-232, score-0.91]
71 Any local minimum of the combinatorial problem then determines a local minimum of the combinatorial problem by means of an explicit formula, and vice-versa. [sent-233, score-0.869]
72 Theorem 4 provides this formula, which also gives a sharp condition for when a global minimum of the continuous problem is two-valued (binary), three-valued (trinary), or k-valued in the general case. [sent-234, score-0.281]
73 This provides an understanding the energy landscape, which is essential due to the lack of convexity present in the continuous problem. [sent-235, score-0.324]
74 Most importantly, we can classify the types of local minima encountered and when they form a continuum. [sent-236, score-0.398]
75 Loosely speaking, a set S is compatible with S1 S2 ··· Sk c whenever the cut defined by the pair (S, S c ) neither intersects nor crosses any of the cuts (Si , Si ). [sent-242, score-0.289]
76 A vertex set S is compatible with an increasing sequence S1 S2 · · · Sk if S ⊆ S1 , Sk ⊆ S or S1 S2 ··· Si ⊆ S ⊆ Si+1 ··· 6 Sk for some 1 ≤ i ≤ k − 1, The concept of compatible cuts then allows us to introduce our notion of a local minimum of the combinatorial problem, i. [sent-245, score-0.742]
77 An increasing collection of nontrivial sets S1 S2 ··· Sk is called a k-local minimum of the combinatorial problem if C(S1 ) = C(S2 ) = · · · = C(Sk ) ≤ C(S) for all S compatible with S1 S2 · · · Sk . [sent-249, score-0.339]
78 c c Pursuing the previous analogy, a collection of cuts (S1 , S1 ), · · · , (Sk , Sk ) forms a k-local minimum of the combinatorial problem precisely when they do not intersect, have the same energy and all other non-intersecting cuts (S, S c ) have higher energy. [sent-250, score-0.68]
79 A cut c (S1 , S1 ) defines a 1-local minimum if and only if it has lower energy than all cuts that do not intersect c it. [sent-252, score-0.626]
80 As a consequence, if a 1-local minimum is not a global minimum then the cut (S1 , S1 ) necessarily intersects all of the cuts defined by the global minimizers. [sent-253, score-0.675]
81 This is a fundamental characteristic of local minima: they are never “parallel” to global minima. [sent-254, score-0.229]
82 For the continuous problem, combinatorial k-local minima naturally correspond to vertex functions f ∈ Rn that take (k + 1) distinct values. [sent-255, score-0.464]
83 We therefore define the concept of a (k + 1)-valued local minimum of the continuous problem. [sent-256, score-0.366]
84 We call a vertex function f ∈ Rn a (k + 1)-valued local minimum of the continuous problem if f is a local minimum of E and if its range contains exactly k + 1 distinct values. [sent-258, score-0.711]
85 The continuous problem has a (k + 1)-valued local minimum if and only if the combinatorial problem has a k-local minimum. [sent-261, score-0.478]
86 For example, if the continuous problem has a trinary local minimum in the usual sense then the comc binatorial problem must have a 2-local minimum in the sense of definition 7. [sent-262, score-0.591]
87 As the cuts (S1 , S1 ) c and (S2 , S2 ) defining a 2-local minimum do not intersect, a 2-local minimum separates the vertices of the graph into three disjoint domains. [sent-263, score-0.456]
88 Given k functions f1 , · · · , fk , their strict convex hull is the set sch{f1 , · · · , fk } = {θ1 f1 + · · · + θk fk : θi > 0 for 1 ≤ i ≤ k and θ1 + · · · + θk = 1} (10) Theorem 4 (Explicit Correspondence of Local Minima). [sent-271, score-0.253]
89 Suppose S1 S2 · · · Sk is a k-local minimum of the combinatorial problem and let f ∈ sch{fS1 , · · · , fSk }. [sent-273, score-0.279]
90 Then any function of the form g = αf + β1 defines a (k + 1)valued local minimum of the continuous problem and with E(g) = C(S1 ). [sent-274, score-0.366]
91 Suppose that f is a (k + 1)-valued local minimum and let c1 > c2 > · · · > ck+1 denote its range. [sent-276, score-0.308]
92 Then the increasing collection of sets S1 · · · Sk given by S1 = Ω1 , S2 = Ω1 ∪ Ω2 ··· Sk = Ω1 ∪ · · · ∪ Ωk is a k-local minimum of the combinatorial problem with C(Si ) = E(f ). [sent-278, score-0.279]
93 If a set S1 is a 1-local min then the strict convext hull (10) of its characteristic function reduces to the single binary function fS1 . [sent-280, score-0.201]
94 Thus every n−1 1-local minimum generates exactly one local minimum of the continuous problem in S0 , and this local minimum is binary. [sent-281, score-0.841]
95 On the other hand, if k ≥ 2 then every k-local minimum of the combin−1 natorial problem generates a continuum (in S0 ) of non-binary local minima of the continuous problem. [sent-282, score-0.697]
96 Thus, the hypotheses of theorem 1, corollary 2 or corollary 3 can hold only if no such higher order k-local minima exist. [sent-283, score-0.431]
97 7 As a final consequence, we summarize the fact that theorem 4 implies that the continuous relaxation of the Cheeger cut problem is exact. [sent-285, score-0.201]
98 It shows the mean Cheeger energy value (2), the mean error of classification (% of misclassified data) and the mean computational time for both algorithms over 10 experiments with the same random initialization for both algorithms in each of the individual experiments. [sent-297, score-0.221]
99 Analogously for the SD Algorithm, we only need to lower the energy sufficiently before proceeding to the next iteration of the algorithm. [sent-350, score-0.221]
100 It proves convenient to stop the minimization when a weaker form of the energy inequality (6) holds, such as ˆ E(f ) ||h − f ||2 2 B(h) c E(f ) ≥ E(h) + θ for some constant 0 < θ < 1. [sent-351, score-0.285]
wordName wordTfidf (topN-words)
[('ipm', 0.314), ('aipm', 0.309), ('asd', 0.309), ('sd', 0.294), ('minima', 0.257), ('energy', 0.221), ('closedness', 0.212), ('cheeger', 0.207), ('hk', 0.178), ('minimum', 0.167), ('critical', 0.144), ('local', 0.141), ('strict', 0.14), ('cslm', 0.135), ('isolated', 0.128), ('monotonicity', 0.12), ('compactness', 0.112), ('combinatorial', 0.112), ('cut', 0.109), ('landscape', 0.098), ('med', 0.097), ('clm', 0.096), ('iterates', 0.096), ('sk', 0.092), ('cuts', 0.09), ('convergence', 0.082), ('rn', 0.078), ('continuum', 0.074), ('corollary', 0.07), ('lemma', 0.064), ('compatible', 0.06), ('trinary', 0.058), ('continuous', 0.058), ('hein', 0.056), ('global', 0.056), ('hler', 0.051), ('fs', 0.05), ('clustering', 0.046), ('property', 0.045), ('median', 0.045), ('coil', 0.044), ('nition', 0.043), ('accumulation', 0.042), ('sequence', 0.041), ('lyapunov', 0.04), ('formula', 0.04), ('intersect', 0.039), ('bresson', 0.039), ('tol', 0.039), ('mnist', 0.038), ('vertex', 0.037), ('cp', 0.036), ('successive', 0.036), ('proves', 0.036), ('closed', 0.035), ('usps', 0.035), ('kong', 0.035), ('lemmas', 0.035), ('theorem', 0.034), ('notion', 0.034), ('hong', 0.034), ('consequence', 0.033), ('spectral', 0.032), ('characteristic', 0.032), ('graph', 0.032), ('holds', 0.032), ('map', 0.031), ('monotonic', 0.03), ('intersects', 0.03), ('subsequence', 0.03), ('hull', 0.029), ('explicit', 0.029), ('continuity', 0.028), ('xi', 0.028), ('minimization', 0.028), ('fk', 0.028), ('si', 0.027), ('afforded', 0.027), ('angeles', 0.027), ('los', 0.027), ('points', 0.027), ('neighborhood', 0.026), ('maps', 0.026), ('dk', 0.026), ('say', 0.025), ('valued', 0.025), ('trapped', 0.025), ('minimizers', 0.025), ('ingredient', 0.024), ('laurent', 0.024), ('loosely', 0.024), ('fi', 0.023), ('understanding', 0.023), ('mathematical', 0.023), ('francisco', 0.022), ('analogously', 0.022), ('steepest', 0.022), ('lack', 0.022), ('point', 0.022), ('maxima', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering
Author: Xavier Bresson, Thomas Laurent, David Uminsky, James V. Brecht
Abstract: This paper provides both theoretical and algorithmic results for the 1 -relaxation of the Cheeger cut problem. The 2 -relaxation, known as spectral clustering, only loosely relates to the Cheeger cut; however, it is convex and leads to a simple optimization problem. The 1 -relaxation, in contrast, is non-convex but is provably equivalent to the original problem. The 1 -relaxation therefore trades convexity for exactness, yielding improved clustering results at the cost of a more challenging optimization. The first challenge is understanding convergence of algorithms. This paper provides the first complete proof of convergence for algorithms that minimize the 1 -relaxation. The second challenge entails comprehending the 1 energy landscape, i.e. the set of possible points to which an algorithm might converge. We show that 1 -algorithms can get trapped in local minima that are not globally optimal and we provide a classification theorem to interpret these local minima. This classification gives meaning to these suboptimal solutions and helps to explain, in terms of graph structure, when the 1 -relaxation provides the solution of the original Cheeger cut problem. 1
2 0.094015136 291 nips-2012-Reducing statistical time-series problems to binary classification
Author: Daniil Ryabko, Jeremie Mary
Abstract: We show how binary classification methods developed to work on i.i.d. data can be used for solving statistical problems that are seemingly unrelated to classification and concern highly-dependent time series. Specifically, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. The algorithms that we construct for solving these problems are based on a new metric between time-series distributions, which can be evaluated using binary classification methods. Universal consistency of the proposed algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data. 1
3 0.091987975 27 nips-2012-A quasi-Newton proximal splitting method
Author: Stephen Becker, Jalal Fadili
Abstract: A new result in convex analysis on the calculation of proximity operators in certain scaled norms is derived. We describe efficient implementations of the proximity calculation for a useful class of functions; the implementations exploit the piece-wise linear nature of the dual problem. The second part of the paper applies the previous result to acceleration of convex minimization problems, and leads to an elegant quasi-Newton method. The optimization method compares favorably against state-of-the-art alternatives. The algorithm has extensive applications including signal processing, sparse recovery and machine learning and classification. 1
4 0.073183961 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function
Author: Stefano Ermon, Ashish Sabharwal, Bart Selman, Carla P. Gomes
Abstract: Given a probabilistic graphical model, its density of states is a distribution that, for any likelihood value, gives the number of configurations with that probability. We introduce a novel message-passing algorithm called Density Propagation (DP) for estimating this distribution. We show that DP is exact for tree-structured graphical models and is, in general, a strict generalization of both sum-product and max-product algorithms. Further, we use density of states and tree decomposition to introduce a new family of upper and lower bounds on the partition function. For any tree decomposition, the new upper bound based on finer-grained density of state information is provably at least as tight as previously known bounds based on convexity of the log-partition function, and strictly stronger if a general condition holds. We conclude with empirical evidence of improvement over convex relaxations and mean-field based bounds. 1
5 0.073155887 286 nips-2012-Random Utility Theory for Social Choice
Author: Hossein Azari, David Parks, Lirong Xia
Abstract: Random utility theory models an agent’s preferences on alternatives by drawing a real-valued score on each alternative (typically independently) from a parameterized distribution, and then ranking the alternatives according to scores. A special case that has received significant attention is the Plackett-Luce model, for which fast inference methods for maximum likelihood estimators are available. This paper develops conditions on general random utility models that enable fast inference within a Bayesian framework through MC-EM, providing concave loglikelihood functions and bounded sets of global maxima solutions. Results on both real-world and simulated data provide support for the scalability of the approach and capability for model selection among general random utility models including Plackett-Luce. 1
6 0.070853122 34 nips-2012-Active Learning of Multi-Index Function Models
7 0.067396812 282 nips-2012-Proximal Newton-type methods for convex optimization
8 0.065856591 184 nips-2012-Learning Probability Measures with respect to Optimal Transport Metrics
9 0.063810043 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders
10 0.063102707 213 nips-2012-Minimization of Continuous Bethe Approximations: A Positive Variation
11 0.061536655 206 nips-2012-Majorization for CRFs and Latent Likelihoods
12 0.059900966 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models
13 0.059610702 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses
14 0.058379561 111 nips-2012-Efficient Sampling for Bipartite Matching Problems
15 0.057579149 16 nips-2012-A Polynomial-time Form of Robust Regression
16 0.056694552 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
17 0.0564982 179 nips-2012-Learning Manifolds with K-Means and K-Flats
18 0.056336939 69 nips-2012-Clustering Sparse Graphs
19 0.055312052 135 nips-2012-Forging The Graphs: A Low Rank and Positive Semidefinite Graph Learning Approach
20 0.055237785 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk
topicId topicWeight
[(0, 0.146), (1, 0.034), (2, 0.069), (3, -0.076), (4, 0.04), (5, 0.042), (6, -0.012), (7, -0.031), (8, 0.04), (9, 0.037), (10, -0.007), (11, -0.018), (12, -0.01), (13, -0.027), (14, -0.003), (15, -0.039), (16, -0.03), (17, 0.004), (18, -0.029), (19, 0.034), (20, -0.053), (21, -0.029), (22, -0.041), (23, 0.042), (24, 0.009), (25, 0.047), (26, -0.031), (27, -0.001), (28, 0.024), (29, 0.003), (30, 0.069), (31, 0.023), (32, -0.073), (33, 0.002), (34, -0.031), (35, -0.048), (36, -0.09), (37, -0.014), (38, 0.051), (39, 0.079), (40, 0.013), (41, -0.026), (42, -0.007), (43, 0.028), (44, 0.055), (45, 0.02), (46, -0.004), (47, -0.08), (48, -0.068), (49, -0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.92653221 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering
Author: Xavier Bresson, Thomas Laurent, David Uminsky, James V. Brecht
Abstract: This paper provides both theoretical and algorithmic results for the 1 -relaxation of the Cheeger cut problem. The 2 -relaxation, known as spectral clustering, only loosely relates to the Cheeger cut; however, it is convex and leads to a simple optimization problem. The 1 -relaxation, in contrast, is non-convex but is provably equivalent to the original problem. The 1 -relaxation therefore trades convexity for exactness, yielding improved clustering results at the cost of a more challenging optimization. The first challenge is understanding convergence of algorithms. This paper provides the first complete proof of convergence for algorithms that minimize the 1 -relaxation. The second challenge entails comprehending the 1 energy landscape, i.e. the set of possible points to which an algorithm might converge. We show that 1 -algorithms can get trapped in local minima that are not globally optimal and we provide a classification theorem to interpret these local minima. This classification gives meaning to these suboptimal solutions and helps to explain, in terms of graph structure, when the 1 -relaxation provides the solution of the original Cheeger cut problem. 1
2 0.62295252 27 nips-2012-A quasi-Newton proximal splitting method
Author: Stephen Becker, Jalal Fadili
Abstract: A new result in convex analysis on the calculation of proximity operators in certain scaled norms is derived. We describe efficient implementations of the proximity calculation for a useful class of functions; the implementations exploit the piece-wise linear nature of the dual problem. The second part of the paper applies the previous result to acceleration of convex minimization problems, and leads to an elegant quasi-Newton method. The optimization method compares favorably against state-of-the-art alternatives. The algorithm has extensive applications including signal processing, sparse recovery and machine learning and classification. 1
3 0.61677873 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models
Author: Nicholas Ruozzi
Abstract: Sudderth, Wainwright, and Willsky conjectured that the Bethe approximation corresponding to any fixed point of the belief propagation algorithm over an attractive, pairwise binary graphical model provides a lower bound on the true partition function. In this work, we resolve this conjecture in the affirmative by demonstrating that, for any graphical model with binary variables whose potential functions (not necessarily pairwise) are all log-supermodular, the Bethe partition function always lower bounds the true partition function. The proof of this result follows from a new variant of the “four functions” theorem that may be of independent interest. 1
4 0.61175591 291 nips-2012-Reducing statistical time-series problems to binary classification
Author: Daniil Ryabko, Jeremie Mary
Abstract: We show how binary classification methods developed to work on i.i.d. data can be used for solving statistical problems that are seemingly unrelated to classification and concern highly-dependent time series. Specifically, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. The algorithms that we construct for solving these problems are based on a new metric between time-series distributions, which can be evaluated using binary classification methods. Universal consistency of the proposed algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data. 1
5 0.60217226 44 nips-2012-Approximating Concavely Parameterized Optimization Problems
Author: Joachim Giesen, Jens Mueller, Soeren Laue, Sascha Swiercy
Abstract: We consider an abstract class of optimization problems that are parameterized concavely in a single parameter, and show that the solution path along the √ parameter can always be approximated with accuracy ε > 0 by a set of size O(1/ ε). A √ lower bound of size Ω(1/ ε) shows that the upper bound is tight up to a constant factor. We also devise an algorithm that calls a step-size oracle and computes an √ approximate path of size O(1/ ε). Finally, we provide an implementation of the oracle for soft-margin support vector machines, and a parameterized semi-definite program for matrix completion. 1
6 0.56949925 135 nips-2012-Forging The Graphs: A Low Rank and Positive Semidefinite Graph Learning Approach
7 0.55906934 320 nips-2012-Spectral Learning of General Weighted Automata via Constrained Matrix Completion
8 0.55115634 206 nips-2012-Majorization for CRFs and Latent Likelihoods
9 0.5444659 34 nips-2012-Active Learning of Multi-Index Function Models
10 0.54246175 133 nips-2012-Finding Exemplars from Pairwise Dissimilarities via Simultaneous Sparse Recovery
11 0.53128254 196 nips-2012-Learning with Partially Absorbing Random Walks
12 0.52969551 36 nips-2012-Adaptive Stratified Sampling for Monte-Carlo integration of Differentiable functions
13 0.52139825 184 nips-2012-Learning Probability Measures with respect to Optimal Transport Metrics
14 0.52104998 286 nips-2012-Random Utility Theory for Social Choice
15 0.52017897 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound
16 0.51899374 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function
17 0.5152896 277 nips-2012-Probabilistic Low-Rank Subspace Clustering
18 0.5123961 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs
19 0.51208031 340 nips-2012-The representer theorem for Hilbert spaces: a necessary and sufficient condition
20 0.50755227 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders
topicId topicWeight
[(0, 0.045), (17, 0.011), (21, 0.015), (38, 0.17), (42, 0.038), (54, 0.033), (55, 0.024), (65, 0.223), (74, 0.061), (76, 0.159), (80, 0.067), (92, 0.055)]
simIndex simValue paperId paperTitle
same-paper 1 0.84415001 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering
Author: Xavier Bresson, Thomas Laurent, David Uminsky, James V. Brecht
Abstract: This paper provides both theoretical and algorithmic results for the 1 -relaxation of the Cheeger cut problem. The 2 -relaxation, known as spectral clustering, only loosely relates to the Cheeger cut; however, it is convex and leads to a simple optimization problem. The 1 -relaxation, in contrast, is non-convex but is provably equivalent to the original problem. The 1 -relaxation therefore trades convexity for exactness, yielding improved clustering results at the cost of a more challenging optimization. The first challenge is understanding convergence of algorithms. This paper provides the first complete proof of convergence for algorithms that minimize the 1 -relaxation. The second challenge entails comprehending the 1 energy landscape, i.e. the set of possible points to which an algorithm might converge. We show that 1 -algorithms can get trapped in local minima that are not globally optimal and we provide a classification theorem to interpret these local minima. This classification gives meaning to these suboptimal solutions and helps to explain, in terms of graph structure, when the 1 -relaxation provides the solution of the original Cheeger cut problem. 1
2 0.83657509 336 nips-2012-The Coloured Noise Expansion and Parameter Estimation of Diffusion Processes
Author: Simon Lyons, Amos J. Storkey, Simo Särkkä
Abstract: Stochastic differential equations (SDE) are a natural tool for modelling systems that are inherently noisy or contain uncertainties that can be modelled as stochastic processes. Crucial to the process of using SDE to build mathematical models is the ability to estimate parameters of those models from observed data. Over the past few decades, significant progress has been made on this problem, but we are still far from having a definitive solution. We describe a novel method of approximating a diffusion process that we show to be useful in Markov chain Monte-Carlo (MCMC) inference algorithms. We take the ‘white’ noise that drives a diffusion process and decompose it into two terms. The first is a ‘coloured noise’ term that can be deterministically controlled by a set of auxilliary variables. The second term is small and enables us to form a linear Gaussian ‘small noise’ approximation. The decomposition allows us to take a diffusion process of interest and cast it in a form that is amenable to sampling by MCMC methods. We explain why many state-of-the-art inference methods fail on highly nonlinear inference problems, and we demonstrate experimentally that our method performs well in such situations. Our results show that this method is a promising new tool for use in inference and parameter estimation problems. 1
3 0.80629373 127 nips-2012-Fast Bayesian Inference for Non-Conjugate Gaussian Process Regression
Author: Emtiyaz Khan, Shakir Mohamed, Kevin P. Murphy
Abstract: We present a new variational inference algorithm for Gaussian process regression with non-conjugate likelihood functions, with application to a wide array of problems including binary and multi-class classification, and ordinal regression. Our method constructs a concave lower bound that is optimized using an efficient fixed-point updating algorithm. We show that the new algorithm has highly competitive computational complexity, matching that of alternative approximate inference methods. We also prove that the use of concave variational bounds provides stable and guaranteed convergence – a property not available to other approaches. We show empirically for both binary and multi-class classification that our new algorithm converges much faster than existing variational methods, and without any degradation in performance. 1
4 0.76556259 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model
Author: Yi Wu, David P. Wipf
Abstract: Sparse linear (or generalized linear) models combine a standard likelihood function with a sparse prior on the unknown coefficients. These priors can conveniently be expressed as a maximization over zero-mean Gaussians with different variance hyperparameters. Standard MAP estimation (Type I) involves maximizing over both the hyperparameters and coefficients, while an empirical Bayesian alternative (Type II) first marginalizes the coefficients and then maximizes over the hyperparameters, leading to a tractable posterior approximation. The underlying cost functions can be related via a dual-space framework from [22], which allows both the Type I or Type II objectives to be expressed in either coefficient or hyperparmeter space. This perspective is useful because some analyses or extensions are more conducive to development in one space or the other. Herein we consider the estimation of a trade-off parameter balancing sparsity and data fit. As this parameter is effectively a variance, natural estimators exist by assessing the problem in hyperparameter (variance) space, transitioning natural ideas from Type II to solve what is much less intuitive for Type I. In contrast, for analyses of update rules and sparsity properties of local and global solutions, as well as extensions to more general likelihood models, we can leverage coefficient-space techniques developed for Type I and apply them to Type II. For example, this allows us to prove that Type II-inspired techniques can be successful recovering sparse coefficients when unfavorable restricted isometry properties (RIP) lead to failure of popular ℓ1 reconstructions. It also facilitates the analysis of Type II when non-Gaussian likelihood models lead to intractable integrations. 1
5 0.75028163 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
Author: Emile Richard, Stephane Gaiffas, Nicolas Vayatis
Abstract: In the paper, we consider the problem of link prediction in time-evolving graphs. We assume that certain graph features, such as the node degree, follow a vector autoregressive (VAR) model and we propose to use this information to improve the accuracy of prediction. Our strategy involves a joint optimization procedure over the space of adjacency matrices and VAR matrices which takes into account both sparsity and low rank properties of the matrices. Oracle inequalities are derived and illustrate the trade-offs in the choice of smoothing parameters when modeling the joint effect of sparsity and low rank property. The estimate is computed efficiently using proximal methods through a generalized forward-backward agorithm. 1
6 0.74960905 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback
7 0.7493403 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models
8 0.74932665 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance
9 0.74861944 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation
10 0.74812907 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity
11 0.74809504 38 nips-2012-Algorithms for Learning Markov Field Policies
12 0.74797845 69 nips-2012-Clustering Sparse Graphs
13 0.74767548 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
14 0.74553239 148 nips-2012-Hamming Distance Metric Learning
15 0.74549282 179 nips-2012-Learning Manifolds with K-Means and K-Flats
16 0.74542135 304 nips-2012-Selecting Diverse Features via Spectral Regularization
17 0.74526006 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
18 0.74525493 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes
19 0.74455702 125 nips-2012-Factoring nonnegative matrices with linear programs
20 0.74399197 162 nips-2012-Inverse Reinforcement Learning through Structured Classification