nips nips2013 nips2013-202 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Xavier Bresson, Thomas Laurent, David Uminsky, James von Brecht
Abstract: Ideas from the image processing literature have recently motivated a new set of clustering algorithms that rely on the concept of total variation. While these algorithms perform well for bi-partitioning tasks, their recursive extensions yield unimpressive results for multiclass clustering tasks. This paper presents a general framework for multiclass total variation clustering that does not rely on recursion. The results greatly outperform previous total variation algorithms and compare well with state-of-the-art NMF approaches. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Ideas from the image processing literature have recently motivated a new set of clustering algorithms that rely on the concept of total variation. [sent-8, score-0.205]
2 While these algorithms perform well for bi-partitioning tasks, their recursive extensions yield unimpressive results for multiclass clustering tasks. [sent-9, score-0.249]
3 This paper presents a general framework for multiclass total variation clustering that does not rely on recursion. [sent-10, score-0.53]
4 The results greatly outperform previous total variation algorithms and compare well with state-of-the-art NMF approaches. [sent-11, score-0.272]
5 1 Introduction Many clustering models rely on the minimization of an energy over possible partitions of the data set. [sent-12, score-0.259]
6 A natural resolution of this issue involves relaxing the discrete minimization space into a continuous one to obtain an easier minimization procedure. [sent-14, score-0.111]
7 Many current algorithms, such as spectral clustering methods or non-negative matrix factorization (NMF) methods, follow this relaxation approach. [sent-15, score-0.237]
8 A tight relaxation, on the other hand, has a solution that closely matches the solution of the original discrete NP-hard problem. [sent-18, score-0.101]
9 Ideas from the image processing literature have recently motivated a new set of algorithms [17, 18, 11, 12, 4, 15, 3, 2, 13, 10] that can obtain tighter relaxations than those used by NMF and spectral clustering. [sent-19, score-0.103]
10 Total variation techniques promote the formation of sharp indicator functions in the continuous relaxation. [sent-21, score-0.466]
11 In contrast to the relaxations employed by spectral clustering and NMF, total variation techniques therefore lead to quasi-discrete solutions that closely resemble the discrete solution of the original NP-hard problem. [sent-23, score-0.529]
12 They provide a promising set of clustering tools for precisely this reason. [sent-24, score-0.107]
13 Previous total variation algorithms obtain excellent results for two class partitioning problems [18, 11, 12, 3] . [sent-25, score-0.272]
14 Until now, total variation techniques have relied upon a recursive bi-partitioning procedure to handle more than two classes. [sent-26, score-0.312]
15 This paper presents a general framework for multiclass total variation clustering that does not rely on a recursive procedure. [sent-28, score-0.57]
16 Specifically, we introduce a new discrete multiclass clustering model, its corresponding continuous relaxation and a new algorithm for optimizing the relaxation. [sent-29, score-0.327]
17 Our approach also easily adapts to handle either unsupervised or transductive 1 clustering tasks. [sent-30, score-0.24]
18 The results significantly outperform previous total variation algorithms and compare well against state-of-the-art approaches [19, 20, 1]. [sent-31, score-0.272]
19 We name our approach Multiclass Total Variation clustering (MTV clustering). [sent-32, score-0.107]
20 , xN } denote the vertex set and W := {wij }1≤i,j≤N denote the non-negative, symmetric similarity matrix. [sent-36, score-0.112]
21 The classical balanced-cut (or, Cheeger cut) [7, 8] asks for a partition of V = A ∪ Ac into two disjoint sets that minimizes the set energy Bal(A) := Cut(A, Ac ) xi ∈A,xj ∈Ac wij = . [sent-38, score-0.245]
22 c |} min{|A|, |A min{|A|, |Ac |} (1) A simple rationale motivates this model: clusters should exhibit similarity between data points, which is reflected by small values of Cut(A, Ac ), and also form an approximately equal sized partition of the vertex set. [sent-39, score-0.14]
23 We generalize this model to the multiclass setting by pursuing the same rationale. [sent-41, score-0.102]
24 For a given number of classes R (that we assume to be known) we formulate our generalized balanced-cut problem as R Cut(Ar , Ac ) r Minimize c |} min{λ|Ar |, |Ar (P) r=1 over all disjoint partitions Ar ∩ As = ∅, A1 ∪ · · · ∪ AR = V of the vertex set. [sent-42, score-0.112]
25 Previous work [4] has used λ = 1 to obtain a multiclass energy by a straightforward sum of the two-class balanced-cut terms (1). [sent-44, score-0.206]
26 While this follows the usual practice, it erroneously attempts to enforce that each set in the partition occupy half of the total number of vertices in the graph. [sent-45, score-0.152]
27 We instead select the parameter λ to ensure that each of the classes approximately occupy the appropriate fraction 1/R of the total number of vertices. [sent-46, score-0.103]
28 This general framework also easily incorporates a priori known information, such as a set of labels for transductive learning. [sent-48, score-0.141]
29 If Lr ⊂ V denotes a set of data points that are a priori known to belong to class r then we simply enforce Lr ⊂ Ar in the definition of an allowable partition of the vertex set. [sent-49, score-0.214]
30 3 Total Variation and a Tight Continuous Relaxation We derive our continuous optimization by relaxing the set energy (P) to the continuous energy R E(F ) = r=1 fr T V fr − medλ (fr ) . [sent-51, score-1.666]
31 The total variation f T V of a vertex function f : V → R is defined by n f TV n wij |f (xi ) − f (xj )|. [sent-57, score-0.442]
32 = (3) i=1 j=1 Alternatively, if we view a vertex function f as a vector (f (x1 ), . [sent-58, score-0.112]
33 For any edge (i, j) in the graph the corresponding row in the matrix K has an entry wij in the column corresponding to the ith vertex, an entry −wij in the column corresponding to the j th vertex and zeros otherwise. [sent-64, score-0.216]
34 f − medλ (f ) 1,λ min {λ|A|, |Ac |} The preceding theorem allows us to restate the original set optimization problem (P) in the equivalent discrete form R fr T V Minimize fr − medλ (fr ) 1,λ (P’) r=1 over non-zero functions f1 , . [sent-72, score-1.451]
35 Indeed, since the non-zero functions fr can take only two values, zero or one, they must define indicator functions of some nonempty set. [sent-79, score-0.844]
36 + fR = 1V then guarantees that the sets Ar := {xi ∈ V : fr (xi ) = 1} form a partition of the vertex set. [sent-83, score-0.834]
37 We obtain the relaxed version (P-rlx) of (P’) in the usual manner by allowing fr ∈ [0, 1] to have a continuous range. [sent-84, score-0.745]
38 This yields R fr T V Minimize fr − medλ (fr ) 1,λ (P-rlx) r=1 over functions f1 , . [sent-85, score-1.427]
39 The following two points form the foundation on which total variation clustering relies: 1 — As the next subsection details, the total variation terms give rise to quasi-indicator functions. [sent-92, score-0.671]
40 Therefore solving (P-rlx) amounts to minimizing a sum of ratios of convex functions with convex constraints. [sent-101, score-0.155]
41 + fR = 1V 3 n Here f 2 = i=1 wij |f (xi ) − f (xj )|2 denotes the spectral relaxation of Cut(A, Ac ); it equals Lap f, Lf if L denotes the unnormalized graph Laplacian matrix. [sent-110, score-0.308]
42 Thus problem (P-rlx2) relates to spectral clustering (and therefore NMF [9]) with a positivity constraint. [sent-111, score-0.168]
43 Note that the only difference between (P-rlx2) and (P-rlx) is that the exponent 2 appears in · Lap while the exponent 1 appears in the total variation. [sent-112, score-0.169]
44 a line with 20 vertices and edge weights wi,i+1 = 1, then the optimal cut lies between vertex 10 and vertex 11 since this gives a perfectly balanced cut. [sent-117, score-0.32]
45 Figure 1(a) shows the vertex function f1 generated by (P-rlx) while figure 1(b) shows the one generated by (P-rlx2). [sent-118, score-0.112]
46 Observe that the solution of the total variation model coincides with the indicator function of the desired cut whereas the the spectral model prefers its smoothed version. [sent-119, score-0.572]
47 Note that both functions in figure 1a) and 1b) have exactly the same total variation f T V = |f (x1 ) − f (x2 )| + · · · + |f (x19 ) − f (x20 )| = f (x1 ) − f (x20 ) = 1 since both functions are monotonic. [sent-120, score-0.35]
48 The total variation model will therefore prefer the sharp indicator function since it differs more from its λ-median than the smooth indicator function. [sent-121, score-0.529]
49 Indeed, the denominator fr − medλ (fr ) 1,λ is larger for the sharp indicator function than for the smooth one. [sent-122, score-0.899]
50 As f 2 = |f (x1 )−f (x2 )|2 +· · ·+|f (x19 )−f (x20 )|2 and t2 < t when t < 1 it follows that f Lap Lap is much smaller for the smooth function than for the sharp one. [sent-124, score-0.113]
51 Thus the spectral model will prefer the smooth indicator function despite the fact that it differs less from its λ-median. [sent-125, score-0.167]
52 We therefore recognize the total variation as the driving force behind the formation of sharp indicator functions. [sent-126, score-0.473]
53 Bottom left: the solution given by the total variation relaxation. [sent-128, score-0.295]
54 Position along the x-axis = vertex number, height along the y-axis = value of the vertex function. [sent-130, score-0.224]
55 This heuristic explanation on a simple, two-class example generalizes to the multiclass case and to real data sets (see figure 2). [sent-131, score-0.102]
56 In simple terms, quasi-indicator functions arise due to the fact that the total variation of a sharp indicator function equals the total variation of a smoothed version of the same indicator function. [sent-132, score-0.831]
57 The denominator fr − medλ (fr ) 1,λ then measures the deviation of these functions from their λ-median. [sent-133, score-0.753]
58 A sharp indicator function deviates more from its median than does its smoothed version since most of its values concentrate around zero and one. [sent-134, score-0.176]
59 The energy is therefore much smaller for a sharp indicator function than for a smooth indicator function, and consequently the total variation clustering energy always prefers sharp indicator functions to smooth ones. [sent-135, score-1.091]
60 Several previous works have proven that the relaxation is exact in the two-class case; that is, the total variation solution coincides with the solution of the original NP-hard problem [8, 18, 3, 5]. [sent-137, score-0.387]
61 Figure 2 illustrates the result of the difference between total variation and NMF relaxations on the data set OPTDIGITS, which contains 5620 images of handwritten numerical digits. [sent-138, score-0.314]
62 The MTV relaxation leads a sharp transition between the fours and the nines while the NMF relaxation leads to a smooth transition. [sent-141, score-0.305]
63 2 Transductive Framework From a modeling point-of-view, the presence of transductive labels poses no additional difficulty. [sent-145, score-0.141]
64 In addition to the simplex constraint R F ∈ Σ := F ∈ MN ×R ([0, 1]) : fr (xi ) ≥ 0, fr (xi ) = 1 (7) r=1 required for unsupervised clustering we also impose the set of labels as a hard constraint. [sent-146, score-1.614]
65 , LR denote the R vertex subsets representing the labeled points, so that xi ∈ Lr means xi belongs to class r, then we may enforce these labels by restricting F to lie in the subset F ∈ Λ := {F ∈ MN ×R ([0, 1]) : ∀r, (f1 (xi ), . [sent-150, score-0.294]
66 Our model for transductive classification then aims to solve the problem R Minimize r=1 fr T V fr − medλ (fr ) over matrices F ∈ Σ ∩ Λ. [sent-155, score-1.498]
67 (P-trans) 1,λ Note that Σ ∩ Λ also defines a convex set, so this minimization remains a sum of ratios of convex functions subject to a convex constraint. [sent-156, score-0.221]
68 In particular, we may use the proximal splitting algorithm detailed in the next section for both unsupervised and transductive classification tasks. [sent-158, score-0.311]
69 4 Proximal Splitting Algorithm This section details our proximal splitting algorithm for finding local minimizers of a sum of ratios of convex functions subject to a convex constraint. [sent-159, score-0.333]
70 We also give an explicit formula for a subdifferential of B since our proximal splitting algorithm requires this in explicit form. [sent-161, score-0.225]
71 We then summarize a few properties of proximal operators before presenting the algorithm. [sent-162, score-0.116]
72 Given a convex function A : RN → R, the proximal operator of A is defined by 1 proxA (g) := argmin A(f ) + ||f − g||2 . [sent-172, score-0.161]
73 2 2 f ∈RN (10) If we let δC denote the barrier function of the convex set C, that is δC (f ) = 0 if f ∈ C and δC (f ) = +∞ if f ∈ C, / (11) then we easily see that proxδC is simply the least-squares projection on C, in other words, 1 proxδC (f ) = projC (f ) := argmin 2 ||f − g||2 . [sent-173, score-0.103]
74 In this manner the proximal operator defines a 2 g∈C mapping from RN to RN that generalizes the least-squares projection onto a convex set. [sent-174, score-0.199]
75 , fr ] ∈ MN ×R Minimize δC (F ) + (12) r=1 where E(fr ) = T (fr )/B(fr ) denotes the energy of the quasi-indicator function of the rth cluster. [sent-179, score-0.853]
76 The set C = Σ or C = Σ ∩ Λ is the convex subset of MN ×R that encodes the simplex constraint (7) or the simplex constraint with labels. [sent-180, score-0.175]
77 Beginning from an initial iterate F 0 ∈ C we propose the following proximal splitting algorithm: F k+1 := proxT k +δC (F k + ∂B k (F k )). [sent-182, score-0.2]
78 This choice of the constants (ck , dk ) yields r r B k (F k ) = T k (F k ), and this fundamental property allows us to derive (see supplementary material) the energy descent estimate: k Theorem 3 (Estimate of the energy descent). [sent-184, score-0.241]
79 Recent years have therefore produced several fast and accurate algorithms for computing the proximal operator of the total variation. [sent-193, score-0.187]
80 As T k + δC consists of a weighted sum of total variation terms subject to a convex constraint, we can readily adapt 6 these algorithms to compute the second step of our algorithm efficiently. [sent-194, score-0.317]
81 This relies on a proper uniformly convex formulation of the proximal minimization, which we detail completely in the supplementary material. [sent-196, score-0.161]
82 Ideally, we would like to terminate F k+1 ≈ proxT k +δC (Gk ) in such a manner so that the energy descent property (14) still holds and F k+1 always satisfies the required constraints. [sent-199, score-0.14]
83 In theory we cannot guarantee that the energy estimate holds for an inexact solution. [sent-200, score-0.104]
84 We may note, however, that a slightly weaker version of the energy estimate (14) R r=1 k+1 Br F k − F k+1 k k+1 Er − Er ≥ (1 − ) k Br ∆k 2 (15) holds after a finite number of iterations of the inner minimization. [sent-201, score-0.104]
85 Our implementation of the proximal splitting algorithm also guarantees that F k+1 always satisfies the required constraints. [sent-205, score-0.178]
86 When C = Σ ∩ Λ the computational effort actually decreases, since in this case the projection consists of a simplex projection on the unlabeled points and straightforward assignment on the labeled points. [sent-210, score-0.141]
87 We may now summarize the full algorithm, including the proximal operator computation. [sent-212, score-0.116]
88 We compare against two previous total variation algorithms [11, 3], which rely on recursive bipartitioning, and two top NMF algorithms [1, 19]. [sent-234, score-0.339]
89 Procedure (b) first selects one data point uniformly at random from each computed NCut cluster, then sets fr equal to one at the data point drawn from the rth cluster and zero otherwise. [sent-238, score-0.722]
90 We then propagate this initial stage by replacing each fr with (I +L)−1 fr where L denotes the unnormalized graph Laplacian. [sent-239, score-1.481]
91 [19] for a definition of purity) of the solution with the lowest discrete energy (P). [sent-244, score-0.151]
92 Our next set of experiments demonstrate our algorithm in a transductive setting. [sent-251, score-0.11]
93 We then run ten trials of initial condition (b) (propagating all labels instead of one) and report the purity of the lowest energy solution as before along with the average computational time (for simple MATLAB code running on a standard desktop) of the ten runs. [sent-283, score-0.211]
94 We terminate the algorithm once the relative change in energy falls below 10−4 between outer steps of algorithm 1. [sent-284, score-0.14]
95 05/ 25s Our non-recursive MTV algorithm vastly outperforms the two previous recursive total variation approaches and also compares well with state-of-the-art NMF approaches. [sent-328, score-0.312]
96 We plan to incorporate such improvements into the total variation framework in future work. [sent-331, score-0.272]
97 Lastly, we found procedure (b) can help overcome the lack of convexity inherent in many clustering approaches. [sent-332, score-0.107]
98 Overall, our total variation framework presents a promising alternative to NMF methods due to its strong mathematical foundation and tight relaxation. [sent-334, score-0.325]
99 Fast multiclass segmentation using diffuse interface methods on graphs. [sent-392, score-0.128]
100 A total variation-based graph clustering algorithm for cheeger ratio cuts. [sent-427, score-0.313]
wordName wordTfidf (topN-words)
[('fr', 0.694), ('med', 0.237), ('variation', 0.201), ('nmf', 0.199), ('mtv', 0.16), ('ac', 0.157), ('ar', 0.127), ('proxt', 0.127), ('proximal', 0.116), ('vertex', 0.112), ('transductive', 0.11), ('lsd', 0.109), ('clustering', 0.107), ('energy', 0.104), ('multiclass', 0.102), ('cut', 0.096), ('nmfr', 0.091), ('cheeger', 0.089), ('br', 0.084), ('lap', 0.08), ('sharp', 0.079), ('gk', 0.075), ('indicator', 0.072), ('total', 0.071), ('relaxation', 0.069), ('simplex', 0.065), ('optdigits', 0.064), ('splitting', 0.062), ('spectral', 0.061), ('wij', 0.058), ('xi', 0.055), ('bresson', 0.054), ('fours', 0.054), ('pendigits', 0.054), ('projc', 0.054), ('lr', 0.054), ('purity', 0.053), ('formation', 0.05), ('exponent', 0.049), ('subdifferential', 0.047), ('graph', 0.046), ('convex', 0.045), ('ucla', 0.044), ('rn', 0.042), ('relaxations', 0.042), ('recursive', 0.04), ('cam', 0.04), ('energies', 0.039), ('functions', 0.039), ('projection', 0.038), ('mn', 0.038), ('mnist', 0.038), ('terminate', 0.036), ('bertozzi', 0.036), ('merkurjev', 0.036), ('uminsky', 0.036), ('zhirong', 0.036), ('smooth', 0.034), ('dk', 0.033), ('er', 0.033), ('erkki', 0.032), ('occupy', 0.032), ('angeles', 0.031), ('tight', 0.031), ('ck', 0.031), ('labels', 0.031), ('los', 0.03), ('db', 0.03), ('rth', 0.028), ('lausanne', 0.028), ('partition', 0.028), ('rely', 0.027), ('denotes', 0.027), ('relaxed', 0.026), ('allowable', 0.026), ('hein', 0.026), ('diffuse', 0.026), ('ratios', 0.026), ('smoothed', 0.025), ('szlam', 0.025), ('francisco', 0.025), ('continuous', 0.025), ('prox', 0.024), ('discrete', 0.024), ('prefers', 0.023), ('unsupervised', 0.023), ('solution', 0.023), ('iterate', 0.022), ('presents', 0.022), ('laurent', 0.021), ('minimization', 0.021), ('enforce', 0.021), ('subsection', 0.02), ('belongs', 0.02), ('barrier', 0.02), ('tv', 0.02), ('denominator', 0.02), ('fold', 0.02), ('unnormalized', 0.02), ('relaxing', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999997 202 nips-2013-Multiclass Total Variation Clustering
Author: Xavier Bresson, Thomas Laurent, David Uminsky, James von Brecht
Abstract: Ideas from the image processing literature have recently motivated a new set of clustering algorithms that rely on the concept of total variation. While these algorithms perform well for bi-partitioning tasks, their recursive extensions yield unimpressive results for multiclass clustering tasks. This paper presents a general framework for multiclass total variation clustering that does not rely on recursion. The results greatly outperform previous total variation algorithms and compare well with state-of-the-art NMF approaches. 1
2 0.099706061 56 nips-2013-Better Approximation and Faster Algorithm Using the Proximal Average
Author: Yao-Liang Yu
Abstract: It is a common practice to approximate “complicated” functions with more friendly ones. In large-scale machine learning applications, nonsmooth losses/regularizers that entail great computational challenges are usually approximated by smooth functions. We re-examine this powerful methodology and point out a nonsmooth approximation which simply pretends the linearity of the proximal map. The new approximation is justified using a recent convex analysis tool— proximal average, and yields a novel proximal gradient algorithm that is strictly better than the one based on smoothing, without incurring any extra overhead. Numerical experiments conducted on two important applications, overlapping group lasso and graph-guided fused lasso, corroborate the theoretical claims. 1
3 0.09636759 328 nips-2013-The Total Variation on Hypergraphs - Learning on Hypergraphs Revisited
Author: Matthias Hein, Simon Setzer, Leonardo Jost, Syama Sundar Rangapuram
Abstract: Hypergraphs allow one to encode higher-order relationships in data and are thus a very flexible modeling tool. Current learning methods are either based on approximations of the hypergraphs via graphs or on tensor methods which are only applicable under special conditions. In this paper, we present a new learning framework on hypergraphs which fully uses the hypergraph structure. The key element is a family of regularization functionals based on the total variation on hypergraphs. 1
4 0.079666696 75 nips-2013-Convex Two-Layer Modeling
Author: Özlem Aslan, Hao Cheng, Xinhua Zhang, Dale Schuurmans
Abstract: Latent variable prediction models, such as multi-layer networks, impose auxiliary latent variables between inputs and outputs to allow automatic inference of implicit features useful for prediction. Unfortunately, such models are difficult to train because inference over latent variables must be performed concurrently with parameter optimization—creating a highly non-convex problem. Instead of proposing another local training method, we develop a convex relaxation of hidden-layer conditional models that admits global training. Our approach extends current convex modeling approaches to handle two nested nonlinearities separated by a non-trivial adaptive latent layer. The resulting methods are able to acquire two-layer models that cannot be represented by any single-layer model over the same features, while improving training quality over local heuristics. 1
5 0.074370131 35 nips-2013-Analyzing the Harmonic Structure in Graph-Based Learning
Author: Xiao-Ming Wu, Zhenguo Li, Shih-Fu Chang
Abstract: We find that various well-known graph-based models exhibit a common important harmonic structure in its target function – the value of a vertex is approximately the weighted average of the values of its adjacent neighbors. Understanding of such structure and analysis of the loss defined over such structure help reveal important properties of the target function over a graph. In this paper, we show that the variation of the target function across a cut can be upper and lower bounded by the ratio of its harmonic loss and the cut cost. We use this to develop an analytical tool and analyze five popular graph-based models: absorbing random walks, partially absorbing random walks, hitting times, pseudo-inverse of the graph Laplacian, and eigenvectors of the Laplacian matrices. Our analysis sheds new insights into several open questions related to these models, and provides theoretical justifications and guidelines for their practical use. Simulations on synthetic and real datasets confirm the potential of the proposed theory and tool.
6 0.065469846 214 nips-2013-On Algorithms for Sparse Multi-factor NMF
7 0.06533663 148 nips-2013-Latent Maximum Margin Clustering
8 0.064854026 268 nips-2013-Reflection methods for user-friendly submodular optimization
9 0.062953465 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization
10 0.059742548 257 nips-2013-Projected Natural Actor-Critic
11 0.058669284 304 nips-2013-Sparse nonnegative deconvolution for compressive calcium imaging: algorithms and phase transitions
12 0.058453787 158 nips-2013-Learning Multiple Models via Regularized Weighting
13 0.056704931 73 nips-2013-Convex Relaxations for Permutation Problems
14 0.053581111 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic
15 0.053134184 332 nips-2013-Tracking Time-varying Graphical Structure
16 0.053057827 186 nips-2013-Matrix factorization with binary components
17 0.050314918 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel
18 0.050286904 355 nips-2013-Which Space Partitioning Tree to Use for Search?
19 0.047716968 307 nips-2013-Speedup Matrix Completion with Side Information: Application to Multi-Label Learning
20 0.047439415 97 nips-2013-Distributed Submodular Maximization: Identifying Representative Elements in Massive Data
topicId topicWeight
[(0, 0.131), (1, 0.041), (2, 0.015), (3, 0.023), (4, 0.037), (5, 0.03), (6, -0.051), (7, -0.044), (8, -0.012), (9, 0.002), (10, 0.013), (11, 0.011), (12, 0.056), (13, -0.01), (14, 0.005), (15, 0.01), (16, -0.033), (17, -0.01), (18, 0.033), (19, 0.031), (20, -0.025), (21, 0.039), (22, 0.096), (23, -0.017), (24, -0.047), (25, -0.091), (26, -0.005), (27, -0.041), (28, 0.047), (29, 0.079), (30, -0.139), (31, 0.077), (32, 0.084), (33, 0.068), (34, -0.01), (35, -0.02), (36, 0.018), (37, -0.009), (38, -0.034), (39, 0.06), (40, 0.008), (41, 0.067), (42, 0.015), (43, 0.025), (44, -0.104), (45, -0.003), (46, -0.015), (47, -0.069), (48, -0.025), (49, 0.05)]
simIndex simValue paperId paperTitle
same-paper 1 0.92399979 202 nips-2013-Multiclass Total Variation Clustering
Author: Xavier Bresson, Thomas Laurent, David Uminsky, James von Brecht
Abstract: Ideas from the image processing literature have recently motivated a new set of clustering algorithms that rely on the concept of total variation. While these algorithms perform well for bi-partitioning tasks, their recursive extensions yield unimpressive results for multiclass clustering tasks. This paper presents a general framework for multiclass total variation clustering that does not rely on recursion. The results greatly outperform previous total variation algorithms and compare well with state-of-the-art NMF approaches. 1
2 0.83009577 328 nips-2013-The Total Variation on Hypergraphs - Learning on Hypergraphs Revisited
Author: Matthias Hein, Simon Setzer, Leonardo Jost, Syama Sundar Rangapuram
Abstract: Hypergraphs allow one to encode higher-order relationships in data and are thus a very flexible modeling tool. Current learning methods are either based on approximations of the hypergraphs via graphs or on tensor methods which are only applicable under special conditions. In this paper, we present a new learning framework on hypergraphs which fully uses the hypergraph structure. The key element is a family of regularization functionals based on the total variation on hypergraphs. 1
3 0.6319111 249 nips-2013-Polar Operators for Structured Sparse Estimation
Author: Xinhua Zhang, Yao-Liang Yu, Dale Schuurmans
Abstract: Structured sparse estimation has become an important technique in many areas of data analysis. Unfortunately, these estimators normally create computational difficulties that entail sophisticated algorithms. Our first contribution is to uncover a rich class of structured sparse regularizers whose polar operator can be evaluated efficiently. With such an operator, a simple conditional gradient method can then be developed that, when combined with smoothing and local optimization, significantly reduces training time vs. the state of the art. We also demonstrate a new reduction of polar to proximal maps that enables more efficient latent fused lasso. 1
4 0.62623221 56 nips-2013-Better Approximation and Faster Algorithm Using the Proximal Average
Author: Yao-Liang Yu
Abstract: It is a common practice to approximate “complicated” functions with more friendly ones. In large-scale machine learning applications, nonsmooth losses/regularizers that entail great computational challenges are usually approximated by smooth functions. We re-examine this powerful methodology and point out a nonsmooth approximation which simply pretends the linearity of the proximal map. The new approximation is justified using a recent convex analysis tool— proximal average, and yields a novel proximal gradient algorithm that is strictly better than the one based on smoothing, without incurring any extra overhead. Numerical experiments conducted on two important applications, overlapping group lasso and graph-guided fused lasso, corroborate the theoretical claims. 1
5 0.57408124 215 nips-2013-On Decomposing the Proximal Map
Author: Yao-Liang Yu
Abstract: The proximal map is the key step in gradient-type algorithms, which have become prevalent in large-scale high-dimensional problems. For simple functions this proximal map is available in closed-form while for more complicated functions it can become highly nontrivial. Motivated by the need of combining regularizers to simultaneously induce different types of structures, this paper initiates a systematic investigation of when the proximal map of a sum of functions decomposes into the composition of the proximal maps of the individual summands. We not only unify a few known results scattered in the literature but also discover several new decompositions obtained almost effortlessly from our theory. 1
6 0.56299692 35 nips-2013-Analyzing the Harmonic Structure in Graph-Based Learning
7 0.55008155 350 nips-2013-Wavelets on Graphs via Deep Learning
8 0.53770947 214 nips-2013-On Algorithms for Sparse Multi-factor NMF
9 0.52816421 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators
10 0.52329916 73 nips-2013-Convex Relaxations for Permutation Problems
11 0.5127278 186 nips-2013-Matrix factorization with binary components
12 0.49040079 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic
13 0.48632526 357 nips-2013-k-Prototype Learning for 3D Rigid Structures
14 0.48556498 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization
15 0.48554936 157 nips-2013-Learning Multi-level Sparse Representations
16 0.48058698 244 nips-2013-Parametric Task Learning
17 0.4788844 158 nips-2013-Learning Multiple Models via Regularized Weighting
18 0.47860679 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization
19 0.46507159 268 nips-2013-Reflection methods for user-friendly submodular optimization
20 0.46052441 358 nips-2013-q-OCSVM: A q-Quantile Estimator for High-Dimensional Distributions
topicId topicWeight
[(16, 0.021), (33, 0.106), (34, 0.548), (41, 0.022), (49, 0.018), (56, 0.072), (70, 0.031), (85, 0.045), (89, 0.015), (93, 0.027), (95, 0.016)]
simIndex simValue paperId paperTitle
1 0.97448176 256 nips-2013-Probabilistic Principal Geodesic Analysis
Author: Miaomiao Zhang, P.T. Fletcher
Abstract: Principal geodesic analysis (PGA) is a generalization of principal component analysis (PCA) for dimensionality reduction of data on a Riemannian manifold. Currently PGA is defined as a geometric fit to the data, rather than as a probabilistic model. Inspired by probabilistic PCA, we present a latent variable model for PGA that provides a probabilistic framework for factor analysis on manifolds. To compute maximum likelihood estimates of the parameters in our model, we develop a Monte Carlo Expectation Maximization algorithm, where the expectation is approximated by Hamiltonian Monte Carlo sampling of the latent variables. We demonstrate the ability of our method to recover the ground truth parameters in simulated sphere data, as well as its effectiveness in analyzing shape variability of a corpus callosum data set from human brain images. 1
2 0.96821308 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach
Author: Zhenwen Dai, Georgios Exarchakis, Jörg Lücke
Abstract: We study optimal image encoding based on a generative approach with non-linear feature combinations and explicit position encoding. By far most approaches to unsupervised learning of visual features, such as sparse coding or ICA, account for translations by representing the same features at different positions. Some earlier models used a separate encoding of features and their positions to facilitate invariant data encoding and recognition. All probabilistic generative models with explicit position encoding have so far assumed a linear superposition of components to encode image patches. Here, we for the first time apply a model with non-linear feature superposition and explicit position encoding for patches. By avoiding linear superpositions, the studied model represents a closer match to component occlusions which are ubiquitous in natural images. In order to account for occlusions, the non-linear model encodes patches qualitatively very different from linear models by using component representations separated into mask and feature parameters. We first investigated encodings learned by the model using artificial data with mutually occluding components. We find that the model extracts the components, and that it can correctly identify the occlusive components with the hidden variables of the model. On natural image patches, the model learns component masks and features for typical image components. By using reverse correlation, we estimate the receptive fields associated with the model’s hidden units. We find many Gabor-like or globular receptive fields as well as fields sensitive to more complex structures. Our results show that probabilistic models that capture occlusions and invariances can be trained efficiently on image patches, and that the resulting encoding represents an alternative model for the neural encoding of images in the primary visual cortex. 1
3 0.96356708 210 nips-2013-Noise-Enhanced Associative Memories
Author: Amin Karbasi, Amir Hesam Salavati, Amin Shokrollahi, Lav R. Varshney
Abstract: Recent advances in associative memory design through structured pattern sets and graph-based inference algorithms allow reliable learning and recall of exponential numbers of patterns. Though these designs correct external errors in recall, they assume neurons compute noiselessly, in contrast to highly variable neurons in hippocampus and olfactory cortex. Here we consider associative memories with noisy internal computations and analytically characterize performance. As long as internal noise is less than a specified threshold, error probability in the recall phase can be made exceedingly small. More surprisingly, we show internal noise actually improves performance of the recall phase. Computational experiments lend additional support to our theoretical analysis. This work suggests a functional benefit to noisy neurons in biological neuronal networks. 1
4 0.95671552 129 nips-2013-Generalized Random Utility Models with Multiple Types
Author: Hossein Azari Soufiani, Hansheng Diao, Zhenyu Lai, David C. Parkes
Abstract: We propose a model for demand estimation in multi-agent, differentiated product settings and present an estimation algorithm that uses reversible jump MCMC techniques to classify agents’ types. Our model extends the popular setup in Berry, Levinsohn and Pakes (1995) to allow for the data-driven classification of agents’ types using agent-level data. We focus on applications involving data on agents’ ranking over alternatives, and present theoretical conditions that establish the identifiability of the model and uni-modality of the likelihood/posterior. Results on both real and simulated data provide support for the scalability of our approach. 1
same-paper 5 0.95503622 202 nips-2013-Multiclass Total Variation Clustering
Author: Xavier Bresson, Thomas Laurent, David Uminsky, James von Brecht
Abstract: Ideas from the image processing literature have recently motivated a new set of clustering algorithms that rely on the concept of total variation. While these algorithms perform well for bi-partitioning tasks, their recursive extensions yield unimpressive results for multiclass clustering tasks. This paper presents a general framework for multiclass total variation clustering that does not rely on recursion. The results greatly outperform previous total variation algorithms and compare well with state-of-the-art NMF approaches. 1
6 0.94449663 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC
7 0.93777359 122 nips-2013-First-order Decomposition Trees
8 0.93577296 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris
9 0.91578209 219 nips-2013-On model selection consistency of penalized M-estimators: a geometric theory
10 0.91156882 143 nips-2013-Integrated Non-Factorized Variational Inference
11 0.80627835 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression
12 0.8016094 347 nips-2013-Variational Planning for Graph-based MDPs
13 0.79593962 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations
14 0.76831049 86 nips-2013-Demixing odors - fast inference in olfaction
15 0.76289409 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
16 0.76250786 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex
17 0.75826228 148 nips-2013-Latent Maximum Margin Clustering
18 0.75321335 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs
19 0.74843657 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture
20 0.74748838 317 nips-2013-Streaming Variational Bayes