nips nips2004 nips2004-207 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: David P. Wipf, Bhaskar D. Rao
Abstract: Finding the sparsest, or minimum ℓ0 -norm, representation of a signal given an overcomplete dictionary of basis vectors is an important problem in many application domains. Unfortunately, the required optimization problem is often intractable because there is a combinatorial increase in the number of local minima as the number of candidate basis vectors increases. This deficiency has prompted most researchers to instead minimize surrogate measures, such as the ℓ1 -norm, that lead to more tractable computational methods. The downside of this procedure is that we have now introduced a mismatch between our ultimate goal and our objective function. In this paper, we demonstrate a sparse Bayesian learning-based method of minimizing the ℓ0 -norm while reducing the number of troublesome local minima. Moreover, we derive necessary conditions for local minima to occur via this approach and empirically demonstrate that there are typically many fewer for general problems of interest. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Finding the sparsest, or minimum ℓ0 -norm, representation of a signal given an overcomplete dictionary of basis vectors is an important problem in many application domains. [sent-4, score-0.489]
2 Unfortunately, the required optimization problem is often intractable because there is a combinatorial increase in the number of local minima as the number of candidate basis vectors increases. [sent-5, score-0.594]
3 In this paper, we demonstrate a sparse Bayesian learning-based method of minimizing the ℓ0 -norm while reducing the number of troublesome local minima. [sent-8, score-0.383]
4 Moreover, we derive necessary conditions for local minima to occur via this approach and empirically demonstrate that there are typically many fewer for general problems of interest. [sent-9, score-0.629]
5 1 Introduction Sparse signal representations from overcomplete dictionaries find increasing relevance in many application domains [1, 2]. [sent-10, score-0.209]
6 t = Φw, (1) w where Φ ∈ ℜN ×M is a matrix whose columns represent an overcomplete basis (i. [sent-13, score-0.248]
7 In this vein, we seek to find weight vectors whose entries are predominantly zero that nonetheless allow us to accurately represent t. [sent-19, score-0.153]
8 For convenience, we will refer these approaches as local sparsity maximization (LSM) algorithms. [sent-21, score-0.219]
9 While all of these methods are potentially very useful candidates for solving (1), they suffer from one significant drawback: as we have discussed in [6], every local minima of (1) is also a local minima to the LSM algorithms. [sent-27, score-0.989]
10 In fact, every basic feasible solution w∗ to t = Φw is such a local minimum. [sent-29, score-0.601]
11 Any other feasible solution can be written as w∗ + αw′ , where w′ ∈ Null(Φ). [sent-31, score-0.295]
12 For simplicity, if we assume that every subset of N columns of Φ are linearly independent, the unique representation property (URP), then w′ must necessarily have nonzero elements in locations that differ from w∗ . [sent-32, score-0.343]
13 This ensures that all such w∗ represent local minima to (1). [sent-34, score-0.513]
14 −1 The number of basic feasible solutions is bounded between MN + 1 and M ; the exact N number depends on t and Φ [4]. [sent-35, score-0.407]
15 Regardless, when M ≫ N , we have an large number of local minima and not surprisingly, we often converge to one of them using currently available LSM algorithms. [sent-36, score-0.519]
16 The considerable price we must pay, however, is that the global minimum of this objective function need not coincide with the sparsest solutions to (1). [sent-39, score-0.406]
17 3 As such, we may fail to recover the maximally sparse solution regardless of the initialization we use (unlike a LSM procedure). [sent-40, score-0.321]
18 In this paper, we will demonstrate an alternative algorithm for solving (1) using a sparse Bayesian learning (SBL) framework. [sent-41, score-0.145]
19 First, we will prove that, unlike minimum ℓ1 -norm methods, the global minimum of the SBL cost function is only achieved at the minimum ℓ0 -norm solution to t = Φw. [sent-43, score-0.702]
20 Later, we will show that this method is only locally minimized at a subset of basic feasible solutions and therefore, has fewer local minima than current LSM algorithms. [sent-44, score-1.024]
21 The marginalized pdf is given by 1 −1/2 p(t; γ) = p(t|w)p(w; γ)dw = (2π)−N/2 |Σt | exp − tT Σ−1 t , (4) t 2 2 A basic feasible solution is a solution with at most N nonzero entries. [sent-53, score-0.705]
22 In very restrictive settings, it has been shown that the minimum ℓ1 -norm solution can equal the minimum ℓ0 -norm solution [7]. [sent-54, score-0.528]
23 ℓ0 -norm minimization via SBL 3 Although SBL was initially developed in a regression context, it can nonetheless be easily adapted to handle (1) by fixing σ 2 to some ε and allowing ε → 0. [sent-61, score-0.146]
24 5 Also, upon convergence we can easily show that if γM L ˆ is sparse, w will also be sparse while maintaining feasibility. [sent-67, score-0.161]
25 A firm connection with ℓ0 -norm minimization is realized when we consider the global minimum of L(γ; σ 2 = ε) in the limit as ε approaches zero. [sent-70, score-0.301]
26 This assumes that t is in the span of the columns of Φ associated with nonzero elements in γ, which will always be the case if t is in the span of Φ and all γ are initialized to nonzero values. [sent-77, score-0.567]
27 First, we know from [6] that every local minimum of L(γ; σ 2 = ε) is achieved at a basic feasible solution γ∗ (i. [sent-79, score-0.786]
28 , a solution with N or fewer nonzero entries), regardless of ε. [sent-81, score-0.353]
29 Therefore, in our search for the global minimum, we only need examine the space of basic feasible solutions. [sent-82, score-0.366]
30 A maximally sparse basic feasible solution, which we denote γ∗∗ , can only occur with nonzero elements aligned with the nonzero elements of some w ∈ W0 . [sent-85, score-1.029]
31 In the limit as ε → 0, w∗∗ becomes feasible while maintaining the same sparsity profile as γ∗∗ , leading to the stated result. [sent-86, score-0.349]
32 More importantly, we will now show that the limiting SBL cost function, which we will henceforth denote L(γ) lim L(γ; σ 2 = ε) = log ΦΓΦT + tT ΦΓΦT ε→0 −1 t, (13) need not have the same problematic local minima profile as other methods. [sent-88, score-0.575]
33 We have not, however, provided any concrete reason why SBL should be preferred over current LSM methods of finding sparse solutions. [sent-90, score-0.144]
34 In fact, this preference is not established until we carefully consider the problem of convergence to local minima. [sent-91, score-0.181]
35 As already mentioned, the problem with current methods of minimizing w 0 is that every basic feasible solution unavoidably becomes a local minimum. [sent-92, score-0.658]
36 For example, consider the alternate objective function f (w) min( w 0 , N ), leading to the optimization problem min f (w), s. [sent-94, score-0.143]
37 (14) w While the global minimum remains unchanged, we observe that all local minima occurring at non-degenerate basic feasible solutions have been effectively removed. [sent-97, score-1.099]
38 6 In other words, at any solution w∗ with N nonzero entries, we can always add a small component αw′ ∈ Null(Φ) without increasing f (w), since f (w) can never be greater than N . [sent-98, score-0.285]
39 Therefore, we are free to move from basic feasible solution to basic feasible solution without increasing f (w). [sent-99, score-0.84]
40 Also, the rare degenerate basic solutions that do remain, even if suboptimal, are sparser by definition. [sent-100, score-0.227]
41 Therefore, locally minimizing our new problem (14) is clearly superior to locally minimizing (1). [sent-101, score-0.238]
42 Although we cannot remove all non-degenerate local minima and still retain computational feasibility, it is possible to remove many of them, providing some measure of approximation to (14). [sent-103, score-0.482]
43 Specifically, we will derive necessary conditions required for a non-degenerate basic feasible solution to represent a local minimum to L(γ). [sent-105, score-0.846]
44 We will then show that these conditions are frequently not satisfied, implying that there are potentially many fewer local minima. [sent-106, score-0.267]
45 Thus, locally minimizing L(γ) comes closer to (locally) minimizing (14) than current LSM methods, which in turn, is closer to globally minimizing w 0 . [sent-107, score-0.263]
46 6 A degenerate basic feasible solution has strictly less than N nonzero entries; however, the vast majority of local minima are non-degenerate, containing exactly N nonzero entries. [sent-108, score-1.35]
47 1 Necessary Conditions for Local Minima As previously stated, all local minima to L(γ) must occur at basic feasible solutions γ∗ . [sent-110, score-0.922]
48 Now suppose that we have found a (non-degenerate) γ∗ with associated w∗ computed via (9) and we would like to assess whether or not it is a local minimum to our SBL cost function. [sent-111, score-0.464]
49 For convenience, let w denote the N nonzero elements of w∗ and Φ the associated columns of Φ (therefore, t = Φw and w = Φ−1 t). [sent-112, score-0.334]
50 Intuitively, it would seem likely that if we are not at a true local minimum, then there must exist at least one additional column of Φ not in Φ, e. [sent-113, score-0.241]
51 But how do we quantify this relationship for the purposes of analyzing local minima? [sent-117, score-0.224]
52 As will be shown below, the similarity required between x and t (needed for establishing the existence of a local minimum) may then be realized by comparing the respective weights v and w. [sent-120, score-0.248]
53 Loosely, we may expect that if v is ‘close enough’ to w, then x is sufficiently close to t (relative to all other columns in Φ) such that we are not at a local minimum. [sent-122, score-0.25]
54 Let Φ satisfy the URP and let γ∗ represent a vector of hyperparameters with N and only N nonzero entries and associated basic feasible solution w = Φ−1 t. [sent-124, score-0.758]
55 Then γ∗ is a local minimum of L(γ) only if i=j vi vj <0 wi wj ∀v ∈ V. [sent-126, score-0.415]
56 (15) Proof : If γ∗ truly represents a local minimum of our cost function, then the following condition must hold for all x ∈ X : ∂L(γ∗ ) ≥ 0, (16) ∂γx where γx denotes the hyperparameter corresponding to the basis vector x. [sent-127, score-0.513]
57 Since we have assumed that we are at a local minimum, it is straightforward to show that Γ = diag(w)2 leading to the expression B = Φ−T diag(w)−2 Φ−1 . [sent-130, score-0.206]
58 This theorem provides a useful picture of what is required for local minima to exist and more importantly, why many basic feasible solutions are not local minimum. [sent-136, score-1.108]
59 2 A Simple Geometric Interpretation In general terms, if the signs of each of the elements in a given v match up with w, then the specified condition will be violated and we cannot be at a local minimum. [sent-139, score-0.216]
60 To begin, we note that our cost function L(γ) is invariant with respect to reflections of any basis vectors about the origin, i. [sent-141, score-0.155]
61 Returning to a candidate local minimum with associated Φ, we may therefore assume, without loss of generality, that Φ ≡ Φdiag (sgn(w)), giving us the decomposition t = Φw, w > 0. [sent-144, score-0.39]
62 Under this assumption, we see that t is located in the convex cone formed by the columns of Φ. [sent-145, score-0.321]
63 , any column of Φ not in Φ) lies in this convex cone, then the associated coefficients v must all be positive by definition (likewise, by a similar argument, any x in the convex cone of −Φ leads to the same result). [sent-148, score-0.371]
64 Consequently, Theorem 2 ensures that we are not at a local minimum. [sent-149, score-0.181]
65 , N = 2 and M = 3) and a basic feasible solution using the columns Φ = [φ1 φ2 ]. [sent-185, score-0.489]
66 Left: In this case, x = φ3 does not penetrate the convex cone containing t, and we do not satisfy the conditions of Theorem 2. [sent-186, score-0.252]
67 This configuration does represent a minimizing basic feasible solution. [sent-187, score-0.429]
68 Right: Now x is in the cone and therefore, we know that we are not at a local minimum; but this configuration does represent a local minimum to current LSM methods. [sent-188, score-0.737]
69 Alternatively, we can cast this geometric perspective in terms of relative cone sizes. [sent-189, score-0.184]
70 For example, let CΦ represent the convex cone (and its reflection) formed by Φ. [sent-190, score-0.283]
71 Then we are not at a local minimum to L(γ) if there exists a second convex cone C formed from a subset of columns of Φ such that t ∈ C ⊂ CΦ , i. [sent-191, score-0.687]
72 In Figure 1(right), we obtain a tighter cone by swapping x for φ2 . [sent-194, score-0.184]
73 , if all x are not in the convex cone of Φ, we still may not be at a local minimum. [sent-197, score-0.404]
74 In fact, to guarantee a local minimum, all x must be reasonably far from this cone as quantified by (15). [sent-198, score-0.373]
75 Of course the ultimate reduction −1 in local minima from the MN + 1 to M bounds is dependent on the distribution of N M/N 1. [sent-199, score-0.543]
76 6% Table 1: Given 1000 trials where FOCUSS has converged to a suboptimal local minimum, we tabulate the percentage of times the local minimum is also a local minimum to SBL. [sent-209, score-0.976]
77 7 However, we will now proceed to empirically demonstrate that the overall reduction in local minima is substantial when the basis vectors are randomly distributed. [sent-214, score-0.676]
78 5 Empirical Comparisons To show that the potential reduction in local minima derived above translates into concrete results, we conducted a simulation study using randomized basis vectors distributed isometrically in t-space. [sent-215, score-0.673]
79 Randomized dictionaries are of interest in signal processing and other disciplines [2, 7] and represent a viable benchmark for testing basis selection methods. [sent-216, score-0.234]
80 Our goal was to demonstrate that current LSM algorithms often converge to local minima that do not exist in the SBL cost function. [sent-218, score-0.589]
81 To accomplish this, we repeated the following procedure for dictionaries of various sizes. [sent-219, score-0.143]
82 Sparse weight vectors w0 are randomly generated with w0 0 = 7 (and uniformly distributed amplitudes on the nonzero components). [sent-221, score-0.247]
83 The LSM algorithm is then presented with t and Φ and attempts to learn the minimum ℓ0 -norm solutions. [sent-223, score-0.185]
84 The experiment is repeated a sufficient number of times such that we collect 1000 examples where the LSM algorithm converges to a local minimum. [sent-224, score-0.181]
85 In all these cases, we check if the condition stipulated by Theorem 2 applies, allowing us to determine if the given solution is a local minimum to the SBL algorithm or not. [sent-225, score-0.445]
86 We note that, the larger the overcompleteness ratio M/N , the larger the total number of LSM local minima (via the bounds presented earlier). [sent-227, score-0.524]
87 In roughly 50% of these cases, it escaped to find the maximally sparse solution. [sent-230, score-0.206]
88 The remaining times, it did escape in accordance with theory; however, it converged to another local minimum. [sent-231, score-0.22]
89 In contrast, when we initialize other LSM algorithms at an SBL local minima, we always remain trapped as expected. [sent-232, score-0.217]
90 6 Discussion In practice, we have consistently observed that SBL outperforms current LSM algorithms in finding maximally sparse solutions (e. [sent-233, score-0.272]
91 The results of this paper provide a very plausible explanation for this improved performance: conventional LSM procedures are very likely to converge to local minima that do not exist in the SBL landscape. [sent-236, score-0.543]
92 However, 7 For example, in the special case where t is proportional to a single column of Φ, we can show −1 that the number of local minima reduces from MN +1 to 1, i. [sent-237, score-0.509]
93 In contrast, we can show that, up to a scale factor, any minimum of L(γ) must also be a minimum of N min γ log λi (γ), s. [sent-246, score-0.432]
94 , as we approach a basic feasible solution), we enter the basin of a local minimum because the associated log |wi | terms becomes enormously negative; hence the one-to-one correspondence between basic feasible solutions and local minima of the LSM algorithms. [sent-253, score-1.62]
95 In other words, since Φ is overcomplete, up to M − N of the γi ’s can be zero while still maintaining a full set of nonzero eigenvalues to ΦΓΦT , so no term in the summation is driven towards minus infinity as occurred above. [sent-255, score-0.249]
96 Thus, we can switch from one basic feasible solution to another in many instances while still reducing our objective function. [sent-256, score-0.469]
97 It is in this respect that SBL approximates the minimization of the alternative objective posed by (14). [sent-257, score-0.131]
98 Jeffs, “On the design of maximally sparse beamforming arrays,” IEEE Transactions on Antennas and Propagation, vol. [sent-279, score-0.206]
99 Rao, “Sparse signal reconstruction from limited data using FOCUSS: A re-weighted minimum norm algorithm,” IEEE Transactions on Signal Processing, vol. [sent-288, score-0.233]
100 Elad, “Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization,” Proc. [sent-309, score-0.233]
wordName wordTfidf (topN-words)
[('sbl', 0.507), ('lsm', 0.402), ('minima', 0.301), ('feasible', 0.216), ('nonzero', 0.206), ('minimum', 0.185), ('local', 0.181), ('cone', 0.159), ('focuss', 0.145), ('basic', 0.125), ('sparse', 0.118), ('bx', 0.097), ('maximally', 0.088), ('diag', 0.085), ('dictionaries', 0.084), ('tt', 0.08), ('solution', 0.079), ('overcomplete', 0.077), ('rao', 0.077), ('urp', 0.072), ('wipf', 0.072), ('basis', 0.071), ('columns', 0.069), ('dictionary', 0.067), ('solutions', 0.066), ('convex', 0.064), ('figueiredo', 0.063), ('locally', 0.062), ('minimizing', 0.057), ('minimization', 0.05), ('objective', 0.049), ('wi', 0.049), ('signal', 0.048), ('sparsest', 0.048), ('quantify', 0.043), ('cost', 0.043), ('maintaining', 0.043), ('overcompleteness', 0.042), ('mn', 0.041), ('realized', 0.041), ('entries', 0.041), ('vectors', 0.041), ('minimized', 0.041), ('alternate', 0.04), ('nonetheless', 0.04), ('bayesian', 0.039), ('converged', 0.039), ('surrogate', 0.038), ('theorem', 0.038), ('sparsity', 0.038), ('converge', 0.037), ('trapped', 0.036), ('somehow', 0.036), ('degenerate', 0.036), ('hyperparameters', 0.036), ('regardless', 0.036), ('elements', 0.035), ('donoho', 0.034), ('ultimate', 0.034), ('must', 0.033), ('fewer', 0.032), ('accomplish', 0.032), ('posed', 0.032), ('via', 0.031), ('importantly', 0.031), ('represent', 0.031), ('globally', 0.03), ('diversity', 0.029), ('formed', 0.029), ('min', 0.029), ('conditions', 0.029), ('empirically', 0.028), ('analogous', 0.027), ('initialized', 0.027), ('procedure', 0.027), ('reduction', 0.027), ('column', 0.027), ('stated', 0.027), ('demonstrate', 0.027), ('weights', 0.026), ('transactions', 0.026), ('concrete', 0.026), ('diego', 0.026), ('randomized', 0.026), ('march', 0.025), ('tighter', 0.025), ('lim', 0.025), ('likewise', 0.025), ('limiting', 0.025), ('null', 0.025), ('leading', 0.025), ('potentially', 0.025), ('developed', 0.025), ('global', 0.025), ('geometric', 0.025), ('xt', 0.024), ('associated', 0.024), ('procedures', 0.024), ('arrive', 0.024), ('suboptimal', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 207 nips-2004-ℓ₀-norm Minimization for Basis Selection
Author: David P. Wipf, Bhaskar D. Rao
Abstract: Finding the sparsest, or minimum ℓ0 -norm, representation of a signal given an overcomplete dictionary of basis vectors is an important problem in many application domains. Unfortunately, the required optimization problem is often intractable because there is a combinatorial increase in the number of local minima as the number of candidate basis vectors increases. This deficiency has prompted most researchers to instead minimize surrogate measures, such as the ℓ1 -norm, that lead to more tractable computational methods. The downside of this procedure is that we have now introduced a mismatch between our ultimate goal and our objective function. In this paper, we demonstrate a sparse Bayesian learning-based method of minimizing the ℓ0 -norm while reducing the number of troublesome local minima. Moreover, we derive necessary conditions for local minima to occur via this approach and empirically demonstrate that there are typically many fewer for general problems of interest. 1
2 0.066950642 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound
Author: Dori Peleg, Ron Meir
Abstract: A novel linear feature selection algorithm is presented based on the global minimization of a data-dependent generalization error bound. Feature selection and scaling algorithms often lead to non-convex optimization problems, which in many previous approaches were addressed through gradient descent procedures that can only guarantee convergence to a local minimum. We propose an alternative approach, whereby the global solution of the non-convex optimization problem is derived via an equivalent optimization problem. Moreover, the convex optimization task is reduced to a conic quadratic programming problem for which efficient solvers are available. Highly competitive numerical results on both artificial and real-world data sets are reported. 1
3 0.064849176 196 nips-2004-Triangle Fixing Algorithms for the Metric Nearness Problem
Author: Suvrit Sra, Joel Tropp, Inderjit S. Dhillon
Abstract: Various problems in machine learning, databases, and statistics involve pairwise distances among a set of objects. It is often desirable for these distances to satisfy the properties of a metric, especially the triangle inequality. Applications where metric data is useful include clustering, classification, metric-based indexing, and approximation algorithms for various graph problems. This paper presents the Metric Nearness Problem: Given a dissimilarity matrix, find the “nearest” matrix of distances that satisfy the triangle inequalities. For p nearness measures, this paper develops efficient triangle fixing algorithms that compute globally optimal solutions by exploiting the inherent structure of the problem. Empirically, the algorithms have time and storage costs that are linear in the number of triangle constraints. The methods can also be easily parallelized for additional speed. 1
4 0.063874967 161 nips-2004-Self-Tuning Spectral Clustering
Author: Lihi Zelnik-manor, Pietro Perona
Abstract: We study a number of open issues in spectral clustering: (i) Selecting the appropriate scale of analysis, (ii) Handling multi-scale data, (iii) Clustering with irregular background clutter, and, (iv) Finding automatically the number of groups. We first propose that a ‘local’ scale should be used to compute the affinity between each pair of points. This local scaling leads to better clustering especially when the data includes multiple scales and when the clusters are placed within a cluttered background. We further suggest exploiting the structure of the eigenvectors to infer automatically the number of groups. This leads to a new algorithm in which the final randomly initialized k-means stage is eliminated. 1
5 0.063551307 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods
Author: Chakra Chennubhotla, Allan D. Jepson
Abstract: We show how to build hierarchical, reduced-rank representation for large stochastic matrices and use this representation to design an efficient algorithm for computing the largest eigenvalues, and the corresponding eigenvectors. In particular, the eigen problem is first solved at the coarsest level of the representation. The approximate eigen solution is then interpolated over successive levels of the hierarchy. A small number of power iterations are employed at each stage to correct the eigen solution. The typical speedups obtained by a Matlab implementation of our fast eigensolver over a standard sparse matrix eigensolver [13] are at least a factor of ten for large image sizes. The hierarchical representation has proven to be effective in a min-cut based segmentation algorithm that we proposed recently [8]. 1 Spectral Methods Graph-theoretic spectral methods have gained popularity in a variety of application domains: segmenting images [22]; embedding in low-dimensional spaces [4, 5, 8]; and clustering parallel scientific computation tasks [19]. Spectral methods enable the study of properties global to a dataset, using only local (pairwise) similarity or affinity measurements between the data points. The global properties that emerge are best understood in terms of a random walk formulation on the graph. For example, the graph can be partitioned into clusters by analyzing the perturbations to the stationary distribution of a Markovian relaxation process defined in terms of the affinity weights [17, 18, 24, 7]. The Markovian relaxation process need never be explicitly carried out; instead, it can be analytically expressed using the leading order eigenvectors, and eigenvalues, of the Markov transition matrix. In this paper we consider the practical application of spectral methods to large datasets. In particular, the eigen decomposition can be very expensive, on the order of O(n 3 ), where n is the number of nodes in the graph. While it is possible to compute analytically the first eigenvector (see §3 below), the remaining subspace of vectors (necessary for say clustering) has to be explicitly computed. A typical approach to dealing with this difficulty is to first sparsify the links in the graph [22] and then apply an efficient eigensolver [13, 23, 3]. In comparison, we propose in this paper a specialized eigensolver suitable for large stochastic matrices with known stationary distributions. In particular, we exploit the spectral properties of the Markov transition matrix to generate hierarchical, successively lower-ranked approximations to the full transition matrix. The eigen problem is solved directly at the coarsest level of representation. The approximate eigen solution is then interpolated over successive levels of the hierarchy, using a small number of power iterations to correct the solution at each stage. 2 Previous Work One approach to speeding up the eigen decomposition is to use the fact that the columns of the affinity matrix are typically correlated. The idea then is to pick a small number of representative columns to perform eigen decomposition via SVD. For example, in the Nystrom approximation procedure, originally proposed for integral eigenvalue problems, the idea is to randomly pick a small set of m columns; generate the corresponding affinity matrix; solve the eigenproblem and finally extend the solution to the complete graph [9, 10]. The Nystrom method has also been recently applied in the kernel learning methods for fast Gaussian process classification and regression [25]. Other sampling-based approaches include the work reported in [1, 2, 11]. Our starting point is the transition matrix generated from affinity weights and we show how building a representational hierarchy follows naturally from considering the stochastic matrix. A closely related work is the paper by Lin on reduced rank approximations of transition matrices [14]. We differ in how we approximate the transition matrices, in particular our objective function is computationally less expensive to solve. In particular, one of our goals in reducing transition matrices is to develop a fast, specialized eigen solver for spectral clustering. Fast eigensolving is also the goal in ACE [12], where successive levels in the hierarchy can potentially have negative affinities. A graph coarsening process for clustering was also pursued in [21, 3]. 3 Markov Chain Terminology We first provide a brief overview of the Markov chain terminology here (for more details see [17, 15, 6]). We consider an undirected graph G = (V, E) with vertices vi , for i = {1, . . . , n}, and edges ei,j with non-negative weights ai,j . Here the weight ai,j represents the affinity between vertices vi and vj . The affinities are represented by a non-negative, symmetric n × n matrix A having weights ai,j as elements. The degree of a node j is n n defined to be: dj = i=1 ai,j = j=1 aj,i , where we define D = diag(d1 , . . . , dn ). A Markov chain is defined using these affinities by setting a transition probability matrix M = AD −1 , where the columns of M each sum to 1. The transition probability matrix defines the random walk of a particle on the graph G. The random walk need never be explicitly carried out; instead, it can be analytically expressed using the leading order eigenvectors, and eigenvalues, of the Markov transition matrix. Because the stochastic matrices need not be symmetric in general, a direct eigen decomposition step is not preferred for reasons of instability. This problem is easily circumvented by considering a normalized affinity matrix: L = D −1/2 AD−1/2 , which is related to the stochastic matrix by a similarity transformation: L = D −1/2 M D1/2 . Because L is symmetric, it can be diagonalized: L = U ΛU T , where U = [u1 , u2 , · · · , un ] is an orthogonal set of eigenvectors and Λ is a diagonal matrix of eigenvalues [λ1 , λ2 , · · · , λn ] sorted in decreasing order. The eigenvectors have unit length uk = 1 and from the form of A and D it can be shown that the eigenvalues λi ∈ (−1, 1], with at least one eigenvalue equal to one. Without loss of generality, we take λ1 = 1. Because L and M are similar we can perform an eigen decomposition of the Markov transition matrix as: M = D1/2 LD−1/2 = D1/2 U Λ U T D−1/2 . Thus an eigenvector u of L corresponds to an eigenvector D 1/2 u of M with the same eigenvalue λ. The Markovian relaxation process after β iterations, namely M β , can be represented as: M β = D1/2 U Λβ U T D−1/2 . Therefore, a particle undertaking a random walk with an initial distribution p 0 acquires after β steps a distribution p β given by: p β = M β p 0 . Assuming the graph is connected, as β → ∞, the Markov chain approaches a unique n stationary distribution given by π = diag(D)/ i=1 di , and thus, M ∞ = π1T , where 1 is a n-dim column vector of all ones. Observe that π is an eigenvector of M as it is easy to show that M π = π and the corresponding eigenvalue is 1. Next, we show how to generate hierarchical, successively low-ranked approximations for the transition matrix M . 4 Building a Hierarchy of Transition Matrices The goal is to generate a very fast approximation, while simultaneously achieving sufficient accuracy. For notational ease, we think of M as a fine-scale representation and M as some coarse-scale approximation to be derived here. By coarsening M further, we can generate successive levels of the representation hierarchy. We use the stationary distribution π to construct a corresponding coarse-scale stationary distribution δ. As we just discussed a critical property of the fine scale Markov matrix M is that it is similar to the symmetric matrix L and we wish to preserve this property at every level of the representation hierarchy. 4.1 Deriving Coarse-Scale Stationary Distribution We begin by expressing the stationary distribution π as a probabilistic mixture of latent distributions. In matrix notation, we have (1) π = K δ, where δ is an unknown mixture coefficient vector of length m, K is an n × m non-negative n kernel matrix whose columns are latent distributions that each sum to 1: i=1 Ki,j = 1 and m n. It is easy to derive a maximum likelihood approximation of δ using an EM type algorithm [16]. The main step is to find a stationary point δ, λ for the Lagrangian: m n i=1 m Ki,j δj + λ πi ln E≡− j=1 δj − 1 . (2) j=1 An implicit step in this EM procedure is to compute the the ownership probability r i,j of the j th kernel (or node) at the coarse scale for the ith node on the fine scale and is given by ri,j = δj Ki,j . m k=1 δk Ki,k (3) The EM procedure allows for an update of both δ and the latent distributions in the kernel matrix K (see §8.3.1 in [6]). For initialization, δ is taken to be uniform over the coarse-scale states. But in choosing kernels K, we provide a good initialization for the EM procedure. Specifically, the Markov matrix M is diffused using a small number of iterations to get M β . The diffusion causes random walks from neighboring nodes to be less distinguishable. This in turn helps us select a small number of columns of M β in a fast and greedy way to be the kernel matrix K. We defer the exact details on kernel selection to a later section (§4.3). 4.2 Deriving the Coarse-Scale Transition Matrix In order to define M , the coarse-scale transition matrix, we break it down into three steps. First, the Markov chain propagation at the coarse scale can be defined as: q k+1 = M q k , (4) where q is the coarse scale probability distribution after k steps of the random walk. Second, we expand q k into the fine scale using the kernels K resulting in a fine scale probability distribution p k : p k = Kq k . (5) k Finally, we lift p k back into the coarse scale by using the ownership probability of the j th kernel for the ith node on the fine grid: n qjk+1 = ri,j pik i=1 (6) Substituting for Eqs.(3) and (5) in Eq. 6 gives n m qjk+1 = i=1 n Ki,t qtk = ri,j t=1 i=1 δj Ki,j m k=1 δk Ki,k m Ki,t qtk . (7) t=1 We can write the preceding equation in a matrix form: q k+1 = diag( δ ) K T diag K δ −1 Kq k . (8) Comparing this with Eq. 4, we can derive the transition matrix M as: M = diag( δ ) K T diag K δ −1 (9) K. It is easy to see that δ = M δ, so δ is the stationary distribution for M . Following the definition of M , and its stationary distribution δ, we can generate a symmetric coarse scale affinity matrix A given by A = M diag(δ) = diag( δ ) K T diag K δ −1 Kdiag(δ) , (10) where we substitute for the expression M from Eq. 9. The coarse-scale affinity matrix A is then normalized to get: L = D−1/2 AD−1/2 ; D = diag(d1 , d2 , · · · , dm ), (11) where dj is the degree of node j in the coarse-scale graph represented by the matrix A (see §3 for degree definition). Thus, the coarse scale Markov matrix M is precisely similar to a symmetric matrix L. 4.3 Selecting Kernels For demonstration purpose, we present the kernel selection details on the image of an eye shown below. To begin with, a random walk is defined where each pixel in the test image is associated with a vertex of the graph G. The edges in G are defined by the standard 8-neighbourhood of each pixel. For the demonstrations in this paper, the edge weight ai,j between neighbouring pixels xi and xj is given by a function of the difference in the 2 corresponding intensities I(xi ) and I(xj ): ai,j = exp(−(I(xi ) − I(xj ))2 /2σa ), where σa is set according to the median absolute difference |I(xi ) − I(xj )| between neighbours measured over the entire image. The affinity matrix A with the edge weights is then used to generate a Markov transition matrix M . The kernel selection process we use is fast and greedy. First, the fine scale Markov matrix M is diffused to M β using β = 4. The Markov matrix M is sparse as we make the affinity matrix A sparse. Every column in the diffused matrix M β is a potential kernel. To facilitate the selection process, the second step is to rank order the columns of M β based on a probability value in the stationary distribution π. Third, the kernels (i.e. columns of M β ) are picked in such a way that for a kernel Ki all of the neighbours of pixel i which are within the half-height of the the maximum value in the kernel Ki are suppressed from the selection process. Finally, the kernel selection is continued until every pixel in the image is within a half-height of the peak value of at least one kernel. If M is a full matrix, to avoid the expense of computing M β explicitly, random kernel centers can be selected, and only the corresponding columns of M β need be computed. We show results from a three-scale hierarchy on the eye image (below). The image has 25 × 20 pixels but is shown here enlarged for clarity. At the first coarse scale 83 kernels are picked. The kernels each correspond to a different column in the fine scale transition matrix and the pixels giving rise to these kernels are shown numbered on the image. Using these kernels as an initialization, the EM procedure derives a coarse-scale stationary distribution δ 21 14 26 4 (Eq. 2), while simultaneously updating the kernel ma12 27 2 19 trix. Using the newly updated kernel matrix K and the 5 8 13 23 30 18 6 9 derived stationary distribution δ a transition matrix M 28 20 15 32 10 22 is generated (Eq. 9). The coarse scale Markov matrix 24 17 7 is then diffused to M β , again using β = 4. The kernel Coarse Scale 1 Coarse Scale 2 selection algorithm is reapplied, this time picking 32 kernels for the second coarse scale. Larger values of β cause the coarser level to have fewer elements. But the exact number of elements depends on the form of the kernels themselves. For the random experiments that we describe later in §6 we found β = 2 in the first iteration and 4 thereafter causes the number of kernels to be reduced by a factor of roughly 1/3 to 1/4 at each level. 72 28 35 44 51 64 82 4 12 31 56 19 77 36 45 52 65 13 57 23 37 5 40 53 63 73 14 29 6 66 38 74 47 24 7 30 41 54 71 78 58 15 8 20 39 48 59 67 25 68 79 21 16 2 11 26 42 49 55 60 75 32 83 43 9 76 50 17 27 61 33 69 80 3 46 18 70 81 34 10 62 22 1 25 11 1 3 16 31 29 At coarser levels of the hierarchy, we expect the kernels to get less sparse and so will the affinity and the transition matrices. In order to promote sparsity at successive levels of the hierarchy we sparsify A by zeroing out elements associated with “small” transition probabilities in M . However, in the experiments described later in §6, we observe this sparsification step to be not critical. To summarize, we use the stationary distribution π at the fine-scale to derive a transition matrix M , and its stationary distribution δ, at the coarse-scale. The coarse scale transition in turn helps to derive an affinity matrix A and its normalized version L. It is obvious that this procedure can be repeated recursively. We describe next how to use this representation hierarchy for building a fast eigensolver. 5 Fast EigenSolver Our goal in generating a hierarchical representation of a transition matrix is to develop a fast, specialized eigen solver for spectral clustering. To this end, we perform a full eigen decomposition of the normalized affinity matrix only at the coarsest level. As discussed in the previous section, the affinity matrix at the coarsest level is not likely to be sparse, hence it will need a full (as opposed to a sparse) version of an eigen solver. However it is typically the case that e ≤ m n (even in the case of the three-scale hierarchy that we just considered) and hence we expect this step to be the least expensive computationally. The resulting eigenvectors are interpolated to the next lower level of the hierarchy by a process which will be described next. Because the eigen interpolation process between every adjacent pair of scales in the hierarchy is similar, we will assume we have access to the leading eigenvectors U (size: m × e) for the normalized affinity matrix L (size: m × m) and describe how to generate the leading eigenvectors U (size: n × e), and the leading eigenvalues S (size: e × 1), for the fine-scale normalized affinity matrix L (size: n × n). There are several steps to the eigen interpolation process and in the discussion that follows we refer to the lines in the pseudo-code presented below. First, the coarse-scale eigenvectors U can be interpolated using the kernel matrix K to generate U = K U , an approximation for the fine-scale eigenvectors (line 9). Second, interpolation alone is unlikely to set the directions of U exactly aligned with U L , the vectors one would obtain by a direct eigen decomposition of the fine-scale normalized affinity matrix L. We therefore update the directions in U by applying a small number of power iterations with L, as given in lines 13-15. e e function (U, S) = CoarseToFine(L, K, U , S) 1: INPUT 2: L, K ⇐ {L is n × n and K is n × m where m n} e e e e 3: U /S ⇐ {leading coarse-scale eigenvectors/eigenvalues of L. U is of size m × e, e ≤ m} 4: OUTPUT 5: U, S ⇐ {leading fine-scale eigenvectors/eigenvalues of L. U is n × e and S is e × 1.} x 10 0.4 3 0.96 0.94 0.92 0.9 0.35 2.5 Relative Error Absolute Relative Error 0.98 Eigen Value |δλ|λ−1 −3 Eigen Spectrum 1 2 1.5 1 5 10 15 20 Eigen Index (a) 25 30 0.2 0.15 0.1 0.5 0.88 0.3 0.25 0.05 5 10 15 20 Eigen Index (b) 25 30 5 10 15 20 Eigen Index 25 30 (c) Figure 1: Hierarchical eigensolver results. (a) comparing ground truth eigenvalues S L (red circles) with multi-scale eigensolver spectrum S (blue line) (b) Relative absolute error between eigenvalues: |S−SL | (c) Eigenvector mismatch: 1 − diag |U T UL | , between SL eigenvectors U derived by the multi-scale eigensolver and the ground truth U L . Observe the slight mismatch in the last few eigenvectors, but excellent agreement in the leading eigenvectors (see text). 6: CONSTANTS: TOL = 1e-4; POWER ITERS = 50 7: “ ” e 8: TPI = min POWER ITERS, log(e × eps/TOL)/ log(min(S)) {eps: machine accuracy} e 9: U = K U {interpolation from coarse to fine} 10: while not converged do 11: Uold = U {n × e matrix, e n} 12: for i = 1 to TPI do 13: U ⇐ LU 14: end for 15: U ⇐ Gram-Schmidt(U ) {orthogonalize U } 16: Le = U T LU {L may be sparse, but Le need not be.} 17: Ue Se UeT = svd(Le ) {eigenanalysis of Le , which is of size e × e.} 18: U ⇐ U Ue {update the leading eigenvectors of L} 19: S = diag(Se ) {grab the leading eigenvalues of L} T 20: innerProd = 1 − diag( Uold U ) {1 is a e × 1 vector of all ones} 21: converged = max[abs(innerProd)] < TOL 22: end while The number of power iterations TPI can be bounded as discussed next. Suppose v = U c where U is a matrix of true eigenvectors and c is a coefficient vector for an arbitrary vector v. After TPI power iterations v becomes v = U diag(S TPI )c, where S has the exact eigenvalues. In order for the component of a vector v in the direction Ue (the eth column of U ) not to be swamped by other components, we can limit it’s decay after TPI iterations as TPI follows: (S(e)/S(1)) >= e×eps/TOL, where S(e) is the exact eth eigenvalue, S(1) = 1, eps is the machine precision, TOL is requested accuracy. Because we do not have access to the exact value S(e) at the beginning of the interpolation procedure, we estimate it from the coarse eigenvalues S. This leads to a bound on the power iterations TPI, as derived on the line 9 above. Third, the interpolation process and the power iterations need not preserve orthogonality in the eigenvectors in U . We fix this by Gram-Schmidt orthogonalization procedure (line 16). Finally, there is a still a problem with power iterations that needs to be resolved, in that it is very hard to separate nearby eigenvalues. In particular, for the convergence of the power iterations the ratio that matters is between the (e + 1) st and eth eigenvalues. So the idea we pursue is to use the power iterations only to separate the reduced space of eigenvectors (of dimension e) from the orthogonal subspace (of dimension n − e). We then use a full SVD on the reduced space to update the leading eigenvectors U , and eigenvalues S, for the fine-scale (lines 17-20). This idea is similar to computing the Ritz values and Ritz vectors in a Rayleigh-Ritz method. 6 Interpolation Results Our multi-scale decomposition code is in Matlab. For the direct eigen decomposition, we have used the Matlab program svds.m which invokes the compiled ARPACKC routine [13], with a default convergence tolerance of 1e-10. In Fig. 1a we compare the spectrum S obtained from a three-scale decomposition on the eye image (blue line) with the ground truth, which is the spectrum SL resulting from direct eigen decomposition of the fine-scale normalized affinity matrices L (red circles). There is an excellent agreement in the leading eigenvalues. To illustrate this, we show absolute relative error between the spectra: |S−SL | in Fig. 1b. The spectra agree mostly, except for SL the last few eigenvalues. For a quantitative comparison between the eigenvectors, we plot in Fig. 1c the following measure: 1 − diag(|U T UL |), where U is the matrix of eigenvectors obtained by the multi-scale approximation, UL is the ground-truth resulting from a direct eigen decomposition of the fine-scale affinity matrix L and 1 is a vector of all ones. The relative error plot demonstrates a close match, within the tolerance threshold of 1e-4 that we chose for the multi-scale method, in the leading eigenvector directions between the two methods. The relative error is high with the last few eigen vectors, which suggests that the power iterations have not clearly separated them from other directions. So, the strategy we suggest is to pad the required number of leading eigen basis by about 20% before invoking the multi-scale procedure. Obviously, the number of hierarchical stages for the multi-scale procedure must be chosen such that the transition matrix at the coarsest scale can accommodate the slight increase in the subspace dimensions. For lack of space we are omitting extra results (see Ch.8 in [6]). Next we measure the time the hierarchical eigensolver takes to compute the leading eigenbasis for various input sizes, in comparison with the svds.m procedure [13]. We form images of different input sizes by Gaussian smoothing of i.i.d noise. The Gaussian function has a standard deviation of 3 pixels. The edges in graph G are defined by the standard 8-neighbourhood of each pixel. The edge weights between neighbouring pixels are simply given by a function of the difference in the corresponding intensities (see §4.3). The affinity matrix A with the edge weights is then used to generate a Markov transition matrix M . The fast eigensolver is run on ten different instances of the input image of a given size and the average of these times is reported here. For a fair comparison between the two procedures, we set the convergence tolerance value for the svds.m procedure to be 1e-4, the same as the one used for the fast eigensolver. We found the hierarchical representation derived from this tolerance threshold to be sufficiently accurate for a novel min-cut based segmentation results that we reported in [8]. Also, the subspace dimensionality is fixed to be 51 where we expect (and indeed observe) the leading 40 eigenpairs derived from the multi-scale procedure to be accurate. Hence, while invoking svds.m we compute only the leading 41 eigenpairs. In the table shown below, the first column corresponds to the number of nodes in the graph, while the second and third columns report the time taken in seconds by the svds.m procedure and the Matlab implementation of the multi-scale eigensolver respectively. The fourth column reports the speedups of the multi-scale eigensolver over svds.m procedure on a standard desktop (Intel P4, 2.5GHz, 1GB RAM). Lowering the tolerance threshold for svds.m made it faster by about 20 − 30%. Despite this, the multi-scale algorithm clearly outperforms the svds.m procedure. The most expensive step in the multi-scale algorithm is the power iteration required in the last stage, that is interpolating eigenvectors from the first coarse scale to the required fine scale. The complexity is of the order of n × e where e is the subspace dimensionality and n is the size of the graph. Indeed, from the table we can see that the multi-scale procedure is taking time roughly proportional to n. Deviations from the linear trend are observed at specific values of n, which we believe are due to the n 322 632 642 652 1002 1272 1282 1292 1602 2552 2562 2572 5112 5122 5132 6002 7002 8002 svds.m 1.6 10.8 20.5 12.6 44.2 91.1 230.9 96.9 179.3 819.2 2170.8 871.7 7977.2 20269 7887.2 10841.4 15048.8 Multi-Scale 1.5 4.9 5.5 5.1 13.1 20.4 35.2 20.9 34.4 90.3 188.7 93.3 458.8 739.3 461.9 644.2 1162.4 1936.6 Speedup 1.1 2.2 3.7 2.5 3.4 4.5 6.6 4.6 5.2 9.1 11.5 9.3 17.4 27.4 17.1 16.8 12.9 variations in the difficulty of the specific eigenvalue problem (eg. nearly multiple eigenvalues). The hierarchical representation has proven to be effective in a min-cut based segmentation algorithm that we proposed recently [8]. Here we explored the use of random walks and associated spectral embedding techniques for the automatic generation of suitable proposal (source and sink) regions for a min-cut based algorithm. The multiscale algorithm was used to generate the 40 leading eigenvectors of large transition matrices (eg. size 20K × 20K). In terms of future work, it will be useful to compare our work with other approximate methods for SVD such as [23]. Ack: We thank S. Roweis, F. Estrada and M. Sakr for valuable comments. References [1] D. Achlioptas and F. McSherry. Fast Computation of Low-Rank Approximations. STOC, 2001. [2] D. Achlioptas et al Sampling Techniques for Kernel Methods. NIPS, 2001. [3] S. Barnard and H. Simon Fast Multilevel Implementation of Recursive Spectral Bisection for Partitioning Unstructured Problems. PPSC, 627-632. [4] M. Belkin et al Laplacian Eigenmaps and Spectral Techniques for Embedding. NIPS, 2001. [5] M. Brand et al A unifying theorem for spectral embedding and clustering. AI & STATS, 2002. [6] C. Chennubhotla. Spectral Methods for Multi-scale Feature Extraction and Spectral Clustering. http://www.cs.toronto.edu/˜chakra/thesis.pdf Ph.D Thesis, Department of Computer Science, University of Toronto, Canada, 2004. [7] C. Chennubhotla and A. Jepson. Half-Lives of EigenFlows for Spectral Clustering. NIPS, 2002. [8] F. Estrada, A. Jepson and C. Chennubhotla. Spectral Embedding and Min-Cut for Image Segmentation. Manuscript Under Review, 2004. [9] C. Fowlkes et al Efficient spatiotemporal grouping using the Nystrom method. CVPR, 2001. [10] S. Belongie et al Spectral Partitioning with Indefinite Kernels using Nystrom app. ECCV, 2002. [11] A. Frieze et al Fast Monte-Carlo Algorithms for finding low-rank approximations. FOCS, 1998. [12] Y. Koren et al ACE: A Fast Multiscale Eigenvectors Computation for Drawing Huge Graphs IEEE Symp. on InfoVis 2002, pp. 137-144 [13] R. B. Lehoucq, D. C. Sorensen and C. Yang. ARPACK User Guide: Solution of Large Scale Eigenvalue Problems by Implicitly Restarted Arnoldi Methods. SIAM 1998. [14] J. J. Lin. Reduced Rank Approximations of Transition Matrices. AI & STATS, 2002. [15] L. Lova’sz. Random Walks on Graphs: A Survey Combinatorics, 1996, 353–398. [16] G. J. McLachlan et al Mixture Models: Inference and Applications to Clustering. 1988 [17] M. Meila and J. Shi. A random walks view of spectral segmentation. AI & STATS, 2001. [18] A. Ng, M. Jordan and Y. Weiss. On Spectral Clustering: analysis and an algorithm NIPS, 2001. [19] A. Pothen Graph partitioning algorithms with applications to scientific computing. Parallel Numerical Algorithms, D. E. Keyes et al (eds.), Kluwer Academic Press, 1996. [20] G. L. Scott et al Feature grouping by relocalization of eigenvectors of the proximity matrix. BMVC, pg. 103-108, 1990. [21] E. Sharon et al Fast Multiscale Image Segmentation CVPR, I:70-77, 2000. [22] J. Shi and J. Malik. Normalized cuts and image segmentation. PAMI, August, 2000. [23] H. Simon et al Low-Rank Matrix Approximation Using the Lanczos Bidiagonalization Process with Applications SIAM J. of Sci. Comp. 21(6):2257-2274, 2000. [24] N. Tishby et al Data clustering by Markovian Relaxation NIPS, 2001. [25] C. Williams et al Using the Nystrom method to speed up the kernel machines. NIPS, 2001.
6 0.061182749 172 nips-2004-Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units
7 0.060545605 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data
8 0.060096659 164 nips-2004-Semi-supervised Learning by Entropy Minimization
9 0.05879315 42 nips-2004-Computing regularization paths for learning multiple kernels
10 0.058179993 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss
11 0.058135197 18 nips-2004-Algebraic Set Kernels with Application to Inference Over Local Image Representations
12 0.056795616 70 nips-2004-Following Curved Regularized Optimization Solution Paths
13 0.055534557 113 nips-2004-Maximum-Margin Matrix Factorization
14 0.053109188 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
15 0.052917168 185 nips-2004-The Convergence of Contrastive Divergences
16 0.052565921 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction
17 0.05045728 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation
18 0.050281268 160 nips-2004-Seeing through water
19 0.050251141 2 nips-2004-A Direct Formulation for Sparse PCA Using Semidefinite Programming
20 0.050110359 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning
topicId topicWeight
[(0, -0.17), (1, 0.03), (2, 0.003), (3, 0.004), (4, -0.046), (5, -0.031), (6, -0.011), (7, 0.024), (8, -0.006), (9, -0.028), (10, 0.022), (11, -0.077), (12, 0.041), (13, -0.023), (14, -0.019), (15, 0.008), (16, 0.022), (17, -0.003), (18, -0.065), (19, -0.041), (20, 0.109), (21, 0.028), (22, -0.04), (23, -0.041), (24, 0.0), (25, 0.054), (26, 0.01), (27, -0.012), (28, 0.009), (29, 0.038), (30, 0.001), (31, 0.057), (32, -0.021), (33, 0.099), (34, -0.087), (35, 0.085), (36, 0.01), (37, -0.052), (38, -0.08), (39, 0.089), (40, 0.111), (41, -0.089), (42, -0.118), (43, 0.014), (44, 0.103), (45, -0.089), (46, 0.019), (47, -0.006), (48, 0.177), (49, -0.199)]
simIndex simValue paperId paperTitle
same-paper 1 0.95996368 207 nips-2004-ℓ₀-norm Minimization for Basis Selection
Author: David P. Wipf, Bhaskar D. Rao
Abstract: Finding the sparsest, or minimum ℓ0 -norm, representation of a signal given an overcomplete dictionary of basis vectors is an important problem in many application domains. Unfortunately, the required optimization problem is often intractable because there is a combinatorial increase in the number of local minima as the number of candidate basis vectors increases. This deficiency has prompted most researchers to instead minimize surrogate measures, such as the ℓ1 -norm, that lead to more tractable computational methods. The downside of this procedure is that we have now introduced a mismatch between our ultimate goal and our objective function. In this paper, we demonstrate a sparse Bayesian learning-based method of minimizing the ℓ0 -norm while reducing the number of troublesome local minima. Moreover, we derive necessary conditions for local minima to occur via this approach and empirically demonstrate that there are typically many fewer for general problems of interest. 1
2 0.54659837 2 nips-2004-A Direct Formulation for Sparse PCA Using Semidefinite Programming
Author: Alexandre D'aspremont, Laurent E. Ghaoui, Michael I. Jordan, Gert R. Lanckriet
Abstract: We examine the problem of approximating, in the Frobenius-norm sense, a positive, semidefinite symmetric matrix by a rank-one matrix, with an upper bound on the cardinality of its eigenvector. The problem arises in the decomposition of a covariance matrix into sparse factors, and has wide applications ranging from biology to finance. We use a modification of the classical variational representation of the largest eigenvalue of a symmetric matrix, where cardinality is constrained, and derive a semidefinite programming based relaxation for our problem. 1
3 0.48668641 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems
Author: Lorenzo Rosasco, Andrea Caponnetto, Ernesto D. Vito, Francesca Odone, Umberto D. Giovannini
Abstract: Many works have shown that strong connections relate learning from examples to regularization techniques for ill-posed inverse problems. Nevertheless by now there was no formal evidence neither that learning from examples could be seen as an inverse problem nor that theoretical results in learning theory could be independently derived using tools from regularization theory. In this paper we provide a positive answer to both questions. Indeed, considering the square loss, we translate the learning problem in the language of regularization theory and show that consistency results and optimal regularization parameter choice can be derived by the discretization of the corresponding inverse problem. 1
4 0.48518711 196 nips-2004-Triangle Fixing Algorithms for the Metric Nearness Problem
Author: Suvrit Sra, Joel Tropp, Inderjit S. Dhillon
Abstract: Various problems in machine learning, databases, and statistics involve pairwise distances among a set of objects. It is often desirable for these distances to satisfy the properties of a metric, especially the triangle inequality. Applications where metric data is useful include clustering, classification, metric-based indexing, and approximation algorithms for various graph problems. This paper presents the Metric Nearness Problem: Given a dissimilarity matrix, find the “nearest” matrix of distances that satisfy the triangle inequalities. For p nearness measures, this paper develops efficient triangle fixing algorithms that compute globally optimal solutions by exploiting the inherent structure of the problem. Empirically, the algorithms have time and storage costs that are linear in the number of triangle constraints. The methods can also be easily parallelized for additional speed. 1
5 0.48002294 172 nips-2004-Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units
Author: Eizaburo Doi, Michael S. Lewicki
Abstract: It has been suggested that the primary goal of the sensory system is to represent input in such a way as to reduce the high degree of redundancy. Given a noisy neural representation, however, solely reducing redundancy is not desirable, since redundancy is the only clue to reduce the effects of noise. Here we propose a model that best balances redundancy reduction and redundant representation. Like previous models, our model accounts for the localized and oriented structure of simple cells, but it also predicts a different organization for the population. With noisy, limited-capacity units, the optimal representation becomes an overcomplete, multi-scale representation, which, compared to previous models, is in closer agreement with physiological data. These results offer a new perspective on the expansion of the number of neurons from retina to V1 and provide a theoretical model of incorporating useful redundancy into efficient neural representations. 1
6 0.41668656 57 nips-2004-Economic Properties of Social Networks
7 0.41542709 185 nips-2004-The Convergence of Contrastive Divergences
8 0.41498423 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound
9 0.40429553 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation
10 0.40158492 180 nips-2004-Synchronization of neural networks by mutual learning and its application to cryptography
11 0.37440315 161 nips-2004-Self-Tuning Spectral Clustering
12 0.3711825 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods
13 0.35938072 127 nips-2004-Neighbourhood Components Analysis
14 0.35304996 1 nips-2004-A Cost-Shaping LP for Bellman Error Minimization with Performance Guarantees
15 0.35092771 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning
16 0.34947327 171 nips-2004-Solitaire: Man Versus Machine
17 0.34916639 70 nips-2004-Following Curved Regularized Optimization Solution Paths
18 0.34415284 190 nips-2004-The Rescorla-Wagner Algorithm and Maximum Likelihood Estimation of Causal Parameters
19 0.34218884 6 nips-2004-A Hidden Markov Model for de Novo Peptide Sequencing
20 0.33891892 52 nips-2004-Discrete profile alignment via constrained information bottleneck
topicId topicWeight
[(13, 0.109), (15, 0.129), (25, 0.011), (26, 0.065), (31, 0.027), (33, 0.212), (35, 0.021), (39, 0.033), (50, 0.059), (64, 0.218)]
simIndex simValue paperId paperTitle
same-paper 1 0.87205511 207 nips-2004-ℓ₀-norm Minimization for Basis Selection
Author: David P. Wipf, Bhaskar D. Rao
Abstract: Finding the sparsest, or minimum ℓ0 -norm, representation of a signal given an overcomplete dictionary of basis vectors is an important problem in many application domains. Unfortunately, the required optimization problem is often intractable because there is a combinatorial increase in the number of local minima as the number of candidate basis vectors increases. This deficiency has prompted most researchers to instead minimize surrogate measures, such as the ℓ1 -norm, that lead to more tractable computational methods. The downside of this procedure is that we have now introduced a mismatch between our ultimate goal and our objective function. In this paper, we demonstrate a sparse Bayesian learning-based method of minimizing the ℓ0 -norm while reducing the number of troublesome local minima. Moreover, we derive necessary conditions for local minima to occur via this approach and empirically demonstrate that there are typically many fewer for general problems of interest. 1
2 0.78779185 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer
Abstract: Prediction suffix trees (PST) provide a popular and effective tool for tasks such as compression, classification, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efficient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the first provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any fixed PST determined in hindsight. 1
3 0.78554124 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
Author: Nicolò Cesa-bianchi, Claudio Gentile, Luca Zaniboni
Abstract: We provide a worst-case analysis of selective sampling algorithms for learning linear threshold functions. The algorithms considered in this paper are Perceptron-like algorithms, i.e., algorithms which can be efficiently run in any reproducing kernel Hilbert space. Our algorithms exploit a simple margin-based randomized rule to decide whether to query the current label. We obtain selective sampling algorithms achieving on average the same bounds as those proven for their deterministic counterparts, but using much fewer labels. We complement our theoretical findings with an empirical comparison on two text categorization tasks. The outcome of these experiments is largely predicted by our theoretical results: Our selective sampling algorithms tend to perform as good as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels. 1
4 0.78462106 69 nips-2004-Fast Rates to Bayes for Kernel Machines
Author: Ingo Steinwart, Clint Scovel
Abstract: We establish learning rates to the Bayes risk for support vector machines (SVMs) with hinge loss. In particular, for SVMs with Gaussian RBF kernels we propose a geometric condition for distributions which can be used to determine approximation properties of these kernels. Finally, we compare our methods with a recent paper of G. Blanchard et al.. 1
5 0.78234154 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
Author: Aharon Bar-hillel, Adam Spiro, Eran Stark
Abstract: Spike sorting involves clustering spike trains recorded by a microelectrode according to the source neuron. It is a complicated problem, which requires a lot of human labor, partly due to the non-stationary nature of the data. We propose an automated technique for the clustering of non-stationary Gaussian sources in a Bayesian framework. At a first search stage, data is divided into short time frames and candidate descriptions of the data as a mixture of Gaussians are computed for each frame. At a second stage transition probabilities between candidate mixtures are computed, and a globally optimal clustering is found as the MAP solution of the resulting probabilistic model. Transition probabilities are computed using local stationarity assumptions and are based on a Gaussian version of the Jensen-Shannon divergence. The method was applied to several recordings. The performance appeared almost indistinguishable from humans in a wide range of scenarios, including movement, merges, and splits of clusters. 1
6 0.78192073 131 nips-2004-Non-Local Manifold Tangent Learning
7 0.78068793 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss
8 0.78026354 102 nips-2004-Learning first-order Markov models for control
9 0.77966559 163 nips-2004-Semi-parametric Exponential Family PCA
10 0.77634478 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound
11 0.77586108 70 nips-2004-Following Curved Regularized Optimization Solution Paths
12 0.7756781 124 nips-2004-Multiple Alignment of Continuous Time Series
13 0.77475703 116 nips-2004-Message Errors in Belief Propagation
14 0.77440327 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach
15 0.77438474 127 nips-2004-Neighbourhood Components Analysis
16 0.77431816 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
17 0.77350205 103 nips-2004-Limits of Spectral Clustering
18 0.77313381 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve
19 0.77300149 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
20 0.77298921 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection