nips nips2013 nips2013-259 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yu-Xiang Wang, Huan Xu, Chenlei Leng
Abstract: Sparse Subspace Clustering (SSC) and Low-Rank Representation (LRR) are both considered as the state-of-the-art methods for subspace clustering. The two methods are fundamentally similar in that both are convex optimizations exploiting the intuition of “Self-Expressiveness”. The main difference is that SSC minimizes the vector 1 norm of the representation matrix to induce sparsity while LRR minimizes nuclear norm (aka trace norm) to promote a low-rank structure. Because the representation matrix is often simultaneously sparse and low-rank, we propose a new algorithm, termed Low-Rank Sparse Subspace Clustering (LRSSC), by combining SSC and LRR, and develops theoretical guarantees of when the algorithm succeeds. The results reveal interesting insights into the strength and weakness of SSC and LRR and demonstrate how LRSSC can take the advantages of both methods in preserving the “Self-Expressiveness Property” and “Graph Connectivity” at the same time. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract Sparse Subspace Clustering (SSC) and Low-Rank Representation (LRR) are both considered as the state-of-the-art methods for subspace clustering. [sent-11, score-0.342]
2 The main difference is that SSC minimizes the vector 1 norm of the representation matrix to induce sparsity while LRR minimizes nuclear norm (aka trace norm) to promote a low-rank structure. [sent-13, score-0.223]
3 Because the representation matrix is often simultaneously sparse and low-rank, we propose a new algorithm, termed Low-Rank Sparse Subspace Clustering (LRSSC), by combining SSC and LRR, and develops theoretical guarantees of when the algorithm succeeds. [sent-14, score-0.088]
4 The underlying assumption is that high-dimensional data often lie in a low-dimensional subspace [4]). [sent-20, score-0.342]
5 Subspace Clustering deals with exactly this structure by clustering data points according to their underlying subspaces. [sent-22, score-0.154]
6 Application include motion segmentation and face clustering in computer vision [16, 8], hybrid system identification in control [26, 2], community clustering in social networks [12], to name a few. [sent-23, score-0.368]
7 Among these algorithms, LRR and SSC, based on minimizing the nuclear norm and 1 norm of the representation matrix respectively, remain the top performers on the Hopkins155 motion segmentation benchmark dataset [23]. [sent-26, score-0.286]
8 Moreover, they are among the few subspace clustering algorithms supported with theoretic guarantees: Both algorithms are known to succeed when the subspaces are independent [27, 16]. [sent-27, score-0.749]
9 Later, [8] showed that subspace being disjoint is sufficient for SSC to succeed1 , and [22] further relaxed this condition to include some cases of overlapping 1 Disjoint subspaces only intersect at the origin. [sent-28, score-0.688]
10 SSC’s solution is sometimes too sparse that the affinity graph of data from a single subspace may not be a connected body [19]. [sent-38, score-0.518]
11 Hence, a natural question is whether combining the two algorithms lead to a better method, in particular since the underlying representation matrix we want to recover is both low-rank and sparse simultaneously. [sent-40, score-0.088]
12 In this paper, we propose Low-Rank Sparse Subspace Clustering (LRSSC), which minimizes a weighted sum of nuclear norm and vector 1-norm of the representation matrix. [sent-41, score-0.121]
13 Indeed, our experiment shows that LRSSC works well in cases where data distribution is skewed (graph connectivity becomes an issue for SSC) and subspaces are not independent (LRR gives poor separation). [sent-45, score-0.455]
14 These insights would be useful when developing subspace clustering algorithms and applications. [sent-46, score-0.467]
15 We remark that in the general regression setup, the simultaneous nuclear norm and 1-norm regularization has been studied before [21]. [sent-47, score-0.135]
16 However, our focus is on the subspace clustering problem, and hence the results and analysis are completely different. [sent-48, score-0.467]
17 2 Problem Setup Notations: We denote the data matrix by X ∈ Rn×N , where each column of X (normalized to unit vector) belongs to a union of L subspaces S1 ∪ S2 ∪ . [sent-49, score-0.3]
18 Each subspace contains N data samples with N1 + N2 + . [sent-53, score-0.342]
19 In addition, we use · to represent Euclidean norm (for vectors) or spectral norm (for matrices) throughout the paper. [sent-63, score-0.192]
20 , [20]) are then applied on the affinity matrix W = |C| + |C|T where C is the solution to (1) to obtain the final clustering and | · | is the elementwise absolute value. [sent-69, score-0.215]
21 Criterion of success: In the subspace clustering task, as opposed to compressive sensing or matrix completion, there is no “ground-truth” C to compare the solution against. [sent-70, score-0.537]
22 , the output matrix C are block diagonal (up to appropriate permutation) with each subspace cluster represented by a disjoint block. [sent-73, score-0.443]
23 Given subspaces {S }L and data points X =1 from these subspaces, we say a matrix C obeys Self-Expressiveness Property, if the nonzero entries of each ci (ith column of C) corresponds to only those columns of X sampled from the same subspace as xi . [sent-76, score-0.672]
24 Note that the solution obeying SEP alone does not imply the clustering is correct, since each block may not be fully connected. [sent-77, score-0.246]
25 On the other hand, failure to achieve SEP does not necessarily imply clustering error either, as the spectral clustering step may give a (sometimes perfect) solution even when there are connections between blocks. [sent-79, score-0.443]
26 Notice that if C obeys SEP and each block is connected, we immediately get the correct clustering. [sent-81, score-0.074]
27 , νN ] ∈ {Λ1 (X)}, we define normalized dual matrix V for X as ν1 νN V (X) , ∗ , . [sent-98, score-0.129]
28 , ν ∗ ν1 N and the normalized dual matrix set {V (X)} as the collection of V (X) for all Λ ∈ {Λ1 (X)}. [sent-101, score-0.129]
29 The incoherence µ in the above definition measures how separable the sample points in S are against sample points in other subspaces (small µ represents more separable data). [sent-105, score-0.407]
30 Our definition differs from Soltanokotabi and Candes’s definition of subspace incoherence [22] in that it is defined as a minimax over all possible dual directions. [sent-106, score-0.52]
31 4] implies µ-minimax-incoherence as their dual direction are contained in {V (X)}. [sent-108, score-0.073]
32 ,L dim(S ), then all X ( ) for each X one can always find a dual matrix V ( ) ∈ {V ( ) } whose column space is orthogonal to the span of all other subspaces. [sent-121, score-0.105]
33 To contrast, the incoherence parameter according to Definition 2. [sent-122, score-0.105]
34 4 in [22] will be a positive value, potentially large if the angles between subspaces are small. [sent-123, score-0.266]
35 Suppose we have L disjoint 1-dimensional subspaces in Rn (L > n). [sent-125, score-0.284]
36 Then the incoherence parameter µ(X (L) ) defined in [22] is at least cos(π/6). [sent-131, score-0.105]
37 The result also depends on the smallest singular value of a rank-d matrix (denoted by σd ) and the inradius of a convex body as defined below. [sent-133, score-0.333]
38 The inradius of a convex body P, denoted by r(P), is defined as the radius of the largest Euclidean ball inscribed in P. [sent-135, score-0.195]
39 The smallest singular value and inradius measure how well-represented each subspace is by its data samples. [sent-136, score-0.601]
40 Small inradius/singular value implies either insufficient data, or skewed data distribution, in other word, it means that the subspace is “poorly represented”. [sent-137, score-0.435]
41 Also it is easy to generalize this example to d-dimensional subspaces and to “random except K subspaces”. [sent-145, score-0.244]
42 3 3 ( ) where X−k denotes X with its k th column removed and σd (X−k ) represents the dth (smallest ( ) non-zero) singular value of the matrix X−k . [sent-146, score-0.118]
43 First we write out the dual problem of (1), Dual LRSSC : max Λ1 ,Λ2 ,Λ3 X, Λ1 s. [sent-149, score-0.073]
44 This leads to a set of optimality conditions, and leaves us to show the existence of a dual certificate satisfying these conditions. [sent-152, score-0.073]
45 We then construct two levels of fictitious optimizations (which is the main novelty of the proof) and construct a dual certificate from the dual solution of the fictitious optimization problems. [sent-153, score-0.207]
46 Under condition (2) and (3), we establish this dual certifacte meets all optimality conditions, hence certifying that SEP holds. [sent-154, score-0.137]
47 , canonical angles between subspaces are small, then the subspace incoherence µ defined in [22] can be quite large (close to 1). [sent-173, score-0.713]
48 Thus, the succeed condition µ < r presented in [22] is violated. [sent-174, score-0.079]
49 Using our new definition of incoherence µ, as long as the subspaces are “sufficiently independent”4 (regardless of their correlation) µ will assume very small values (e. [sent-176, score-0.349]
50 , Example 2), making SEP possible even if r is small, namely when subspaces are poorly represented. [sent-178, score-0.264]
51 The guarantee is the strongest when λ → ∞ and becomes superficial when λ → 0 unless subspaces are independent (see Example 1). [sent-180, score-0.269]
52 Condition (2) is based on singular values, hence is computationally tractable. [sent-184, score-0.086]
53 When λ → ∞, Theorem 1 reduces to the first computationally tractable guarantee for SSC that works for disjoint and potentially overlapping subspaces. [sent-186, score-0.086]
54 “Random sampling” assumes that for each , data points in X ( ) are iid uniformly distributed on the unit sphere of S . [sent-192, score-0.093]
55 Under the random sampling assumption, when λ is smaller than a threshold, the singular value condition (2) is better than the inradius condition (3). [sent-205, score-0.321]
56 Specifically, σd (X) > 1 N with 4 d high probability, so for some constant C > 1, the singular value condition is strictly better if √ C N − log (N /d ) C , or when N is large, λ < λ< √ . [sent-206, score-0.149]
57 1 + log (N /d ) N 1 + log (N /d ) By further assuming random subspace, we provide an upper bound of the incoherence µ. [sent-207, score-0.149]
58 n 96d log N (4) −1 The above condition is obtained from the singular value condition. [sent-214, score-0.149]
59 Using the inradius guarantee, n combined with Lemma 2 and 3, we have a different succeed condition requiring d < 96log(κ) for all log N 1 . [sent-215, score-0.254]
60 Ignoring constant terms, the condition on d is slightly better than (4) by a log λ> n log κ 96d log N −1 factor but the range of valid λ is significantly reduced. [sent-216, score-0.107]
61 4 Graph Connectivity Problem The graph connectivity problem concerns when SEP is satisfied, whether each block of the solution C to LRSSC represents a connected graph. [sent-217, score-0.284]
62 The graph connectivity problem concerns whether each disjoint block (since SEP holds true) of the solution C to LRSSC represents a connected graph. [sent-218, score-0.324]
63 This is equivalent to the connectivity of the solution of the following fictitious optimization problem, where each sample is constrained to be represented by the samples of the same subspace, min C ( C( ) ) ∗ + λ C( ) 1 s. [sent-219, score-0.156]
64 (5) The graph connectivity for SSC is studied by [19] under deterministic conditions (to make the problem well-posed). [sent-222, score-0.21]
65 They show by a negative example that even if the well-posed condition is satisfied, the solution of SSC may not satisfy graph connectivity if the dimension of the subspace is greater than 3. [sent-223, score-0.588]
66 On the other hand, graph connectivity problem is not an issue for LRR: as the following proposition suggests, the intra-class connections of LRR’s solution are inherently dense (fully connected). [sent-224, score-0.26]
67 When the subspaces are independent, X is not full-rank and the data points are randomly sampled from a unit sphere in each subspace, then the solution to LRR, i. [sent-226, score-0.375]
68 F C 2 We believe it is possible (but maybe tedious) to extend our guarantee to this noisy version following the strategy of [29] which analyzed the noisy version of SSC. [sent-243, score-0.085]
69 According to the noisy analysis of SSC, a rule of thumb of choosing the scale of β1 and β2 is σ( 1 ) σ( λ ) β1 = √ 1+λ , β2 = √ 1+λ , 2 log N 2 log N where λ is the tradeoff parameter used in noiseless case (1), σ is the estimated noise level and N is the total number of entries. [sent-245, score-0.101]
70 In case of sparse corruption, one may use 1 norm penalty instead of the Frobenious norm. [sent-246, score-0.084]
71 2 Fast Numerical Algorithm As subspace clustering problem is usually large-scale, off-the-shelf SDP solvers are often too slow to use. [sent-249, score-0.467]
72 X = XJ, J = C2 − diag(C2 ), J = C1 , and update J, C1 , C2 and the three dual variables alternatively. [sent-253, score-0.073]
73 Detailed derivations, update rules, convergence guarantee and the corresponding ADMM algorithm for the noisy version of LRSSC are made available in the supplementary materials. [sent-257, score-0.079]
74 The results are given against an exponential grid of λ values, so comparisons to only 1-norm (SSC) and only nuclear norm (LRR) are clear from two ends of the plots. [sent-261, score-0.1]
75 1 Separation-Sparsity Tradeoff We first illustrate the tradeoff of the solution between obeying SEP and being connected (this is measured using the intra-class sparsity of the solution). [sent-263, score-0.165]
76 We randomly generate L subspaces of dimension 10 from R50 . [sent-264, score-0.244]
77 Then, 50 unit length random samples are drawn from each subspace and we concatenate into a 50 × 50L data matrix. [sent-265, score-0.366]
78 This justifies our claim that the solution of LRSSC, taking λ within this range, can achieve SEP and at the same time keep the intra-class connections relatively dense. [sent-281, score-0.074]
79 Indeed, for the subspace clustering task, a good tradeoff between separation and intra-class connection is important. [sent-282, score-0.525]
80 2 Skewed data distribution and model selection In this experiment, we use the data for L = 6 and combine the first two subspaces into one 20dimensional subspace and randomly sample 10 more points from the new subspace to “connect” the 100 points from the original two subspaces together. [sent-284, score-1.23]
81 , the data samples within one subspace has two dominating directions. [sent-287, score-0.342]
82 The skewed distribution creates trouble for model selection (judging the number of subspaces), and intuitively, the graph connectivity problem might occur. [sent-288, score-0.26]
83 We find that model selection heuristics such as the spectral gap [28] and spectral gap ratio [14] of the normalized Laplacian are good metrics to evaluate the quality of the solution of LRSSC. [sent-289, score-0.358]
84 Here the correct number of subspaces is 5, so the spectral gap is the difference between the 6th and 5th smallest singular value and the spectral gap ratio is the ratio of adjacent spectral gaps. [sent-290, score-0.76]
85 Figure 2 demonstrates how singular values change when λ increases. [sent-296, score-0.086]
86 When λ = 0 (corresponding to LRR), there is no significant drop from the 6th to the 5th singular value, hence it is impossible for either heuristic to identify the correct model. [sent-297, score-0.106]
87 As λ increases, the last 5 singular values gets smaller and become almost zero when λ is large. [sent-298, score-0.086]
88 Then the 5-subspace model can be correctly identified using spectral gap ratio. [sent-299, score-0.148]
89 On the other hand, we note that the 6th singular value also shrinks as λ increases, which makes the spectral gap very small on the SSC side and leaves little robust margin for correct model selection against some violation of SEP. [sent-300, score-0.29]
90 As is shown in Figure 3, the largest spectral gap and spectral gap ratio appear at around λ = 0. [sent-301, score-0.296]
91 1, where the solution is able to benefit from both the better separation induced by the 1-norm factor and the relatively denser connections promoted by the nuclear norm factor. [sent-302, score-0.233]
92 Figure 2: Last 20 singular values of the normalized Figure 3: Spectral Gap and Spectral Gap Laplacian in the skewed data experiment. [sent-303, score-0.203]
93 7 Conclusion and future works In this paper, we proposed LRSSC for the subspace clustering problem and provided theoretical analysis of the method. [sent-305, score-0.467]
94 We demonstrated that LRSSC is able to achieve perfect SEP for a wider range of problems than previously known for SSC and meanwhile maintains denser intra-class connections than SSC (hence less likely to encounter the “graph connectivity” issue). [sent-306, score-0.085]
95 Furthermore, the results offer new understandings to SSC and LRR themselves as well as problems such as skewed data distribution and model selection. [sent-307, score-0.093]
96 An important future research question is to mathematically define the concept of the graph connectivity, and establish conditions that perfect SEP and connectivity indeed occur together for some non-empty range of λ for LRSSC. [sent-308, score-0.209]
97 A closed form solution to robust subspace estimation and clustering. [sent-357, score-0.38]
98 Exact subspace segmentation and outlier detection by low-rank representation. [sent-415, score-0.389]
99 A benchmark for the comparison of 3-d motion segmentation algorithms. [sent-446, score-0.084]
100 Multiframe motion segmentation with missing data using powerfactorization and gpca. [sent-473, score-0.084]
wordName wordTfidf (topN-words)
[('lrr', 0.459), ('ssc', 0.395), ('lrssc', 0.383), ('subspace', 0.342), ('sep', 0.263), ('subspaces', 0.244), ('inradius', 0.153), ('clustering', 0.125), ('connectivity', 0.118), ('incoherence', 0.105), ('relviolation', 0.096), ('spectral', 0.094), ('skewed', 0.093), ('singular', 0.086), ('dual', 0.073), ('vidal', 0.062), ('gap', 0.054), ('ctitious', 0.051), ('gini', 0.051), ('nuclear', 0.051), ('norm', 0.049), ('graph', 0.049), ('segmentation', 0.047), ('sl', 0.047), ('conv', 0.042), ('diag', 0.042), ('condition', 0.041), ('sphere', 0.04), ('disjoint', 0.04), ('solution', 0.038), ('frobenious', 0.038), ('giniindex', 0.038), ('gpca', 0.038), ('succeed', 0.038), ('xc', 0.037), ('motion', 0.037), ('connections', 0.036), ('violation', 0.036), ('sparse', 0.035), ('remark', 0.035), ('nition', 0.034), ('cate', 0.034), ('lemma', 0.034), ('vision', 0.034), ('readers', 0.033), ('singapore', 0.032), ('matrix', 0.032), ('certi', 0.031), ('elhamifar', 0.031), ('tron', 0.031), ('separation', 0.031), ('numerical', 0.031), ('admm', 0.031), ('connected', 0.031), ('noisy', 0.03), ('points', 0.029), ('obeying', 0.029), ('block', 0.029), ('denser', 0.028), ('lin', 0.028), ('tradeoff', 0.027), ('nity', 0.026), ('liu', 0.026), ('imply', 0.025), ('guarantee', 0.025), ('xu', 0.025), ('obeys', 0.025), ('completion', 0.025), ('rn', 0.024), ('suspect', 0.024), ('supplementary', 0.024), ('unit', 0.024), ('normalized', 0.024), ('body', 0.023), ('meets', 0.023), ('obey', 0.023), ('optimizations', 0.023), ('tpami', 0.023), ('pattern', 0.022), ('deterministic', 0.022), ('cvpr', 0.022), ('setup', 0.022), ('dim', 0.022), ('angles', 0.022), ('log', 0.022), ('perfect', 0.021), ('conditions', 0.021), ('representation', 0.021), ('overlapping', 0.021), ('sparsity', 0.021), ('absolute', 0.02), ('poorly', 0.02), ('smallest', 0.02), ('correct', 0.02), ('convex', 0.019), ('illustrate', 0.019), ('concerns', 0.019), ('curvature', 0.019), ('outliers', 0.019), ('dense', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 259 nips-2013-Provable Subspace Clustering: When LRR meets SSC
Author: Yu-Xiang Wang, Huan Xu, Chenlei Leng
Abstract: Sparse Subspace Clustering (SSC) and Low-Rank Representation (LRR) are both considered as the state-of-the-art methods for subspace clustering. The two methods are fundamentally similar in that both are convex optimizations exploiting the intuition of “Self-Expressiveness”. The main difference is that SSC minimizes the vector 1 norm of the representation matrix to induce sparsity while LRR minimizes nuclear norm (aka trace norm) to promote a low-rank structure. Because the representation matrix is often simultaneously sparse and low-rank, we propose a new algorithm, termed Low-Rank Sparse Subspace Clustering (LRSSC), by combining SSC and LRR, and develops theoretical guarantees of when the algorithm succeeds. The results reveal interesting insights into the strength and weakness of SSC and LRR and demonstrate how LRSSC can take the advantages of both methods in preserving the “Self-Expressiveness Property” and “Graph Connectivity” at the same time. 1
2 0.18504241 91 nips-2013-Dirty Statistical Models
Author: Eunho Yang, Pradeep Ravikumar
Abstract: We provide a unified framework for the high-dimensional analysis of “superposition-structured” or “dirty” statistical models: where the model parameters are a superposition of structurally constrained parameters. We allow for any number and types of structures, and any statistical model. We consider the general class of M -estimators that minimize the sum of any loss function, and an instance of what we call a “hybrid” regularization, that is the infimal convolution of weighted regularization functions, one for each structural component. We provide corollaries showcasing our unified framework for varied statistical models such as linear regression, multiple regression and principal component analysis, over varied superposition structures. 1
3 0.16324966 137 nips-2013-High-Dimensional Gaussian Process Bandits
Author: Josip Djolonga, Andreas Krause, Volkan Cevher
Abstract: Many applications in machine learning require optimizing unknown functions defined over a high-dimensional space from noisy samples that are expensive to obtain. We address this notoriously hard challenge, under the assumptions that the function varies only along some low-dimensional subspace and is smooth (i.e., it has a low norm in a Reproducible Kernel Hilbert Space). In particular, we present the SI-BO algorithm, which leverages recent low-rank matrix recovery techniques to learn the underlying subspace of the unknown function and applies Gaussian Process Upper Confidence sampling for optimization of the function. We carefully calibrate the exploration–exploitation tradeoff by allocating the sampling budget to subspace estimation and function optimization, and obtain the first subexponential cumulative regret bounds and convergence rates for Bayesian optimization in high-dimensions under noisy observations. Numerical results demonstrate the effectiveness of our approach in difficult scenarios. 1
4 0.1609176 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
Author: Vincent Q. Vu, Juhee Cho, Jing Lei, Karl Rohe
Abstract: We propose a novel convex relaxation of sparse principal subspace estimation based on the convex hull of rank-d projection matrices (the Fantope). The convex problem can be solved efficiently using alternating direction method of multipliers (ADMM). We establish a near-optimal convergence rate, in terms of the sparsity, ambient dimension, and sample size, for estimation of the principal subspace of a general covariance matrix without assuming the spiked covariance model. In the special case of d = 1, our result implies the near-optimality of DSPCA (d’Aspremont et al. [1]) even when the solution is not rank 1. We also provide a general theoretical framework for analyzing the statistical properties of the method for arbitrary input matrices that extends the applicability and provable guarantees to a wide array of settings. We demonstrate this with an application to Kendall’s tau correlation matrices and transelliptical component analysis. 1
5 0.15283081 224 nips-2013-On the Sample Complexity of Subspace Learning
Author: Alessandro Rudi, Guillermo D. Canas, Lorenzo Rosasco
Abstract: A large number of algorithms in machine learning, from principal component analysis (PCA), and its non-linear (kernel) extensions, to more recent spectral embedding and support estimation methods, rely on estimating a linear subspace from samples. In this paper we introduce a general formulation of this problem and derive novel learning error estimates. Our results rely on natural assumptions on the spectral properties of the covariance operator associated to the data distribution, and hold for a wide class of metrics between subspaces. As special cases, we discuss sharp error estimates for the reconstruction properties of PCA and spectral support estimation. Key to our analysis is an operator theoretic approach that has broad applicability to spectral learning methods. 1
6 0.14865881 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles
7 0.14601754 179 nips-2013-Low-Rank Matrix and Tensor Completion via Adaptive Sampling
8 0.13346362 158 nips-2013-Learning Multiple Models via Regularized Weighting
9 0.12434421 233 nips-2013-Online Robust PCA via Stochastic Optimization
10 0.087906234 133 nips-2013-Global Solver and Its Efficient Approximation for Variational Bayesian Low-rank Subspace Clustering
11 0.084920384 188 nips-2013-Memory Limited, Streaming PCA
12 0.079475433 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization
13 0.079130262 173 nips-2013-Least Informative Dimensions
14 0.0785098 185 nips-2013-Matrix Completion From any Given Set of Observations
15 0.077194445 319 nips-2013-Submodular Optimization with Submodular Cover and Submodular Knapsack Constraints
16 0.07495366 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation
17 0.073604621 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel
18 0.065855749 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
19 0.058256574 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions
20 0.058134846 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
topicId topicWeight
[(0, 0.166), (1, 0.078), (2, 0.111), (3, 0.101), (4, -0.037), (5, 0.01), (6, -0.097), (7, -0.007), (8, -0.14), (9, -0.043), (10, -0.032), (11, 0.118), (12, 0.077), (13, 0.11), (14, 0.05), (15, 0.018), (16, -0.011), (17, -0.031), (18, 0.03), (19, -0.063), (20, 0.031), (21, -0.035), (22, -0.011), (23, -0.064), (24, 0.047), (25, -0.009), (26, -0.027), (27, -0.013), (28, -0.046), (29, 0.016), (30, -0.079), (31, -0.067), (32, 0.102), (33, -0.024), (34, 0.029), (35, 0.173), (36, 0.047), (37, 0.029), (38, 0.069), (39, -0.015), (40, 0.044), (41, 0.15), (42, -0.135), (43, 0.099), (44, 0.104), (45, 0.005), (46, 0.031), (47, 0.039), (48, 0.001), (49, -0.039)]
simIndex simValue paperId paperTitle
same-paper 1 0.94089693 259 nips-2013-Provable Subspace Clustering: When LRR meets SSC
Author: Yu-Xiang Wang, Huan Xu, Chenlei Leng
Abstract: Sparse Subspace Clustering (SSC) and Low-Rank Representation (LRR) are both considered as the state-of-the-art methods for subspace clustering. The two methods are fundamentally similar in that both are convex optimizations exploiting the intuition of “Self-Expressiveness”. The main difference is that SSC minimizes the vector 1 norm of the representation matrix to induce sparsity while LRR minimizes nuclear norm (aka trace norm) to promote a low-rank structure. Because the representation matrix is often simultaneously sparse and low-rank, we propose a new algorithm, termed Low-Rank Sparse Subspace Clustering (LRSSC), by combining SSC and LRR, and develops theoretical guarantees of when the algorithm succeeds. The results reveal interesting insights into the strength and weakness of SSC and LRR and demonstrate how LRSSC can take the advantages of both methods in preserving the “Self-Expressiveness Property” and “Graph Connectivity” at the same time. 1
2 0.71714658 224 nips-2013-On the Sample Complexity of Subspace Learning
Author: Alessandro Rudi, Guillermo D. Canas, Lorenzo Rosasco
Abstract: A large number of algorithms in machine learning, from principal component analysis (PCA), and its non-linear (kernel) extensions, to more recent spectral embedding and support estimation methods, rely on estimating a linear subspace from samples. In this paper we introduce a general formulation of this problem and derive novel learning error estimates. Our results rely on natural assumptions on the spectral properties of the covariance operator associated to the data distribution, and hold for a wide class of metrics between subspaces. As special cases, we discuss sharp error estimates for the reconstruction properties of PCA and spectral support estimation. Key to our analysis is an operator theoretic approach that has broad applicability to spectral learning methods. 1
3 0.6868431 91 nips-2013-Dirty Statistical Models
Author: Eunho Yang, Pradeep Ravikumar
Abstract: We provide a unified framework for the high-dimensional analysis of “superposition-structured” or “dirty” statistical models: where the model parameters are a superposition of structurally constrained parameters. We allow for any number and types of structures, and any statistical model. We consider the general class of M -estimators that minimize the sum of any loss function, and an instance of what we call a “hybrid” regularization, that is the infimal convolution of weighted regularization functions, one for each structural component. We provide corollaries showcasing our unified framework for varied statistical models such as linear regression, multiple regression and principal component analysis, over varied superposition structures. 1
4 0.66883415 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
Author: Vincent Q. Vu, Juhee Cho, Jing Lei, Karl Rohe
Abstract: We propose a novel convex relaxation of sparse principal subspace estimation based on the convex hull of rank-d projection matrices (the Fantope). The convex problem can be solved efficiently using alternating direction method of multipliers (ADMM). We establish a near-optimal convergence rate, in terms of the sparsity, ambient dimension, and sample size, for estimation of the principal subspace of a general covariance matrix without assuming the spiked covariance model. In the special case of d = 1, our result implies the near-optimality of DSPCA (d’Aspremont et al. [1]) even when the solution is not rank 1. We also provide a general theoretical framework for analyzing the statistical properties of the method for arbitrary input matrices that extends the applicability and provable guarantees to a wide array of settings. We demonstrate this with an application to Kendall’s tau correlation matrices and transelliptical component analysis. 1
5 0.63830352 137 nips-2013-High-Dimensional Gaussian Process Bandits
Author: Josip Djolonga, Andreas Krause, Volkan Cevher
Abstract: Many applications in machine learning require optimizing unknown functions defined over a high-dimensional space from noisy samples that are expensive to obtain. We address this notoriously hard challenge, under the assumptions that the function varies only along some low-dimensional subspace and is smooth (i.e., it has a low norm in a Reproducible Kernel Hilbert Space). In particular, we present the SI-BO algorithm, which leverages recent low-rank matrix recovery techniques to learn the underlying subspace of the unknown function and applies Gaussian Process Upper Confidence sampling for optimization of the function. We carefully calibrate the exploration–exploitation tradeoff by allocating the sampling budget to subspace estimation and function optimization, and obtain the first subexponential cumulative regret bounds and convergence rates for Bayesian optimization in high-dimensions under noisy observations. Numerical results demonstrate the effectiveness of our approach in difficult scenarios. 1
6 0.55625707 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles
7 0.5406267 158 nips-2013-Learning Multiple Models via Regularized Weighting
8 0.53068686 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation
9 0.50684124 133 nips-2013-Global Solver and Its Efficient Approximation for Variational Bayesian Low-rank Subspace Clustering
10 0.50624567 342 nips-2013-Unsupervised Spectral Learning of Finite State Transducers
11 0.49687719 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel
12 0.49246278 233 nips-2013-Online Robust PCA via Stochastic Optimization
13 0.48136449 188 nips-2013-Memory Limited, Streaming PCA
14 0.47954234 179 nips-2013-Low-Rank Matrix and Tensor Completion via Adaptive Sampling
15 0.47753748 314 nips-2013-Stochastic Optimization of PCA with Capped MSG
16 0.46642807 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression
17 0.46126685 284 nips-2013-Robust Spatial Filtering with Beta Divergence
18 0.44593871 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization
19 0.44332957 185 nips-2013-Matrix Completion From any Given Set of Observations
20 0.4428485 186 nips-2013-Matrix factorization with binary components
topicId topicWeight
[(9, 0.21), (10, 0.01), (16, 0.025), (33, 0.155), (34, 0.099), (41, 0.025), (49, 0.027), (56, 0.111), (70, 0.028), (85, 0.042), (89, 0.093), (93, 0.059), (95, 0.025)]
simIndex simValue paperId paperTitle
1 0.87607056 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks
Author: Junming Yin, Qirong Ho, Eric Xing
Abstract: We propose a scalable approach for making inference about latent spaces of large networks. With a succinct representation of networks as a bag of triangular motifs, a parsimonious statistical model, and an efficient stochastic variational inference algorithm, we are able to analyze real networks with over a million vertices and hundreds of latent roles on a single machine in a matter of hours, a setting that is out of reach for many existing methods. When compared to the state-of-the-art probabilistic approaches, our method is several orders of magnitude faster, with competitive or improved accuracy for latent space recovery and link prediction. 1
same-paper 2 0.82452935 259 nips-2013-Provable Subspace Clustering: When LRR meets SSC
Author: Yu-Xiang Wang, Huan Xu, Chenlei Leng
Abstract: Sparse Subspace Clustering (SSC) and Low-Rank Representation (LRR) are both considered as the state-of-the-art methods for subspace clustering. The two methods are fundamentally similar in that both are convex optimizations exploiting the intuition of “Self-Expressiveness”. The main difference is that SSC minimizes the vector 1 norm of the representation matrix to induce sparsity while LRR minimizes nuclear norm (aka trace norm) to promote a low-rank structure. Because the representation matrix is often simultaneously sparse and low-rank, we propose a new algorithm, termed Low-Rank Sparse Subspace Clustering (LRSSC), by combining SSC and LRR, and develops theoretical guarantees of when the algorithm succeeds. The results reveal interesting insights into the strength and weakness of SSC and LRR and demonstrate how LRSSC can take the advantages of both methods in preserving the “Self-Expressiveness Property” and “Graph Connectivity” at the same time. 1
3 0.80055153 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies
Author: Maria-Florina Balcan, Steven Ehrlich, Yingyu Liang
Abstract: This paper provides new algorithms for distributed clustering for two popular center-based objectives, k-median and k-means. These algorithms have provable guarantees and improve communication complexity over existing approaches. Following a classic approach in clustering by [13], we reduce the problem of finding a clustering with low cost to the problem of finding a coreset of small size. We provide a distributed method for constructing a global coreset which improves over the previous methods by reducing the communication complexity, and which works over general communication topologies. Experimental results on large scale data sets show that this approach outperforms other coreset-based distributed clustering algorithms. 1
4 0.7790525 121 nips-2013-Firing rate predictions in optimal balanced networks
Author: David G. Barrett, Sophie Denève, Christian K. Machens
Abstract: How are firing rates in a spiking network related to neural input, connectivity and network function? This is an important problem because firing rates are a key measure of network activity, in both the study of neural computation and neural network dynamics. However, it is a difficult problem, because the spiking mechanism of individual neurons is highly non-linear, and these individual neurons interact strongly through connectivity. We develop a new technique for calculating firing rates in optimal balanced networks. These are particularly interesting networks because they provide an optimal spike-based signal representation while producing cortex-like spiking activity through a dynamic balance of excitation and inhibition. We can calculate firing rates by treating balanced network dynamics as an algorithm for optimising signal representation. We identify this algorithm and then calculate firing rates by finding the solution to the algorithm. Our firing rate calculation relates network firing rates directly to network input, connectivity and function. This allows us to explain the function and underlying mechanism of tuning curves in a variety of systems. 1
Author: Ryan D. Turner, Steven Bottone, Clay J. Stanek
Abstract: The Bayesian online change point detection (BOCPD) algorithm provides an efficient way to do exact inference when the parameters of an underlying model may suddenly change over time. BOCPD requires computation of the underlying model’s posterior predictives, which can only be computed online in O(1) time and memory for exponential family models. We develop variational approximations to the posterior on change point times (formulated as run lengths) for efficient inference when the underlying model is not in the exponential family, and does not have tractable posterior predictive distributions. In doing so, we develop improvements to online variational inference. We apply our methodology to a tracking problem using radar data with a signal-to-noise feature that is Rice distributed. We also develop a variational method for inferring the parameters of the (non-exponential family) Rice distribution. Change point detection has been applied to many applications [5; 7]. In recent years there have been great improvements to the Bayesian approaches via the Bayesian online change point detection algorithm (BOCPD) [1; 23; 27]. Likewise, the radar tracking community has been improving in its use of feature-aided tracking [10]: methods that use auxiliary information from radar returns such as signal-to-noise ratio (SNR), which depend on radar cross sections (RCS) [21]. Older systems would often filter only noisy position (and perhaps Doppler) measurements while newer systems use more information to improve performance. We use BOCPD for modeling the RCS feature. Whereas BOCPD inference could be done exactly when finding change points in conjugate exponential family models the physics of RCS measurements often causes them to be distributed in non-exponential family ways, often following a Rice distribution. To do inference efficiently we call upon variational Bayes (VB) to find approximate posterior (predictive) distributions. Furthermore, the nature of both BOCPD and tracking require the use of online updating. We improve upon the existing and limited approaches to online VB [24; 13]. This paper produces contributions to, and builds upon background from, three independent areas: change point detection, variational Bayes, and radar tracking. Although the emphasis in machine learning is on filtering, a substantial part of tracking with radar data involves data association, illustrated in Figure 1. Observations of radar returns contain measurements from multiple objects (targets) in the sky. If we knew which radar return corresponded to which target we would be presented with NT ∈ N0 independent filtering problems; Kalman filters [14] (or their nonlinear extensions) are applied to “average out” the kinematic errors in the measurements (typically positions) using the measurements associated with each target. The data association problem is to determine which measurement goes to which track. In the classical setup, once a particular measurement is associated with a certain target, that measurement is plugged into the filter for that target as if we knew with certainty it was the correct assignment. The association algorithms, in effect, find the maximum a posteriori (MAP) estimate on the measurement-to-track association. However, approaches such as the joint probabilistic data association (JPDA) filter [2] and the probability hypothesis density (PHD) filter [16] have deviated from this. 1 To find the MAP estimate a log likelihood of the data under each possible assignment vector a must be computed. These are then used to construct cost matrices that reduce the assignment problem to a particular kind of optimization problem (the details of which are beyond the scope of this paper). The motivation behind feature-aided tracking is that additional features increase the probability that the MAP measurement-to-track assignment is correct. Based on physical arguments the RCS feature (SNR) is often Rice distributed [21, Ch. 3]; although, in certain situations RCS is exponential or gamma distributed [26]. The parameters of the RCS distribution are determined by factors such as the shape of the aircraft facing the radar sensor. Given that different aircraft have different RCS characteristics, if one attempts to create a continuous track estimating the path of an aircraft, RCS features may help distinguish one aircraft from another if they cross paths or come near one another, for example. RCS also helps distinguish genuine aircraft returns from clutter: a flock of birds or random electrical noise, for example. However, the parameters of the RCS distributions may also change for the same aircraft due to a change in angle or ground conditions. These must be taken into account for accurate association. Providing good predictions in light of a possible sudden change in the parameters of a time series is “right up the alley” of BOCPD and change point methods. The original BOCPD papers [1; 11] studied sudden changes in the parameters of exponential family models for time series. In this paper, we expand the set of applications of BOCPD to radar SNR data which often has the same change point structure found in other applications, and requires online predictions. The BOCPD model is highly modular in that it looks for changes in the parameters of any underlying process model (UPM). The UPM merely needs to provide posterior predictive probabilities, the UPM can otherwise be a “black box.” The BOCPD queries the UPM for a prediction of the next data point under each possible run length, the number of points since the last change point. If (and only if by Hipp [12]) the UPM is exponential family (with a conjugate prior) the posterior is computed by accumulating the sufficient statistics since the last potential change point. This allows for O(1) UPM updates in both computation and memory as the run length increases. We motivate the use of VB for implementing UPMs when the data within a regime is believed to follow a distribution that is not exponential family. The methods presented in this paper can be used to find variational run length posteriors for general non-exponential family UPMs in addition to the Rice distribution. Additionally, the methods for improving online updating in VB (Section 2.2) are applicable in areas outside of change point detection. Likelihood clutter (birds) track 1 (747) track 2 (EMB 110) 0 5 10 15 20 SNR Figure 1: Illustrative example of a tracking scenario: The black lines (−) show the true tracks while the red stars (∗) show the state estimates over time for track 2 and the blue stars for track 1. The 95% credible regions on the states are shown as blue ellipses. The current (+) and previous (×) measurements are connected to their associated tracks via red lines. The clutter measurements (birds in this case) are shown with black dots (·). The distributions on the SNR (RCS) for each track (blue and red) and the clutter (black) are shown on the right. To our knowledge this paper is the first to demonstrate how to compute Bayesian posterior distributions on the parameters of a Rice distribution; the closest work would be Lauwers et al. [15], which computes a MAP estimate. Other novel factors of this paper include: demonstrating the usefulness (and advantages over existing techniques) of change point detection for RCS estimation and tracking; and applying variational inference for UPMs where analytic posterior predictives are not possible. This paper provides four main technical contributions: 1) VB inference for inferring the parameters of a Rice distribution. 2) General improvements to online VB (which is then applied to updating the UPM in BOCPD). 3) Derive a VB approximation to the run length posterior when the UPM posterior predictive is intractable. 4) Handle censored measurements (particularly for a Rice distribution) in VB. This is key for processing missed detections in data association. 2 1 Background In this section we briefly review the three areas of background: BOCPD, VB, and tracking. 1.1 Bayesian Online Change Point Detection We briefly summarize the model setup and notation for the BOCPD algorithm; see [27, Ch. 5] for a detailed description. We assume we have a time series with n observations so far y1 , . . . , yn ∈ Y. In effect, BOCPD performs message passing to do online inference on the run length rn ∈ 0:n − 1, the number of observations since the last change point. Given an underlying predictive model (UPM) and a hazard function h, we can compute an exact posterior over the run length rn . Conditional on a run length, the UPM produces a sequential prediction on the next data point using all the data since the last change point: p(yn |y(r) , Θm ) where (r) := (n − r):(n − 1). The UPM is a simpler model where the parameters θ change at every change point and are modeled as being sampled from a prior with hyper-parameters Θm . The canonical example of a UPM would be a Gaussian whose mean and variance change at every change point. The online updates are summarized as: P (rn |rn−1 ) p(yn |rn−1 , y(r) ) p(rn−1 , y1:n−1 ) . msgn := p(rn , y1:n ) = rn−1 hazard UPM (1) msgn−1 Unless rn = 0, the sum in (1) only contains one term since the only possibility is that rn−1 = rn −1. The indexing convention is such that if rn = 0 then yn+1 is the first observation sampled from the new parameters θ. The marginal posterior predictive on the next data point is easily calculated as: p(yn+1 |y1:n ) = p(yn+1 |y(r) )P (rn |y1:n ) . (2) rn Thus, the predictions from BOCPD fully integrate out any uncertainty in θ. The message updates (1) perform exact inference under a model where the number of change points is not known a priori. BOCPD RCS Model We show the Rice UPM as an example as it is required for our application. The data within a regime are assumed to be iid Rice observations, with a normal-gamma prior: yn ∼ Rice(ν, σ) , ν ∼ N (µ0 , σ 2 /λ0 ) , σ −2 =: τ ∼ Gamma(α0 , β0 ) (3) 2 =⇒ p(yn |ν, σ) = yn τ exp(−τ (yn + ν 2 )/2)I0 (yn ντ )I{yn ≥ 0} (4) where I0 (·) is a modified Bessel function of order zero, which is what excludes the Rice distribution from the exponential family. Although the normal-gamma is not conjugate to a Rice it will enable us to use the VB-EM algorithm. The UPM parameters are the Rice shape1 ν ∈ R and scale σ ∈ R+ , θ := {ν, σ}, and the hyper-parameters are the normal-gamma parameters Θm := {µ0 , λ0 , α0 , β0 }. Every change point results in a new value for ν and σ being sampled. A posterior on θ is maintained for each run length, i.e. every possible starting point for the current regime, and is updated at each new data point. Therefore, BOCPD maintains n distinct posteriors on θ, and although this can be reduced with pruning, it necessitates posterior updates on θ that are computationally efficient. Note that the run length updates in (1) require the UPM to provide predictive log likelihoods at all sample sizes rn (including zero). Therefore, UPM implementations using such approximations as plug-in MLE predictions will not work very well. The MLE may not even be defined for run lengths smaller than the number of UPM parameters |θ|. For a Rice UPM, the efficient O(1) updating in exponential family models by using a conjugate prior and accumulating sufficient statistics is not possible. This motivates the use of VB methods for approximating the UPM predictions. 1.2 Variational Bayes We follow the framework of VB where when computation of the exact posterior distribution p(θ|y1:n ) is intractable it is often possible to create a variational approximation q(θ) that is locally optimal in terms of the Kullback-Leibler (KL) divergence KL(q p) while constraining q to be in a certain family of distributions Q. In general this is done by optimizing a lower bound L(q) on the evidence log p(y1:n ), using either gradient based methods or standard fixed point equations. 1 The shape ν is usually assumed to be positive (∈ R+ ); however, there is nothing wrong with using a negative ν as Rice(x|ν, σ) = Rice(x|−ν, σ). It also allows for use of a normal-gamma prior. 3 The VB-EM Algorithm In many cases, such as the Rice UPM, the derivation of the VB fixed point equations can be simplified by applying the VB-EM algorithm [3]. VB-EM is applicable to models that are conjugate-exponential (CE) after being augmented with latent variables x1:n . A model is CE if: 1) The complete data likelihood p(x1:n , y1:n |θ) is an exponential family distribution; and 2) the prior p(θ) is a conjugate prior for the complete data likelihood p(x1:n , y1:n |θ). We only have to constrain the posterior q(θ, x1:n ) = q(θ)q(x1:n ) to factorize between the latent variables and the parameters; we do not constrain the posterior to be of any particular parametric form. Requiring the complete likelihood to be CE is a much weaker condition than requiring the marginal on the observed data p(y1:n |θ) to be CE. Consider a mixture of Gaussians: the model becomes CE when augmented with latent variables (class labels). This is also the case for the Rice distribution (Section 2.1). Like the ordinary EM algorithm [9] the VB-EM algorithm alternates between two steps: 1) Find the posterior of the latent variables treating the expected natural parameters η := Eq(θ) [η] as correct: ¯ q(xi ) ← p(xi |yi , η = η ). 2) Find the posterior of the parameters using the expected sufficient statis¯ ¯ tics S := Eq(x1:n ) [S(x1:n , y1:n )] as if they were the sufficient statistics for the complete data set: ¯ q(θ) ← p(θ|S(x1:n , y1:n ) = S). The posterior will be of the same exponential family as the prior. 1.3 Tracking In this section we review data association, which along with filtering constitutes tracking. In data association we estimate the association vectors a which map measurements to tracks. At each time NZ (n) step, n ∈ N1 , we observe NZ (n) ∈ N0 measurements, Zn = {zi,n }i=1 , which includes returns from both real targets and clutter (spurious measurements). Here, zi,n ∈ Z is a vector of kinematic measurements (positions in R3 , or R4 with a Doppler), augmented with an RCS component R ∈ R+ for the measured SNR, at time tn ∈ R. The assignment vector at time tn is such that an (i) = j if measurement i is associated with track j > 0; an (i) = 0 if measurement i is clutter. The inverse mapping a−1 maps tracks to measurements: meaning a−1 (an (i)) = i if an (i) = 0; and n n a−1 (i) = 0 ⇔ an (j) = i for all j. For example, if NT = 4 and a = [2 0 0 1 4] then NZ = 5, n Nc = 2, and a−1 = [4 1 0 5]. Each track is associated with at most one measurement, and vice-versa. In N D data association we jointly find the MAP estimate of the association vectors over a sliding window of the last N − 1 time steps. We assume we have NT (n) ∈ N0 total tracks as a known parameter: NT (n) is adjusted over time using various algorithms (see [2, Ch. 3]). In the generative process each track places a probability distribution on the next N − 1 measurements, with both kinematic and RCS components. However, if the random RCS R for a measurement is below R0 then it will not be observed. There are Nc (n) ∈ N0 clutter measurements from a Poisson process with λ := E[Nc (n)] (often with uniform intensity). The ordering of measurements in Zn is assumed to be uniformly random. For 3D data association the model joint p(Zn−1:n , an−1 , an |Z1:n−2 ) is: NT |Zi | n pi (za−1 (i),n , za−1 n n−1 i=1 (i),n−1 ) × λNc (i) exp(−λ)/|Zi |! i=n−1 p0 (zj,i )I{ai (j)=0} , (5) j=1 where pi is the probability of the measurement sequence under track i; p0 is the clutter distribution. The probability pi is the product of the RCS component predictions (BOCPD) and the kinematic components (filter); informally, pi (z) = pi (positions) × pi (RCS). If there is a missed detection, i.e. a−1 (i) = 0, we then use pi (za−1 (i),n ) = P (R < R0 ) under the RCS model for track i with no conn n tribution from positional (kinematic) component. Just as BOCPD allows any black box probabilistic predictor to be used as a UPM, any black box model of measurement sequences can used in (5). The estimation of association vectors for the 3D case becomes an optimization problem of the form: ˆ (ˆn−1 , an ) = argmax log P (an−1 , an |Z1:n ) = argmax log p(Zn−1:n , an−1 , an |Z1:n−2 ) , (6) a (an−1 ,an ) (an−1 ,an ) which is effectively optimizing (5) with respect to the assignment vectors. The optimization given in (6) can be cast as a multidimensional assignment (MDA) problem [2], which can be solved efficiently in the 2D case. Higher dimensional assignment problems, however, are NP-hard; approximate, yet typically very accurate, solvers must be used for real-time operation, which is usually required for tracking systems [20]. If a radar scan occurs at each time step and a target is not detected, we assume the SNR has not exceeded the threshold, implying 0 ≤ R < R0 . This is a (left) censored measurement and is treated differently than a missing data point. Censoring is accounted for in Section 2.3. 4 2 Online Variational UPMs We cover the four technical challenges for implementing non-exponential family UPMs in an efficient and online manner. We drop the index of the data point i when it is clear from context. 2.1 Variational Posterior for a Rice Distribution The Rice distribution has the property that x ∼ N (ν, σ 2 ) , y ∼ N (0, σ 2 ) =⇒ R = x2 + y 2 ∼ Rice(ν, σ) . (7) For simplicity we perform inference using R2 , as opposed to R, and transform accordingly: x ∼ N (ν, σ 2 ) , 1 R2 − x2 ∼ Gamma( 2 , τ ) , 2 τ := 1/σ 2 ∈ R+ =⇒ p(R2 , x) = p(R2 |x)p(x) = Gamma(R2 − x2 | 1 , τ )N (x|ν, σ 2 ) . 2 2 (8) The complete likelihood (8) is the product of two exponential family models and is exponential family itself, parameterized with base measure h and partition factor g: η = [ντ, −τ /2] , S = [x, R2 ] , h(R2 , x) = (2π R2 − x2 )−1 , g(ν, τ ) = τ exp(−ν 2 τ /2) . By inspection we see that the natural parameters η and sufficient statistics S are the same as a Gaussian with unknown mean and variance. Therefore, we apply the normal-gamma prior on (ν, τ ) as it is the conjugate prior for the complete data likelihood. This allows us to apply the VB-EM 2 algorithm. We use yi := Ri as the VB observation, not Ri as in (3). In (5), z·,· (end) is the RCS R. VB M-Step We derive the posterior updates to the parameters given expected sufficient statistics: n λ0 µ0 + i E[xi ] , λn = λ0 + n , αn = α0 + n , λ0 + n i=1 n n 1 1 nλ0 1 βn = β0 + (E[xi ] − x)2 + ¯ (¯ − µ0 )2 + x R2 − E[xi ]2 . 2 i=1 2 λ0 + n 2 i=1 i x := ¯ E[xi ]/n , µn = (9) (10) This is the same as an observation from a Gaussian and a gamma that share a (inverse) scale τ . 2 2 ¯ VB E-Step We then must find both expected sufficient statistics S. The expectation E[Ri |Ri ] = 2 2 Ri trivially; leaving E[xi |Ri ]. Recall that the joint on (x, y ) is a bivariate normal; if we constrain the radius to R, the angle ω will be distributed by a von Mises (VM) distribution. Therefore, ω := arccos(x/R) ∼ VM(0, κ) , κ = R E[ντ ] =⇒ E[x] = R E[cos ω] = RI1 (κ)/I0 (κ) , (11) where computing κ constitutes the VB E-step and we have used the trigonometric moment on ω [18]. This completes the computations required to do the VB updates on the Rice posterior. Variational Lower Bound For completeness, and to assess convergence, we derive the VB lower bound L(q). Using the standard formula [4] for L(q) = Eq [log p(y1:n , x1:n , θ)] + H[q] we get: n 2 1 E[log τ /2] − 1 E[τ ]Ri + (E[ντ ] − κi /Ri )E[xi ] − 2 E[ν 2 τ ] + log I0 (κi ) − KL(q p) , 2 (12) i=1 where p in the KL is the prior on (ν, τ ) which is easy to compute as q and p are both normal-gamma. Equivalently, (12) can be optimized directly instead of using the VB-EM updates. 2.2 Online Variational Inference In Section 2.1 we derived an efficient way to compute the variational posterior for a Rice distribution for a fixed data set. However, as is apparent from (1) we need online predictions from the UPM; we must be able to update the posterior one data point at a time. When the UPM is exponential family and we can compute the posterior exactly, we merely use the posterior from the previous step as the prior. However, since we are only computing a variational approximation to the posterior, using the previous posterior as the prior does not give the exact same answer as re-computing the posterior from batch. This gives two obvious options: 1) recompute the posterior from batch every update at O(n) cost or 2) use the previous posterior as the prior at O(1) cost and reduced accuracy. 5 The difference between the options is encapsulated by looking at the expected sufficient statistics: n ¯ S = i=1 Eq(xi |y1:n ) [S(xi , yi )]. Naive online updating uses old expected sufficient statistics whose n ¯ posterior effectively uses S = i=1 Eq(xi |y1:i ) [S(xi , yi )]. We get the best of both worlds if we adjust those estimates over time. We in fact can do this if we project the expected sufficient statistics into a “feature space” in terms of the expected natural parameters. For some function f , q(xi ) = p(xi |yi , η = η ) =⇒ Eq(xi |y1:n ) [S(xi , yi )] = f (yi , η ) . ¯ ¯ If f is piecewise continuous then we can represent it with an inner product [8, Sec. 2.1.6] n n ¯ f (yi , η ) = φ(¯) ψ(yi ) =⇒ S = ¯ η φ(¯) ψ(yi ) = φ(¯) η η ψ(yi ) , i=1 i=1 (13) (14) where an infinite dimensional φ and ψ may be required for exact representation, but can be approximated by a finite inner product. In the Rice distribution case we use (11) f (yi , η ) = E[xi ] = Ri I (Ri E[ντ ]) = Ri I ((Ri /µ0 ) µ0 E[ντ ]) , ¯ I (·) := I1 (·)/I0 (·) , (15) 2 Ri where recall that yi = and η1 = E[ντ ]. We can easily represent f with an inner product if we can ¯ represent I as an inner product: I (uv) = φ(u) ψ(v). We use unitless φi (u) = I (ci u) with c1:G as a log-linear grid from 10−2 to 103 and G = 50. We use a lookup table for ψ(v) that was trained to match I using non-negative least squares, which left us with a sparse lookup table. Online updating for VB posteriors was also developed in [24; 13]. These methods involved introducing forgetting factors to forget the contributions from old data points that might be detrimental to accuracy. Since the VB predictions are “embedded” in a change point method, they are automatically phased out if the posterior predictions become inaccurate making the forgetting factors unnecessary. 2.3 Censored Data As mentioned in Section 1.3, we must handle censored RCS observations during a missed detection. In the VB-EM framework we merely have to compute the expected sufficient statistics given the censored measurement: E[S|R < R0 ]. The expected sufficient statistic from (11) is now: R0 E[x|R < R0 ] = 0 ν ν E[x|R]p(R)dR RiceCDF (R0 |ν, τ ) = ν(1 − Q2 ( σ , R0 ))/(1 − Q1 ( σ , R0 )) , σ σ where QM is the Marcum Q function [17] of order M . Similar updates for E[S|R < R0 ] are possible for exponential or gamma UPMs, but are not shown as they are relatively easy to derive. 2.4 Variational Run Length Posteriors: Predictive Log Likelihoods Both updating the BOCPD run length posterior (1) and finding the marginal predictive log likelihood of the next point (2) require calculating the UPM’s posterior predictive log likelihood log p(yn+1 |rn , y(r) ). The marginal posterior predictive from (2) is used in data association (6) and benchmarking BOCPD against other methods. However, the exact posterior predictive distribution obtained by integrating the Rice likelihood against the VB posterior is difficult to compute. We can break the BOCPD update (1) into a time and measurement update. The measurement update corresponds to a Bayesian model comparison (BMC) calculation with prior p(rn |y1:n ): p(rn |y1:n+1 ) ∝ p(yn+1 |rn , y(r) )p(rn |y1:n ) . (16) Using the BMC results in Bishop [4, Sec. 10.1.4] we find a variational posterior on the run length by using the variational lower bound for each run length Li (q) ≤ log p(yn+1 |rn = i, y(r) ), calculated using (12), as a proxy for the exact UPM posterior predictive in (16). This gives the exact VB posterior if the approximating family Q is of the form: q(rn , θ, x) = qUPM (θ, x|rn )q(rn ) =⇒ q(rn = i) = exp(Li (q))p(rn = i|y1:n )/ exp(L(q)) , (17) where qUPM contains whatever constraints we used to compute Li (q). The normalizer on q(rn ) serves as a joint VB lower bound: L(q) = log i exp(Li (q))p(rn = i|y1:n ) ≤ log p(yn+1 |y1:n ). Note that the conditional factorization is different than the typical independence constraint on q. Furthermore, we derive the estimation of the assignment vectors a in (6) as a VB routine. We use a similar conditional constraint on the latent BOCPD variables given the assignment and constrain the assignment posterior to be a point mass. In the 2D assignment case, for example, ˆ q(an , X1:NT ) = q(X1:NT |an )q(an ) = q(X1:NT |an )I{an = an } , (18) 6 2 10 0 10 −1 10 −2 10 10 20 30 40 50 RCS RMSE (dBsm) RCS RMSE (dBsm) 10 KL (nats) 5 10 1 8 6 4 2 3 2 1 0 0 0 100 200 Sample Size (a) Online Updating 4 300 Time (b) Exponential RCS 400 0 100 200 300 400 Time (c) Rice RCS Figure 2: Left: KL from naive updating ( ), Sato’s method [24] ( ), and improved online VB (◦) to the batch VB posterior vs. sample size n; using a standard normal-gamma prior. Each curve represents a true ν in the generating Rice distribution: ν = 3.16 (red), ν = 10.0 (green), ν = 31.6 (blue) and τ = 1. Middle: The RMSE (dB scale) of the estimate on the mean RCS distribution E[Rn ] is plotted for an exponential RCS model. The curves are BOCPD (blue), IMM (black), identity (magenta), α-filter (green), and median filter (red). Right: Same as the middle but for the Rice RCS case. The dashed lines are 95% confidence intervals. where each track’s Xi represents all the latent variables used to compute the variational lower bound on log p(zj,n |an (j) = i). In the BOCPD case, Xi := {rn , x, θ}. The resulting VB fixed point ˆ equations find the posterior on the latent variables Xi by taking an as the true assignment and solving ˆ the VB problem of (17); the assignment an is found by using (6) and taking the joint BOCPD lower bound L(q) as a proxy for the BOCPD predictive log likelihood component of log pi in (5). 3 3.1 Results Improved Online Solution We first demonstrate the accuracy of the online VB approximation (Section 2.2) on a Rice estimation example; here, we only test the VB posterior as no change point detection is applied. Figure 2(a) compares naive online updating, Sato’s method [24], and our improved online updating in KL(online batch) of the posteriors for three different true parameters ν as sample size n increases. The performance curves are the KL divergence between these online approximations to the posterior and the batch VB solution (i.e. restarting VB from “scratch” every new data point) vs sample size. The error for our method stays around a modest 10−2 nats while naive updating incurs large errors of 1 to 50 nats [19, Ch. 4]. Sato’s method tends to settle in around a 1 nat approximation error. The recommended annealing schedule, i.e. forgetting factors, in [24] performed worse than naive updating. We did a grid search over annealing exponents and show the results for the best performing schedule of n−0.52 . By contrast, our method does not require the tuning of an annealing schedule. 3.2 RCS Estimation Benchmarking We now compare BOCPD with other methods for RCS estimation. We use the same experimental example as Slocumb and Klusman III [25], which uses an augmented interacting multiple model (IMM) based method for estimating the RCS; we also compare against the same α-filter and median filter used in [25]. As a reference point, we also consider the “identity filter” which is merely an unbiased filter that uses only yn to estimate the mean RCS E[Rn ] at time step n. We extend this example to look at Rice RCS in addition to the exponential RCS case. The bias correction constants in the IMM were adjusted for the Rice distribution case as per [25, Sec. 3.4]. The results on exponential distributions used in [25] and the Rice distribution case are shown in Figures 2(b) and 2(c). The IMM used in [25] was hard-coded to expect jumps in the SNR of multiples of ±10 dB, which is exactly what is presented in the example (a sequence of 20, 10, 30, and 10 dB). In [25] the authors mention that the IMM reaches an RMSE “floor” at 2 dB, yet BOCPD continues to drop as low as 0.56 dB. The RMSE from BOCPD does not spike nearly as high as the other methods upon a change in E[Rn ]. The α-filter and median filter appear worse than both the IMM and BOCPD. The RMSE and confidence intervals are calculated from 5000 runs of the experiment. 7 45 80 40 30 Northing (km) Improvement (%) 35 25 20 15 10 5 60 40 20 0 0 −5 1 2 3 4 −20 5 Difficulty 0 20 40 60 80 100 Easting (km) (a) SIAP Metrics (b) Heathrow (LHR) Figure 3: Left: Average relative improvements (%) for SIAP metrics: position accuracy (red ), velocity accuracy (green ), and spurious tracks (blue ◦) across difficulty levels. Right: LHR: true trajectories shown as black lines (−), estimates using a BOCPD RCS model for association shown as blue stars (∗), and the standard tracker as red circles (◦). The standard tracker has spurious tracks over east London and near Ipswich. Background map data: Google Earth (TerraMetrics, Data SIO, NOAA, U.S. Navy, NGA, GEBCO, Europa Technologies) 3.3 Flightradar24 Tracking Problem Finally, we used real flight trajectories from flightradar24 and plugged them into our 3D tracking algorithm. We compare tracking performance between using our BOCPD model and the relatively standard constant probability of detection (no RCS) [2, Sec. 3.5] setup. We use the single integrated air picture (SIAP) metrics [6] to demonstrate the improved performance of the tracking. The SIAP metrics are a standard set of metrics used to compare tracking systems. We broke the data into 30 regions during a one hour period (in Sept. 2012) sampled every 5 s, each within a 200 km by 200 km area centered around the world’s 30 busiest airports [22]. Commercial airport traffic is typically very orderly and does not allow aircraft to fly close to one another or cross paths. Feature-aided tracking is most necessary in scenarios with a more chaotic air situation. Therefore, we took random subsets of 10 flight paths and randomly shifted their start time to allow for scenarios of greater interest. The resulting SIAP metric improvements are shown in Figure 3(a) where we look at performance by a difficulty metric: the number of times in a scenario any two aircraft come within ∼400 m of each other. The biggest improvements are seen for difficulties above three where positional accuracy increases by 30%. Significant improvements are also seen for velocity accuracy (11%) and the frequency of spurious tracks (6%). Significant performance gains are seen at all difficulty levels considered. The larger improvements at level three over level five are possibly due to some level five scenarios that are not resolvable simply through more sophisticated models. We demonstrate how our RCS methods prevent the creation of spurious tracks around London Heathrow in Figure 3(b). 4 Conclusions We have demonstrated that it is possible to use sophisticated and recent developments in machine learning such as BOCPD, and use the modern inference method of VB, to produce demonstrable improvements in the much more mature field of radar tracking. We first closed a “hole” in the literature in Section 2.1 by deriving variational inference on the parameters of a Rice distribution, with its inherent applicability to radar tracking. In Sections 2.2 and 2.4 we showed that it is possible to use these variational UPMs for non-exponential family models in BOCPD without sacrificing its modular or online nature. The improvements in online VB are extendable to UPMs besides a Rice distribution and more generally beyond change point detection. We can use the variational lower bound from the UPM and obtain a principled variational approximation to the run length posterior. Furthermore, we cast the estimation of the assignment vectors themselves as a VB problem, which is in large contrast to the tracking literature. More algorithms from the tracking literature can possibly be cast in various machine learning frameworks, such as VB, and improved upon from there. 8 References [1] Adams, R. P. and MacKay, D. J. (2007). Bayesian online changepoint detection. Technical report, University of Cambridge, Cambridge, UK. [2] Bar-Shalom, Y., Willett, P., and Tian, X. (2011). Tracking and Data Fusion: A Handbook of Algorithms. YBS Publishing. [3] Beal, M. and Ghahramani, Z. (2003). The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures. In Bayesian Statistics, volume 7, pages 453–464. [4] Bishop, C. M. (2007). Pattern Recognition and Machine Learning. Springer. [5] Braun, J. V., Braun, R., and M¨ ller, H.-G. (2000). Multiple changepoint fitting via quasilikelihood, with u application to DNA sequence segmentation. Biometrika, 87(2):301–314. [6] Byrd, E. (2003). Single integrated air picture (SIAP) attributes version 2.0. Technical Report 2003-029, DTIC. [7] Chen, J. and Gupta, A. (1997). Testing and locating variance changepoints with application to stock prices. Journal of the Americal Statistical Association, 92(438):739–747. [8] Courant, R. and Hilbert, D. (1953). Methods of Mathematical Physics. Interscience. [9] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38. [10] Ehrman, L. M. and Blair, W. D. (2006). Comparison of methods for using target amplitude to improve measurement-to-track association in multi-target tracking. In Information Fusion, 2006 9th International Conference on, pages 1–8. IEEE. [11] Fearnhead, P. and Liu, Z. (2007). Online inference for multiple changepoint problems. Journal of the Royal Statistical Society, Series B, 69(4):589–605. [12] Hipp, C. (1974). Sufficient statistics and exponential families. The Annals of Statistics, 2(6):1283–1292. [13] Honkela, A. and Valpola, H. (2003). On-line variational Bayesian learning. In 4th International Symposium on Independent Component Analysis and Blind Signal Separation, pages 803–808. [14] Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Transactions of the ASME — Journal of Basic Engineering, 82(Series D):35–45. [15] Lauwers, L., Barb´ , K., Van Moer, W., and Pintelon, R. (2009). Estimating the parameters of a Rice e distribution: A Bayesian approach. In Instrumentation and Measurement Technology Conference, 2009. I2MTC’09. IEEE, pages 114–117. IEEE. [16] Mahler, R. (2003). Multi-target Bayes filtering via first-order multi-target moments. IEEE Trans. AES, 39(4):1152–1178. [17] Marcum, J. (1950). Table of Q functions. U.S. Air Force RAND Research Memorandum M-339, Rand Corporation, Santa Monica, CA. [18] Mardia, K. V. and Jupp, P. E. (2000). Directional Statistics. John Wiley & Sons, New York. [19] Murray, I. (2007). Advances in Markov chain Monte Carlo methods. PhD thesis, Gatsby computational neuroscience unit, University College London, London, UK. [20] Poore, A. P., Rijavec, N., Barker, T. N., and Munger, M. L. (1993). Data association problems posed as multidimensional assignment problems: algorithm development. In Optical Engineering and Photonics in Aerospace Sensing, pages 172–182. International Society for Optics and Photonics. [21] Richards, M. A., Scheer, J., and Holm, W. A., editors (2010). Principles of Modern Radar: Basic Principles. SciTech Pub. [22] Rogers, S. (2012). The world’s top 100 airports: listed, ranked and mapped. The Guardian. [23] Saatci, Y., Turner, R., and Rasmussen, C. E. (2010). Gaussian process change point models. In 27th ¸ International Conference on Machine Learning, pages 927–934, Haifa, Israel. Omnipress. [24] Sato, M.-A. (2001). Online model selection based on the variational Bayes. Neural Computation, 13(7):1649–1681. [25] Slocumb, B. J. and Klusman III, M. E. (2005). A multiple model SNR/RCS likelihood ratio score for radar-based feature-aided tracking. In Optics & Photonics 2005, pages 59131N–59131N. International Society for Optics and Photonics. [26] Swerling, P. (1954). Probability of detection for fluctuating targets. Technical Report RM-1217, Rand Corporation. [27] Turner, R. (2011). Gaussian Processes for State Space Models and Change Point Detection. PhD thesis, University of Cambridge, Cambridge, UK. 9
6 0.73435199 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
7 0.73370886 99 nips-2013-Dropout Training as Adaptive Regularization
8 0.73313224 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
9 0.73304123 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification
10 0.73100448 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.
11 0.72994083 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
12 0.72889805 304 nips-2013-Sparse nonnegative deconvolution for compressive calcium imaging: algorithms and phase transitions
13 0.7288481 233 nips-2013-Online Robust PCA via Stochastic Optimization
14 0.72876084 194 nips-2013-Model Selection for High-Dimensional Regression under the Generalized Irrepresentability Condition
15 0.72756124 91 nips-2013-Dirty Statistical Models
16 0.72715449 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables
17 0.72568899 288 nips-2013-Scalable Influence Estimation in Continuous-Time Diffusion Networks
18 0.72545999 350 nips-2013-Wavelets on Graphs via Deep Learning
19 0.72521192 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits