nips nips2013 nips2013-233 knowledge-graph by maker-knowledge-mining

233 nips-2013-Online Robust PCA via Stochastic Optimization

Source: pdf

Author: Jiashi Feng, Huan Xu, Shuicheng Yan

Abstract: Robust PCA methods are typically based on batch optimization and have to load all the samples into memory during optimization. This prevents them from efﬁciently processing big data. In this paper, we develop an Online Robust PCA (OR-PCA) that processes one sample per time instance and hence its memory cost is independent of the number of samples, signiﬁcantly enhancing the computation and storage efﬁciency. The proposed OR-PCA is based on stochastic optimization of an equivalent reformulation of the batch RPCA. Indeed, we show that OR-PCA provides a sequence of subspace estimations converging to the optimum of its batch counterpart and hence is provably robust to sparse corruption. Moreover, OR-PCA can naturally be applied for tracking dynamic subspace. Comprehensive simulations on subspace recovering and tracking demonstrate the robustness and efﬁciency advantages of the OR-PCA over online PCA and batch RPCA methods. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 sg Abstract Robust PCA methods are typically based on batch optimization and have to load all the samples into memory during optimization. [sent-7, score-0.332]

2 The proposed OR-PCA is based on stochastic optimization of an equivalent reformulation of the batch RPCA. [sent-10, score-0.259]

3 Indeed, we show that OR-PCA provides a sequence of subspace estimations converging to the optimum of its batch counterpart and hence is provably robust to sparse corruption. [sent-11, score-0.608]

4 Comprehensive simulations on subspace recovering and tracking demonstrate the robustness and efﬁciency advantages of the OR-PCA over online PCA and batch RPCA methods. [sent-13, score-0.664]

5 One prominent example is the Principal Component Pursuit (PCP) method proposed in [4] that robustly ﬁnds the low-dimensional subspace through decomposing the sample matrix into a low-rank component and an overall sparse component. [sent-17, score-0.458]

6 It is proved that both components can be recovered exactly through minimizing a weighted combination of the nuclear norm of the ﬁrst term and 1 norm of the second one. [sent-18, score-0.236]

7 Thus the subspace estimation is robust to sparse corruptions. [sent-19, score-0.414]

8 However, PCP and other robust PCA methods are all implemented in a batch manner. [sent-20, score-0.323]

9 Thus, robust PCA methods require memorizing all samples, in sharp contrast to standard PCA where only the covariance matrix is needed. [sent-22, score-0.173]

10 Moreover, for an incremental samples set, when a new sample is added, the optimization procedure has to be re-implemented on all available samples. [sent-24, score-0.205]

11 Another pitfall of batch robust PCA methods is that they cannot handle the case where the underlying subspaces are changing gradually. [sent-26, score-0.354]

12 Unfortunately, traditional batch RPCA methods may fail in this case. [sent-30, score-0.194]

13 In order to efﬁciently and robustly estimate the subspace of a large-scale or dynamic samples set, we propose an Online Robust PCA (OR-PCA) method. [sent-31, score-0.312]

14 OR-PCA processes only one sample per time instance and thus is able to efﬁciently handle big data and dynamic sample sets, saving the memory cost and dynamically estimating the subspace of evolutional samples. [sent-32, score-0.408]

15 The major difﬁculty of implementing the previous RPCA methods, such as PCP, in an online fashion is that the adopted nuclear norm tightly couples the samples and thus the samples have to be processed simultaneously. [sent-34, score-0.48]

16 To tackle this, OR-PCA pursues the low-rank component in a different manner: using an equivalent form of the nuclear norm, OR-PCA explicitly decomposes the sample matrix into the multiplication of the subspace basis and coefﬁcients plus a sparse noise component. [sent-35, score-0.628]

17 The ﬁrst one is to project the sample onto the current basis and isolate the sparse noise (explaining the outlier contamination), and the second one is to update the basis given the new sample. [sent-38, score-0.275]

18 Our main technical contribution is to show the above mentioned iterative optimization sheme converges to the global optimal solution of the original PCP formulation, thus we establish the validity of our online method. [sent-39, score-0.273]

19 Our proof is inspired by recent results from [16], who proposed an online dictionary learning method and provided the convergence guarantee of the proposed online dictionary learning method. [sent-40, score-0.35]

20 However, [16] can only guarantee that the solution converges to a stationary point of the optimization problem. [sent-41, score-0.175]

21 Besides the nice behavior on single subspace recovering, OR-PCA can also be applied for tracking time-variant subspace naturally, since it updates the subspace estimation timely after revealing one new sample. [sent-42, score-0.748]

22 We conduct comprehensive simulations to demonstrate the advantages of OR-PCA for both subspace recovering and tracking in this work. [sent-43, score-0.355]

23 2 Related Work The robust PCA algorithms based on nuclear norm minimization to recover low-rank matrices are now standard, since the seminal works [21, 6]. [sent-44, score-0.305]

24 Recent works [4, 5] have taken the nuclear norm minimization approach to the decomposition of a low-rank matrix and an overall sparse matrix. [sent-45, score-0.28]

25 Different from the setting of samples being corrupted by sparse noise, [25, 24] and [7] solve robust PCA in the case that a few samples are completely corrupted. [sent-46, score-0.371]

26 However, all of these RPCA methods are implemented in batch manner and cannot be directly adapted to the online setup. [sent-47, score-0.314]

27 There are only a few pieces of work on online robust PCA [13, 20, 10], which we discuss below. [sent-48, score-0.249]

28 In [13], an incremental and robust subspace learning method is proposed. [sent-49, score-0.418]

29 In [20], a compressive sensing based recursive robust PCA algorithm is proposed. [sent-53, score-0.152]

30 The proposed method essentially solves compressive sensing optimization over a small batch of data to update the principal components estimation instead of using a single sample, and it is not clear how to extend the method to the latter case. [sent-54, score-0.281]

31 propose an incremental gradient descent method on Grassmannian manifold for solving the robust PCA problem, named GRASTA [10]. [sent-56, score-0.243]

32 Moreover, in the experiments in this work, we show that our proposed method is more robust than GRASTA to the sparse corruption and achieves higher breakdown point. [sent-59, score-0.294]

33 The most closely related work to ours in technique is [16], which proposes an online learning method for dictionary learning and sparse coding. [sent-60, score-0.25]

34 Based on that work, [9] proposes an online nonnegative matrix factorization method. [sent-61, score-0.29]

35 Both works can be seen as solving online matrix factorization problems with speciﬁc constraints (sparse or non-negative). [sent-62, score-0.257]

36 In OR-PCA, an additive sparse noise matrix is considered along with the matrix factorization. [sent-64, score-0.177]

37 In addition, beneﬁtting from explicitly considering the noise, OR-PCA is robust to sparse contamination, which is absent in either the dictionary learning or nonnegative matrix factorization works. [sent-66, score-0.355]

38 After this paper was accepted, we found similar works which apply the same main idea of combining the online learning framework in [16] with the factorization formulation of nuclear norm was published in [17, 18, 23] before. [sent-68, score-0.362]

39 Let r denote the intrinsic dimension of the subspace underlying {xi }n . [sent-75, score-0.302]

40 For an arbitrary real matrix E, Let E F denote its Frobenius norm, E 1 = i,j |Eij | denote the 1 -norm of E seen as a long vector in Rp×n , and E ∗ = i σi (E) denote its nuclear norm, i. [sent-81, score-0.16]

41 2 Objective Function Formulation Robust PCA (RPCA) aims to accurately estimate the subspace underlying the observed samples, even though the samples are corrupted by gross but sparse noise. [sent-85, score-0.44]

42 However, these optimization methods are implemented in a batch manner. [sent-89, score-0.234]

43 The main difﬁculty is that the nuclear norm couples all the samples tightly and thus the samples cannot be considered separately as in typical online optimization problems. [sent-95, score-0.52]

44 To overcome this difﬁculty, we use an equivalent form of the nuclear norm for the matrix X whose rank is upper bounded by r, as follows [21], X ∗ = 1 L 2 inf L∈Rp×r ,R∈Rn×r 2 F + 1 R 2 2 F : X = LRT . [sent-96, score-0.254]

45 Namely, the nuclear norm is re-formulated as an explicit low-rank factorization of X. [sent-97, score-0.242]

46 Such nuclear norm factorization is developed in [3] and well established in recent works [22, 21]. [sent-98, score-0.242]

47 In this decomposition, L ∈ Rp×r can be seen as the basis of the low-dimensional subspace and R ∈ Rn×r denotes the coefﬁcients of the samples w. [sent-99, score-0.348]

48 , zn ] ∈ Rp×n , solving problem (2) indeed minimizes the following empirical cost function, fn (L) 1 n n (zi , L) + i=1 λ1 L 2n 2 F, (3) λ1 r 2 2 2 (4) where the loss function for each sample is deﬁned as (zi , L) min r,e 1 zi − Lr − e 2 2 2 + + λ2 e 1 . [sent-115, score-0.138]

49 The loss function measures the representation error for the sample z on a ﬁxed basis L, where the coefﬁcients on the basis r and the sparse noise e associated with each sample are optimized to minimize the loss. [sent-116, score-0.277]

50 In this work, we ﬁrst establish a surrogate function for this expected cost and then optimize the surrogate function for obtaining the subspace estimation in an online fashion. [sent-121, score-0.513]

51 The main idea is to develop a stochastic optimization algorithm to minimize the empirical cost function (3), which processes one sample per time instance in an online manner. [sent-123, score-0.263]

52 The objective function for updating the basis Lt is deﬁned as, i=1 gt (L) 1 t t i=1 1 zi − Lri − ei 2 2 2 + λ1 ri 2 2 2 + λ2 ei 1 + λ1 L 2t 2 F. [sent-129, score-0.337]

53 (6) This is a surrogate function of the empirical cost function ft (L) deﬁned in (3), i. [sent-130, score-0.14]

54 , it provides an upper bound for ft (L): gt (L) ≥ ft (L). [sent-132, score-0.238]

55 The following theorem is the main theoretic result of the paper, which states that the solution from Algorithm 1 will converge to the optimal solution of the batch optimization. [sent-137, score-0.317]

56 Thus, the proposed OR-PCA converges to the correct low-dimensional subspace even in the presence of sparse noise, as long as the batch version – PCP – works. [sent-138, score-0.542]

57 Given the rank of the optimal solution to (5) is provided as r, and the solution Lt ∈ Rp×r provided by Algorithm 1 is full rank, then Lt converges to the optimal solution of (5) asymptotically. [sent-141, score-0.291]

58 Theorem 2 (Convergence of the surrogate function gt ). [sent-171, score-0.231]

59 Let gt denote the surrogate function deﬁned in (6). [sent-172, score-0.231]

60 Then, gt (Lt ) converges almost surely when the solution Lt is given by Algorithm 1. [sent-173, score-0.329]

61 , the convergence of the stochastic positive process gt (Lt ) > 0, by showing that it is a quasi-martingale. [sent-176, score-0.193]

62 We ﬁrst show that the summation of the positive difference of gt (Lt ) is bounded utilizing the fact that gt (Lt ) upper bounds the empirical cost ft (Lt ) and the loss function (zt , Lt ) is Lipschitz. [sent-177, score-0.413]

63 Applying the lemma from [8] about the convergence of quasi-martingale, we conclude that gt (Lt ) converges. [sent-179, score-0.168]

64 To prove the above result, we ﬁrst show that the function gt (L) is strictly convex. [sent-184, score-0.198]

65 Then we further show 5 that variation of the function gt (L), gt (Lt ) − gt+1 (Lt ), is Lipschitz if using the updating rule shown in Algorithm 2. [sent-187, score-0.366]

66 In the third step, we show that the expected cost function f (Lt ) is a smooth one, and the difference f (Lt ) − gt (Lt ) goes to zero when t → ∞. [sent-189, score-0.21]

67 Let gt denote the surrogate function deﬁned in (2). [sent-205, score-0.231]

68 Then, 1) f (Lt ) − gt (Lt ) converges almost surely to 0; and 2) f (Lt ) converges almost surely, when the solution Lt is given by Algorithm 1. [sent-206, score-0.392]

69 When the solution L satisﬁes the ﬁrst order condition for minimizing the objective function in (5) , the obtained solution L is the optimal solution of the problem (5) if L is full rank. [sent-213, score-0.15]

70 Combining Theorem 5 and Theorem 6 directly yields Theorem 1 – the solution from Algorithm 1 converges to the optimal solution of Problem (5) asymptotically. [sent-214, score-0.163]

71 Due to space constraints, more results, including those of subspace tracking, are deferred in the supplementary material. [sent-216, score-0.225]

72 1 Medium-scale Robust PCA We here evaluate the ability of the proposed OR-PCA of correctly recovering the subspace of corrupted observations, under various settings of the intrinsic subspace dimension and error density. [sent-218, score-0.606]

73 In particular, we adopt the batch robust PCA method, Principal Component Pursuit [4], as the batch 6 Batch RPCA Online RPCA 1 0. [sent-219, score-0.517]

74 3 Figure 1: (a) and (b): subspace recovery performance under different corruption fraction ρs (vertical axis) and rank/n (horizontal axis). [sent-253, score-0.358]

75 Brighter color means better performance; (c) and (d): the performance comparison of the OR-PCA, Grasta, and online PCA methods against the number of revealed samples under two different corruption levels ρs with PCP as reference. [sent-254, score-0.333]

76 PCP estimates the subspace in a batch manner through solving the problem in (1) and outputs the low-rank data matrix. [sent-256, score-0.446]

77 Here U is the basis of the subspace and the intrinsic dimension of the subspace spanned by U is r. [sent-262, score-0.585]

78 Namely, the matrix E contains gross but sparse errors. [sent-265, score-0.142]

79 We run the OR-PCA and the PCP algorithms 10 times under the following settings: the ambient dimension and number of samples are set as p = 400 and n = 1, 000; the intrinsic rank r of the subspace varies from 4 to 200; the value of error fraction, ρs , varies from very sparse 0. [sent-266, score-0.528]

80 The performance is evaluated by the similarity between the subspace obtained from the algorithms and the groundtruth. [sent-270, score-0.225]

81 The results demonstrate that under relatively low intrinsic dimension (small rank/n) and sparse corruption (small ρs ), OR-PCA is able to recover the subspace nearly perfectly (E. [sent-280, score-0.467]

82 This demonstrates that the proposed OR-PCA method achieves comparable performance with the batch method and veriﬁes our convergence guarantee on the OR-PCA. [sent-284, score-0.216]

83 In the relatively difﬁcult setting (high intrinsic dimension and dense error, shown in the top-right of the matrix), OR-PCA performs slightly worse than the PCP, possibly because the number of streaming samples is not enough to achieve convergence. [sent-285, score-0.142]

84 To better demonstrate the robustness of OR-PCA to corruptions and illustrate how the performance of OR-PCA is improved when more samples are revealed, we plot the performance curve of ORPCA against the number of samples in Figure 1(c), under the setting of p = 400, n = 1, 000, ρs = 0. [sent-286, score-0.211]

85 We observe that when more samples are revealed, both OR-PCA and GRASTA steadily improve the subspace recovery. [sent-290, score-0.29]

86 However, our proposed OR-PCA converges much faster than GRASTA, possibly because in each iteration OR-PCA obtains the optimal closed-form solution to the basis updating subproblem while GRASTA only takes one gradient descent step. [sent-291, score-0.246]

87 To show the robustness of the proposed OR-PCA, we also plot the performance of the standard online (or incremental) PCA [1] for comparison. [sent-297, score-0.171]

88 We observe that as expected, the online PCA cannot recover the subspace correctly (E. [sent-302, score-0.345]

89 In fact, from other simulation results under different settings of intrinsic rank and corruption level (see supplementary material), we observe that the GRASTA breaks down at 25% corruption (the value of E. [sent-318, score-0.296]

90 Note that batch RPCA cannot process these data due to out of memory. [sent-331, score-0.194]

91 In each iteration, batch PCP needs to perform an SVD plus a thresholding operation, whose complexity is O(np2 ). [sent-384, score-0.216]

92 In contrast, for OR-PCA, in each iteration, the computational cost is O(pr2 ), which is independent of the sample size and linear in the ambient dimension. [sent-385, score-0.145]

93 This is much smaller than the memory cost of the batch PCP algorithm (O(pn)), where n p for large scale dataset. [sent-396, score-0.269]

94 7 Conclusions In this work, we develop an online robust PCA (OR-PCA) method. [sent-399, score-0.249]

95 Different from previous batch based methods, the OR-PCA need not “remember” all the past samples and achieves much higher storage efﬁciency. [sent-400, score-0.284]

96 The main idea of OR-PCA is to reformulate the objective function of PCP (a widely applied batch RPCA algorithm) by decomposing the nuclear norm to an explicit product of two low-rank matrices, which can be solved by a stochastic optimization algorithm. [sent-401, score-0.468]

97 We provide the convergence analysis of the OR-PCA method and show that OR-PCA converges to the solution of batch RPCA asymptotically. [sent-402, score-0.307]

98 Robpca: a new approach to robust principal component analysis. [sent-484, score-0.214]

99 Recursive robust pca or recursive sparse recovery in large but structured noise. [sent-527, score-0.507]

100 Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. [sent-535, score-0.22]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lt', 0.497), ('pcp', 0.346), ('grasta', 0.298), ('rpca', 0.298), ('pca', 0.267), ('subspace', 0.225), ('batch', 0.194), ('gt', 0.168), ('robust', 0.129), ('online', 0.12), ('nuclear', 0.116), ('rp', 0.113), ('corruption', 0.105), ('lrt', 0.081), ('tracking', 0.073), ('singapore', 0.069), ('ambient', 0.067), ('factorization', 0.066), ('samples', 0.065), ('incremental', 0.064), ('converges', 0.063), ('surrogate', 0.063), ('sparse', 0.06), ('norm', 0.06), ('basis', 0.058), ('intrinsic', 0.052), ('corrupted', 0.052), ('solution', 0.05), ('lr', 0.048), ('surely', 0.048), ('principal', 0.047), ('matrix', 0.044), ('revealed', 0.043), ('zt', 0.042), ('cost', 0.042), ('gonzalo', 0.041), ('mardani', 0.041), ('mateos', 0.041), ('optimization', 0.04), ('feng', 0.038), ('gross', 0.038), ('component', 0.038), ('proposes', 0.037), ('big', 0.036), ('pursuit', 0.036), ('jiashi', 0.036), ('contamination', 0.036), ('sample', 0.036), ('ft', 0.035), ('rt', 0.035), ('outlier', 0.034), ('rank', 0.034), ('morteza', 0.033), ('zi', 0.033), ('xu', 0.033), ('memory', 0.033), ('dictionary', 0.033), ('decomposing', 0.033), ('rr', 0.033), ('pr', 0.032), ('pitfall', 0.031), ('prove', 0.03), ('comprehensive', 0.03), ('ece', 0.03), ('corruptions', 0.03), ('updating', 0.03), ('arxiv', 0.03), ('cients', 0.029), ('noise', 0.029), ('couples', 0.028), ('warm', 0.028), ('recovery', 0.028), ('ez', 0.027), ('yan', 0.027), ('caramanis', 0.027), ('solving', 0.027), ('recovering', 0.027), ('plot', 0.026), ('lj', 0.026), ('tightly', 0.026), ('stochastic', 0.025), ('storage', 0.025), ('dimension', 0.025), ('robustness', 0.025), ('ei', 0.024), ('bt', 0.024), ('coef', 0.024), ('severely', 0.024), ('theorem', 0.023), ('gradient', 0.023), ('nonnegative', 0.023), ('recursive', 0.023), ('enhance', 0.022), ('robustly', 0.022), ('augmented', 0.022), ('guarantee', 0.022), ('subproblem', 0.022), ('multiplier', 0.022), ('plus', 0.022), ('provided', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 233 nips-2013-Online Robust PCA via Stochastic Optimization

Author: Jiashi Feng, Huan Xu, Shuicheng Yan

2 0.41302031 232 nips-2013-Online PCA for Contaminated Data

Author: Jiashi Feng, Huan Xu, Shie Mannor, Shuicheng Yan

Abstract: We consider the online Principal Component Analysis (PCA) where contaminated samples (containing outliers) are revealed sequentially to the Principal Components (PCs) estimator. Due to their sensitiveness to outliers, previous online PCA algorithms fail in this case and their results can be arbitrarily skewed by the outliers. Here we propose the online robust PCA algorithm, which is able to improve the PCs estimation upon an initial one steadily, even when faced with a constant fraction of outliers. We show that the ﬁnal result of the proposed online RPCA has an acceptable degradation from the optimum. Actually, under mild conditions, online RPCA achieves the maximal robustness with a 50% breakdown point. Moreover, online RPCA is shown to be efﬁcient for both storage and computation, since it need not re-explore the previous samples as in traditional robust PCA algorithms. This endows online RPCA with scalability for large scale data. 1

3 0.23166892 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints

Author: Yuhong Guo

Abstract: Principal component analysis (PCA), a well-established technique for data analysis and processing, provides a convenient form of dimensionality reduction that is effective for cleaning small Gaussian noises presented in the data. However, the applicability of standard principal component analysis in real scenarios is limited by its sensitivity to large errors. In this paper, we tackle the challenge problem of recovering data corrupted with errors of high magnitude by developing a novel robust transfer principal component analysis method. Our method is based on the assumption that useful information for the recovery of a corrupted data matrix can be gained from an uncorrupted related data matrix. Speciﬁcally, we formulate the data recovery problem as a joint robust principal component analysis problem on the two data matrices, with common principal components shared across matrices and individual principal components speciﬁc to each data matrix. The formulated optimization problem is a minimization problem over a convex objective function but with non-convex rank constraints. We develop an efﬁcient proximal projected gradient descent algorithm to solve the proposed optimization problem with convergence guarantees. Our empirical results over image denoising tasks show the proposed method can effectively recover images with random large errors, and signiﬁcantly outperform both standard PCA and robust PCA with rank constraints. 1

4 0.20780893 188 nips-2013-Memory Limited, Streaming PCA

Author: Ioannis Mitliagkas, Constantine Caramanis, Prateek Jain

Abstract: We consider streaming, one-pass principal component analysis (PCA), in the highdimensional regime, with limited memory. Here, p-dimensional samples are presented sequentially, and the goal is to produce the k-dimensional subspace that best approximates these points. Standard algorithms require O(p2 ) memory; meanwhile no algorithm can do better than O(kp) memory, since this is what the output itself requires. Memory (or storage) complexity is most meaningful when understood in the context of computational and sample complexity. Sample complexity for high-dimensional PCA is typically studied in the setting of the spiked covariance model, where p-dimensional points are generated from a population covariance equal to the identity (white noise) plus a low-dimensional perturbation (the spike) which is the signal to be recovered. It is now well-understood that the spike can be recovered when the number of samples, n, scales proportionally with the dimension, p. Yet, all algorithms that provably achieve this, have memory complexity O(p2 ). Meanwhile, algorithms with memory-complexity O(kp) do not have provable bounds on sample complexity comparable to p. We present an algorithm that achieves both: it uses O(kp) memory (meaning storage of any kind) and is able to compute the k-dimensional spike with O(p log p) samplecomplexity – the ﬁrst algorithm of its kind. While our theoretical analysis focuses on the spiked covariance model, our simulations show that our algorithm is successful on much more general models for the data. 1

5 0.17898272 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

Author: Vincent Q. Vu, Juhee Cho, Jing Lei, Karl Rohe

Abstract: We propose a novel convex relaxation of sparse principal subspace estimation based on the convex hull of rank-d projection matrices (the Fantope). The convex problem can be solved efﬁciently using alternating direction method of multipliers (ADMM). We establish a near-optimal convergence rate, in terms of the sparsity, ambient dimension, and sample size, for estimation of the principal subspace of a general covariance matrix without assuming the spiked covariance model. In the special case of d = 1, our result implies the near-optimality of DSPCA (d’Aspremont et al. [1]) even when the solution is not rank 1. We also provide a general theoretical framework for analyzing the statistical properties of the method for arbitrary input matrices that extends the applicability and provable guarantees to a wide array of settings. We demonstrate this with an application to Kendall’s tau correlation matrices and transelliptical component analysis. 1

6 0.15966918 91 nips-2013-Dirty Statistical Models

7 0.15763012 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives

8 0.15007554 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles

9 0.14495356 224 nips-2013-On the Sample Complexity of Subspace Learning

10 0.12998481 325 nips-2013-The Pareto Regret Frontier

11 0.12584588 137 nips-2013-High-Dimensional Gaussian Process Bandits

12 0.12434421 259 nips-2013-Provable Subspace Clustering: When LRR meets SSC

13 0.12142594 191 nips-2013-Minimax Optimal Algorithms for Unconstrained Linear Optimization

14 0.11080999 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization

15 0.10726795 179 nips-2013-Low-Rank Matrix and Tensor Completion via Adaptive Sampling

16 0.10617891 158 nips-2013-Learning Multiple Models via Regularized Weighting

17 0.10028332 314 nips-2013-Stochastic Optimization of PCA with Capped MSG

18 0.099043541 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences

19 0.094730705 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization

20 0.091974758 120 nips-2013-Faster Ridge Regression via the Subsampled Randomized Hadamard Transform

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.222), (1, 0.04), (2, 0.214), (3, 0.033), (4, -0.054), (5, -0.008), (6, -0.161), (7, 0.17), (8, -0.114), (9, -0.096), (10, -0.116), (11, 0.199), (12, 0.164), (13, 0.226), (14, 0.032), (15, -0.054), (16, 0.024), (17, -0.041), (18, 0.072), (19, -0.09), (20, -0.026), (21, -0.013), (22, -0.05), (23, -0.057), (24, 0.126), (25, 0.126), (26, 0.063), (27, 0.152), (28, 0.039), (29, 0.067), (30, 0.062), (31, 0.032), (32, -0.031), (33, -0.065), (34, 0.056), (35, -0.075), (36, 0.005), (37, 0.037), (38, -0.161), (39, -0.043), (40, -0.096), (41, 0.004), (42, 0.053), (43, -0.043), (44, 0.034), (45, -0.098), (46, 0.039), (47, -0.01), (48, 0.013), (49, 0.075)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95174712 233 nips-2013-Online Robust PCA via Stochastic Optimization

Author: Jiashi Feng, Huan Xu, Shuicheng Yan

2 0.93403757 232 nips-2013-Online PCA for Contaminated Data

Author: Jiashi Feng, Huan Xu, Shie Mannor, Shuicheng Yan

3 0.74217409 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints

Author: Yuhong Guo

4 0.67635393 188 nips-2013-Memory Limited, Streaming PCA

Author: Ioannis Mitliagkas, Constantine Caramanis, Prateek Jain

5 0.67034268 314 nips-2013-Stochastic Optimization of PCA with Capped MSG

Author: Raman Arora, Andy Cotter, Nati Srebro

Abstract: We study PCA as a stochastic optimization problem and propose a novel stochastic approximation algorithm which we refer to as “Matrix Stochastic Gradient” (MSG), as well as a practical variant, Capped MSG. We study the method both theoretically and empirically. 1

6 0.62548041 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

7 0.572101 120 nips-2013-Faster Ridge Regression via the Subsampled Randomized Hadamard Transform

8 0.48085836 224 nips-2013-On the Sample Complexity of Subspace Learning

9 0.45840731 259 nips-2013-Provable Subspace Clustering: When LRR meets SSC

10 0.44423434 91 nips-2013-Dirty Statistical Models

11 0.44185948 158 nips-2013-Learning Multiple Models via Regularized Weighting

12 0.42672268 117 nips-2013-Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis

13 0.42380342 284 nips-2013-Robust Spatial Filtering with Beta Divergence

14 0.418401 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles

15 0.40969917 261 nips-2013-Rapid Distance-Based Outlier Detection via Sampling

16 0.37641361 137 nips-2013-High-Dimensional Gaussian Process Bandits

17 0.36921489 324 nips-2013-The Fast Convergence of Incremental PCA

18 0.36186332 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning

19 0.3616896 80 nips-2013-Data-driven Distributionally Robust Polynomial Optimization

20 0.35483643 256 nips-2013-Probabilistic Principal Geodesic Analysis

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.019), (10, 0.012), (16, 0.05), (33, 0.164), (34, 0.091), (36, 0.035), (41, 0.021), (47, 0.03), (49, 0.013), (56, 0.11), (70, 0.039), (85, 0.069), (87, 0.151), (89, 0.073), (93, 0.037), (95, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.88794166 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent

Author: Yuening Hu, Jordan Boyd-Graber, Hal Daume III, Z. Irene Ying

Abstract: Discovering hierarchical regularities in data is a key problem in interacting with large datasets, modeling cognition, and encoding knowledge. A previous Bayesian solution—Kingman’s coalescent—provides a probabilistic model for data represented as a binary tree. Unfortunately, this is inappropriate for data better described by bushier trees. We generalize an existing belief propagation framework of Kingman’s coalescent to the beta coalescent, which models a wider range of tree structures. Because of the complex combinatorial search over possible structures, we develop new sampling schemes using sequential Monte Carlo and Dirichlet process mixture models, which render inference efﬁcient and tractable. We present results on synthetic and real data that show the beta coalescent outperforms Kingman’s coalescent and is qualitatively better at capturing data in bushy hierarchies. 1 The Need For Bushy Hierarchical Clustering Hierarchical clustering is a fundamental data analysis problem: given observations, what hierarchical grouping of those observations effectively encodes the similarities between observations? This is a critical task for understanding and describing observations in many domains [1, 2], including natural language processing [3], computer vision [4], and network analysis [5]. In all of these cases, natural and intuitive hierarchies are not binary but are instead bushy, with more than two children per parent node. Our goal is to provide efﬁcient algorithms to discover bushy hierarchies. We review existing nonparametric probabilistic clustering algorithms in Section 2, with particular focus on Kingman’s coalescent [6] and its generalization, the beta coalescent [7, 8]. While Kingman’s coalescent has attractive properties—it is probabilistic and has edge “lengths” that encode how similar clusters are—it only produces binary trees. The beta coalescent (Section 3) does not have this restriction. However, na¨ve inference is impractical, because bushy trees are more complex: we need ı to consider all possible subsets of nodes to construct each internal nodes in the hierarchy. Our ﬁrst contribution is a generalization of the belief propagation framework [9] for beta coalescent to compute the joint probability of observations and trees (Section 3). After describing sequential Monte Carlo posterior inference for the beta coalescent, we develop efﬁcient inference strategies in Section 4, where we use proposal distributions that draw on the connection between Dirichlet processes—a ubiquitous Bayesian nonparametric tool for non-hierarchical clustering—and hierarchical coalescents to make inference tractable. We present results on both synthetic and real data that show the beta coalescent captures bushy hierarchies and outperforms Kingman’s coalescent (Section 5). 2 Bayesian Clustering Approaches Recent hierarchical clustering techniques have been incorporated inside statistical models; this requires formulating clustering as a statistical—often Bayesian—problem. Heller et al. [10] build 1 binary trees based on the marginal likelihoods, extended by Blundell et al. [11] to trees with arbitrary branching structure. Ryan et al. [12] propose a tree-structured stick-breaking process to generate trees with unbounded width and depth, which supports data observations at leaves and internal nodes.1 However, these models do not distinguish edge lengths, an important property in distinguishing how “tight” the clustering is at particular nodes. Hierarchical models can be divided into complementary “fragmentation” and “coagulation” frameworks [7]. Both produce hierarchical partitions of a dataset. Fragmentation models start with a single partition and divide it into ever more speciﬁc partitions until only singleton partitions remain. Coagulation frameworks repeatedly merge singleton partitions until only one partition remains. Pitman-Yor diffusion trees [13], a generalization of Dirichlet diffusion trees [14], are an example of a bushy fragmentation model, and they model edge lengths and build non-binary trees. Instead, our focus is on bottom-up coalescent models [8], one of the coagulation models and complementary to diffusion trees, which can also discover hierarchies and edge lengths. In this model, n nodes are observed (we use both observed to emphasize that nodes are known and leaves to emphasize topology). These observed nodes are generated through some unknown tree with latent edges and unobserved internal nodes. Each node (both observed and latent) has a single parent. The convention in such models is to assume our observed nodes come at time t = 0, and at time −∞ all nodes share a common ur-parent through some sequence of intermediate parents. Consider a set of n individuals observed at the present (time t = 0). All individuals start in one of n singleton sets. After time ti , a set of these nodes coalesce into a new node. Once a set merges, their parent replaces the original nodes. This is called a coalescent event. This process repeats until there is only one node left, and a complete tree structure π (Figure 1) is obtained. Different coalescents are deﬁned by different probabilities of merging a set of nodes. This is called the coalescent rate, deﬁned by a general family of coalescents: the lambda coalescent [7, 15]. We represent the rate via the symbol λk , the rate at which k out of n nodes merge into a parent node. n From a collection of n nodes, k ≤ n can coalesce at some coalescent event (k can be different for different coalescent events). The rate of a fraction γ of the nodes coalescing is given by γ −2 Λ(dγ), where Λ(dγ) is a ﬁnite measure on [0, 1]. So k nodes merge at rate 1 λk = n γ k−2 (1 − γ)n−k Λ(dγ) (2 ≤ k ≤ n). (1) 0 Choosing different measures yields different coalescents. A degenerate Dirac delta measure at 0 results in Kingman’s coalescent [6], where λk is 1 when k = 2 and zero otherwise. Because this n gives zero probability to non-binary coalescent events, this only creates binary trees. Alternatively, using a beta distribution BETA(2 − α, α) as the measure Λ yields the beta coalescent. When α is closer to 1, the tree is bushier; as α approaches 2, it becomes Kingman’s coalescent. If we have ni−1 nodes at time ti−1 in a beta coalescent, the rate λkii−1 for a children set of ki nodes at time n ti and the total rate λni−1 of any children set merging—summing over all possible mergers—is λkii−1 = n Γ(ki − α)Γ(ni−1 − ki + α) and λni−1 = Γ(2 − α)Γ(α)Γ(ni−1 ) ni−1 ni−1 ki λkii−1 . n (2) ki =2 Each coalescent event also has an edge length—duration—δi . The duration of an event comes from an exponential distribution, δi ∼ exp(λni−1 ), and the parent node forms at time ti = ti−1 − δi . Shorter durations mean that the children more closely resemble their parent (the mathematical basis for similarity is speciﬁed by a transition kernel, Section 3). Analogous to Kingman’s coalescent, the prior probability of a complete tree π is the product of all of its constituent coalescent events i = 1, . . . m, merging ki children after duration δi , m m λkii−1 · exp(−λni−1 δi ). n p(ki |ni−1 ) · p(δi |ki , ni−1 ) = p(π) = i=1 Merge ki nodes After duration δi (3) i=1 1 This is appropriate where the entirety of a population is known—both ancestors and descendants. We focus on the case where only the descendants are known. For a concrete example, see Section 5.2. 2 Algorithm 1 MCMC inference for generating a tree 1: for Particle s = 1, 2, · · · , S do s 2: Initialize ns = n, i = 0, ts = 0, w0 = 1. 0 s 3: Initialize the node set V = {ρ0 , ρ1 , · · · , ρn }. 4: while ∃s ∈ {1 · · · S} where ns > 1 do 5: Update i = i + 1. 6: for Particle s = 1, 2, · · · , S do (a) Kingman’s coalescent 7: if ns == 1 then 8: Continue. s 9: Propose a duration δi by Equation 10. s 10: Set coalescent time ts = ts − δi . i i−1 11: Sample partitions ps from DPMM. i s 12: Propose a set ρci according to Equation 11. s 13: Update weight wi by Equation 13. s s s 14: Update n = n − |ρci | + 1. s (b) the beta coalescent 15: Remove ρci from V s , add ρs to V s . i 16: Compute effective sample size ESS [16]. Figure 1: The beta coalescent can merge four simi17: if ESS < S/2 then lar nodes at once, while Kingman’s coalescent only 18: Resample particles [17]. merges two each time. 3 Beta Coalescent Belief Propagation The beta coalescent prior only depends on the topology of the tree. In real clustering applications, we also care about a node’s children and features. In this section, we deﬁne the nodes and their features, and then review how we use message passing to compute the probabilities of trees. An internal node ρi is deﬁned as the merger of other nodes. The children set of node ρi , ρci , coalesces into a new node ρi ≡ ∪b∈ci ρb . This encodes the identity of the nodes that participate in speciﬁc coalescent events; Equation 3, in contrast, only considers the number of nodes involved in an event. In addition, each node is associated with a multidimensional feature vector yi . Two terms specify the relationship between nodes’ features: an initial distribution p0 (yi ) and a transition kernel κti tb (yi , yb ). The initial distribution can be viewed as a prior or regularizer for feature representations. The transition kernel encourages a child’s feature yb (at time tb ) to resemble feature yi (formed at ti ); shorter durations tb − ti increase the resemblance. Intuitively, the transition kernel can be thought as a similarity score; the more similar the features are, the more likely nodes are. For Brownian diffusion (discussed in Section 4.3), the transition kernel follows a Gaussian distribution centered at a feature. The covariance matrix Σ is decided by the mutation rate µ [18, 9], the probability of a mutation in an individual. Different kernels (e.g., multinomial, tree kernels) can be applied depending on modeling assumptions of the feature representations. To compute the probability of the beta coalescent tree π and observed data x, we generalize the belief propagation framework used by Teh et al. [9] for Kingman’s coalescent; this is a more scalable alternative to other approaches for computing the probability of a Beta coalescent tree [19]. We deﬁne a subtree structure θi = {θi−1 , δi , ρci }, thus the tree θm after the ﬁnal coalescent event m is a complete tree π. The message for node ρi marginalizes over the features of the nodes in its children set.2 The total message for a parent node ρi is −1 Mρi (yi ) = Zρi (x|θi ) κti tb (yi , yb )Mρb (yb )dyb . (4) b∈ci where Zρi (x|θi ) is the local normalizer, which can be computed as the combination of initial distribution and messages from a set of children, Zρi (x|θi ) = p0 (yi ) κti tb (yi , yb )Mρb (yb )dyb dyi . b∈ci 2 When ρb is a leaf, the message Mρb (yb ) is a delta function centered on the observation. 3 (5) Recursively performing this marginalization through message passing provides the joint probability of a complete tree π and the observations x. At the root, Z−∞ (x|θm ) = p0 (y−∞ )κ−∞,tm (y−∞ , ym )Mρm (ym )dym dy−∞ (6) where p0 (y−∞ ) is the initial feature distribution and m is the number of coalescent events. This gives the marginal probability of the whole tree, m p(x|π) = Z−∞ (x|θm ) Zρi (x|θi ), (7) i=1 The joint probability of a tree π combines the prior (Equation 3) and likelihood (Equation 7), m λkii−1 exp(−λni−1 δi ) · Zρi (x|θi ). n p(x, π) = Z−∞ (x|θm ) (8) i=1 3.1 Sequential Monte Carlo Inference Sequential Monte Carlo (SMC)—often called particle ﬁlters—estimates a structured sequence of hidden variables based on observations [20]. For coalescent models, this estimates the posterior distribution over tree structures given observations x. Initially (i = 0) each observation is in a singleton cluster;3 in subsequent particles (i > 0), points coalesce into more complicated tree s structures θi , where s is the particle index and we add superscript s to all the related notations to distinguish between particles. We use sequential importance resampling [21, SIR] to weight each s particle s at time ti , denoted as wi . The weights from SIR approximate the posterior. Computing the weights requires a conditional distris s s bution of data given a latent state p(x|θi ), a transition distribution between latent states p(θi |θi−1 ), s s and a proposal distribution f (θi |θi−1 , x). Together, these distributions deﬁne weights s s wi = wi−1 s s s p(x | θi )p(θi | θi−1 ) . s s f (θi | θi−1 , x) (9) Then we can approximate the posterior distribution of the hidden structure using the normalized weights, which become more accurate with more particles. To apply SIR inference to belief propagation with the beta coalescent prior, we ﬁrst deﬁne the particle s space structure. The sth particle represents a subtree θi−1 at time ts , and a transition to a new i−1 s s s s subtree θi takes a set of nodes ρci from θi−1 , and merges them at ts , where ts = ts − δi and i i i−1 s s s s s θi = {θi−1 , δi , ρci }. Our proposal distribution must provide the duration δi and the children set ρsi c s to merge based on the previous subtree θi−1 . s We propose the duration δi from the prior exponential distribution and propose a children set from the posterior distribution based on the local normalizers. 4 This is the “priorpost” method in Teh et al. [9]. However, this approach is intractable. Given ni−1 nodes at time ti , we must consider all possible n children sets ni−1 + ni−1 + · · · + ni−1 . The computational complexity grows from O(n2 ) i−1 i−1 2 3 ni−1 (Kingman’s coalescent) to O(2 ) (beta coalescent). 4 Efﬁciently Finding Children Sets with DPMM We need a more efﬁcient way to consider possible children sets. Even for Kingman’s coalescent, which only considers pairs of nodes, Gorur et al. [22] do not exhaustively consider all pairs. Instead, they use data structures from computational geometry to select the R closest pairs as their restriction set, reducing inference to O(n log n). While ﬁnding closest pairs is a traditional problem in computational geometry, discovering arbitrary-sized sets is less studied. 3 The relationship between time and particles is non-intuitive. Time t goes backward with subsequent particles. When we use time-speciﬁc adjectives for particles, this is with respect to inference. 4 This is a special case of Section 4.2’s algorithm, where the restriction set Ωi is all possible subsets. 4 In this section, we describe how we use a Dirichlet process mixture model [23, DPMM] to discover a restriction set Ω, integrating DPMMs into the SMC proposal. We ﬁrst brieﬂy review what DPMMs are, describe why they are attractive, and then describe how we incorporate DPMMs in SMC inference. The DPMM is deﬁned by a concentration β and a base distribution G0 . A distribution over mixtures is drawn from a Dirichlet process (DP): G ∼ DP(β, G0 ). Each observation xi is assigned to a mixture component µi drawn from G. Because the Dirichlet process is a discrete distribution, observations i and j can have the same mixture component (µi = µj ). When this happens, points are said to be in the same partition. Posterior inference can discover a distribution over partitions. A full derivation of these sampling equations appears in the supplemental material. 4.1 Attractive Properties of DPMMs DPMM s and Coalescents Berestycki et al. [8] showed that the distribution over partitions in a Dirichlet process is equivalent to the distribution over coalescents’ allelic partitions—the set of members that have the same feature representation—when the mutation rate µ of the associated kernel is half of the Dirichlet concentration β (Section 3). For Brownian diffusion, we can connect DPMM with coalescents by setting the kernel covariance Σ = µI to Σ = β/2I. The base distribution G0 is also related with nodes’ feature. The base distribution G0 of a Dirichlet process generates the probability measure G for each block, which generates the nodes in a block. As a result, we can select a base distribution which ﬁts the distribution of the samples in coalescent process. For example, if we use Gaussian distribution for the transition kernel and prior, a Gaussian is also appropriate as the DPMM base distribution. Effectiveness as a Proposal The necessary condition for a valid proposal [24] is that it should have support on a superset of the true posterior. In our case, the distribution over partitions provided by the DPMM considers all possible children sets that could be merged in the coalescent. Thus the new proposal with DPMM satisﬁes this requirement, and it is a valid proposal. In addition, Chen [25] gives a set of desirable criteria for a good proposal distribution: accounts for outliers, considers the likelihood, and lies close to the true posterior. The DPMM fulﬁlls these criteria. First, the DPMM provides a distribution over all partitions. Varying the concentration parameter β can control the length of the tail of the distribution over partitions. Second, choosing the base distribution of the DPMM appropriately models the feature likelihood; i.e., ensuring the DPMM places similar nodes together in a partition with high probability. Third, the DPMM qualitatively provides reasonable children sets when compared with exhaustively considering all children sets (Figure 2(c)). 4.2 Incorporating DPMM in SMC Proposals To address the inference intractability in Section 3.1, we use the DPMM to obtain a distribution over partitions of nodes. Each partition contains clusters of nodes, and we take a union over all partitions to create a restriction set Ωi = {ωi1 , ωi2 , · · · }, where each ωij is a subset of the ni−1 nodes. A standard Gibbs sampler provides these partitions (see supplemental). s With this restriction set Ωi , we propose the duration time δi from the exponential distribution and s propose a children set ρci based on the local normalizers s s Zρi (x|θi−1 , δi , ρsi ) c s s s s s s · I ρci ∈ Ωs , (11) fi (ρci |δi , θi−1 ) = fi (δi ) = λs i−1 exp(−λs i−1 δi ) (10) i n n Z0 s where Ωs restricts the candidate children sets, I is the indicator, and we replace Zρi (x|θi ) with i s s s Zρi (x|θi−1 , δi , ρci ) since they are equivalent here. The normalizer is Z0 = ρc s s Zρi (x|θi−1 , δi , ρc ) · I [ρc ∈ Ωs ] = i ρc ∈Ωs i s s Zρi (x|θi−1 , δi , ρc ). (12) Applying the true distribution (the ith multiplicand from Equation 8) and the proposal distribution (Equation 10 and Equation 11) to the SIR weight update (Equation 9), |ρs | c s s wi = wi−1 i λni−1 · ρc ∈Ωs i s s Zρi (x|θi−1 , δi , ρc ) λs i−1 n 5 , (13) |ρs | c s i where |ρsi | is the size of children set ρci ; parameter λni−1 is the rate of the children set ρsi (Equac c s tion 2); and λni−1 is the rate of all possible sets given a total number of nodes ni−1 (Equation 2). We can view this new proposal as a coarse-to-ﬁne process: DPMM proposes candidate children sets; SMC selects a children set from DPMM to coalesce. Since the coarse step is faster and ﬁlters “bad” children sets, the slower ﬁner step considers fewer children sets, saving computation time (Algorithm 1). If Ωi has all children sets, it recovers exhaustive SMC. We estimate the effective sample size [16] and resample [17] when needed. For smaller sets, the DPMM is sometimes impractical (and only provides singleton clusters). In such cases it is simpler to enumerate all children sets. 4.3 Example Transition Kernel: Brownian Diffusion This section uses Brownian diffusion as an example for message passing framework. The initial distribution p0 (y) of each node is N (0, ∞); the transition kernel κti tb (y, ·) is a Gaussian centered at y with variance (ti − tb )Σ, where Σ = µI, µ = β/2, β is the concentration parameter of DPMM. Then the local normalizer Zρi (x|θi ) is Zρi (x|θi ) = N (yi ; 0, ∞) b∈ci N (yi ; yb , Σ(vρb + tb − ti ))dyi , ˆ (14) and the node message Mρi (yi ) is normally distributed Mρi (yi ) ∼ N (yi ; yρi , Σvρi ), where ˆ vρi = 5 b∈ci (vρb + tb − ti )−1 −1 , yρi = ˆ b∈ci yρb ˆ vρ b + t b − t i vρ i . Experiments: Finding Bushy Trees In this section, we compare trees built by the beta coalescent (beta) against those built by Kingman’s coalescent (kingman) and hierarchical agglomerative clustering [26, hac] on both synthetic and real data. We show beta performs best and can capture data in more interpretable, bushier trees. Setup The parameter α for the beta coalescent is between 1 and 2. The closer α is to 1, bushier the tree is, and we set α = 1.2.5 We set the mutation rate as 1, thus the DPMM parameter is initialized as β = 2, and updated using slice sampling [27]. All experiments use 100 initial iterations of DPMM inference with 30 more iterations after each coalescent event (forming a new particle). Metrics We use three metrics to evaluate the quality of the trees discovered by our algorithm: purity, subtree and path length. The dendrogram purity score [28, 10] measures how well the leaves in a subtree belong to the same class. For any two leaf nodes, we ﬁnd the least common subsumer node s and—for the subtree rooted at s—measure the fraction of leaves with same class labels. The subtree score [9] is the ratio between the number of internal nodes with all children in the same class and the total number of internal nodes. The path length score is the average difference—over all pairs—of the lowest common subsumer distance between the true tree and the generated tree, where the lowest common subsumer distance is the distance between the root and the lowest common subsumer of two nodes. For purity and subtree, higher is better, while for length, lower is better. Scores are in expectation over particles and averaged across chains. 5.1 Synthetic Hierarchies To test our inference method, we generated synthetic data with edge length (full details available in the supplemental material); we also assume each child of the root has a unique label and the descendants also have the same label as their parent node (except the root node). We compared beta against kingman and hac by varying the number of observations (Figure 2(a)) and feature dimensions (Figure 2(b)). In both cases, beta is comparable to kingman and hac (no edge length). While increasing the feature dimension improves both scores, more observations do not: for synthetic data, a small number of observations sufﬁce to construct a good tree. 5 With DPMM proposals, α has a negligible effect, so we elide further analysis for different α values. 6 0.8 0.6 beta hac kingman 0.4 0.8 0.6 beta hac kingman 8 60 80 100 beta kingman length Number of Observations Scores 0.6 beta kingman 0.4 0.2 0.0 4 6 length Dimension 0.6 0.4 0.2 0.0 20 40 60 80 100 0.6 beta 2 4 6 length Dimension 0.6 8 enum beta 10 enum 0.4 0.2 0.0 2 Number of Observations (a) Increasing observations 0.8 0.4 2 Scores 40 1.0 10 0.4 20 Scores purity 1.0 Scores purity Scores Scores purity 1.0 4 6 8 Dimension (b) Increasing dimension 10 2 4 6 8 10 Dimension (c) beta v.s. enum Figure 2: Figure 2(a) and 2(b) show the effect of changing the underlying data size or number of dimension. Figure 2(c) shows that our DPMM proposal for children sets is comparable to an exhaustive enumeration of all possible children sets (enum). To evaluate the effectiveness of using our DPMM as a proposal distribution, we compare exhaustively enumerating all children set candidates (enum) while keeping the SMC otherwise unchanged; this experiment uses ten data points (enum is completely intractable on larger data). Beta uses the DPMM and achieved similar accuracy (Figure 2(c)) while greatly improving efﬁciency. 5.2 Human Tissue Development Our ﬁrst real dataset is based on the developmental biology of human tissues. As a human develops, tissues specialize, starting from three embryonic germ layers: the endoderm, ectoderm, and mesoderm. These eventually form all human tissues. For example, one developmental pathway is ectoderm → neural crest → cranial neural crest → optic vesicle → cornea. Because each germ layer specializes into many different types of cells at speciﬁc times, it is inappropriate to model this development as a binary tree, or with clustering models lacking path lengths. Historically, uncovering these specialization pathways is a painstaking process, requiring inspection of embryos at many stages of development; however, massively parallel sequencing data make it possible to efﬁciently form developmental hypotheses based on similar patterns of gene expression. To investigate this question we use the transcriptome of 27 tissues with known, unambiguous, time-speciﬁc lineages [29]. We reduce the original 182727 dimensions via principle component analysis [30, PCA]. We use ﬁve chains with ﬁve particles per chain. Using reference developmental trees, beta performs better on all three scores (Table 1) because beta builds up a bushy hierarchy more similar to the true tree. The tree recovered by beta (Figure 3) reﬂects human development. The ﬁrst major differentiation is the division of embryonic cells into three layers of tissue: endoderm, mesoderm, and ectoderm. These go on to form almost all adult organs and cells. The placenta (magenta), however, forms from a fourth cell type, the trophoblast; this is placed in its own cluster at the root of the tree. It also successfully captures ectodermal tissue lineage. However, mesodermic and endodermic tissues, which are highly diverse, do not cluster as well. Tissues known to secrete endocrine hormones (dashed borders) cluster together. 5.3 Clustering 20-newsgroups Data Following Heller et al. [10], we also compare the three models on 20-newsgroups,6 a multilevel hierarchy ﬁrst dividing into general areas (rec, space, and religion) before specializing into areas such as baseball or hockey.7 This true hierarchy is inset in the bottom right of Figure 4, and we assume each edge has the same length. We apply latent Dirichlet allocation [31] with 50 topics to this corpus, and use the topic distribution for each document as the document feature. We use ﬁve chains with eighty particles per chain. 6 http://qwone.com/˜jason/20Newsgroups/ 7 We use “rec.autos”, “rec.sport.baseball”, “rec.sport.hockey”, “sci.space” newsgroups but also—in contrast to Heller et al. [10]—added “soc.religion.christian”. 7 ectoderm Stomach Pancreas mesoderm placenta Placenta endoderm Bone Marrow Thyroid Colon Kidney Heart PeripheralBlood Lymphocytes Brain Hypothalamus Brain Amygdala Prostate Uterus Lung Brain Thalamus BrainCorpus Callosum Spleen Thymus Spinal Cord Brain Cerebellum BrainCaudate Nucleus Doc Label rec.sport.baseball rec.autos rec.sport.hocky sci.space soc.religion.christian Trachea Small Intestine Retina Monocytes Mammary Gland ... purity ↑ subtree ↑ length ↓ ... ... ... ... Bladder Figure 3: One sample hierarchy of human tissue from beta. Color indicates germ layer origin of tissue. Dashed border indicates secretory function. While neural tissues from the ectoderm were clustered correctly, some mesoderm and endoderm tissues were commingled. The cluster also preferred placing secretory tissues together and higher in the tree. hac 0.453 0.240 − True Tree Figure 4: One sample hierarchy of the 20newsgroups from beta. Each small square is a document colored by its class label. Large rectangles represent a subtree with all the enclosed documents as leaf nodes. Most of the documents from the same group are clustered together; the three “rec” groups are merged together ﬁrst, and then merged with the religion and space groups. Biological Data kingman beta 0.474 ± 0.029 0.492 ± 0.028 0.302 ± 0.033 0.331 ± 0.050 0.654 ± 0.041 0.586 ± 0.051 hac 0.465 0.571 − 20-newsgroups Data kingman beta 0.510 ± 0.047 0.565 ± 0.081 0.651 ± 0.013 0.720 ± 0.013 0.477 ± 0.027 0.333 ± 0.047 Table 1: Comparing the three models: beta performs best on all three scores. As with the biological data, beta performs best on all scores for 20-newsgroups. Figure 4 shows a bushy tree built by beta, which mostly recovered the true hierarchy. Documents within a newsgroup merge ﬁrst, then the three “rec” groups, followed by “space” and “religion” groups. We only use topic distribution as features, so better results could be possible with more comprehensive features. 6 Conclusion This paper generalizes Bayesian hierarchical clustering, moving from Kingman’s coalescent to the beta coalescent. Our novel inference scheme based on SMC and DPMM make this generalization practical and efﬁcient. This new model provides a bushier tree, often a more realistic view of data. While we only consider real-valued vectors, which we model through the ubiquitous Gaussian, other likelihoods might be better suited to other applications. For example, for discrete data such as in natural language processing, a multinomial likelihood may be more appropriate. This is a straightforward extension of our model via other transition kernels and DPMM base distributions. Recent work uses the coalescent as a means of producing a clustering in tandem with a downstream task such as classiﬁcation [32]. Hierarchies are often taken a priori in natural language processing. Particularly for linguistic tasks, a fully statistical model like the beta coalescent that jointly learns the hierarchy and a downstream task could improve performance in dependency parsing [33] (clustering parts of speech), multilingual sentiment [34] (ﬁnding sentiment-correlated words across languages), or topic modeling [35] (ﬁnding coherent words that should co-occur in a topic). Acknowledgments We would like to thank the anonymous reviewers for their helpful comments, and thank H´ ctor e Corrada Bravo for pointing us to human tissue data. This research was supported by NSF grant #1018625. Any opinions, ﬁndings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reﬂect the view of the sponsor. 8 References [1] Kaufman, L., P. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley, 1990. [2] Jain, A. K. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651–666, 2010. [3] Brown, P. F., V. J. D. Pietra, P. V. deSouza, et al. Class-based n-gram models of natural language. Computational Linguistics, 18:18–4, 1990. [4] Bergen, J., P. Anandan, K. Hanna, et al. Hierarchical model-based motion estimation. In ECCV. 1992. [5] Girvan, M., M. E. J. Newman. Community structure in social and biological networks. PNAS, 99:7821– 7826, 2002. [6] Kingman, J. F. C. On the genealogy of large populations. Journal of Applied Probability, 19:27–43, 1982. [7] Pitman, J. Coalescents with multiple collisions. The Annals of Probability, 27:1870–1902, 1999. [8] Berestycki, N. Recent progress in coalescent theory. In Ensaios Matematicos, vol. 16. 2009. e [9] Teh, Y. W., H. Daum´ III, D. M. Roy. Bayesian agglomerative clustering with coalescents. In NIPS. 2008. [10] Heller, K. A., Z. Ghahramani. Bayesian hierarchical clustering. In ICML. 2005. [11] Blundell, C., Y. W. Teh, K. A. Heller. Bayesian rose trees. In UAI. 2010. [12] Adams, R., Z. Ghahramani, M. Jordan. Tree-structured stick breaking for hierarchical data. In NIPS. 2010. [13] Knowles, D., Z. Ghahramani. Pitman-Yor diffusion trees. In UAI. 2011. [14] Neal, R. M. Density modeling and clustering using Dirichlet diffusion trees. Bayesian Statistics, 7:619–629, 2003. [15] Sagitov, S. The general coalescent with asynchronous mergers of ancestral lines. Journal of Applied Probability, 36:1116–1125, 1999. [16] Neal, R. M. Annealed importance sampling. Technical report 9805, University of Toronto, 1998. [17] Fearhhead, P. Sequential Monte Carlo method in ﬁlter theory. PhD thesis, University of Oxford, 1998. [18] Felsenstein, J. Maximum-likelihood estimation of evolutionary trees from continuous characters. Am J Hum Genet, 25(5):471–492, 1973. [19] Birkner, M., J. Blath, M. Steinrucken. Importance sampling for lambda-coalescents in the inﬁnitely many sites model. Theoretical population biology, 79(4):155–73, 2011. [20] Doucet, A., N. De Freitas, N. Gordon, eds. Sequential Monte Carlo methods in practice. 2001. [21] Gordon, N., D. Salmond, A. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEEE Proceedings F, Radar and Signal Processing, 140(2):107–113, 1993. ou [22] G¨ r¨ r, D., L. Boyles, M. Welling. Scalable inference on Kingman’s coalescent using pair similarity. JMLR, 22:440–448, 2012. [23] Antoniak, C. E. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, 2(6):1152–1174, 1974. [24] Cappe, O., S. Godsill, E. Moulines. An overview of existing methods and recent advances in sequential Monte Carlo. PROCEEDINGS-IEEE, 95(5):899, 2007. [25] Chen, Z. Bayesian ﬁltering: From kalman ﬁlters to particle ﬁlters, and beyond. McMaster, [Online], 2003. [26] Eads, D. Hierarchical clustering (scipy.cluster.hierarchy). SciPy, 2007. [27] Neal, R. M. Slice sampling. Annals of Statistics, 31:705–767, 2003. [28] Powers, D. M. W. Unsupervised learning of linguistic structure an empirical evaluation. International Journal of Corpus Linguistics, 2:91–131, 1997. [29] Jongeneel, C., M. Delorenzi, C. Iseli, et al. An atlas of human gene expression from massively parallel signature sequencing (mpss). Genome Res, 15:1007–1014, 2005. [30] Shlens, J. A tutorial on principal component analysis. In Systems Neurobiology Laboratory, Salk Institute for Biological Studies. 2005. [31] Blei, D. M., A. Ng, M. Jordan. Latent Dirichlet allocation. JMLR, 2003. [32] Rai, P., H. Daum´ III. The inﬁnite hierarchical factor regression model. In NIPS. 2008. e [33] Koo, T., X. Carreras, M. Collins. Simple semi-supervised dependency parsing. In ACL. 2008. [34] Boyd-Graber, J., P. Resnik. Holistic sentiment analysis across languages: Multilingual supervised latent Dirichlet allocation. In EMNLP. 2010. [35] Andrzejewski, D., X. Zhu, M. Craven. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In ICML. 2009. 9

same-paper 2 0.8658936 233 nips-2013-Online Robust PCA via Stochastic Optimization

Author: Jiashi Feng, Huan Xu, Shuicheng Yan

3 0.86569154 279 nips-2013-Robust Bloom Filters for Large MultiLabel Classification Tasks

Author: Moustapha M. Cisse, Nicolas Usunier, Thierry Artières, Patrick Gallinari

Abstract: This paper presents an approach to multilabel classiﬁcation (MLC) with a large number of labels. Our approach is a reduction to binary classiﬁcation in which label sets are represented by low dimensional binary vectors. This representation follows the principle of Bloom ﬁlters, a space-efﬁcient data structure originally designed for approximate membership testing. We show that a naive application of Bloom ﬁlters in MLC is not robust to individual binary classiﬁers’ errors. We then present an approach that exploits a speciﬁc feature of real-world datasets when the number of labels is large: many labels (almost) never appear together. Our approach is provably robust, has sublinear training and inference complexity with respect to the number of labels, and compares favorably to state-of-the-art algorithms on two large scale multilabel datasets. 1

4 0.83288014 232 nips-2013-Online PCA for Contaminated Data

Author: Jiashi Feng, Huan Xu, Shie Mannor, Shuicheng Yan

5 0.81912142 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints

Author: Yuhong Guo

6 0.81656575 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

7 0.81424373 350 nips-2013-Wavelets on Graphs via Deep Learning

8 0.81200552 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.

9 0.81090826 288 nips-2013-Scalable Influence Estimation in Continuous-Time Diffusion Networks

10 0.81037325 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

11 0.81023121 308 nips-2013-Spike train entropy-rate estimation using hierarchical Dirichlet process priors

12 0.81001246 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic

13 0.80846143 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints