nips nips2013 nips2013-118 knowledge-graph by maker-knowledge-mining

118 nips-2013-Fast Determinantal Point Process Sampling with Application to Clustering

Source: pdf

Author: Byungkon Kang

Abstract: Determinantal Point Process (DPP) has gained much popularity for modeling sets of diverse items. The gist of DPP is that the probability of choosing a particular set of items is proportional to the determinant of a positive deﬁnite matrix that deﬁnes the similarity of those items. However, computing the determinant requires time cubic in the number of items, and is hence impractical for large sets. In this paper, we address this problem by constructing a rapidly mixing Markov chain, from which we can acquire a sample from the given DPP in sub-cubic time. In addition, we show that this framework can be extended to sampling from cardinalityconstrained DPPs. As an application, we show how our sampling algorithm can be used to provide a fast heuristic for determining the number of clusters, resulting in better clustering.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The gist of DPP is that the probability of choosing a particular set of items is proportional to the determinant of a positive deﬁnite matrix that deﬁnes the similarity of those items. [sent-4, score-0.208]

2 However, computing the determinant requires time cubic in the number of items, and is hence impractical for large sets. [sent-5, score-0.155]

3 In this paper, we address this problem by constructing a rapidly mixing Markov chain, from which we can acquire a sample from the given DPP in sub-cubic time. [sent-6, score-0.166]

4 In addition, we show that this framework can be extended to sampling from cardinalityconstrained DPPs. [sent-7, score-0.058]

5 As an application, we show how our sampling algorithm can be used to provide a fast heuristic for determining the number of clusters, resulting in better clustering. [sent-8, score-0.079]

6 In a nutshell, given an itemset S = [n] = {1, 2, · · · , n} and a symmetric positive deﬁnite (SPD) matrix L ∈ Rn×n that describes pairwise similarities, a (discrete) DPP is a probability distribution over 2S proportional to the determinant of the corresponding submatrix of L. [sent-11, score-0.23]

7 Although the size of the support is exponential, DPP offers tractable inference and sampling algorithms. [sent-13, score-0.058]

8 However, sampling from a DPP requires O(n3 ) time, as an eigen-decomposition of L is necessary [4]. [sent-14, score-0.058]

9 A motivating problem we consider is that of kernelized clustering [5]. [sent-18, score-0.14]

10 In this problem, we are given a large number of points plus a kernel function that serves as a dot product between the points in a feature space. [sent-19, score-0.141]

11 Naively using the cubic-complexity sampling algorithm is inefﬁcient, since it can take up to a whole day to eigen-decompose a 10000 × 10000 matrix. [sent-23, score-0.079]

12 In this paper, we present a rapidly mixing Markov chain whose stationary distribution is the DPP of interest. [sent-24, score-0.302]

13 Our Markov chain does not require the eigen-decomposition of L, and is hence timeefﬁcient. [sent-25, score-0.156]

14 Moreover, our algorithm works seamlessly even when new items are added to S (and L), while the previous sampling algorithm requires expensive decompositions whenever S is updated. [sent-26, score-0.16]

15 Many literatures introduce DPP in terms of a marginal kernel that describes marginal probabilities of inclusion. [sent-33, score-0.089]

16 2 Algorithm The overall idea of our algorithm is to design a rapidly-mixing Markov chain whose stationary distribution is PL . [sent-44, score-0.214]

17 The state space of our chain consists of the characteristic vectors of each subset of S. [sent-45, score-0.182]

18 This Markov chain is generated by a standard Metropolis-Hastings algorithm, where the transition probability from state vY to vZ is given as the ratio of PL (Z) to PL (Y ). [sent-46, score-0.213]

19 Hence, the transition probability of removing an element u from Y is of the following form: Pr(Y ∪ {u} → Y ) = min 1, det(LY ) det(LY ∪{u} ) . [sent-48, score-0.06]

20 The overall chain is an insertion/deletion chain, where a uniformly proposed element is either added to, or subtracted from the current state. [sent-50, score-0.226]

21 Note that this algorithm has a potentially high computational complexity, as the determinant of LY for a given Y ⊆ S must be computed on every iteration. [sent-52, score-0.152]

22 First, we state that the chain induced by Algorithm 1 does indeed represent our desired distribution2. [sent-56, score-0.176]

23 The Markov chain in Algorithm 1 has a stationary distribution PL . [sent-58, score-0.193]

24 The computational complexity of sampling from PL using Algorithm 1 depends on the mixing time of the Markov chain; i. [sent-59, score-0.159]

25 , the number of steps required in the Markov chain to ensure that the current distribution is “close enough” to the stationary distribution. [sent-61, score-0.217]

26 In other words, we must spend at least this many time steps in order to acquire a sample from a distribution close to PL . [sent-63, score-0.077]

27 Our next result states that the chain in Algorithm 1 mixes rapidly: Theorem 1. [sent-64, score-0.156]

28 The Markov chain in Algorithm 1 has a mixing time τ (ǫ) = O (n log (n/ǫ)). [sent-65, score-0.236]

29 p− (Y ) u end if end while return Y One advantage of having a rapidly-mixing Markov chain as means of sampling from a DPP is that it is robust to addition/deletion of elements. [sent-70, score-0.214]

30 That is, when a new element is introduced to or removed from S, we may simply continue the current chain until it is mixed again to obtain a sample from the new distribution. [sent-71, score-0.283]

31 Previous sampling algorithm, on the other hand, requires to expensively eigendecompose the updated L. [sent-72, score-0.079]

32 The mixing time of the chain in Algorithm 1 seems to offer a promising direction for sampling from PL . [sent-73, score-0.294]

33 Unfortunately, computing the determinant already costs O(|Y |3 ) operations, rendering Algorithm 1 impractical for large Y ’s. [sent-75, score-0.13]

34 In the following sections, we present a linear-algebraic manipulation of the determinant ratio so that explicit computation of the determinants is unnecessary. [sent-76, score-0.18]

35 Since the determinant is permutationinvariant with respect to the index set, we can represent LY ∪{u} as the following block matrix form, due to its symmetry: L Y bu , LY ∪{u} = b ⊤ cu u where bu = (L(i, u))i∈Y ∈ R|Y | and cu = L(u, u). [sent-80, score-0.708]

36 With this, the determinant of LY ∪{u} is expressed as (1) det(LY ∪{u} ) = det(LY ) cu − b⊤ L−1 bu . [sent-81, score-0.397]

37 u Y This allows us to re-formulate the insertion transition probability as a determinant-free ratio. [sent-82, score-0.074]

38 p+ (Y ) = min 1, u det(LY ∪{u} ) det(LY ) = min 1, cu − b⊤ L−1 bu . [sent-83, score-0.286]

39 u Y (2) The deletion transition probability p− (Y ∪ {u}) is computed likewise: u p− (Y ∪ {u}) = min 1, u det(LY ) det(LY ∪{u} ) = min 1, (cu − b⊤ L−1 bu )−1 . [sent-84, score-0.22]

40 u Y However, this transformation alone does not seem to result in enhanced computation time, as computing the inverse of a matrix is just as time-consuming as computing the determinant. [sent-85, score-0.058]

41 To save time on computing L−1 , we incrementally update the inverse through blockwise matrix inY′ version. [sent-86, score-0.058]

42 The new inverse L−1 Y ∪{u} must be −1 updated from the current LY . [sent-89, score-0.098]

43 This is achieved by the following block-inversion formula [6]: L−1 Y ∪{u} = LY b⊤ u bu cu −1 = −1 LY + L−1 bu b⊤ L−1 /du u Y Y −b⊤ L−1 /du u Y −L−1 bu /du Y , du (3) where du = cu − b⊤ L−1 bu is the Schur complement of LY . [sent-90, score-0.928]

44 Since L−1 is already given, computing u Y Y each block of the new inverse matrix costs O(|Y |2 ), which is an order faster than the O(|Y |3 ) complexity required by the determinant. [sent-91, score-0.079]

45 Next, consider the case when an element u is removed (‘else’ clause) from the current set Y . [sent-93, score-0.072]

46 We ﬁrst represent the Y \{u} current inverse L−1 as Y −1 LY \{u} bu D e L−1 = , Y e⊤ f b⊤ cu u where D ∈ R(|Y |−1)×(|Y |−1) , e ∈ R|Y |−1 , and f ∈ R. [sent-95, score-0.343]

47 This complexity can be reduced by restricting the size of the initial Y . [sent-99, score-0.056]

48 Y As we proceed with the chain, update L−1 using Equations 3 and 4 when an insertion or a deletion Y proposal is accepted, respectively. [sent-101, score-0.071]

49 Therefore, the average complexity of acquiring a sample from a distribution that is ǫ-close to PL is O(T 2 n log(n/ǫ)), where T is the average size of Y encountered during the progress of chain. [sent-102, score-0.081]

50 2 Extension to k-DPPs The idea of constructing a Markov chain to obtain a sample can be extended to k-DPPs. [sent-105, score-0.19]

51 The only known algorithm so far for sampling from a k-DPP also requires the eigen-decomposition of L. [sent-106, score-0.079]

52 Extending the previous idea, we provide a Markov chain sampling algorithm similar to Algorithm 1 k that samples from PL . [sent-107, score-0.235]

53 The main idea behind the k-DPP chain is to propose a new conﬁguration by choosing two elements: one to remove from the current set, and another to add. [sent-108, score-0.18]

54 We accept this move according to the probability deﬁned by the ratio of the proposed determinant to the current determinant. [sent-109, score-0.158]

55 , for X = Y ∪ {u} LX=Y ∪{u} = LY b⊤ u bu cu ⇒ LX ′ =Y ∪{v} = LY b⊤ v bv , cv where u and v are the elements being removed and added, respectively. [sent-113, score-0.391]

56 Following Equation 2, we set the transition probability as the ratio of the determinants of the two matrices. [sent-114, score-0.103]

57 det(LX ) cu − b⊤ L−1 bu u Y The ﬁnal procedure is given in Algorithm 2. [sent-116, score-0.286]

58 Similarly to Algorithm 1, we present the analysis on the stationary distribution and the mixing time of Algorithm 2. [sent-117, score-0.117]

59 The Markov chain in Algorithm 2 has a stationary distribution PL . [sent-119, score-0.193]

60 The Markov chain in Algorithm 2 has a mixing time τ (ǫ) = O(k log(k/ǫ)). [sent-121, score-0.236]

61 4 k Algorithm 2 Markov chain for sampling from PL Require: Itemset S = [n], similarity matrix L ≻ 0 Randomly initialize state X ⊆ S, s. [sent-122, score-0.271]

62 Letting Y = X \ {u}, set cv − b⊤ L−1 bv v Y p ← min 1, cu − b⊤ L−1 bu u Y . [sent-127, score-0.369]

63 This leads to the ﬁnal sampling complexity of O(k 3 log(k/ǫ)), which dominates the O(k 3 ) cost of computing the intitial inverse, for acquiring a single sample from the chain. [sent-132, score-0.139]

64 3 Application to Clustering Finally, we show how our algorithms lead to an efﬁcient heuristic for a k-means clustering problem when the number of clusters is not known. [sent-134, score-0.188]

65 Given a set of points P = {xi ∈ Rd }n , the goal of clustering is to construct a partition C = i=1 {C1 , · · · , Ck |Ci ⊆ P } of P such that the distortion k DC = i=1 x∈Ci x − mi 2 2 (6) is minimized, where mi is the centroid of cluster Ci . [sent-136, score-0.496]

66 It is known that the optimal centroid is the mean of the points of Ci . [sent-137, score-0.08]

67 However, determining the number of clusters k is the factor that makes this problem NP-hard [7]. [sent-142, score-0.074]

68 In this work, we propose to sample the initial centroids of the clustering via our DPP sampling algorithms. [sent-145, score-0.293]

69 The similarity matrix L will be the Gram matrix determined by L(i, j) = κ(xi , xj ), where κ(·) is simply the inner-product for regular k-means, and a speciﬁed kernel function for kernel k-means. [sent-146, score-0.207]

70 Since DPPs naturally capture the notion of diversity, the sampled points will tend to be more diverse, and thus serve better as initial representatives for each cluster. [sent-147, score-0.107]

71 We use the proposed algorithms to sample the representative points that have high probability under PL , and cluster the rest of the points around the sample. [sent-151, score-0.132]

72 To reduce this possible negative inﬂuence, we adopt a two-step sampling strategy: First, we gather C samples from Algorithm 1 and construct a histogram H over the sizes of the samples. [sent-155, score-0.084]

73 This last sample is the representatives we use to cluster. [sent-157, score-0.07]

74 This does not increase the order of the mixing time of the induced chain, since only a constant factor of exp(±λ) is multiplied to the transition probabilities. [sent-162, score-0.134]

75 For √ example, examine the BIC of the initial clustering with respect to the centroids sampled from O( n) randomly chosen elements P ′ ⊂ P , with λ = 0. [sent-166, score-0.201]

76 Because we only use O( n) points to sample from the DPP, each search step has at most linear complexity, allowing for ample time to choose better λ’s. [sent-169, score-0.07]

77 This procedure may not appear to have an apparent advantage over a standard BIC-based model selection to choose the number of clusters k. [sent-170, score-0.074]

78 4 Experiments In this section, we empirically demonstrate how our proposed method, denoted DPP-MC, of choosing an initial clustering compares to other methods, in terms of distortion and running time. [sent-172, score-0.365]

79 The methods we compare against include: • DPP-Full: Sample using full DPP sampling procedure as given in [4]. [sent-173, score-0.058]

80 • DPP-MAP: Sample the initial centroids according to the MAP conﬁguration, using the algorithm of [11]. [sent-174, score-0.108]

81 • KKM: Plain kernel k-means clustering given by [5], run on the “true” number of clusters. [sent-175, score-0.165]

82 The real-world data sets we use are the letter recognition data set [12] (LET), and a subset of the power consumption data set [13] (PWC), The LET set is represented as 10,000 points in R16 , and 6 the PWC set 10,000 points in R7 . [sent-180, score-0.133]

83 In addition, we also tested our algorithm on an artiﬁcially-generated set consisting of 15,000 points in R10 from ﬁve mixtures of Gaussians (MG). [sent-184, score-0.057]

84 For the MG set, we present the result of DBSCAN [9]: another clustering algorithm that does not require k beforehand. [sent-187, score-0.135]

85 01) mixing steps for ﬁrst and second phases, respectively, and C = R = 10. [sent-192, score-0.08]

86 The running time of our algorithm includes the time taken to heuristically search for λ using the following BIC [14]: kd BICk log Pr(x|{mi }k , σ) − log n, i=1 2 x∈P where σ is the average of each cluster’s distortion, and d is the dimension of the data set. [sent-193, score-0.067]

87 1 Real-World Data Sets The plots of the distortion and time for the LET set over the clustering iterations are given in Figure 1. [sent-196, score-0.328]

88 Recall that KKM was run with the true number of clusters as its input, so one may expect it to perform relatively better, in terms of distortion and running time, than the other algorithms, which must compute k. [sent-197, score-0.31]

89 5 DPP−MC KKM DPP−Full DPP−MAP 2500 2000 1500 1000 500 3 1 2 3 4 5 6 7 8 9 0 10 Iterations 1 2 3 4 5 6 7 8 9 Iterations Figure 1: Distortion (left) and cumulated runtime (right) of the clustering induced by the competing algorithms on the LET set. [sent-205, score-0.134]

90 As in the previous case, DPP-MC exhibits the lowest distortion with the fastest running time. [sent-211, score-0.216]

91 ) 50 Distortion 40 30 20 10 0 DPP−MC KKM DPP−Full 1000 800 600 400 200 1 2 3 4 5 6 7 8 0 9 1 2 3 Iterations 4 5 6 7 8 9 Iterations Figure 2: Distortion (left) and time (right) of the clustering induced by the competing algorithms on the PWC set. [sent-219, score-0.134]

92 2 Artiﬁcial Data Set Finally, we present results on clustering the artiﬁcial MG set. [sent-221, score-0.114]

93 In this set, we compare our algorithm with another clustering algorithm DBSCAN that does not require k a priori. [sent-222, score-0.156]

94 On the other hand, DPP-MC managed to ﬁnd many distinct clusters in a way the distortion is lowered. [sent-229, score-0.27]

95 5 Discussion and Future Works We have proposed a fast method for sampling from an ǫ-close DPP distribution and its application to kernelized clustering. [sent-230, score-0.084]

96 Although the exact computational complexity of sampling (O(T 2 n log(n/ǫ)) is not explicitly superior to the previous approach (O(n3 )), we emperically show that T is generally small enough to account for our algorithm’s efﬁciency. [sent-231, score-0.079]

97 Furthermore, the extension to k-DPP sampling yields very fast speed-up compared to the previous sampling algorithm. [sent-232, score-0.116]

98 However, one must keep in mind that the mixing time analysis is for a single sample only: i. [sent-233, score-0.134]

99 For a larger number of samples, we must further investigate the effect of sample correlation after mixing in order to prove long-term efﬁciency. [sent-237, score-0.134]

100 A density-based algorithm for discovering clusters in large spatial databases with noise. [sent-294, score-0.095]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ly', 0.57), ('dpp', 0.446), ('det', 0.272), ('pl', 0.233), ('distortion', 0.196), ('kkm', 0.16), ('chain', 0.156), ('bu', 0.155), ('pwc', 0.141), ('cu', 0.131), ('clustering', 0.114), ('determinantal', 0.112), ('determinant', 0.111), ('dpps', 0.082), ('dbscan', 0.081), ('mixing', 0.08), ('lx', 0.077), ('vy', 0.077), ('clusters', 0.074), ('bic', 0.065), ('mg', 0.062), ('sampling', 0.058), ('markov', 0.057), ('kulesza', 0.054), ('itemset', 0.053), ('mc', 0.052), ('centroids', 0.052), ('kernel', 0.051), ('determinants', 0.046), ('bv', 0.046), ('ci', 0.046), ('centroid', 0.044), ('insertion', 0.04), ('items', 0.04), ('consumption', 0.039), ('stationary', 0.037), ('cv', 0.037), ('points', 0.036), ('representatives', 0.036), ('initial', 0.035), ('transition', 0.034), ('sample', 0.034), ('inverse', 0.033), ('clause', 0.033), ('voronoi', 0.033), ('similarity', 0.032), ('deletion', 0.031), ('diversity', 0.031), ('mi', 0.03), ('household', 0.029), ('rapidly', 0.029), ('cluster', 0.026), ('gather', 0.026), ('acquiring', 0.026), ('heuristically', 0.026), ('kernelized', 0.026), ('element', 0.026), ('characteristic', 0.026), ('plain', 0.025), ('matrix', 0.025), ('cubic', 0.025), ('electric', 0.025), ('current', 0.024), ('ratio', 0.023), ('encounter', 0.023), ('acquire', 0.023), ('mj', 0.023), ('guration', 0.023), ('du', 0.023), ('outlier', 0.023), ('map', 0.023), ('regular', 0.023), ('initializations', 0.022), ('letter', 0.022), ('removed', 0.022), ('updated', 0.021), ('mixed', 0.021), ('algorithm', 0.021), ('spectral', 0.021), ('submatrix', 0.021), ('complexity', 0.021), ('running', 0.02), ('induced', 0.02), ('partition', 0.02), ('added', 0.02), ('must', 0.02), ('describes', 0.02), ('impractical', 0.019), ('iterations', 0.018), ('si', 0.018), ('dot', 0.018), ('gillenwater', 0.018), ('plj', 0.018), ('fraley', 0.018), ('friend', 0.018), ('goutte', 0.018), ('iny', 0.018), ('literatures', 0.018), ('smoothes', 0.018), ('spd', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 118 nips-2013-Fast Determinantal Point Process Sampling with Application to Clustering

Author: Byungkon Kang

2 0.39629638 40 nips-2013-Approximate Inference in Continuous Determinantal Processes

Author: Raja Hafiz Affandi, Emily Fox, Ben Taskar

Abstract: Determinantal point processes (DPPs) are random point processes well-suited for modeling repulsion. In machine learning, the focus of DPP-based models has been on diverse subset selection from a discrete and ﬁnite base set. This discrete setting admits an efﬁcient sampling algorithm based on the eigendecomposition of the deﬁning kernel matrix. Recently, there has been growing interest in using DPPs deﬁned on continuous spaces. While the discrete-DPP sampler extends formally to the continuous case, computationally, the steps required are not tractable in general. In this paper, we present two efﬁcient DPP sampling schemes that apply to a wide range of kernel functions: one based on low rank approximations via Nystr¨ m o and random Fourier feature techniques and another based on Gibbs sampling. We demonstrate the utility of continuous DPPs in repulsive mixture modeling and synthesizing human poses spanning activity spaces. 1

3 0.32115293 147 nips-2013-Lasso Screening Rules via Dual Polytope Projection

Author: Jie Wang, Jiayu Zhou, Peter Wonka, Jieping Ye

Abstract: Lasso is a widely used regression technique to ﬁnd sparse representations. When the dimension of the feature space and the number of samples are extremely large, solving the Lasso problem remains challenging. To improve the efﬁciency of solving large-scale Lasso problems, El Ghaoui and his colleagues have proposed the SAFE rules which are able to quickly identify the inactive predictors, i.e., predictors that have 0 components in the solution vector. Then, the inactive predictors or features can be removed from the optimization problem to reduce its scale. By transforming the standard Lasso to its dual form, it can be shown that the inactive predictors include the set of inactive constraints on the optimal dual solution. In this paper, we propose an efﬁcient and effective screening rule via Dual Polytope Projections (DPP), which is mainly based on the uniqueness and nonexpansiveness of the optimal dual solution due to the fact that the feasible set in the dual space is a convex and closed polytope. Moreover, we show that our screening rule can be extended to identify inactive groups in group Lasso. To the best of our knowledge, there is currently no “exact” screening rule for group Lasso. We have evaluated our screening rule using many real data sets. Results show that our rule is more effective in identifying inactive predictors than existing state-of-the-art screening rules for Lasso. 1

4 0.14833125 6 nips-2013-A Determinantal Point Process Latent Variable Model for Inhibition in Neural Spiking Data

Author: Jasper Snoek, Richard Zemel, Ryan P. Adams

Abstract: Point processes are popular models of neural spiking behavior as they provide a statistical distribution over temporal sequences of spikes and help to reveal the complexities underlying a series of recorded action potentials. However, the most common neural point process models, the Poisson process and the gamma renewal process, do not capture interactions and correlations that are critical to modeling populations of neurons. We develop a novel model based on a determinantal point process over latent embeddings of neurons that effectively captures and helps visualize complex inhibitory and competitive interaction. We show that this model is a natural extension of the popular generalized linear model to sets of interacting neurons. The model is extended to incorporate gain control or divisive normalization, and the modulation of neural spiking based on periodic phenomena. Applied to neural spike recordings from the rat hippocampus, we see that the model captures inhibitory relationships, a dichotomy of classes of neurons, and a periodic modulation by the theta rhythm known to be present in the data. 1

5 0.08238057 66 nips-2013-Computing the Stationary Distribution Locally

Author: Christina E. Lee, Asuman Ozdaglar, Devavrat Shah

Abstract: Computing the stationary distribution of a large ﬁnite or countably inﬁnite state space Markov Chain (MC) has become central in many problems such as statistical inference and network analysis. Standard methods involve large matrix multiplications as in power iteration, or simulations of long random walks, as in Markov Chain Monte Carlo (MCMC). Power iteration is costly, as it involves computation at every state. For MCMC, it is difﬁcult to determine whether the random walks are long enough to guarantee convergence. In this paper, we provide a novel algorithm that answers whether a chosen state in a MC has stationary probability larger than some ∆ ∈ (0, 1), and outputs an estimate of the stationary probability. Our algorithm is constant time, using information from a local neighborhood of the state on the graph induced by the MC, which has constant size relative to the state space. The multiplicative error of the estimate is upper bounded by a function of the mixing properties of the MC. Simulation results show MCs for which this method gives tight estimates. 1

6 0.074998423 59 nips-2013-Blind Calibration in Compressed Sensing using Message Passing Algorithms

7 0.069300719 158 nips-2013-Learning Multiple Models via Regularized Weighting

8 0.069071904 299 nips-2013-Solving inverse problem of Markov chain with partial observations

9 0.064002566 291 nips-2013-Sensor Selection in High-Dimensional Gaussian Trees with Nuisances

10 0.063926198 131 nips-2013-Geometric optimisation on positive definite matrices for elliptically contoured distributions

11 0.05796035 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

12 0.05603274 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition

13 0.055808567 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models

14 0.055364665 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing

15 0.053941 108 nips-2013-Error-Minimizing Estimates and Universal Entry-Wise Error Bounds for Low-Rank Matrix Completion

16 0.052372027 180 nips-2013-Low-rank matrix reconstruction and clustering via approximate message passing

17 0.049090862 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

18 0.045644198 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models

19 0.045444127 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations

20 0.044750303 63 nips-2013-Cluster Trees on Manifolds

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.13), (1, 0.051), (2, -0.004), (3, 0.016), (4, -0.03), (5, 0.1), (6, 0.076), (7, -0.031), (8, -0.034), (9, 0.003), (10, -0.02), (11, -0.045), (12, -0.023), (13, -0.163), (14, 0.188), (15, 0.087), (16, -0.053), (17, -0.27), (18, 0.453), (19, 0.097), (20, 0.079), (21, -0.194), (22, -0.039), (23, -0.092), (24, 0.03), (25, -0.011), (26, 0.023), (27, 0.016), (28, -0.033), (29, -0.044), (30, 0.001), (31, 0.041), (32, -0.062), (33, 0.059), (34, 0.043), (35, 0.009), (36, 0.015), (37, 0.022), (38, -0.017), (39, 0.066), (40, 0.032), (41, 0.006), (42, -0.047), (43, -0.027), (44, 0.01), (45, 0.003), (46, -0.029), (47, -0.018), (48, -0.047), (49, 0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90143371 118 nips-2013-Fast Determinantal Point Process Sampling with Application to Clustering

Author: Byungkon Kang

2 0.81357712 40 nips-2013-Approximate Inference in Continuous Determinantal Processes

Author: Raja Hafiz Affandi, Emily Fox, Ben Taskar

3 0.78545272 147 nips-2013-Lasso Screening Rules via Dual Polytope Projection

Author: Jie Wang, Jiayu Zhou, Peter Wonka, Jieping Ye

4 0.3401354 6 nips-2013-A Determinantal Point Process Latent Variable Model for Inhibition in Neural Spiking Data

Author: Jasper Snoek, Richard Zemel, Ryan P. Adams

5 0.27844286 299 nips-2013-Solving inverse problem of Markov chain with partial observations

Author: Tetsuro Morimura, Takayuki Osogami, Tsuyoshi Ide

Abstract: The Markov chain is a convenient tool to represent the dynamics of complex systems such as trafﬁc and social systems, where probabilistic transition takes place between internal states. A Markov chain is characterized by initial-state probabilities and a state-transition probability matrix. In the traditional setting, a major goal is to study properties of a Markov chain when those probabilities are known. This paper tackles an inverse version of the problem: we ﬁnd those probabilities from partial observations at a limited number of states. The observations include the frequency of visiting a state and the rate of reaching a state from another. Practical examples of this task include trafﬁc monitoring systems in cities, where we need to infer the trafﬁc volume on single link on a road network from a limited number of observation points. We formulate this task as a regularized optimization problem, which is efﬁciently solved using the notion of natural gradient. Using synthetic and real-world data sets including city trafﬁc monitoring data, we demonstrate the effectiveness of our method.

6 0.27000543 344 nips-2013-Using multiple samples to learn mixture models

7 0.26806152 128 nips-2013-Generalized Method-of-Moments for Rank Aggregation

8 0.2674095 76 nips-2013-Correlated random features for fast semi-supervised learning

9 0.26026639 261 nips-2013-Rapid Distance-Based Outlier Detection via Sampling

10 0.25144854 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning

11 0.24890584 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

12 0.23284978 357 nips-2013-k-Prototype Learning for 3D Rigid Structures

13 0.22614314 66 nips-2013-Computing the Stationary Distribution Locally

14 0.21411014 107 nips-2013-Embed and Project: Discrete Sampling with Universal Hashing

15 0.21326582 296 nips-2013-Sinkhorn Distances: Lightspeed Computation of Optimal Transport

16 0.21204504 252 nips-2013-Predictive PAC Learning and Process Decompositions

17 0.21131514 131 nips-2013-Geometric optimisation on positive definite matrices for elliptically contoured distributions

18 0.21034786 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions

19 0.20921525 180 nips-2013-Low-rank matrix reconstruction and clustering via approximate message passing

20 0.20915303 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.019), (16, 0.04), (33, 0.141), (34, 0.091), (39, 0.239), (41, 0.05), (49, 0.086), (56, 0.088), (70, 0.026), (85, 0.054), (89, 0.013), (93, 0.035), (95, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.8013916 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning

Author: Min Xiao, Yuhong Guo

Abstract: Cross language text classiﬁcation is an important learning task in natural language processing. A critical challenge of cross language learning arises from the fact that words of different languages are in disjoint feature spaces. In this paper, we propose a two-step representation learning method to bridge the feature spaces of different languages by exploiting a set of parallel bilingual documents. Speciﬁcally, we ﬁrst formulate a matrix completion problem to produce a complete parallel document-term matrix for all documents in two languages, and then induce a low dimensional cross-lingual document representation by applying latent semantic indexing on the obtained matrix. We use a projected gradient descent algorithm to solve the formulated matrix completion problem with convergence guarantees. The proposed method is evaluated by conducting a set of experiments with cross language sentiment classiﬁcation tasks on Amazon product reviews. The experimental results demonstrate that the proposed learning method outperforms a number of other cross language representation learning methods, especially when the number of parallel bilingual documents is small. 1

2 0.77961665 76 nips-2013-Correlated random features for fast semi-supervised learning

Author: Brian McWilliams, David Balduzzi, Joachim Buhmann

Abstract: This paper presents Correlated Nystr¨ m Views (XNV), a fast semi-supervised alo gorithm for regression and classiﬁcation. The algorithm draws on two main ideas. First, it generates two views consisting of computationally inexpensive random features. Second, multiview regression, using Canonical Correlation Analysis (CCA) on unlabeled data, biases the regression towards useful features. It has been shown that CCA regression can substantially reduce variance with a minimal increase in bias if the views contains accurate estimators. Recent theoretical and empirical work shows that regression with random features closely approximates kernel regression, implying that the accuracy requirement holds for random views. We show that XNV consistently outperforms a state-of-the-art algorithm for semi-supervised learning: substantially improving predictive performance and reducing the variability of performance on a wide variety of real-world datasets, whilst also reducing runtime by orders of magnitude. 1

same-paper 3 0.77558255 118 nips-2013-Fast Determinantal Point Process Sampling with Application to Clustering

Author: Byungkon Kang

4 0.73338729 341 nips-2013-Universal models for binary spike patterns using centered Dirichlet processes

Author: Il M. Park, Evan W. Archer, Kenneth Latimer, Jonathan W. Pillow

Abstract: Probabilistic models for binary spike patterns provide a powerful tool for understanding the statistical dependencies in large-scale neural recordings. Maximum entropy (or “maxent”) models, which seek to explain dependencies in terms of low-order interactions between neurons, have enjoyed remarkable success in modeling such patterns, particularly for small groups of neurons. However, these models are computationally intractable for large populations, and low-order maxent models have been shown to be inadequate for some datasets. To overcome these limitations, we propose a family of “universal” models for binary spike patterns, where universality refers to the ability to model arbitrary distributions over all 2m binary patterns. We construct universal models using a Dirichlet process centered on a well-behaved parametric base measure, which naturally combines the ﬂexibility of a histogram and the parsimony of a parametric model. We derive computationally efﬁcient inference methods using Bernoulli and cascaded logistic base measures, which scale tractably to large populations. We also establish a condition for equivalence between the cascaded logistic and the 2nd-order maxent or “Ising” model, making cascaded logistic a reasonable choice for base measure in a universal model. We illustrate the performance of these models using neural data. 1

5 0.66437995 303 nips-2013-Sparse Overlapping Sets Lasso for Multitask Learning and its Application to fMRI Analysis

Author: Nikhil Rao, Christopher Cox, Rob Nowak, Timothy T. Rogers

Abstract: Multitask learning can be effective when features useful in one task are also useful for other tasks, and the group lasso is a standard method for selecting a common subset of features. In this paper, we are interested in a less restrictive form of multitask learning, wherein (1) the available features can be organized into subsets according to a notion of similarity and (2) features useful in one task are similar, but not necessarily identical, to the features best suited for other tasks. The main contribution of this paper is a new procedure called Sparse Overlapping Sets (SOS) lasso, a convex optimization that automatically selects similar features for related learning tasks. Error bounds are derived for SOSlasso and its consistency is established for squared error loss. In particular, SOSlasso is motivated by multisubject fMRI studies in which functional activity is classiﬁed using brain voxels as features. Experiments with real and synthetic data demonstrate the advantages of SOSlasso compared to the lasso and group lasso. 1

6 0.66348457 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking

7 0.66158921 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization

8 0.66022348 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines

9 0.65631348 121 nips-2013-Firing rate predictions in optimal balanced networks

10 0.65514106 40 nips-2013-Approximate Inference in Continuous Determinantal Processes

11 0.65092754 131 nips-2013-Geometric optimisation on positive definite matrices for elliptically contoured distributions

12 0.64836913 266 nips-2013-Recurrent linear models of simultaneously-recorded neural populations

13 0.64822251 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables

14 0.647636 70 nips-2013-Contrastive Learning Using Spectral Methods

15 0.64761931 301 nips-2013-Sparse Additive Text Models with Low Rank Background

16 0.64592171 294 nips-2013-Similarity Component Analysis

17 0.64436996 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models

18 0.64384043 304 nips-2013-Sparse nonnegative deconvolution for compressive calcium imaging: algorithms and phase transitions

19 0.64323586 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits

20 0.6432215 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching