nips nips2012 nips2012-299 knowledge-graph by maker-knowledge-mining

299 nips-2012-Scalable imputation of genetic data with a discrete fragmentation-coagulation process


Source: pdf

Author: Lloyd Elliott, Yee W. Teh

Abstract: We present a Bayesian nonparametric model for genetic sequence data in which a set of genetic sequences is modelled using a Markov model of partitions. The partitions at consecutive locations in the genome are related by the splitting and merging of their clusters. Our model can be thought of as a discrete analogue of the continuous fragmentation-coagulation process [Teh et al 2011], preserving the important properties of projectivity, exchangeability and reversibility, while being more scalable. We apply this model to the problem of genotype imputation, showing improved computational efficiency while maintaining accuracies comparable to other state-of-the-art genotype imputation methods. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Scalable imputation of genetic data with a discrete fragmentation-coagulation process Lloyd T. [sent-1, score-0.338]

2 uk Abstract We present a Bayesian nonparametric model for genetic sequence data in which a set of genetic sequences is modelled using a Markov model of partitions. [sent-14, score-0.415]

3 The partitions at consecutive locations in the genome are related by the splitting and merging of their clusters. [sent-15, score-0.165]

4 We apply this model to the problem of genotype imputation, showing improved computational efficiency while maintaining accuracies comparable to other state-of-the-art genotype imputation methods. [sent-17, score-0.241]

5 Due to gene conversion and chromosomal crossover, genetic sequences exhibit a local ‘mosaic’-like structure wherein sequences are composed of prototypical segments called haplotypes [5]. [sent-23, score-0.332]

6 Locally, these prototypical segments are shared by a cluster of sequences: each sequence in the cluster is described well by a haplotype that is specific to the location on the chromosome of the cluster. [sent-24, score-0.601]

7 A continuous fragmentation-coagulation process (CFCP) has recently been proposed for modelling local mosaic structure in genetic sequences [9]. [sent-29, score-0.32]

8 The CFCP is a nonparametric models defined directly on unlabelled partitions thereby avoiding both costly model selection and the label switching problem [8]. [sent-30, score-0.15]

9 Vertical axis represents clusters from last sample of an MCMC chain converging to DFCP posterior. [sent-36, score-0.147]

10 describes location-dependent unlabelled partitions such that at each location on the chromosome the clusters will split into multiple clusters which then merge to form the clusters at the next location. [sent-38, score-0.682]

11 The splitting and merging of clusters across the chromosome forms a mosaic structure of haplotypes. [sent-40, score-0.343]

12 Sections 4 and 5 report some experimental results showing good performance on an imputation problem, and in section 6 we conclude. [sent-43, score-0.139]

13 2 The discrete fragmentation-coagulation process In humans, most of the bases on a chromosome are the same for all individuals in a population. [sent-44, score-0.229]

14 At each SNP location, a particular chromosome has one of usually two possible bases (referred to as the major and minor allele). [sent-46, score-0.158]

15 Consequently, SNP data for a chromosome can be modelled as a binary sequence, with each entry indicating which of the two bases is present at that location. [sent-47, score-0.158]

16 In this paper we consider SNP data consisting of n binary sequences x = (xi )n , i=1 where each sequence xi = (xit )T is of length T and corresponds to the T SNPs on a segment of t=1 a chromosome in an individual. [sent-48, score-0.258]

17 The t-th entry xit of sequence i is equal to zero if individual i has the major allele at location t and equal to one otherwise. [sent-49, score-0.34]

18 We will model these sequences using a discrete fragmentation-coagulation process (DFCP) so that the sequence values at the SNP at location t are described by the latent partition πt of the sequences. [sent-50, score-0.292]

19 Each cluster in the partition corresponds to a haplotype. [sent-51, score-0.168]

20 The DFCP models the sequence of partitions using a discrete Markov chain as follows: starting with πt , we first fragment each cluster in πt into smaller clusters, forming a finer partition ρt . [sent-52, score-0.368]

21 Then we coagulate the clusters in ρt to form the clusters of πt+1 . [sent-53, score-0.27]

22 In the remainder of this section, we will first give some background theory on partitions, and random fragmentation and coagulation operations and then we will describe the DFCP as a Markov chain over partitions. [sent-54, score-0.479]

23 Finally, we will describe the likelihood model used to relate the sequence of partitions to the observed sequences. [sent-55, score-0.15]

24 The fragmentation and coagulation operators are random operations on partitions. [sent-66, score-0.477]

25 The fragmentation F RAG(π, α, σ) of a partition π is formed by partitioning further each cluster a of π according to CRP(a, α, σ) and then taking the union of the resulting partitions, yielding a partition of S that is finer than π. [sent-67, score-0.449]

26 Conversely, the coagulation C OAG(π, α, σ) of π is formed by partitioning the set of clusters of π (i. [sent-68, score-0.357]

27 , the set π itself) according to CRP(π, α, σ) and then replacing each cluster with the union of its elements, yielding a partition that is coarser than π. [sent-70, score-0.185]

28 The fragmentation and coagulation operators are linked through the following theorem by Pitman [12]. [sent-71, score-0.477]

29 Under t=1 the DFCP, the marginal distribution of the partition πt is CRP(S, µ, 0) and so µ controls the number of clusters that are found at each location. [sent-77, score-0.167]

30 Subsequently, we draw ρt |πt from F RAG(πt , 0, Rt ), which fragments each of the clusters in πt into smaller clusters in ρt , and then πt+1 |ρt from C OAG(ρt , µ/Rt , 0), which coagulates clusters in ρt into larger clusters in πt+1 . [sent-85, score-0.514]

31 In population genetics the CRP appears as (and was predated by) Ewen’s sampling formula [13], a counting formula for the number of alleles appearing in a population, observed at a given location. [sent-88, score-0.16]

32 Over a short segment of the chromosome where recombination rates are low, haplotypes behave like alleles and so a CRP prior on the number of haplotypes at a location is reasonable. [sent-89, score-0.412]

33 Further, since fragmentation and coagulation operators are defined in terms of CRPs which are projective and exchangeable, the Markov chain is projective and exchangeable in S as well. [sent-90, score-0.542]

34 Chromosome replication is directional and so statistics for genetic processes along the chromosome are not reversible. [sent-94, score-0.287]

35 But the strength of this effect on SNP data is not currently known and many genetic models such as the coalescent with recombination [14] assume reversibility for simplicity. [sent-95, score-0.279]

36 The non-reversibility displayed by models such as fastPHASE is an artifact of their construction rather than an attempt to capture non-reversible aspects of genetic sequences. [sent-96, score-0.148]

37 3 Likelihood model for sequence observations Given the sequence of partitions (πt )T , we model the observations in each cluster at each location t=1 t independently. [sent-98, score-0.386]

38 (3) ρT −1 ρ2 π1 π2 ··· πT xi1 xi2 ··· xiT ∀1≤i≤n θ1a θ2a ∀ a ∈ π1 ∀ a ∈ π2 β1 β2 ··· θT a Figure 2: Left: Graphical model for the discrete fragmentation coagulation process. [sent-100, score-0.479]

39 For each sequence i, let ait ∈ πt be the cluster in πt containing i. [sent-104, score-0.217]

40 Let θta be the emission of cluster a at location t. [sent-105, score-0.177]

41 Let the mean of θta be βt (this is the latent allele frequency at location t). [sent-107, score-0.194]

42 We assume that conditioned on the partitions and the parameters, the observations xit are independent, and determined by the cluster parameter θta . [sent-108, score-0.347]

43 Since this distribution is heavy tailed, the βt variables will have more mass near 0 and 1 than they would have if γt were fixed, adding sparsity to the latent allele frequencies. [sent-114, score-0.14]

44 Note that the prior gives more mass to values of Rt close to Rmin which we set close to zero, since we expect the partitions of consecutive locations to be relatively similar so that the mosaic haplotype structure can be formed. [sent-117, score-0.29]

45 4 Relationship with the continuous fragmentation-coagulation process The continuous version of the fragmentation-coagulation process [9], which we refer to as the CFCP, is a partition valued Markov jump process (MJP). [sent-121, score-0.209]

46 (The ‘time’ variable for this MJP is the chromosome location, viewed as a continuous variable. [sent-122, score-0.16]

47 There are two types of events in the CFCP: binary fragmentation events, in which a single cluster a is split into two clusters b and c at a rate of RΓ(#b)Γ(#c)/Γ(#a), and binary coagulation events in which two clusters b and c merge to form one cluster a at a rate of R/µ. [sent-124, score-1.082]

48 Then as ε → 0 the probability that the coagulation and fragmentation operations at a specific time step t induce no change in the partition structure πt approaches 1. [sent-127, score-0.499]

49 If we rescale the time steps by t → εt, then the expected number of binary events over a finite interval approaches ε times the rates given above and the expected number of all other events goes to zero, yielding the CFCP. [sent-129, score-0.155]

50 In the CFCP fragmentation and coagulation events are binary: they involve either one cluster fragmenting into two new clusters, or two clusters coagulating into one new cluster. [sent-130, score-0.794]

51 However, for the DFCP the fragmentation and coagulation operators can describe more complicated haplotype structures without introducing more latent events. [sent-131, score-0.585]

52 For example one cluster splitting into three clusters (as happens to the second haplotype from the top of Figure 1 after the 18th SNP) can be described 4 by the DFCP using just one fragmentation operator. [sent-132, score-0.549]

53 3 Inference with the discrete fragmentation coagulation process We derive a Gibbs sampler for posterior simulation in the DFCP by making use of the exchangeability of the process. [sent-134, score-0.561]

54 Each iteration of the sampler updates the trajectory of cluster assignments of one sequence i through the partition structure. [sent-135, score-0.273]

55 In this section, we derive the conditional distribution of trajectory i using the definition of fragmentation and coagulation and also the posterior distributions of the parameters Rt , µ which we will update using slice sampling [15]. [sent-138, score-0.545]

56 1 Conditional probabilities for the trajectory of sequence i −i We will refer to the projection of the partitions πt and ρt onto S − {i} by πt and ρ−i respectively. [sent-140, score-0.23]

57 t Let at (respectively bt ) be the cluster assignment of sequence i at location t in πt (respectively ρt ). [sent-141, score-0.371]

58 If the sequence i is placed in a new cluster by itself in πt (i. [sent-142, score-0.209]

59 , it forms a singleton cluster) we will denote this by at = ∅ and for ρ−i we will denote the respective event by bt = ∅. [sent-144, score-0.187]

60 Otherwise, if t −i the the sequence i is placed in an existing cluster in πt (respectively ρ−i ) we will denote this by t −i −i −i at ∈ πt (respectively bt ∈ ρt ). [sent-145, score-0.344]

61 Starting at t = 1, since the initial distribution is π1 ∈ CRP(S, µ, 0), the conditional cluster assignment of the sequence i in π1 is given by the CRP probabilities from (1): −i Pr(at = a|π1 ) = −i #a/(n − 1 + µ) if a ∈ πt , µ/(n − 1 + µ) if a = ∅. [sent-147, score-0.216]

62 (4) To find the conditional distribution of bt given at , we use the definition of the fragmentation operation as independent CRP partitions of each cluster in πt . [sent-148, score-0.568]

63 If at = ∅, then the sequence i is in a cluster by itself in πt and so it will remain in a cluster by itself after fragmenting. [sent-149, score-0.305]

64 If at = a ∈ πt then bt must be one of the clusters in ρt into which a fragments. [sent-151, score-0.257]

65 This can be a singleton cluster, in which case bt = ∅, or it can be one of the clusters in ρ−i . [sent-152, score-0.292]

66 Since a is fragmented according to CRP(a, 0, R), when t the i-th sequence is added to this CRP it is placed in a cluster b ∈ Ft (a) with probability proportional to (#b − R) and it is placed in a singleton cluster with probability proportional to R#Ft (a). [sent-154, score-0.438]

67 Similarly, to find the conditional distribution of at+1 given bt = b we use the definition of the coagulation operation. [sent-156, score-0.37]

68 If b = ∅, then the sequence i was not in a singleton cluster in ρ−i and so it t −i must follow the rest of the sequences in b to the unique a ∈ πt+1 such that b ⊆ a (i. [sent-157, score-0.277]

69 We will refer to the set of clusters in ρ−i that coagulate to form a by t Ct (a). [sent-160, score-0.148]

70 If b = ∅ then the sequence i is in a singleton cluster in ρ−i and so we can imagine it being the t last customer added to the coagulating CRP(ρt , µ/Rt , 0) of the clusters of ρt . [sent-161, score-0.388]

71 Hence the probability −i that sequence i is placed in a cluster a ∈ πt+1 is proportional to #Ct (a) while the probability that it −i forms a cluster by itself in πt+1 is proportional to µ/Rt . [sent-162, score-0.393]

72 2 Message passing and sampling for the sequences of the DFCP Once the conditional probabilities are defined, it is straightforward to derive messages that allow us to conduct backwards-filtering/forwards-sampling to resample the trajectory of sequence i in the DFCP. [sent-165, score-0.247]

73 This provides an exact Gibbs update for the trajectory of that sequence conditioned on the trajectories of all the other sequences and the data. [sent-166, score-0.188]

74 The messages we will define are the conditional distribution of all the data seen after a given location in the sequence conditioned on the cluster assignment of sequence i at that location. [sent-167, score-0.366]

75 As the fragmentation and coagulation conditional probabilities are only supported for clusters a, b such that b ⊆ a, these sums can be expanded so that only non-zero terms are summed over. [sent-176, score-0.61]

76 To sample from the posterior distribution of the trajectory for sequence i conditioned on the other trajectories and the data, we use the Markov property for the chain a1 , b1 , . [sent-180, score-0.179]

77 For subsequent bt and at+1 for locations t = 1, . [sent-187, score-0.152]

78 9 500 1000 1500 2000 runtime (seconds) 2500 3000 3500 Figure 3: Allele imputation for X chromosomes from the Thousand Genomes project. [sent-216, score-0.191]

79 Left: Accuracy for prediction of held out alleles for continuous (CFCP) and discrete (DFCP) versions of fragmentation-coagulation process and for popular methods BEAGLE and fastPHASE. [sent-217, score-0.173]

80 the posterior probabilities of µ and Rt given the partitions π1:T and ρ1:(T −1) are as follows: Pr(µ|π, ρ) ∝ Pr(µ) Pr(π1 |µ, R1 ) Pr(ρ1 |π1 , µ, R1 ) · · · Pr(πT |ρT −1 , µ, RT −1 ), ∝ Pr(µ) Γ(µ) µ−T + Γ(µ + n) T −1 T t=1 #πt t=1 Γ(µ/Rt ) . [sent-221, score-0.151]

81 (15) b∈ρt Experiments To examine the accuracy and scalability of the DFCP we conducted an allele imputation experiment on SNP data from the Thousand Genomes project1 . [sent-223, score-0.273]

82 We also compared the runtime of the samplers for the DFCP and CFCP on data simulated from the coalescent with recombination model [14]. [sent-224, score-0.168]

83 For the allele imputation experiment, we considered SNPs from 524 male X chromosomes. [sent-226, score-0.256]

84 The accuracies for the DFCP and CFCP were computed by thresholding the empirical marginal probabilities of the held out alleles at 0. [sent-232, score-0.157]

85 In order to match the expected number of clusters in the posterior, we also conducted allele imputation in the 50% missing data condition with µ fixed at 10. [sent-236, score-0.412]

86 We then computed the accuracy of the samples by predicting held out alleles based on the cluster assignments of the sample. [sent-239, score-0.241]

87 7 In a second experiment we simulated datasets from the coalescent with recombination model consisting of between 10,000 and 50,000 sequences using the software ms [14]. [sent-241, score-0.176]

88 5 Results The accuracy of the DFCP in the allele imputation experiment was comparable to that of the CFCP and fastPHASE in all missing data conditions (Figure 3, left). [sent-243, score-0.307]

89 The CFCP has an arbitrary number of latent events between consecutive observations and it is likely that the runtime improvement shown by the DFCP is due to its reduced number of required message calculations. [sent-258, score-0.176]

90 The DFCP is a partition-valued Markov chain, where partitions change along the chromosome by a fragmentation operation followed by a coagulation operation. [sent-260, score-0.684]

91 The DFCP is designed to model the mosaic haplotype structure observed in genetic sequences. [sent-261, score-0.298]

92 We applied the DFCP to an allele prediction task on data from the Thousand Genomes Project yielding accuracies comparable to state-of-the-art methods and runtimes that were lower than the runtimes of the continuous fragmentation-coagulation process [9]. [sent-262, score-0.257]

93 0 ×104 Figure 4: Runtimes per iteration per sequence of The DFCP and CFCP induce different joint dis- DFCP and CFCP on simulated datasets consisttributions on the partitions at adjacent locations. [sent-272, score-0.17]

94 bitrary number of latent binary events wherein a single cluster is split into two clusters, or two clusters are merged into one. [sent-276, score-0.337]

95 The DFCP however can model any partition structure with one pair of fragmentation and coagulation operations. [sent-277, score-0.499]

96 Another avenue of future research is to understand how other genetic processes can be incorporated into the fragmentation-coagulation framework, including population admixture and gene conversion. [sent-280, score-0.178]

97 Although haplotype structure is a local property, the Markov assumption does not hold in real genetic data. [sent-281, score-0.233]

98 Properties of a neutral allele model with intragenic recombination. [sent-291, score-0.144]

99 Generating samples under a Wright-Fisher neutral model of genetic variation. [sent-375, score-0.175]

100 A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. [sent-393, score-0.205]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dfcp', 0.563), ('cfcp', 0.406), ('coagulation', 0.235), ('fragmentation', 0.219), ('crp', 0.2), ('rt', 0.179), ('pr', 0.166), ('genetic', 0.148), ('chromosome', 0.139), ('imputation', 0.139), ('bt', 0.135), ('cluster', 0.123), ('clusters', 0.122), ('allele', 0.117), ('xit', 0.11), ('snp', 0.106), ('partitions', 0.091), ('ta', 0.088), ('haplotype', 0.085), ('beagle', 0.078), ('events', 0.069), ('genomes', 0.069), ('alleles', 0.069), ('fastphase', 0.065), ('mosaic', 0.065), ('oag', 0.065), ('rag', 0.065), ('genetics', 0.061), ('sequences', 0.06), ('sequence', 0.059), ('recombination', 0.058), ('mt', 0.056), ('location', 0.054), ('runtime', 0.052), ('messages', 0.048), ('haplotypes', 0.046), ('hapmap', 0.046), ('snps', 0.046), ('trajectory', 0.046), ('partition', 0.045), ('jump', 0.044), ('thousand', 0.041), ('genotype', 0.04), ('rmin', 0.039), ('coalescent', 0.038), ('singleton', 0.035), ('reversibility', 0.035), ('ait', 0.035), ('missing', 0.034), ('probabilities', 0.034), ('held', 0.032), ('unlabelled', 0.032), ('ft', 0.032), ('consecutive', 0.032), ('markov', 0.031), ('population', 0.03), ('exchangeability', 0.03), ('placed', 0.027), ('neutral', 0.027), ('runtimes', 0.027), ('switching', 0.027), ('mcmc', 0.026), ('coagulate', 0.026), ('coagulates', 0.026), ('coagulating', 0.026), ('projectivity', 0.026), ('trios', 0.026), ('process', 0.026), ('posterior', 0.026), ('genome', 0.025), ('chain', 0.025), ('discrete', 0.025), ('ct', 0.024), ('latent', 0.023), ('customer', 0.023), ('conditioned', 0.023), ('operators', 0.023), ('marchini', 0.023), ('elliott', 0.023), ('mjp', 0.023), ('polymorphisms', 0.023), ('accuracies', 0.022), ('proportional', 0.022), ('nucleotide', 0.021), ('uniformization', 0.021), ('hmm', 0.021), ('continuous', 0.021), ('gibbs', 0.02), ('simulated', 0.02), ('projective', 0.02), ('individuals', 0.02), ('rao', 0.019), ('slice', 0.019), ('bases', 0.019), ('backwards', 0.018), ('prototypical', 0.018), ('yielding', 0.017), ('forms', 0.017), ('locations', 0.017), ('accuracy', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 299 nips-2012-Scalable imputation of genetic data with a discrete fragmentation-coagulation process

Author: Lloyd Elliott, Yee W. Teh

Abstract: We present a Bayesian nonparametric model for genetic sequence data in which a set of genetic sequences is modelled using a Markov model of partitions. The partitions at consecutive locations in the genome are related by the splitting and merging of their clusters. Our model can be thought of as a discrete analogue of the continuous fragmentation-coagulation process [Teh et al 2011], preserving the important properties of projectivity, exchangeability and reversibility, while being more scalable. We apply this model to the problem of genotype imputation, showing improved computational efficiency while maintaining accuracies comparable to other state-of-the-art genotype imputation methods. 1

2 0.11036023 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

Author: Yanping Huang, Timothy Hanks, Mike Shadlen, Abram L. Friesen, Rajesh P. Rao

Abstract: How does the brain combine prior knowledge with sensory evidence when making decisions under uncertainty? Two competing descriptive models have been proposed based on experimental data. The first posits an additive offset to a decision variable, implying a static effect of the prior. However, this model is inconsistent with recent data from a motion discrimination task involving temporal integration of uncertain sensory evidence. To explain this data, a second model has been proposed which assumes a time-varying influence of the prior. Here we present a normative model of decision making that incorporates prior knowledge in a principled way. We show that the additive offset model and the time-varying prior model emerge naturally when decision making is viewed within the framework of partially observable Markov decision processes (POMDPs). Decision making in the model reduces to (1) computing beliefs given observations and prior information in a Bayesian manner, and (2) selecting actions based on these beliefs to maximize the expected sum of future rewards. We show that the model can explain both data previously explained using the additive offset model as well as more recent data on the time-varying influence of prior knowledge on decision making. 1

3 0.10886425 53 nips-2012-Bayesian Pedigree Analysis using Measure Factorization

Author: Bonnie Kirkpatrick, Alexandre Bouchard-côté

Abstract: Pedigrees, or family trees, are directed graphs used to identify sites of the genome that are correlated with the presence or absence of a disease. With the advent of genotyping and sequencing technologies, there has been an explosion in the amount of data available, both in the number of individuals and in the number of sites. Some pedigrees number in the thousands of individuals. Meanwhile, analysis methods have remained limited to pedigrees of < 100 individuals which limits analyses to many small independent pedigrees. Disease models, such those used for the linkage analysis log-odds (LOD) estimator, have similarly been limited. This is because linkage analysis was originally designed with a different task in mind, that of ordering the sites in the genome, before there were technologies that could reveal the order. LODs are difficult to interpret and nontrivial to extend to consider interactions among sites. These developments and difficulties call for the creation of modern methods of pedigree analysis. Drawing from recent advances in graphical model inference and transducer theory, we introduce a simple yet powerful formalism for expressing genetic disease models. We show that these disease models can be turned into accurate and computationally efficient estimators. The technique we use for constructing the variational approximation has potential applications to inference in other large-scale graphical models. This method allows inference on larger pedigrees than previously analyzed in the literature, which improves disease site prediction. 1

4 0.10109681 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points

Author: Azadeh Khaleghi, Daniil Ryabko

Abstract: The problem of multiple change point estimation is considered for sequences with unknown number of change points. A consistency framework is suggested that is suitable for highly dependent time-series, and an asymptotically consistent algorithm is proposed. In order for the consistency to be established the only assumption required is that the data is generated by stationary ergodic time-series distributions. No modeling, independence or parametric assumptions are made; the data are allowed to be dependent and the dependence can be of arbitrary form. The theoretical results are complemented with experimental evaluations. 1

5 0.08576151 126 nips-2012-FastEx: Hash Clustering with Exponential Families

Author: Amr Ahmed, Sujith Ravi, Alex J. Smola, Shravan M. Narayanamurthy

Abstract: Clustering is a key component in any data analysis toolbox. Despite its importance, scalable algorithms often eschew rich statistical models in favor of simpler descriptions such as k-means clustering. In this paper we present a sampler, capable of estimating mixtures of exponential families. At its heart lies a novel proposal distribution using random projections to achieve high throughput in generating proposals, which is crucial for clustering models with large numbers of clusters. 1

6 0.078879684 69 nips-2012-Clustering Sparse Graphs

7 0.076343983 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

8 0.069556296 41 nips-2012-Ancestor Sampling for Particle Gibbs

9 0.065353841 99 nips-2012-Dip-means: an incremental clustering method for estimating the number of clusters

10 0.063199364 26 nips-2012-A nonparametric variable clustering model

11 0.061840676 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering

12 0.057387214 219 nips-2012-Modelling Reciprocating Relationships with Hawkes Processes

13 0.05554992 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

14 0.055344746 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

15 0.055218261 205 nips-2012-MCMC for continuous-time discrete-state systems

16 0.055095859 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set

17 0.052841611 251 nips-2012-On Lifting the Gibbs Sampling Algorithm

18 0.052823119 315 nips-2012-Slice sampling normalized kernel-weighted completely random measure mixture models

19 0.052584004 107 nips-2012-Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data

20 0.051424686 151 nips-2012-High-Order Multi-Task Feature Learning to Identify Longitudinal Phenotypic Markers for Alzheimer's Disease Progression Prediction


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.112), (1, 0.006), (2, 0.003), (3, 0.014), (4, -0.099), (5, -0.063), (6, -0.002), (7, -0.008), (8, 0.018), (9, 0.046), (10, 0.031), (11, -0.095), (12, -0.029), (13, -0.031), (14, 0.014), (15, -0.077), (16, -0.08), (17, -0.013), (18, -0.023), (19, -0.05), (20, -0.026), (21, 0.023), (22, -0.103), (23, -0.033), (24, 0.022), (25, -0.029), (26, -0.04), (27, 0.017), (28, 0.032), (29, -0.059), (30, 0.06), (31, -0.002), (32, 0.052), (33, 0.021), (34, 0.04), (35, 0.038), (36, 0.017), (37, -0.027), (38, -0.049), (39, -0.05), (40, -0.047), (41, 0.077), (42, -0.004), (43, -0.073), (44, 0.005), (45, -0.046), (46, -0.093), (47, -0.086), (48, -0.05), (49, 0.038)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92187035 299 nips-2012-Scalable imputation of genetic data with a discrete fragmentation-coagulation process

Author: Lloyd Elliott, Yee W. Teh

Abstract: We present a Bayesian nonparametric model for genetic sequence data in which a set of genetic sequences is modelled using a Markov model of partitions. The partitions at consecutive locations in the genome are related by the splitting and merging of their clusters. Our model can be thought of as a discrete analogue of the continuous fragmentation-coagulation process [Teh et al 2011], preserving the important properties of projectivity, exchangeability and reversibility, while being more scalable. We apply this model to the problem of genotype imputation, showing improved computational efficiency while maintaining accuracies comparable to other state-of-the-art genotype imputation methods. 1

2 0.63828385 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points

Author: Azadeh Khaleghi, Daniil Ryabko

Abstract: The problem of multiple change point estimation is considered for sequences with unknown number of change points. A consistency framework is suggested that is suitable for highly dependent time-series, and an asymptotically consistent algorithm is proposed. In order for the consistency to be established the only assumption required is that the data is generated by stationary ergodic time-series distributions. No modeling, independence or parametric assumptions are made; the data are allowed to be dependent and the dependence can be of arbitrary form. The theoretical results are complemented with experimental evaluations. 1

3 0.54725343 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

Author: Yanping Huang, Timothy Hanks, Mike Shadlen, Abram L. Friesen, Rajesh P. Rao

Abstract: How does the brain combine prior knowledge with sensory evidence when making decisions under uncertainty? Two competing descriptive models have been proposed based on experimental data. The first posits an additive offset to a decision variable, implying a static effect of the prior. However, this model is inconsistent with recent data from a motion discrimination task involving temporal integration of uncertain sensory evidence. To explain this data, a second model has been proposed which assumes a time-varying influence of the prior. Here we present a normative model of decision making that incorporates prior knowledge in a principled way. We show that the additive offset model and the time-varying prior model emerge naturally when decision making is viewed within the framework of partially observable Markov decision processes (POMDPs). Decision making in the model reduces to (1) computing beliefs given observations and prior information in a Bayesian manner, and (2) selecting actions based on these beliefs to maximize the expected sum of future rewards. We show that the model can explain both data previously explained using the additive offset model as well as more recent data on the time-varying influence of prior knowledge on decision making. 1

4 0.5324139 219 nips-2012-Modelling Reciprocating Relationships with Hawkes Processes

Author: Charles Blundell, Jeff Beck, Katherine A. Heller

Abstract: We present a Bayesian nonparametric model that discovers implicit social structure from interaction time-series data. Social groups are often formed implicitly, through actions among members of groups. Yet many models of social networks use explicitly declared relationships to infer social structure. We consider a particular class of Hawkes processes, a doubly stochastic point process, that is able to model reciprocity between groups of individuals. We then extend the Infinite Relational Model by using these reciprocating Hawkes processes to parameterise its edges, making events associated with edges co-dependent through time. Our model outperforms general, unstructured Hawkes processes as well as structured Poisson process-based models at predicting verbal and email turn-taking, and military conflicts among nations. 1

5 0.52844429 151 nips-2012-High-Order Multi-Task Feature Learning to Identify Longitudinal Phenotypic Markers for Alzheimer's Disease Progression Prediction

Author: Hua Wang, Feiping Nie, Heng Huang, Jingwen Yan, Sungeun Kim, Shannon Risacher, Andrew Saykin, Li Shen

Abstract: Alzheimer’s disease (AD) is a neurodegenerative disorder characterized by progressive impairment of memory and other cognitive functions. Regression analysis has been studied to relate neuroimaging measures to cognitive status. However, whether these measures have further predictive power to infer a trajectory of cognitive performance over time is still an under-explored but important topic in AD research. We propose a novel high-order multi-task learning model to address this issue. The proposed model explores the temporal correlations existing in imaging and cognitive data by structured sparsity-inducing norms. The sparsity of the model enables the selection of a small number of imaging measures while maintaining high prediction accuracy. The empirical studies, using the longitudinal imaging and cognitive data of the ADNI cohort, have yielded promising results.

6 0.52568203 155 nips-2012-Human memory search as a random walk in a semantic network

7 0.49798253 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set

8 0.48956713 276 nips-2012-Probabilistic Event Cascades for Alzheimer's disease

9 0.48873854 41 nips-2012-Ancestor Sampling for Particle Gibbs

10 0.486121 26 nips-2012-A nonparametric variable clustering model

11 0.48076361 53 nips-2012-Bayesian Pedigree Analysis using Measure Factorization

12 0.46349913 196 nips-2012-Learning with Partially Absorbing Random Walks

13 0.45843452 251 nips-2012-On Lifting the Gibbs Sampling Algorithm

14 0.45625085 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

15 0.45478898 291 nips-2012-Reducing statistical time-series problems to binary classification

16 0.44737375 126 nips-2012-FastEx: Hash Clustering with Exponential Families

17 0.43589389 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

18 0.42586353 315 nips-2012-Slice sampling normalized kernel-weighted completely random measure mixture models

19 0.41929892 107 nips-2012-Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data

20 0.41849795 133 nips-2012-Finding Exemplars from Pairwise Dissimilarities via Simultaneous Sparse Recovery


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.057), (9, 0.36), (17, 0.011), (21, 0.045), (38, 0.055), (42, 0.015), (54, 0.014), (55, 0.014), (74, 0.032), (76, 0.125), (79, 0.033), (80, 0.08), (92, 0.06)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73032337 299 nips-2012-Scalable imputation of genetic data with a discrete fragmentation-coagulation process

Author: Lloyd Elliott, Yee W. Teh

Abstract: We present a Bayesian nonparametric model for genetic sequence data in which a set of genetic sequences is modelled using a Markov model of partitions. The partitions at consecutive locations in the genome are related by the splitting and merging of their clusters. Our model can be thought of as a discrete analogue of the continuous fragmentation-coagulation process [Teh et al 2011], preserving the important properties of projectivity, exchangeability and reversibility, while being more scalable. We apply this model to the problem of genotype imputation, showing improved computational efficiency while maintaining accuracies comparable to other state-of-the-art genotype imputation methods. 1

2 0.62497151 156 nips-2012-Identifiability and Unmixing of Latent Parse Trees

Author: Percy Liang, Daniel J. Hsu, Sham M. Kakade

Abstract: This paper explores unsupervised learning of parsing models along two directions. First, which models are identifiable from infinite data? We use a general technique for numerically checking identifiability based on the rank of a Jacobian matrix, and apply it to several standard constituency and dependency parsing models. Second, for identifiable models, how do we estimate the parameters efficiently? EM suffers from local optima, while recent work using spectral methods [1] cannot be directly applied since the topology of the parse tree varies across sentences. We develop a strategy, unmixing, which deals with this additional complexity for restricted classes of parsing models. 1

3 0.61376864 99 nips-2012-Dip-means: an incremental clustering method for estimating the number of clusters

Author: Argyris Kalogeratos, Aristidis Likas

Abstract: Learning the number of clusters is a key problem in data clustering. We present dip-means, a novel robust incremental method to learn the number of data clusters that can be used as a wrapper around any iterative clustering algorithm of k-means family. In contrast to many popular methods which make assumptions about the underlying cluster distributions, dip-means only assumes a fundamental cluster property: each cluster to admit a unimodal distribution. The proposed algorithm considers each cluster member as an individual ‘viewer’ and applies a univariate statistic hypothesis test for unimodality (dip-test) on the distribution of distances between the viewer and the cluster members. Important advantages are: i) the unimodality test is applied on univariate distance vectors, ii) it can be directly applied with kernel-based methods, since only the pairwise distances are involved in the computations. Experimental results on artificial and real datasets indicate the effectiveness of our method and its superiority over analogous approaches.

4 0.54264259 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction

Author: Minjie Xu, Jun Zhu, Bo Zhang

Abstract: We present a probabilistic formulation of max-margin matrix factorization and build accordingly a nonparametric Bayesian model which automatically resolves the unknown number of latent factors. Our work demonstrates a successful example that integrates Bayesian nonparametrics and max-margin learning, which are conventionally two separate paradigms and enjoy complementary advantages. We develop an efficient variational algorithm for posterior inference, and our extensive empirical studies on large-scale MovieLens and EachMovie data sets appear to justify the aforementioned dual advantages. 1

5 0.52545059 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs

Author: Abner Guzmán-rivera, Dhruv Batra, Pushmeet Kohli

Abstract: We address the problem of generating multiple hypotheses for structured prediction tasks that involve interaction with users or successive components in a cascaded architecture. Given a set of multiple hypotheses, such components/users typically have the ability to retrieve the best (or approximately the best) solution in this set. The standard approach for handling such a scenario is to first learn a single-output model and then produce M -Best Maximum a Posteriori (MAP) hypotheses from this model. In contrast, we learn to produce multiple outputs by formulating this task as a multiple-output structured-output prediction problem with a loss-function that effectively captures the setup of the problem. We present a max-margin formulation that minimizes an upper-bound on this lossfunction. Experimental results on image segmentation and protein side-chain prediction show that our method outperforms conventional approaches used for this type of scenario and leads to substantial improvements in prediction accuracy. 1

6 0.46147466 75 nips-2012-Collaborative Ranking With 17 Parameters

7 0.43925667 53 nips-2012-Bayesian Pedigree Analysis using Measure Factorization

8 0.43658239 128 nips-2012-Fast Resampling Weighted v-Statistics

9 0.4342621 197 nips-2012-Learning with Recursive Perceptual Representations

10 0.43018648 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

11 0.4301694 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

12 0.43000531 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

13 0.42995638 329 nips-2012-Super-Bit Locality-Sensitive Hashing

14 0.42893428 336 nips-2012-The Coloured Noise Expansion and Parameter Estimation of Diffusion Processes

15 0.4286145 49 nips-2012-Automatic Feature Induction for Stagewise Collaborative Filtering

16 0.42737219 41 nips-2012-Ancestor Sampling for Particle Gibbs

17 0.42678648 188 nips-2012-Learning from Distributions via Support Measure Machines

18 0.42637223 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

19 0.42634389 279 nips-2012-Projection Retrieval for Classification

20 0.42583349 356 nips-2012-Unsupervised Structure Discovery for Semantic Analysis of Audio