nips nips2004 nips2004-149 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ofer Shai, Brendan J. Frey, Quaid D. Morris, Qun Pan, Christine Misquitta, Benjamin J. Blencowe
Abstract: Alternative splicing (AS) is an important and frequent step in mammalian gene expression that allows a single gene to specify multiple products, and is crucial for the regulation of fundamental biological processes. The extent of AS regulation, and the mechanisms involved, are not well understood. We have developed a custom DNA microarray platform for surveying AS levels on a large scale. We present here a generative model for the AS Array Platform (GenASAP) and demonstrate its utility for quantifying AS levels in different mouse tissues. Learning is performed using a variational expectation maximization algorithm, and the parameters are shown to correctly capture expected AS trends. A comparison of the results obtained with a well-established but low through-put experimental method demonstrate that AS levels obtained from GenASAP are highly predictive of AS levels in mammalian tissues. 1 Biological diversity through alternative splicing Current estimates place the number of genes in the human genome at approximately 30,000, which is a surprisingly small number when one considers that the genome of yeast, a singlecelled organism, has 6,000 genes. The number of genes alone cannot account for the complexity and cell specialization exhibited by higher eukaryotes (i.e. mammals, plants, etc.). Some of that added complexity can be achieved through the use of alternative splicing, whereby a single gene can be used to code for a multitude of products. Genes are segments of the double stranded DNA that contain the information required by the cell for protein synthesis. That information is coded using an alphabet of 4 (A, C, G, and T), corresponding to the four nucleotides that make up the DNA. In what is known as the central dogma of molecular biology, DNA is transcribed to RNA, which in turn is translated into proteins. Messenger RNA (mRNA) is synthesized in the nucleus of the cell and carries the genomic information to the ribosome. In eukaryotes, genes are generally comprised of both exons, which contain the information needed by the cell to synthesize proteins, and introns, sometimes referred to as spacer DNA, which are spliced out of the pre-mRNA to create mature mRNA. An estimated 35%-75% of human genes [1] can be C1 (a) C1 A C1 C2 C1 A 3’ C2 (b) C2 C1 A 3’ C1 A 5’ C2 A 5’ C2 C1 C1 C2 C1 (c) A2 A1 C1 A1 C1 A2 C1 C2 C2 C1 (d) C2 C2 A C2 C2 C2 C1 C2 Figure 1: Four types of AS. Boxes represent exons and lines represent introns, with the possible splicing alternatives indicated by the connectors. (a) Single cassette exon inclusion/exclusion. C1 and C2 are constitutive exons (exons that are included in all isoforms) and flank a single alternative exon (A). The alternative exon is included in one isoform and excluded in the other. (b) Alternative 3’ (or donor) and alternative 5’ (acceptor) splicing sites. Both exons are constitutive, but may contain alternative donor and/or acceptor splicing sites. (c) Mutually exclusive exons. One of the two alternative exons (A1 and A2 ) may be included in the isoform, but not both. (d) Intron inclusion. An intron may be included in the mature mRNA strand. spliced to yield different combinations of exons (called isoforms), a phenomenon referred to as alternative splicing (AS). There are four major types of AS as shown in Figure 1. Many multi-exon genes may undergo more than one alternative splicing event, resulting in many possible isoforms from a single gene. [2] In addition to adding to the genetic repertoire of an organism by enabling a single gene to code for more than one protein, AS has been shown to be critical for gene regulation, contributing to tissue specificity, and facilitating evolutionary processes. Despite the evident importance of AS, its regulation and impact on specific genes remains poorly understood. The work presented here is concerned with the inference of single cassette exon AS levels (Figure 1a) based on data obtained from RNA expression arrays, also known as microarrays. 1.1 An exon microarray data set that probes alternative splicing events Although it is possible to directly analyze the proteins synthesized by a cell, it is easier, and often more informative, to instead measure the abundance of mRNA present. Traditionally, gene expression (abundance of mRNA) has been studied using low throughput techniques (such as RT-PCR or Northern blots), limited to studying a few sequences at a time and making large scale analysis nearly impossible. In the early 1990s, microarray technology emerged as a method capable of measuring the expression of thousands of DNA sequences simultaneously. Sequences of interest are deposited on a substrate the size of a small microscope slide, to form probes. The mRNA is extracted from the cell and reverse-transcribed back into DNA, which is labelled with red and green fluorescent dye molecules (cy3 and cy5 respectively). When the sample of tagged DNA is washed over the slide, complementary strands of DNA from the sample hybridize to the probes on the array forming A-T and C-G pairings. The slide is then scanned and the fluorescent intensity is measured at each probe. It is generally assumed that the intensity measure at the probe is linearly related to the abundance of mRNA in the cell over a wide dynamic range. Despite significant improvements in microarray technologies in recent years, microarray data still presents some difficulties in analysis. Low measurements tend to have extremely low signal to noise ratio (SNR) [7] and probes often bind to sequences that are very similar, but not identical, to the one for which they were designed (a process referred to as cross- C1 A C1 A C1:A C1 C2 C2 3 Body probes A:C2 A C2 2 Inclusion junction probes C1:C2 C1 C2 1 Exclusion junction probe Figure 2: Each alternative splicing event is studied using six probes. Probes were chosen to measure the expression levels of each of the three exons involved in the event. Additionally, 3 probes are used that target the junctions that are formed by each of the two isoforms. The inclusion isoform would express the junctions formed by C1 and A, and A and C2 , while the exclusion isoform would express the junction formed by C1 and C2 hybridization). Additionally, probes exhibit somewhat varying hybridization efficiency, and sequences exhibit varying labelling efficiency. To design our data sets, we mined public sequence databases and identified exons that were strong candidates for exhibiting AS (the details of that analysis are provided elsewhere [4, 3]). Of the candidates, 3,126 potential AS events in 2,647 unique mouse genes were selected for the design of Agilent Custom Oligonucleotide microarray. The arrays were hybridized with unamplified mRNA samples extracted from 10 wild-type mouse tissues (brain, heart, intestine, kidney, liver, lung, salivary gland, skeletal muscle, spleen, and testis). Each AS event has six target probes on the arrays, chosen from regions of the C1 exon, C2 exon, A exon, C1 :A splice junction, A:C2 splice junction, and C1 :C2 splice junction, as shown in Figure 2. 2 Unsupervised discovery of alternative splicing With the exception of the probe measuring the alternative exon, A (Figure 2), all probes measure sequences that occur in both isoforms. For example, while the sequence of the probe measuring the junction A:C1 is designed to measure the inclusion isoform, half of it corresponds to a sequence that is found in the exclusion isoform. We can therefore safely assume that the measured intensity at each probe is a result of a certain amount of both isoforms binding to the probe. Due to the generally assumed linear relationship between the abundance of mRNA hybridized at a probe and the fluorescent intensity measured, we model the measured intensity as a weighted sum of the overall abundance of the two isoforms. A stronger assumption is that of a single, consistent hybridization profile for both isoforms across all probes and all slides. Ideally, one would prefer to estimate an individual hybridization profile for each AS event studied across all slides. However, in our current setup, the number of tissues is small (10), resulting in two difficulties. First, the number of parameters is very large when compared to the number of data point using this model, and second, a portion of the events do not exhibit tissue specific alternative splicing within our small set of tissues. While the first hurdle could be accounted for using Baysian parameter estimation, the second cannot. 2.1 GenASAP - a generative model for alternative splicing array platform Using the setup described above, the expression vector x, containing the six microarray measurements as real numbers, can be decomposed as a linear combination of the abundance of the two splice isoforms, represented by the real vector s, with some added noise: x = Λs + noise, where Λ is a 6 × 2 weight matrix containing the hybridization profiles for s1 ^ xC ^ xC 1 2 s2 ^ xC :A ^ xA ^ xA:C 2 ^ xC1:C2 2 xC1:C2 1 r xC xC 2 xA xC :A xA:C oC1 oC2 oA oC :A oA:C 1 1 1 2 oC :C 1 2 Σn 2 Figure 3: Graphical model for alternative splicing. Each measurement in the observed expression profile, x, is generated by either using a scale factor, r, on a linear combination of the isoforms, s, or drawing randomly from an outlier model. For a detailed description of the model, see text. the two isoforms across the six probes. Note that we may not have a negative amount of a given isoform, nor can the presence of an isoform deduct from the measured expression, and so both s and Λ are constrained to be positive. Expression levels measured by microarrays have previously been modelled as having expression-dependent noise [7]. To address this, we rewrite the above formulation as x = r(Λs + ε), (1) where r is a scale factor and ε is a zero-mean normally distributed random variable with a diagonal covariance matrix, Ψ, denoted as p(ε) = N (ε; 0, Ψ). The prior distribution for the abundance of the splice isoforms is given by a truncated normal distribution, denoted as p(s) ∝ N (s, 0, I)[s ≥ 0], where [·] is an indicator function such that [s ≥ 0] = 1 if ∀i, si ≥ 0, and [s ≥ 0] = 0 otherwise. Lastly, there is a need to account for aberrant observations (e.g. due to faulty probes, flakes of dust, etc.) with an outlier model. The complete GenASAP model (shown in Figure 3) accounts for the observations as the outcome of either applying equation (1) or an outlier model. To avoid degenerate cases and ensure meaningful and interpretable results, the number of faulty probes considered for each AS event may not exceed two, as indicated by the filled-in square constraint node in Figure 3. The distribution of x conditional on the latent variables, s, r, and o, is: N (xi ; rΛi s, r2 Ψi )[oi =0] N (xi ; Ei , Vi )[oi =1] , p(x|s, r, o) = (2) i where oi ∈ {0, 1} is a bernoulli random variable indicating if the measurement at probe xi is the result of the AS model or the outlier model parameterized by p(oi = 1) = γi . The parameters of the outlier model, E and V, are not optimized and are set to the mean and variance of the data. 2.2 Variational learning in the GenASAP model To infer the posterior distribution over the splice isoform abundances while at the same time learning the model parameters we use a variational expectation-maximization algorithm (EM). EM maximizes the log likelihood of the data by iteratively estimating the posterior distribution of the model given the data in the expectation (E) step, and maximizing the log likelihood with respect to the parameters, while keeping the posterior fixed, in the maximization (M) step. Variational EM is used when, as in the case of GenASAP, the exact posterior is intractable. Variational EM minimizes the free energy of the model, defined as the KL-divergence between the joint distribution of the latent and observed variables and the approximation to the posterior under the model parameters [5, 6]. We approximate the true posterior using the Q distribution given by T Q({s(t) }, {o(t) }, {r(t) }) = (t) Q(r(t) )Q(o(t) |r(t) ) t=1 (t) Q(si |oi , r(t) ) i −1 T =Z (t) (3) d d ρ(t) ω (t) N (s(t) ; µ(t) , Σ(t) )[s(t) ≥ 0], ro ro t=1 where Z is a normalization constant, the superscript d indicates that Σ is constrained to be diagonal, and there are T iid AS events. For computational efficiency, r is selected from a finite set, r ∈ {r1 , r2 , . . . , rC } with uniform probability. The variational free energy is given by Q({s(t) }, {o(t) }, {r(t) }) . P ({s(t) }, {o(t) }, {r(t) }, {x(t) }) s r o (4) Variational EM minimizes the free energy by iteratively updating the Q distribution’s vari(t)d (t)d ational parameters (ρ(t) , ω (t) , µro , and Σro ) in the E-step, and the model parameters (Λ, Ψ, {r1 , r2 , . . . , rC }, and γ) in the M-step. The resulting updates are too long to be shown in the context of this paper and are discussed in detail elsewhere [3]. A few particular points regarding the E-step are worth covering in detail here. Q({s(t) }, {o(t) }, {r(t) }) log F(Q, P ) = If the prior on s was a full normal distribution, there would be no need for a variational approach, and exact EM is possible. For a truncated normal distribution, however, the mixing proportions, Q(r)Q(o|r) cannot be calculated analytically except for the case where s is scalar, necessitating the diagonality constraint. Note that if Σ was allowed to be a full covariance matrix, equation (3) would be the true posterior, and we could find the sufficient statistics of Q(s(t) |o(t) , r(t) ): −1 µ(t) = (I + ΛT (I − O(t) )T Ψ−1 (I − O(t) )Λ)−1 ΛT (I − O(t) )T Ψ−1 x(t) r(t) ro −1 Σ(t) ro = (I + ΛT (I − O(t) )T Ψ−1 (I − O(t) )Λ) (5) (6) where O is a diagonal matrix with elements Oi,i = oi . Furthermore, it can be easily shown that the optimal settings for µd and Σd approximating a normal distribution with full covariance Σ and mean µ is µd optimal = µ −1 Σd optimal = diag(Σ (7) −1 ) (8) In the truncated case, equation (8) is still true. Equation (7) does not hold, though, and µd optimal cannot be found analytically. In our experiments, we found that using equation (7) still decreases the free energy every E-step, and it is significantly more efficient than using, for example, a gradient decent method to compute the optimal µd . Intuitive Weigh Matrix Optimal Weight Matrix 50 50 40 40 30 30 20 20 10 0 10 Inclusion Isoform 0 Exclusion Isoform Inclusion Isoform (a) Exclusion Isoform (b) Figure 4: (a) An intuitive set of weights. Based on the biological background, one would expect to see the inclusion isoform hybridize to the probes measuring C1 , C2 , A, C1 :A, and A:C2 , while the exclusion isoform hybridizes to C1 , C2 , and C1 :C2 . (b) The learned set of weights closely agrees with the intuition, and captures cross hybridization between the probes Contribution of exclusion isoform Contribution of inclusion isoform AS model Original Data (a) RT−PCR AS model measurement prediction (% exclusion) (% exclusion) 14% 72% (b) 27% 70% 8% 22% outliers (c) Figure 5: Three examples of data cases and their predictions. (a) The data does not follow our notion of single cassette exon AS, but the AS level is predicted accurately by the model.(b) The probe C1 :A is marked as outlier, allowing the model to predict the other probes accurately. (c) Two probes are marked as outliers, and the model is still successful in predicting the AS levels. 3 Making biological predictions about alternative splicing The results presented in this paper were obtained using two stages of learning. In the first step, the weight matrix, Λ, is learned on a subset of the data that is selected for quality. Two selection criteria were used: (a) sequencing data was used to select those cases for which, with high confidence, no other AS event is present (Figure 1) and (b) probe sets were selected for high expression, as determined by a set of negative controls. The second selection criterion is motivated by the common assumption that low intensity measurements are of lesser quality (see Section 1.1). In the second step, Λ is kept fixed, and we introduce the additional constraint that the noise is isotropic (Ψ = ψI) and learn on the entire data set. The constraint on the noise is introduced to prevent the model from using only a subset of the six probes for making the final set of predictions. We show a typical learned set of weights in Figure 4. The weights fit well with our intuition of what they should be to capture the presence of the two isoforms. Moreover, the learned weights account for the specific trends in the data. Examples of model prediction based on the microarray data are shown in Figure 5. Due to the nature of the microarray data, we do not expect all the inferred abundances to be equally good, and we devised a scoring criterion that ranks each AS event based on its fit to the model. Intuitively, given two input vectors that are equivalent up to a scale factor, with inferred MAP estimations that are equal up to the same scale factor, we would like their scores to be identical. The scoring criterion used, therefore is k (xk − rΛk s)2 /(xk + Rank 500 1000 2000 5000 10000 15000 20000 30000 Pearson’s correlation coefficient 0.94 0.95 0.95 0.79 0.79 0.78 0.75 0.65 False positive rate 0.11 0.08 0.05 0.2 0.25 0.29 0.32 0.42 Table 1: Model performance evaluated at various ranks. Using 180 RT-PCR measurements, we are able to predict the model’s performance at various ranks. Two evaluation criteria are used: Pearson’s correlation coefficient between the model’s predictions and the RT-PCR measurements and false positive rate, where a prediction is considered to be false positive if it is more than 15% away from the RT-PCR measurement. rΛk s)2 , where the MAP estimations for r and s are used. This scoring criterion can be viewed as proportional to the sum of noise to signal ratios, as estimated using the two values given by the observation and the model’s best prediction of that observation. Since it is the relative amount of the isoforms that is of most interest, we need to use the inferred distribution of the isoform abundances to obtain an estimate for the relative levels of AS. It is not immediately clear how this should be done. We do, however, have RTPCR measurements for 180 AS events to guide us (see figure 6 for details). Using the top 50 ranked RT-PCR measurement, we fit three parameters, {a1 , a2 , a3 }, such that the s2 proportion of excluded isoform present, p, is given by p = a1 s1 +a2 s2 + a3 , where s1 is the MAP estimation of the abundance of the inclusion isoform, s2 is the MAP estimation of the abundance of the exclusion isoform, and the RT-PCR measurement are used for target p. The parameters are fitted using gradient descent on a least squared error (LSE) evaluation criterion. We used two criteria to evaluate the quality of the AS model predictions. Pearson’s correlation coefficient (PCC) is used to evaluate the overall ability of the model to correctly estimate trends in the data. PCC is invariant to affine transformation and so is independent of the transformation parameters a1 and a3 discussed above, while the parameter a2 was found to effect PCC very little. The PCC stays above 0.75 for the top two thirds ranked predictions. The second evaluation criterion used is the false positive rate, where a prediction is considered to be false positive if it is more than 15% away from the RT-PCR measurement. This allows us to say, for example, that if a prediction is within the top 10000, we are 75% confident that it is within 15% of the actual levels of AS. 4 Summary We designed a novel AS model for the inference of the relative abundance of two alternatively spliced isoforms from six measurements. Unsupervised learning in the model is performed using a structured variational EM algorithm, which correctly captures the underlying structure of the data, as suggested by its biological nature. The AS model, though presented here for a cassette exon AS events, can be used to learn any type of AS, and with a simple adjustment, multiple types. The predictions obtained from the AS model are currently being used to verify various claims about the role of AS in evolution and functional genomics, and to help identify sequences that affect the regulation of AS. % Exclusion isoform RT−PCR measurement Vs. AS model predictions RT−PCR measurements: 90 80 AS model prediction Int es Te tine sti Kid s n Sa ey liva Br ry ain Sp le Liv en er Mu sc Lu le ng 100 70 60 50 40 30 14 22 27 32 47 46 66 78 63 20 AS model prediction: 10 27 24 26 26 51 75 60 85 100 (a) 0 0 20 40 60 80 RT−PCR measurement 100 (b) Figure 6: (a) Sample RT-PCR. RNA extracted from the cell is reverse-transcribed to DNA, amplified and labelled with radioactive or fluorescent molecules. The sample is pulled through a viscous gel in an electric field (DNA, being an acid, is positively charged). Shorter strands travel further through the gel than longer ones, resulting in two distinct bands, corresponding to the two isoforms, when exposed to a photosensitive or x-ray film. (b) A scatter plot showing the RT-PCR measurements as compared to the AS model predictions. The plot shows all available RT-PCR measurements with a rank of 8000 or better. The AS model presented assumes a single weight matrix for all data cases. This is an oversimplified view of the data, and current work is being carried out in identifying probe specific expression profiles. However, due to the low dimensionality of the problem (10 tissues, six probes per event), care must be taken to avoid overfitting and to ensure meaningful interpretations. Acknowledgments We would like to thank Wen Zhang, Naveed Mohammad, and Timothy Hughes for their contributions in generating the data set. This work was funded in part by an operating and infrastructure grants from the CIHR and CFI, and a operating grants from NSERC and a Premier’s Research Excellence Award. References [1] J. M. Johnson et al. Genome-wide survey of human alternative pre-mrna splicing with exon junction microarrays. Science, 302:2141–44, 2003. [2] L. Cartegni et al. Listening to silence and understanding nonsense: exonic mutations that affect splicing. Nature Gen. Rev., 3:285–98, 2002. [3] Q. Pan et al. Revealing global regulatory features of mammalian alternative splicing using a quantitative microarray platform. Molecular Cell, 16(6):929–41, 2004. [4] Q. Pan et al. Alternative splicing of conserved exons is frequently species specific in human and mouse. Trends Gen., In Press, 2005. [5] M. I. Jordan, Z. Ghahramani, T. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37(2):183– 233, 1999. [6] R. M. Neal and G. E. Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models. Cambridge, MIT Press, 1998. [7] D. M. Rocke and B. Durbin. A model for measurement error for gene expression arrays. Journal of Computational Biology, 8(6):557–69, 2001.
Reference: text
sentIndex sentText sentNum sentScore
1 of Medical Research University of Toronto, Toronto, ON Abstract Alternative splicing (AS) is an important and frequent step in mammalian gene expression that allows a single gene to specify multiple products, and is crucial for the regulation of fundamental biological processes. [sent-6, score-0.673]
2 We have developed a custom DNA microarray platform for surveying AS levels on a large scale. [sent-8, score-0.272]
3 We present here a generative model for the AS Array Platform (GenASAP) and demonstrate its utility for quantifying AS levels in different mouse tissues. [sent-9, score-0.136]
4 Learning is performed using a variational expectation maximization algorithm, and the parameters are shown to correctly capture expected AS trends. [sent-10, score-0.085]
5 A comparison of the results obtained with a well-established but low through-put experimental method demonstrate that AS levels obtained from GenASAP are highly predictive of AS levels in mammalian tissues. [sent-11, score-0.182]
6 1 Biological diversity through alternative splicing Current estimates place the number of genes in the human genome at approximately 30,000, which is a surprisingly small number when one considers that the genome of yeast, a singlecelled organism, has 6,000 genes. [sent-12, score-0.507]
7 The number of genes alone cannot account for the complexity and cell specialization exhibited by higher eukaryotes (i. [sent-13, score-0.211]
8 Some of that added complexity can be achieved through the use of alternative splicing, whereby a single gene can be used to code for a multitude of products. [sent-17, score-0.165]
9 Genes are segments of the double stranded DNA that contain the information required by the cell for protein synthesis. [sent-18, score-0.07]
10 Messenger RNA (mRNA) is synthesized in the nucleus of the cell and carries the genomic information to the ribosome. [sent-21, score-0.099]
11 In eukaryotes, genes are generally comprised of both exons, which contain the information needed by the cell to synthesize proteins, and introns, sometimes referred to as spacer DNA, which are spliced out of the pre-mRNA to create mature mRNA. [sent-22, score-0.271]
12 An estimated 35%-75% of human genes [1] can be C1 (a) C1 A C1 C2 C1 A 3’ C2 (b) C2 C1 A 3’ C1 A 5’ C2 A 5’ C2 C1 C1 C2 C1 (c) A2 A1 C1 A1 C1 A2 C1 C2 C2 C1 (d) C2 C2 A C2 C2 C2 C1 C2 Figure 1: Four types of AS. [sent-23, score-0.101]
13 Boxes represent exons and lines represent introns, with the possible splicing alternatives indicated by the connectors. [sent-24, score-0.51]
14 C1 and C2 are constitutive exons (exons that are included in all isoforms) and flank a single alternative exon (A). [sent-26, score-0.602]
15 The alternative exon is included in one isoform and excluded in the other. [sent-27, score-0.809]
16 (b) Alternative 3’ (or donor) and alternative 5’ (acceptor) splicing sites. [sent-28, score-0.406]
17 Both exons are constitutive, but may contain alternative donor and/or acceptor splicing sites. [sent-29, score-0.685]
18 One of the two alternative exons (A1 and A2 ) may be included in the isoform, but not both. [sent-31, score-0.323]
19 An intron may be included in the mature mRNA strand. [sent-33, score-0.109]
20 spliced to yield different combinations of exons (called isoforms), a phenomenon referred to as alternative splicing (AS). [sent-34, score-0.665]
21 Many multi-exon genes may undergo more than one alternative splicing event, resulting in many possible isoforms from a single gene. [sent-36, score-0.746]
22 [2] In addition to adding to the genetic repertoire of an organism by enabling a single gene to code for more than one protein, AS has been shown to be critical for gene regulation, contributing to tissue specificity, and facilitating evolutionary processes. [sent-37, score-0.201]
23 Despite the evident importance of AS, its regulation and impact on specific genes remains poorly understood. [sent-38, score-0.167]
24 The work presented here is concerned with the inference of single cassette exon AS levels (Figure 1a) based on data obtained from RNA expression arrays, also known as microarrays. [sent-39, score-0.459]
25 1 An exon microarray data set that probes alternative splicing events Although it is possible to directly analyze the proteins synthesized by a cell, it is easier, and often more informative, to instead measure the abundance of mRNA present. [sent-41, score-1.344]
26 Traditionally, gene expression (abundance of mRNA) has been studied using low throughput techniques (such as RT-PCR or Northern blots), limited to studying a few sequences at a time and making large scale analysis nearly impossible. [sent-42, score-0.184]
27 In the early 1990s, microarray technology emerged as a method capable of measuring the expression of thousands of DNA sequences simultaneously. [sent-43, score-0.272]
28 The mRNA is extracted from the cell and reverse-transcribed back into DNA, which is labelled with red and green fluorescent dye molecules (cy3 and cy5 respectively). [sent-45, score-0.07]
29 When the sample of tagged DNA is washed over the slide, complementary strands of DNA from the sample hybridize to the probes on the array forming A-T and C-G pairings. [sent-46, score-0.406]
30 The slide is then scanned and the fluorescent intensity is measured at each probe. [sent-47, score-0.13]
31 It is generally assumed that the intensity measure at the probe is linearly related to the abundance of mRNA in the cell over a wide dynamic range. [sent-48, score-0.453]
32 Despite significant improvements in microarray technologies in recent years, microarray data still presents some difficulties in analysis. [sent-49, score-0.252]
33 Probes were chosen to measure the expression levels of each of the three exons involved in the event. [sent-51, score-0.339]
34 Additionally, 3 probes are used that target the junctions that are formed by each of the two isoforms. [sent-52, score-0.364]
35 The inclusion isoform would express the junctions formed by C1 and A, and A and C2 , while the exclusion isoform would express the junction formed by C1 and C2 hybridization). [sent-53, score-1.322]
36 Additionally, probes exhibit somewhat varying hybridization efficiency, and sequences exhibit varying labelling efficiency. [sent-54, score-0.491]
37 To design our data sets, we mined public sequence databases and identified exons that were strong candidates for exhibiting AS (the details of that analysis are provided elsewhere [4, 3]). [sent-55, score-0.199]
38 Of the candidates, 3,126 potential AS events in 2,647 unique mouse genes were selected for the design of Agilent Custom Oligonucleotide microarray. [sent-56, score-0.195]
39 The arrays were hybridized with unamplified mRNA samples extracted from 10 wild-type mouse tissues (brain, heart, intestine, kidney, liver, lung, salivary gland, skeletal muscle, spleen, and testis). [sent-57, score-0.172]
40 Each AS event has six target probes on the arrays, chosen from regions of the C1 exon, C2 exon, A exon, C1 :A splice junction, A:C2 splice junction, and C1 :C2 splice junction, as shown in Figure 2. [sent-58, score-0.681]
41 2 Unsupervised discovery of alternative splicing With the exception of the probe measuring the alternative exon, A (Figure 2), all probes measure sequences that occur in both isoforms. [sent-59, score-1.03]
42 For example, while the sequence of the probe measuring the junction A:C1 is designed to measure the inclusion isoform, half of it corresponds to a sequence that is found in the exclusion isoform. [sent-60, score-0.588]
43 We can therefore safely assume that the measured intensity at each probe is a result of a certain amount of both isoforms binding to the probe. [sent-61, score-0.475]
44 Due to the generally assumed linear relationship between the abundance of mRNA hybridized at a probe and the fluorescent intensity measured, we model the measured intensity as a weighted sum of the overall abundance of the two isoforms. [sent-62, score-0.698]
45 A stronger assumption is that of a single, consistent hybridization profile for both isoforms across all probes and all slides. [sent-63, score-0.643]
46 Ideally, one would prefer to estimate an individual hybridization profile for each AS event studied across all slides. [sent-64, score-0.168]
47 However, in our current setup, the number of tissues is small (10), resulting in two difficulties. [sent-65, score-0.047]
48 First, the number of parameters is very large when compared to the number of data point using this model, and second, a portion of the events do not exhibit tissue specific alternative splicing within our small set of tissues. [sent-66, score-0.509]
49 Each measurement in the observed expression profile, x, is generated by either using a scale factor, r, on a linear combination of the isoforms, s, or drawing randomly from an outlier model. [sent-70, score-0.21]
50 Note that we may not have a negative amount of a given isoform, nor can the presence of an isoform deduct from the measured expression, and so both s and Λ are constrained to be positive. [sent-73, score-0.444]
51 Expression levels measured by microarrays have previously been modelled as having expression-dependent noise [7]. [sent-74, score-0.115]
52 The prior distribution for the abundance of the splice isoforms is given by a truncated normal distribution, denoted as p(s) ∝ N (s, 0, I)[s ≥ 0], where [·] is an indicator function such that [s ≥ 0] = 1 if ∀i, si ≥ 0, and [s ≥ 0] = 0 otherwise. [sent-76, score-0.535]
53 The complete GenASAP model (shown in Figure 3) accounts for the observations as the outcome of either applying equation (1) or an outlier model. [sent-81, score-0.092]
54 To avoid degenerate cases and ensure meaningful and interpretable results, the number of faulty probes considered for each AS event may not exceed two, as indicated by the filled-in square constraint node in Figure 3. [sent-82, score-0.404]
55 The parameters of the outlier model, E and V, are not optimized and are set to the mean and variance of the data. [sent-84, score-0.068]
56 2 Variational learning in the GenASAP model To infer the posterior distribution over the splice isoform abundances while at the same time learning the model parameters we use a variational expectation-maximization algorithm (EM). [sent-86, score-0.734]
57 EM maximizes the log likelihood of the data by iteratively estimating the posterior distribution of the model given the data in the expectation (E) step, and maximizing the log likelihood with respect to the parameters, while keeping the posterior fixed, in the maximization (M) step. [sent-87, score-0.094]
58 Variational EM is used when, as in the case of GenASAP, the exact posterior is intractable. [sent-88, score-0.035]
59 Variational EM minimizes the free energy of the model, defined as the KL-divergence between the joint distribution of the latent and observed variables and the approximation to the posterior under the model parameters [5, 6]. [sent-89, score-0.113]
60 The variational free energy is given by Q({s(t) }, {o(t) }, {r(t) }) . [sent-95, score-0.139]
61 P ({s(t) }, {o(t) }, {r(t) }, {x(t) }) s r o (4) Variational EM minimizes the free energy by iteratively updating the Q distribution’s vari(t)d (t)d ational parameters (ρ(t) , ω (t) , µro , and Σro ) in the E-step, and the model parameters (Λ, Ψ, {r1 , r2 , . [sent-96, score-0.078]
62 Q({s(t) }, {o(t) }, {r(t) }) log F(Q, P ) = If the prior on s was a full normal distribution, there would be no need for a variational approach, and exact EM is possible. [sent-102, score-0.085]
63 In our experiments, we found that using equation (7) still decreases the free energy every E-step, and it is significantly more efficient than using, for example, a gradient decent method to compute the optimal µd . [sent-107, score-0.054]
64 Based on the biological background, one would expect to see the inclusion isoform hybridize to the probes measuring C1 , C2 , A, C1 :A, and A:C2 , while the exclusion isoform hybridizes to C1 , C2 , and C1 :C2 . [sent-109, score-1.517]
65 (a) The data does not follow our notion of single cassette exon AS, but the AS level is predicted accurately by the model. [sent-111, score-0.319]
66 (b) The probe C1 :A is marked as outlier, allowing the model to predict the other probes accurately. [sent-112, score-0.482]
67 (c) Two probes are marked as outliers, and the model is still successful in predicting the AS levels. [sent-113, score-0.324]
68 3 Making biological predictions about alternative splicing The results presented in this paper were obtained using two stages of learning. [sent-114, score-0.46]
69 Two selection criteria were used: (a) sequencing data was used to select those cases for which, with high confidence, no other AS event is present (Figure 1) and (b) probe sets were selected for high expression, as determined by a set of negative controls. [sent-116, score-0.246]
70 The second selection criterion is motivated by the common assumption that low intensity measurements are of lesser quality (see Section 1. [sent-117, score-0.153]
71 The constraint on the noise is introduced to prevent the model from using only a subset of the six probes for making the final set of predictions. [sent-120, score-0.401]
72 Moreover, the learned weights account for the specific trends in the data. [sent-123, score-0.04]
73 Examples of model prediction based on the microarray data are shown in Figure 5. [sent-124, score-0.184]
74 Due to the nature of the microarray data, we do not expect all the inferred abundances to be equally good, and we devised a scoring criterion that ranks each AS event based on its fit to the model. [sent-125, score-0.332]
75 Intuitively, given two input vectors that are equivalent up to a scale factor, with inferred MAP estimations that are equal up to the same scale factor, we would like their scores to be identical. [sent-126, score-0.057]
76 The scoring criterion used, therefore is k (xk − rΛk s)2 /(xk + Rank 500 1000 2000 5000 10000 15000 20000 30000 Pearson’s correlation coefficient 0. [sent-127, score-0.057]
77 Two evaluation criteria are used: Pearson’s correlation coefficient between the model’s predictions and the RT-PCR measurements and false positive rate, where a prediction is considered to be false positive if it is more than 15% away from the RT-PCR measurement. [sent-145, score-0.24]
78 This scoring criterion can be viewed as proportional to the sum of noise to signal ratios, as estimated using the two values given by the observation and the model’s best prediction of that observation. [sent-147, score-0.115]
79 Since it is the relative amount of the isoforms that is of most interest, we need to use the inferred distribution of the isoform abundances to obtain an estimate for the relative levels of AS. [sent-148, score-0.807]
80 We do, however, have RTPCR measurements for 180 AS events to guide us (see figure 6 for details). [sent-150, score-0.124]
81 We used two criteria to evaluate the quality of the AS model predictions. [sent-153, score-0.048]
82 Pearson’s correlation coefficient (PCC) is used to evaluate the overall ability of the model to correctly estimate trends in the data. [sent-154, score-0.064]
83 The second evaluation criterion used is the false positive rate, where a prediction is considered to be false positive if it is more than 15% away from the RT-PCR measurement. [sent-158, score-0.138]
84 This allows us to say, for example, that if a prediction is within the top 10000, we are 75% confident that it is within 15% of the actual levels of AS. [sent-159, score-0.099]
85 4 Summary We designed a novel AS model for the inference of the relative abundance of two alternatively spliced isoforms from six measurements. [sent-160, score-0.549]
86 Unsupervised learning in the model is performed using a structured variational EM algorithm, which correctly captures the underlying structure of the data, as suggested by its biological nature. [sent-161, score-0.138]
87 The AS model, though presented here for a cassette exon AS events, can be used to learn any type of AS, and with a simple adjustment, multiple types. [sent-162, score-0.319]
88 The predictions obtained from the AS model are currently being used to verify various claims about the role of AS in evolution and functional genomics, and to help identify sequences that affect the regulation of AS. [sent-163, score-0.154]
89 RNA extracted from the cell is reverse-transcribed to DNA, amplified and labelled with radioactive or fluorescent molecules. [sent-166, score-0.07]
90 The sample is pulled through a viscous gel in an electric field (DNA, being an acid, is positively charged). [sent-167, score-0.04]
91 Shorter strands travel further through the gel than longer ones, resulting in two distinct bands, corresponding to the two isoforms, when exposed to a photosensitive or x-ray film. [sent-168, score-0.075]
92 (b) A scatter plot showing the RT-PCR measurements as compared to the AS model predictions. [sent-169, score-0.101]
93 The plot shows all available RT-PCR measurements with a rank of 8000 or better. [sent-170, score-0.077]
94 This is an oversimplified view of the data, and current work is being carried out in identifying probe specific expression profiles. [sent-172, score-0.233]
95 However, due to the low dimensionality of the problem (10 tissues, six probes per event), care must be taken to avoid overfitting and to ensure meaningful interpretations. [sent-173, score-0.353]
96 Genome-wide survey of human alternative pre-mrna splicing with exon junction microarrays. [sent-179, score-0.763]
97 Revealing global regulatory features of mammalian alternative splicing using a quantitative microarray platform. [sent-189, score-0.584]
98 Alternative splicing of conserved exons is frequently species specific in human and mouse. [sent-193, score-0.51]
99 A view of the em algorithm that justifies incremental, sparse, and other variants. [sent-209, score-0.048]
100 A model for measurement error for gene expression arrays. [sent-216, score-0.236]
wordName wordTfidf (topN-words)
[('isoform', 0.418), ('splicing', 0.311), ('probes', 0.3), ('exon', 0.239), ('isoforms', 0.239), ('exons', 0.199), ('exclusion', 0.19), ('abundance', 0.173), ('mrna', 0.159), ('probe', 0.158), ('dna', 0.126), ('microarray', 0.126), ('genasap', 0.119), ('junction', 0.118), ('hybridization', 0.104), ('xc', 0.104), ('genes', 0.101), ('alternative', 0.095), ('inclusion', 0.09), ('oi', 0.088), ('splice', 0.088), ('variational', 0.085), ('cassette', 0.08), ('pcc', 0.08), ('pcr', 0.08), ('uorescent', 0.08), ('measurements', 0.077), ('expression', 0.075), ('ro', 0.073), ('gene', 0.07), ('cell', 0.07), ('outlier', 0.068), ('measurement', 0.067), ('regulation', 0.066), ('levels', 0.065), ('event', 0.064), ('rna', 0.063), ('abundances', 0.06), ('pan', 0.06), ('spliced', 0.06), ('six', 0.053), ('xa', 0.053), ('intensity', 0.052), ('platform', 0.052), ('slide', 0.052), ('mammalian', 0.052), ('em', 0.048), ('mouse', 0.047), ('tissues', 0.047), ('pearson', 0.047), ('events', 0.047), ('pro', 0.046), ('le', 0.044), ('false', 0.04), ('acceptor', 0.04), ('constitutive', 0.04), ('donor', 0.04), ('eukaryotes', 0.04), ('faulty', 0.04), ('gel', 0.04), ('hybridize', 0.04), ('hybridized', 0.04), ('intron', 0.04), ('introns', 0.04), ('junctions', 0.04), ('mature', 0.04), ('oa', 0.04), ('trends', 0.04), ('sequences', 0.039), ('arrays', 0.038), ('toronto', 0.036), ('truncated', 0.035), ('rt', 0.035), ('strands', 0.035), ('posterior', 0.035), ('prediction', 0.034), ('scoring', 0.033), ('measuring', 0.032), ('estimations', 0.032), ('tissue', 0.032), ('array', 0.031), ('energy', 0.03), ('synthesized', 0.029), ('oc', 0.029), ('organism', 0.029), ('custom', 0.029), ('biological', 0.029), ('included', 0.029), ('excluded', 0.028), ('measured', 0.026), ('predictions', 0.025), ('inferred', 0.025), ('criterion', 0.024), ('proteins', 0.024), ('noise', 0.024), ('free', 0.024), ('exhibit', 0.024), ('criteria', 0.024), ('formed', 0.024), ('model', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data
Author: Ofer Shai, Brendan J. Frey, Quaid D. Morris, Qun Pan, Christine Misquitta, Benjamin J. Blencowe
Abstract: Alternative splicing (AS) is an important and frequent step in mammalian gene expression that allows a single gene to specify multiple products, and is crucial for the regulation of fundamental biological processes. The extent of AS regulation, and the mechanisms involved, are not well understood. We have developed a custom DNA microarray platform for surveying AS levels on a large scale. We present here a generative model for the AS Array Platform (GenASAP) and demonstrate its utility for quantifying AS levels in different mouse tissues. Learning is performed using a variational expectation maximization algorithm, and the parameters are shown to correctly capture expected AS trends. A comparison of the results obtained with a well-established but low through-put experimental method demonstrate that AS levels obtained from GenASAP are highly predictive of AS levels in mammalian tissues. 1 Biological diversity through alternative splicing Current estimates place the number of genes in the human genome at approximately 30,000, which is a surprisingly small number when one considers that the genome of yeast, a singlecelled organism, has 6,000 genes. The number of genes alone cannot account for the complexity and cell specialization exhibited by higher eukaryotes (i.e. mammals, plants, etc.). Some of that added complexity can be achieved through the use of alternative splicing, whereby a single gene can be used to code for a multitude of products. Genes are segments of the double stranded DNA that contain the information required by the cell for protein synthesis. That information is coded using an alphabet of 4 (A, C, G, and T), corresponding to the four nucleotides that make up the DNA. In what is known as the central dogma of molecular biology, DNA is transcribed to RNA, which in turn is translated into proteins. Messenger RNA (mRNA) is synthesized in the nucleus of the cell and carries the genomic information to the ribosome. In eukaryotes, genes are generally comprised of both exons, which contain the information needed by the cell to synthesize proteins, and introns, sometimes referred to as spacer DNA, which are spliced out of the pre-mRNA to create mature mRNA. An estimated 35%-75% of human genes [1] can be C1 (a) C1 A C1 C2 C1 A 3’ C2 (b) C2 C1 A 3’ C1 A 5’ C2 A 5’ C2 C1 C1 C2 C1 (c) A2 A1 C1 A1 C1 A2 C1 C2 C2 C1 (d) C2 C2 A C2 C2 C2 C1 C2 Figure 1: Four types of AS. Boxes represent exons and lines represent introns, with the possible splicing alternatives indicated by the connectors. (a) Single cassette exon inclusion/exclusion. C1 and C2 are constitutive exons (exons that are included in all isoforms) and flank a single alternative exon (A). The alternative exon is included in one isoform and excluded in the other. (b) Alternative 3’ (or donor) and alternative 5’ (acceptor) splicing sites. Both exons are constitutive, but may contain alternative donor and/or acceptor splicing sites. (c) Mutually exclusive exons. One of the two alternative exons (A1 and A2 ) may be included in the isoform, but not both. (d) Intron inclusion. An intron may be included in the mature mRNA strand. spliced to yield different combinations of exons (called isoforms), a phenomenon referred to as alternative splicing (AS). There are four major types of AS as shown in Figure 1. Many multi-exon genes may undergo more than one alternative splicing event, resulting in many possible isoforms from a single gene. [2] In addition to adding to the genetic repertoire of an organism by enabling a single gene to code for more than one protein, AS has been shown to be critical for gene regulation, contributing to tissue specificity, and facilitating evolutionary processes. Despite the evident importance of AS, its regulation and impact on specific genes remains poorly understood. The work presented here is concerned with the inference of single cassette exon AS levels (Figure 1a) based on data obtained from RNA expression arrays, also known as microarrays. 1.1 An exon microarray data set that probes alternative splicing events Although it is possible to directly analyze the proteins synthesized by a cell, it is easier, and often more informative, to instead measure the abundance of mRNA present. Traditionally, gene expression (abundance of mRNA) has been studied using low throughput techniques (such as RT-PCR or Northern blots), limited to studying a few sequences at a time and making large scale analysis nearly impossible. In the early 1990s, microarray technology emerged as a method capable of measuring the expression of thousands of DNA sequences simultaneously. Sequences of interest are deposited on a substrate the size of a small microscope slide, to form probes. The mRNA is extracted from the cell and reverse-transcribed back into DNA, which is labelled with red and green fluorescent dye molecules (cy3 and cy5 respectively). When the sample of tagged DNA is washed over the slide, complementary strands of DNA from the sample hybridize to the probes on the array forming A-T and C-G pairings. The slide is then scanned and the fluorescent intensity is measured at each probe. It is generally assumed that the intensity measure at the probe is linearly related to the abundance of mRNA in the cell over a wide dynamic range. Despite significant improvements in microarray technologies in recent years, microarray data still presents some difficulties in analysis. Low measurements tend to have extremely low signal to noise ratio (SNR) [7] and probes often bind to sequences that are very similar, but not identical, to the one for which they were designed (a process referred to as cross- C1 A C1 A C1:A C1 C2 C2 3 Body probes A:C2 A C2 2 Inclusion junction probes C1:C2 C1 C2 1 Exclusion junction probe Figure 2: Each alternative splicing event is studied using six probes. Probes were chosen to measure the expression levels of each of the three exons involved in the event. Additionally, 3 probes are used that target the junctions that are formed by each of the two isoforms. The inclusion isoform would express the junctions formed by C1 and A, and A and C2 , while the exclusion isoform would express the junction formed by C1 and C2 hybridization). Additionally, probes exhibit somewhat varying hybridization efficiency, and sequences exhibit varying labelling efficiency. To design our data sets, we mined public sequence databases and identified exons that were strong candidates for exhibiting AS (the details of that analysis are provided elsewhere [4, 3]). Of the candidates, 3,126 potential AS events in 2,647 unique mouse genes were selected for the design of Agilent Custom Oligonucleotide microarray. The arrays were hybridized with unamplified mRNA samples extracted from 10 wild-type mouse tissues (brain, heart, intestine, kidney, liver, lung, salivary gland, skeletal muscle, spleen, and testis). Each AS event has six target probes on the arrays, chosen from regions of the C1 exon, C2 exon, A exon, C1 :A splice junction, A:C2 splice junction, and C1 :C2 splice junction, as shown in Figure 2. 2 Unsupervised discovery of alternative splicing With the exception of the probe measuring the alternative exon, A (Figure 2), all probes measure sequences that occur in both isoforms. For example, while the sequence of the probe measuring the junction A:C1 is designed to measure the inclusion isoform, half of it corresponds to a sequence that is found in the exclusion isoform. We can therefore safely assume that the measured intensity at each probe is a result of a certain amount of both isoforms binding to the probe. Due to the generally assumed linear relationship between the abundance of mRNA hybridized at a probe and the fluorescent intensity measured, we model the measured intensity as a weighted sum of the overall abundance of the two isoforms. A stronger assumption is that of a single, consistent hybridization profile for both isoforms across all probes and all slides. Ideally, one would prefer to estimate an individual hybridization profile for each AS event studied across all slides. However, in our current setup, the number of tissues is small (10), resulting in two difficulties. First, the number of parameters is very large when compared to the number of data point using this model, and second, a portion of the events do not exhibit tissue specific alternative splicing within our small set of tissues. While the first hurdle could be accounted for using Baysian parameter estimation, the second cannot. 2.1 GenASAP - a generative model for alternative splicing array platform Using the setup described above, the expression vector x, containing the six microarray measurements as real numbers, can be decomposed as a linear combination of the abundance of the two splice isoforms, represented by the real vector s, with some added noise: x = Λs + noise, where Λ is a 6 × 2 weight matrix containing the hybridization profiles for s1 ^ xC ^ xC 1 2 s2 ^ xC :A ^ xA ^ xA:C 2 ^ xC1:C2 2 xC1:C2 1 r xC xC 2 xA xC :A xA:C oC1 oC2 oA oC :A oA:C 1 1 1 2 oC :C 1 2 Σn 2 Figure 3: Graphical model for alternative splicing. Each measurement in the observed expression profile, x, is generated by either using a scale factor, r, on a linear combination of the isoforms, s, or drawing randomly from an outlier model. For a detailed description of the model, see text. the two isoforms across the six probes. Note that we may not have a negative amount of a given isoform, nor can the presence of an isoform deduct from the measured expression, and so both s and Λ are constrained to be positive. Expression levels measured by microarrays have previously been modelled as having expression-dependent noise [7]. To address this, we rewrite the above formulation as x = r(Λs + ε), (1) where r is a scale factor and ε is a zero-mean normally distributed random variable with a diagonal covariance matrix, Ψ, denoted as p(ε) = N (ε; 0, Ψ). The prior distribution for the abundance of the splice isoforms is given by a truncated normal distribution, denoted as p(s) ∝ N (s, 0, I)[s ≥ 0], where [·] is an indicator function such that [s ≥ 0] = 1 if ∀i, si ≥ 0, and [s ≥ 0] = 0 otherwise. Lastly, there is a need to account for aberrant observations (e.g. due to faulty probes, flakes of dust, etc.) with an outlier model. The complete GenASAP model (shown in Figure 3) accounts for the observations as the outcome of either applying equation (1) or an outlier model. To avoid degenerate cases and ensure meaningful and interpretable results, the number of faulty probes considered for each AS event may not exceed two, as indicated by the filled-in square constraint node in Figure 3. The distribution of x conditional on the latent variables, s, r, and o, is: N (xi ; rΛi s, r2 Ψi )[oi =0] N (xi ; Ei , Vi )[oi =1] , p(x|s, r, o) = (2) i where oi ∈ {0, 1} is a bernoulli random variable indicating if the measurement at probe xi is the result of the AS model or the outlier model parameterized by p(oi = 1) = γi . The parameters of the outlier model, E and V, are not optimized and are set to the mean and variance of the data. 2.2 Variational learning in the GenASAP model To infer the posterior distribution over the splice isoform abundances while at the same time learning the model parameters we use a variational expectation-maximization algorithm (EM). EM maximizes the log likelihood of the data by iteratively estimating the posterior distribution of the model given the data in the expectation (E) step, and maximizing the log likelihood with respect to the parameters, while keeping the posterior fixed, in the maximization (M) step. Variational EM is used when, as in the case of GenASAP, the exact posterior is intractable. Variational EM minimizes the free energy of the model, defined as the KL-divergence between the joint distribution of the latent and observed variables and the approximation to the posterior under the model parameters [5, 6]. We approximate the true posterior using the Q distribution given by T Q({s(t) }, {o(t) }, {r(t) }) = (t) Q(r(t) )Q(o(t) |r(t) ) t=1 (t) Q(si |oi , r(t) ) i −1 T =Z (t) (3) d d ρ(t) ω (t) N (s(t) ; µ(t) , Σ(t) )[s(t) ≥ 0], ro ro t=1 where Z is a normalization constant, the superscript d indicates that Σ is constrained to be diagonal, and there are T iid AS events. For computational efficiency, r is selected from a finite set, r ∈ {r1 , r2 , . . . , rC } with uniform probability. The variational free energy is given by Q({s(t) }, {o(t) }, {r(t) }) . P ({s(t) }, {o(t) }, {r(t) }, {x(t) }) s r o (4) Variational EM minimizes the free energy by iteratively updating the Q distribution’s vari(t)d (t)d ational parameters (ρ(t) , ω (t) , µro , and Σro ) in the E-step, and the model parameters (Λ, Ψ, {r1 , r2 , . . . , rC }, and γ) in the M-step. The resulting updates are too long to be shown in the context of this paper and are discussed in detail elsewhere [3]. A few particular points regarding the E-step are worth covering in detail here. Q({s(t) }, {o(t) }, {r(t) }) log F(Q, P ) = If the prior on s was a full normal distribution, there would be no need for a variational approach, and exact EM is possible. For a truncated normal distribution, however, the mixing proportions, Q(r)Q(o|r) cannot be calculated analytically except for the case where s is scalar, necessitating the diagonality constraint. Note that if Σ was allowed to be a full covariance matrix, equation (3) would be the true posterior, and we could find the sufficient statistics of Q(s(t) |o(t) , r(t) ): −1 µ(t) = (I + ΛT (I − O(t) )T Ψ−1 (I − O(t) )Λ)−1 ΛT (I − O(t) )T Ψ−1 x(t) r(t) ro −1 Σ(t) ro = (I + ΛT (I − O(t) )T Ψ−1 (I − O(t) )Λ) (5) (6) where O is a diagonal matrix with elements Oi,i = oi . Furthermore, it can be easily shown that the optimal settings for µd and Σd approximating a normal distribution with full covariance Σ and mean µ is µd optimal = µ −1 Σd optimal = diag(Σ (7) −1 ) (8) In the truncated case, equation (8) is still true. Equation (7) does not hold, though, and µd optimal cannot be found analytically. In our experiments, we found that using equation (7) still decreases the free energy every E-step, and it is significantly more efficient than using, for example, a gradient decent method to compute the optimal µd . Intuitive Weigh Matrix Optimal Weight Matrix 50 50 40 40 30 30 20 20 10 0 10 Inclusion Isoform 0 Exclusion Isoform Inclusion Isoform (a) Exclusion Isoform (b) Figure 4: (a) An intuitive set of weights. Based on the biological background, one would expect to see the inclusion isoform hybridize to the probes measuring C1 , C2 , A, C1 :A, and A:C2 , while the exclusion isoform hybridizes to C1 , C2 , and C1 :C2 . (b) The learned set of weights closely agrees with the intuition, and captures cross hybridization between the probes Contribution of exclusion isoform Contribution of inclusion isoform AS model Original Data (a) RT−PCR AS model measurement prediction (% exclusion) (% exclusion) 14% 72% (b) 27% 70% 8% 22% outliers (c) Figure 5: Three examples of data cases and their predictions. (a) The data does not follow our notion of single cassette exon AS, but the AS level is predicted accurately by the model.(b) The probe C1 :A is marked as outlier, allowing the model to predict the other probes accurately. (c) Two probes are marked as outliers, and the model is still successful in predicting the AS levels. 3 Making biological predictions about alternative splicing The results presented in this paper were obtained using two stages of learning. In the first step, the weight matrix, Λ, is learned on a subset of the data that is selected for quality. Two selection criteria were used: (a) sequencing data was used to select those cases for which, with high confidence, no other AS event is present (Figure 1) and (b) probe sets were selected for high expression, as determined by a set of negative controls. The second selection criterion is motivated by the common assumption that low intensity measurements are of lesser quality (see Section 1.1). In the second step, Λ is kept fixed, and we introduce the additional constraint that the noise is isotropic (Ψ = ψI) and learn on the entire data set. The constraint on the noise is introduced to prevent the model from using only a subset of the six probes for making the final set of predictions. We show a typical learned set of weights in Figure 4. The weights fit well with our intuition of what they should be to capture the presence of the two isoforms. Moreover, the learned weights account for the specific trends in the data. Examples of model prediction based on the microarray data are shown in Figure 5. Due to the nature of the microarray data, we do not expect all the inferred abundances to be equally good, and we devised a scoring criterion that ranks each AS event based on its fit to the model. Intuitively, given two input vectors that are equivalent up to a scale factor, with inferred MAP estimations that are equal up to the same scale factor, we would like their scores to be identical. The scoring criterion used, therefore is k (xk − rΛk s)2 /(xk + Rank 500 1000 2000 5000 10000 15000 20000 30000 Pearson’s correlation coefficient 0.94 0.95 0.95 0.79 0.79 0.78 0.75 0.65 False positive rate 0.11 0.08 0.05 0.2 0.25 0.29 0.32 0.42 Table 1: Model performance evaluated at various ranks. Using 180 RT-PCR measurements, we are able to predict the model’s performance at various ranks. Two evaluation criteria are used: Pearson’s correlation coefficient between the model’s predictions and the RT-PCR measurements and false positive rate, where a prediction is considered to be false positive if it is more than 15% away from the RT-PCR measurement. rΛk s)2 , where the MAP estimations for r and s are used. This scoring criterion can be viewed as proportional to the sum of noise to signal ratios, as estimated using the two values given by the observation and the model’s best prediction of that observation. Since it is the relative amount of the isoforms that is of most interest, we need to use the inferred distribution of the isoform abundances to obtain an estimate for the relative levels of AS. It is not immediately clear how this should be done. We do, however, have RTPCR measurements for 180 AS events to guide us (see figure 6 for details). Using the top 50 ranked RT-PCR measurement, we fit three parameters, {a1 , a2 , a3 }, such that the s2 proportion of excluded isoform present, p, is given by p = a1 s1 +a2 s2 + a3 , where s1 is the MAP estimation of the abundance of the inclusion isoform, s2 is the MAP estimation of the abundance of the exclusion isoform, and the RT-PCR measurement are used for target p. The parameters are fitted using gradient descent on a least squared error (LSE) evaluation criterion. We used two criteria to evaluate the quality of the AS model predictions. Pearson’s correlation coefficient (PCC) is used to evaluate the overall ability of the model to correctly estimate trends in the data. PCC is invariant to affine transformation and so is independent of the transformation parameters a1 and a3 discussed above, while the parameter a2 was found to effect PCC very little. The PCC stays above 0.75 for the top two thirds ranked predictions. The second evaluation criterion used is the false positive rate, where a prediction is considered to be false positive if it is more than 15% away from the RT-PCR measurement. This allows us to say, for example, that if a prediction is within the top 10000, we are 75% confident that it is within 15% of the actual levels of AS. 4 Summary We designed a novel AS model for the inference of the relative abundance of two alternatively spliced isoforms from six measurements. Unsupervised learning in the model is performed using a structured variational EM algorithm, which correctly captures the underlying structure of the data, as suggested by its biological nature. The AS model, though presented here for a cassette exon AS events, can be used to learn any type of AS, and with a simple adjustment, multiple types. The predictions obtained from the AS model are currently being used to verify various claims about the role of AS in evolution and functional genomics, and to help identify sequences that affect the regulation of AS. % Exclusion isoform RT−PCR measurement Vs. AS model predictions RT−PCR measurements: 90 80 AS model prediction Int es Te tine sti Kid s n Sa ey liva Br ry ain Sp le Liv en er Mu sc Lu le ng 100 70 60 50 40 30 14 22 27 32 47 46 66 78 63 20 AS model prediction: 10 27 24 26 26 51 75 60 85 100 (a) 0 0 20 40 60 80 RT−PCR measurement 100 (b) Figure 6: (a) Sample RT-PCR. RNA extracted from the cell is reverse-transcribed to DNA, amplified and labelled with radioactive or fluorescent molecules. The sample is pulled through a viscous gel in an electric field (DNA, being an acid, is positively charged). Shorter strands travel further through the gel than longer ones, resulting in two distinct bands, corresponding to the two isoforms, when exposed to a photosensitive or x-ray film. (b) A scatter plot showing the RT-PCR measurements as compared to the AS model predictions. The plot shows all available RT-PCR measurements with a rank of 8000 or better. The AS model presented assumes a single weight matrix for all data cases. This is an oversimplified view of the data, and current work is being carried out in identifying probe specific expression profiles. However, due to the low dimensionality of the problem (10 tissues, six probes per event), care must be taken to avoid overfitting and to ensure meaningful interpretations. Acknowledgments We would like to thank Wen Zhang, Naveed Mohammad, and Timothy Hughes for their contributions in generating the data set. This work was funded in part by an operating and infrastructure grants from the CIHR and CFI, and a operating grants from NSERC and a Premier’s Research Excellence Award. References [1] J. M. Johnson et al. Genome-wide survey of human alternative pre-mrna splicing with exon junction microarrays. Science, 302:2141–44, 2003. [2] L. Cartegni et al. Listening to silence and understanding nonsense: exonic mutations that affect splicing. Nature Gen. Rev., 3:285–98, 2002. [3] Q. Pan et al. Revealing global regulatory features of mammalian alternative splicing using a quantitative microarray platform. Molecular Cell, 16(6):929–41, 2004. [4] Q. Pan et al. Alternative splicing of conserved exons is frequently species specific in human and mouse. Trends Gen., In Press, 2005. [5] M. I. Jordan, Z. Ghahramani, T. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37(2):183– 233, 1999. [6] R. M. Neal and G. E. Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models. Cambridge, MIT Press, 1998. [7] D. M. Rocke and B. Durbin. A model for measurement error for gene expression arrays. Journal of Computational Biology, 8(6):557–69, 2001.
2 0.070768818 177 nips-2004-Supervised Graph Inference
Author: Jean-philippe Vert, Yoshihiro Yamanishi
Abstract: We formulate the problem of graph inference where part of the graph is known as a supervised learning problem, and propose an algorithm to solve it. The method involves the learning of a mapping of the vertices to a Euclidean space where the graph is easy to infer, and can be formulated as an optimization problem in a reproducing kernel Hilbert space. We report encouraging results on the problem of metabolic network reconstruction from genomic data. 1
3 0.06904804 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge
Author: Isabelle Guyon, Steve Gunn, Asa Ben-Hur, Gideon Dror
Abstract: The NIPS 2003 workshops included a feature selection competition organized by the authors. We provided participants with five datasets from different application domains and called for classification results using a minimal number of features. The competition took place over a period of 13 weeks and attracted 78 research groups. Participants were asked to make on-line submissions on the validation and test sets, with performance on the validation set being presented immediately to the participant and performance on the test set presented to the participants at the workshop. In total 1863 entries were made on the validation sets during the development period and 135 entries on all test sets for the final competition. The winners used a combination of Bayesian neural networks with ARD priors and Dirichlet diffusion trees. Other top entries used a variety of methods for feature selection, which combined filters and/or wrapper or embedded methods using Random Forests, kernel methods, or neural networks as a classification engine. The results of the benchmark (including the predictions made by the participants and the features they selected) and the scoring software are publicly available. The benchmark is available at www.nipsfsc.ecs.soton.ac.uk for post-challenge submissions to stimulate further research. 1
4 0.062110297 128 nips-2004-Neural Network Computation by In Vitro Transcriptional Circuits
Author: Jongmin Kim, John Hopfield, Erik Winfree
Abstract: The structural similarity of neural networks and genetic regulatory networks to digital circuits, and hence to each other, was noted from the very beginning of their study [1, 2]. In this work, we propose a simple biochemical system whose architecture mimics that of genetic regulation and whose components allow for in vitro implementation of arbitrary circuits. We use only two enzymes in addition to DNA and RNA molecules: RNA polymerase (RNAP) and ribonuclease (RNase). We develop a rate equation for in vitro transcriptional networks, and derive a correspondence with general neural network rate equations [3]. As proof-of-principle demonstrations, an associative memory task and a feedforward network computation are shown by simulation. A difference between the neural network and biochemical models is also highlighted: global coupling of rate equations through enzyme saturation can lead to global feedback regulation, thus allowing a simple network without explicit mutual inhibition to perform the winner-take-all computation. Thus, the full complexity of the cell is not necessary for biochemical computation: a wide range of functional behaviors can be achieved with a small set of biochemical components. 1
5 0.060765669 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment
Author: Scott J. Gaffney, Padhraic Smyth
Abstract: Clustering and prediction of sets of curves is an important problem in many areas of science and engineering. It is often the case that curves tend to be misaligned from each other in a continuous manner, either in space (across the measurements) or in time. We develop a probabilistic framework that allows for joint clustering and continuous alignment of sets of curves in curve space (as opposed to a fixed-dimensional featurevector space). The proposed methodology integrates new probabilistic alignment models with model-based curve clustering algorithms. The probabilistic approach allows for the derivation of consistent EM learning algorithms for the joint clustering-alignment problem. Experimental results are shown for alignment of human growth data, and joint clustering and alignment of gene expression time-course data.
6 0.058757532 52 nips-2004-Discrete profile alignment via constrained information bottleneck
7 0.050825112 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data
8 0.041751847 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes
9 0.039282378 141 nips-2004-Optimal sub-graphical models
10 0.038683478 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data
11 0.037726499 124 nips-2004-Multiple Alignment of Continuous Time Series
12 0.037227202 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants
13 0.036716115 161 nips-2004-Self-Tuning Spectral Clustering
14 0.035632461 136 nips-2004-On Semi-Supervised Classification
15 0.035406478 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models
16 0.033894066 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
17 0.033841532 63 nips-2004-Expectation Consistent Free Energies for Approximate Inference
18 0.032634735 163 nips-2004-Semi-parametric Exponential Family PCA
19 0.032432456 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data
20 0.032292172 73 nips-2004-Generative Affine Localisation and Tracking
topicId topicWeight
[(0, -0.115), (1, -0.003), (2, -0.011), (3, -0.032), (4, -0.021), (5, 0.03), (6, -0.006), (7, 0.035), (8, 0.007), (9, -0.069), (10, 0.011), (11, 0.01), (12, 0.037), (13, 0.001), (14, 0.067), (15, 0.004), (16, -0.067), (17, -0.031), (18, 0.057), (19, -0.069), (20, -0.049), (21, 0.05), (22, -0.089), (23, -0.02), (24, 0.092), (25, 0.036), (26, -0.061), (27, -0.12), (28, -0.002), (29, -0.066), (30, -0.1), (31, 0.116), (32, 0.037), (33, 0.036), (34, -0.056), (35, 0.014), (36, -0.116), (37, 0.106), (38, 0.158), (39, 0.031), (40, 0.046), (41, -0.06), (42, 0.071), (43, 0.064), (44, 0.004), (45, 0.026), (46, 0.2), (47, -0.089), (48, -0.005), (49, 0.143)]
simIndex simValue paperId paperTitle
same-paper 1 0.92273092 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data
Author: Ofer Shai, Brendan J. Frey, Quaid D. Morris, Qun Pan, Christine Misquitta, Benjamin J. Blencowe
Abstract: Alternative splicing (AS) is an important and frequent step in mammalian gene expression that allows a single gene to specify multiple products, and is crucial for the regulation of fundamental biological processes. The extent of AS regulation, and the mechanisms involved, are not well understood. We have developed a custom DNA microarray platform for surveying AS levels on a large scale. We present here a generative model for the AS Array Platform (GenASAP) and demonstrate its utility for quantifying AS levels in different mouse tissues. Learning is performed using a variational expectation maximization algorithm, and the parameters are shown to correctly capture expected AS trends. A comparison of the results obtained with a well-established but low through-put experimental method demonstrate that AS levels obtained from GenASAP are highly predictive of AS levels in mammalian tissues. 1 Biological diversity through alternative splicing Current estimates place the number of genes in the human genome at approximately 30,000, which is a surprisingly small number when one considers that the genome of yeast, a singlecelled organism, has 6,000 genes. The number of genes alone cannot account for the complexity and cell specialization exhibited by higher eukaryotes (i.e. mammals, plants, etc.). Some of that added complexity can be achieved through the use of alternative splicing, whereby a single gene can be used to code for a multitude of products. Genes are segments of the double stranded DNA that contain the information required by the cell for protein synthesis. That information is coded using an alphabet of 4 (A, C, G, and T), corresponding to the four nucleotides that make up the DNA. In what is known as the central dogma of molecular biology, DNA is transcribed to RNA, which in turn is translated into proteins. Messenger RNA (mRNA) is synthesized in the nucleus of the cell and carries the genomic information to the ribosome. In eukaryotes, genes are generally comprised of both exons, which contain the information needed by the cell to synthesize proteins, and introns, sometimes referred to as spacer DNA, which are spliced out of the pre-mRNA to create mature mRNA. An estimated 35%-75% of human genes [1] can be C1 (a) C1 A C1 C2 C1 A 3’ C2 (b) C2 C1 A 3’ C1 A 5’ C2 A 5’ C2 C1 C1 C2 C1 (c) A2 A1 C1 A1 C1 A2 C1 C2 C2 C1 (d) C2 C2 A C2 C2 C2 C1 C2 Figure 1: Four types of AS. Boxes represent exons and lines represent introns, with the possible splicing alternatives indicated by the connectors. (a) Single cassette exon inclusion/exclusion. C1 and C2 are constitutive exons (exons that are included in all isoforms) and flank a single alternative exon (A). The alternative exon is included in one isoform and excluded in the other. (b) Alternative 3’ (or donor) and alternative 5’ (acceptor) splicing sites. Both exons are constitutive, but may contain alternative donor and/or acceptor splicing sites. (c) Mutually exclusive exons. One of the two alternative exons (A1 and A2 ) may be included in the isoform, but not both. (d) Intron inclusion. An intron may be included in the mature mRNA strand. spliced to yield different combinations of exons (called isoforms), a phenomenon referred to as alternative splicing (AS). There are four major types of AS as shown in Figure 1. Many multi-exon genes may undergo more than one alternative splicing event, resulting in many possible isoforms from a single gene. [2] In addition to adding to the genetic repertoire of an organism by enabling a single gene to code for more than one protein, AS has been shown to be critical for gene regulation, contributing to tissue specificity, and facilitating evolutionary processes. Despite the evident importance of AS, its regulation and impact on specific genes remains poorly understood. The work presented here is concerned with the inference of single cassette exon AS levels (Figure 1a) based on data obtained from RNA expression arrays, also known as microarrays. 1.1 An exon microarray data set that probes alternative splicing events Although it is possible to directly analyze the proteins synthesized by a cell, it is easier, and often more informative, to instead measure the abundance of mRNA present. Traditionally, gene expression (abundance of mRNA) has been studied using low throughput techniques (such as RT-PCR or Northern blots), limited to studying a few sequences at a time and making large scale analysis nearly impossible. In the early 1990s, microarray technology emerged as a method capable of measuring the expression of thousands of DNA sequences simultaneously. Sequences of interest are deposited on a substrate the size of a small microscope slide, to form probes. The mRNA is extracted from the cell and reverse-transcribed back into DNA, which is labelled with red and green fluorescent dye molecules (cy3 and cy5 respectively). When the sample of tagged DNA is washed over the slide, complementary strands of DNA from the sample hybridize to the probes on the array forming A-T and C-G pairings. The slide is then scanned and the fluorescent intensity is measured at each probe. It is generally assumed that the intensity measure at the probe is linearly related to the abundance of mRNA in the cell over a wide dynamic range. Despite significant improvements in microarray technologies in recent years, microarray data still presents some difficulties in analysis. Low measurements tend to have extremely low signal to noise ratio (SNR) [7] and probes often bind to sequences that are very similar, but not identical, to the one for which they were designed (a process referred to as cross- C1 A C1 A C1:A C1 C2 C2 3 Body probes A:C2 A C2 2 Inclusion junction probes C1:C2 C1 C2 1 Exclusion junction probe Figure 2: Each alternative splicing event is studied using six probes. Probes were chosen to measure the expression levels of each of the three exons involved in the event. Additionally, 3 probes are used that target the junctions that are formed by each of the two isoforms. The inclusion isoform would express the junctions formed by C1 and A, and A and C2 , while the exclusion isoform would express the junction formed by C1 and C2 hybridization). Additionally, probes exhibit somewhat varying hybridization efficiency, and sequences exhibit varying labelling efficiency. To design our data sets, we mined public sequence databases and identified exons that were strong candidates for exhibiting AS (the details of that analysis are provided elsewhere [4, 3]). Of the candidates, 3,126 potential AS events in 2,647 unique mouse genes were selected for the design of Agilent Custom Oligonucleotide microarray. The arrays were hybridized with unamplified mRNA samples extracted from 10 wild-type mouse tissues (brain, heart, intestine, kidney, liver, lung, salivary gland, skeletal muscle, spleen, and testis). Each AS event has six target probes on the arrays, chosen from regions of the C1 exon, C2 exon, A exon, C1 :A splice junction, A:C2 splice junction, and C1 :C2 splice junction, as shown in Figure 2. 2 Unsupervised discovery of alternative splicing With the exception of the probe measuring the alternative exon, A (Figure 2), all probes measure sequences that occur in both isoforms. For example, while the sequence of the probe measuring the junction A:C1 is designed to measure the inclusion isoform, half of it corresponds to a sequence that is found in the exclusion isoform. We can therefore safely assume that the measured intensity at each probe is a result of a certain amount of both isoforms binding to the probe. Due to the generally assumed linear relationship between the abundance of mRNA hybridized at a probe and the fluorescent intensity measured, we model the measured intensity as a weighted sum of the overall abundance of the two isoforms. A stronger assumption is that of a single, consistent hybridization profile for both isoforms across all probes and all slides. Ideally, one would prefer to estimate an individual hybridization profile for each AS event studied across all slides. However, in our current setup, the number of tissues is small (10), resulting in two difficulties. First, the number of parameters is very large when compared to the number of data point using this model, and second, a portion of the events do not exhibit tissue specific alternative splicing within our small set of tissues. While the first hurdle could be accounted for using Baysian parameter estimation, the second cannot. 2.1 GenASAP - a generative model for alternative splicing array platform Using the setup described above, the expression vector x, containing the six microarray measurements as real numbers, can be decomposed as a linear combination of the abundance of the two splice isoforms, represented by the real vector s, with some added noise: x = Λs + noise, where Λ is a 6 × 2 weight matrix containing the hybridization profiles for s1 ^ xC ^ xC 1 2 s2 ^ xC :A ^ xA ^ xA:C 2 ^ xC1:C2 2 xC1:C2 1 r xC xC 2 xA xC :A xA:C oC1 oC2 oA oC :A oA:C 1 1 1 2 oC :C 1 2 Σn 2 Figure 3: Graphical model for alternative splicing. Each measurement in the observed expression profile, x, is generated by either using a scale factor, r, on a linear combination of the isoforms, s, or drawing randomly from an outlier model. For a detailed description of the model, see text. the two isoforms across the six probes. Note that we may not have a negative amount of a given isoform, nor can the presence of an isoform deduct from the measured expression, and so both s and Λ are constrained to be positive. Expression levels measured by microarrays have previously been modelled as having expression-dependent noise [7]. To address this, we rewrite the above formulation as x = r(Λs + ε), (1) where r is a scale factor and ε is a zero-mean normally distributed random variable with a diagonal covariance matrix, Ψ, denoted as p(ε) = N (ε; 0, Ψ). The prior distribution for the abundance of the splice isoforms is given by a truncated normal distribution, denoted as p(s) ∝ N (s, 0, I)[s ≥ 0], where [·] is an indicator function such that [s ≥ 0] = 1 if ∀i, si ≥ 0, and [s ≥ 0] = 0 otherwise. Lastly, there is a need to account for aberrant observations (e.g. due to faulty probes, flakes of dust, etc.) with an outlier model. The complete GenASAP model (shown in Figure 3) accounts for the observations as the outcome of either applying equation (1) or an outlier model. To avoid degenerate cases and ensure meaningful and interpretable results, the number of faulty probes considered for each AS event may not exceed two, as indicated by the filled-in square constraint node in Figure 3. The distribution of x conditional on the latent variables, s, r, and o, is: N (xi ; rΛi s, r2 Ψi )[oi =0] N (xi ; Ei , Vi )[oi =1] , p(x|s, r, o) = (2) i where oi ∈ {0, 1} is a bernoulli random variable indicating if the measurement at probe xi is the result of the AS model or the outlier model parameterized by p(oi = 1) = γi . The parameters of the outlier model, E and V, are not optimized and are set to the mean and variance of the data. 2.2 Variational learning in the GenASAP model To infer the posterior distribution over the splice isoform abundances while at the same time learning the model parameters we use a variational expectation-maximization algorithm (EM). EM maximizes the log likelihood of the data by iteratively estimating the posterior distribution of the model given the data in the expectation (E) step, and maximizing the log likelihood with respect to the parameters, while keeping the posterior fixed, in the maximization (M) step. Variational EM is used when, as in the case of GenASAP, the exact posterior is intractable. Variational EM minimizes the free energy of the model, defined as the KL-divergence between the joint distribution of the latent and observed variables and the approximation to the posterior under the model parameters [5, 6]. We approximate the true posterior using the Q distribution given by T Q({s(t) }, {o(t) }, {r(t) }) = (t) Q(r(t) )Q(o(t) |r(t) ) t=1 (t) Q(si |oi , r(t) ) i −1 T =Z (t) (3) d d ρ(t) ω (t) N (s(t) ; µ(t) , Σ(t) )[s(t) ≥ 0], ro ro t=1 where Z is a normalization constant, the superscript d indicates that Σ is constrained to be diagonal, and there are T iid AS events. For computational efficiency, r is selected from a finite set, r ∈ {r1 , r2 , . . . , rC } with uniform probability. The variational free energy is given by Q({s(t) }, {o(t) }, {r(t) }) . P ({s(t) }, {o(t) }, {r(t) }, {x(t) }) s r o (4) Variational EM minimizes the free energy by iteratively updating the Q distribution’s vari(t)d (t)d ational parameters (ρ(t) , ω (t) , µro , and Σro ) in the E-step, and the model parameters (Λ, Ψ, {r1 , r2 , . . . , rC }, and γ) in the M-step. The resulting updates are too long to be shown in the context of this paper and are discussed in detail elsewhere [3]. A few particular points regarding the E-step are worth covering in detail here. Q({s(t) }, {o(t) }, {r(t) }) log F(Q, P ) = If the prior on s was a full normal distribution, there would be no need for a variational approach, and exact EM is possible. For a truncated normal distribution, however, the mixing proportions, Q(r)Q(o|r) cannot be calculated analytically except for the case where s is scalar, necessitating the diagonality constraint. Note that if Σ was allowed to be a full covariance matrix, equation (3) would be the true posterior, and we could find the sufficient statistics of Q(s(t) |o(t) , r(t) ): −1 µ(t) = (I + ΛT (I − O(t) )T Ψ−1 (I − O(t) )Λ)−1 ΛT (I − O(t) )T Ψ−1 x(t) r(t) ro −1 Σ(t) ro = (I + ΛT (I − O(t) )T Ψ−1 (I − O(t) )Λ) (5) (6) where O is a diagonal matrix with elements Oi,i = oi . Furthermore, it can be easily shown that the optimal settings for µd and Σd approximating a normal distribution with full covariance Σ and mean µ is µd optimal = µ −1 Σd optimal = diag(Σ (7) −1 ) (8) In the truncated case, equation (8) is still true. Equation (7) does not hold, though, and µd optimal cannot be found analytically. In our experiments, we found that using equation (7) still decreases the free energy every E-step, and it is significantly more efficient than using, for example, a gradient decent method to compute the optimal µd . Intuitive Weigh Matrix Optimal Weight Matrix 50 50 40 40 30 30 20 20 10 0 10 Inclusion Isoform 0 Exclusion Isoform Inclusion Isoform (a) Exclusion Isoform (b) Figure 4: (a) An intuitive set of weights. Based on the biological background, one would expect to see the inclusion isoform hybridize to the probes measuring C1 , C2 , A, C1 :A, and A:C2 , while the exclusion isoform hybridizes to C1 , C2 , and C1 :C2 . (b) The learned set of weights closely agrees with the intuition, and captures cross hybridization between the probes Contribution of exclusion isoform Contribution of inclusion isoform AS model Original Data (a) RT−PCR AS model measurement prediction (% exclusion) (% exclusion) 14% 72% (b) 27% 70% 8% 22% outliers (c) Figure 5: Three examples of data cases and their predictions. (a) The data does not follow our notion of single cassette exon AS, but the AS level is predicted accurately by the model.(b) The probe C1 :A is marked as outlier, allowing the model to predict the other probes accurately. (c) Two probes are marked as outliers, and the model is still successful in predicting the AS levels. 3 Making biological predictions about alternative splicing The results presented in this paper were obtained using two stages of learning. In the first step, the weight matrix, Λ, is learned on a subset of the data that is selected for quality. Two selection criteria were used: (a) sequencing data was used to select those cases for which, with high confidence, no other AS event is present (Figure 1) and (b) probe sets were selected for high expression, as determined by a set of negative controls. The second selection criterion is motivated by the common assumption that low intensity measurements are of lesser quality (see Section 1.1). In the second step, Λ is kept fixed, and we introduce the additional constraint that the noise is isotropic (Ψ = ψI) and learn on the entire data set. The constraint on the noise is introduced to prevent the model from using only a subset of the six probes for making the final set of predictions. We show a typical learned set of weights in Figure 4. The weights fit well with our intuition of what they should be to capture the presence of the two isoforms. Moreover, the learned weights account for the specific trends in the data. Examples of model prediction based on the microarray data are shown in Figure 5. Due to the nature of the microarray data, we do not expect all the inferred abundances to be equally good, and we devised a scoring criterion that ranks each AS event based on its fit to the model. Intuitively, given two input vectors that are equivalent up to a scale factor, with inferred MAP estimations that are equal up to the same scale factor, we would like their scores to be identical. The scoring criterion used, therefore is k (xk − rΛk s)2 /(xk + Rank 500 1000 2000 5000 10000 15000 20000 30000 Pearson’s correlation coefficient 0.94 0.95 0.95 0.79 0.79 0.78 0.75 0.65 False positive rate 0.11 0.08 0.05 0.2 0.25 0.29 0.32 0.42 Table 1: Model performance evaluated at various ranks. Using 180 RT-PCR measurements, we are able to predict the model’s performance at various ranks. Two evaluation criteria are used: Pearson’s correlation coefficient between the model’s predictions and the RT-PCR measurements and false positive rate, where a prediction is considered to be false positive if it is more than 15% away from the RT-PCR measurement. rΛk s)2 , where the MAP estimations for r and s are used. This scoring criterion can be viewed as proportional to the sum of noise to signal ratios, as estimated using the two values given by the observation and the model’s best prediction of that observation. Since it is the relative amount of the isoforms that is of most interest, we need to use the inferred distribution of the isoform abundances to obtain an estimate for the relative levels of AS. It is not immediately clear how this should be done. We do, however, have RTPCR measurements for 180 AS events to guide us (see figure 6 for details). Using the top 50 ranked RT-PCR measurement, we fit three parameters, {a1 , a2 , a3 }, such that the s2 proportion of excluded isoform present, p, is given by p = a1 s1 +a2 s2 + a3 , where s1 is the MAP estimation of the abundance of the inclusion isoform, s2 is the MAP estimation of the abundance of the exclusion isoform, and the RT-PCR measurement are used for target p. The parameters are fitted using gradient descent on a least squared error (LSE) evaluation criterion. We used two criteria to evaluate the quality of the AS model predictions. Pearson’s correlation coefficient (PCC) is used to evaluate the overall ability of the model to correctly estimate trends in the data. PCC is invariant to affine transformation and so is independent of the transformation parameters a1 and a3 discussed above, while the parameter a2 was found to effect PCC very little. The PCC stays above 0.75 for the top two thirds ranked predictions. The second evaluation criterion used is the false positive rate, where a prediction is considered to be false positive if it is more than 15% away from the RT-PCR measurement. This allows us to say, for example, that if a prediction is within the top 10000, we are 75% confident that it is within 15% of the actual levels of AS. 4 Summary We designed a novel AS model for the inference of the relative abundance of two alternatively spliced isoforms from six measurements. Unsupervised learning in the model is performed using a structured variational EM algorithm, which correctly captures the underlying structure of the data, as suggested by its biological nature. The AS model, though presented here for a cassette exon AS events, can be used to learn any type of AS, and with a simple adjustment, multiple types. The predictions obtained from the AS model are currently being used to verify various claims about the role of AS in evolution and functional genomics, and to help identify sequences that affect the regulation of AS. % Exclusion isoform RT−PCR measurement Vs. AS model predictions RT−PCR measurements: 90 80 AS model prediction Int es Te tine sti Kid s n Sa ey liva Br ry ain Sp le Liv en er Mu sc Lu le ng 100 70 60 50 40 30 14 22 27 32 47 46 66 78 63 20 AS model prediction: 10 27 24 26 26 51 75 60 85 100 (a) 0 0 20 40 60 80 RT−PCR measurement 100 (b) Figure 6: (a) Sample RT-PCR. RNA extracted from the cell is reverse-transcribed to DNA, amplified and labelled with radioactive or fluorescent molecules. The sample is pulled through a viscous gel in an electric field (DNA, being an acid, is positively charged). Shorter strands travel further through the gel than longer ones, resulting in two distinct bands, corresponding to the two isoforms, when exposed to a photosensitive or x-ray film. (b) A scatter plot showing the RT-PCR measurements as compared to the AS model predictions. The plot shows all available RT-PCR measurements with a rank of 8000 or better. The AS model presented assumes a single weight matrix for all data cases. This is an oversimplified view of the data, and current work is being carried out in identifying probe specific expression profiles. However, due to the low dimensionality of the problem (10 tissues, six probes per event), care must be taken to avoid overfitting and to ensure meaningful interpretations. Acknowledgments We would like to thank Wen Zhang, Naveed Mohammad, and Timothy Hughes for their contributions in generating the data set. This work was funded in part by an operating and infrastructure grants from the CIHR and CFI, and a operating grants from NSERC and a Premier’s Research Excellence Award. References [1] J. M. Johnson et al. Genome-wide survey of human alternative pre-mrna splicing with exon junction microarrays. Science, 302:2141–44, 2003. [2] L. Cartegni et al. Listening to silence and understanding nonsense: exonic mutations that affect splicing. Nature Gen. Rev., 3:285–98, 2002. [3] Q. Pan et al. Revealing global regulatory features of mammalian alternative splicing using a quantitative microarray platform. Molecular Cell, 16(6):929–41, 2004. [4] Q. Pan et al. Alternative splicing of conserved exons is frequently species specific in human and mouse. Trends Gen., In Press, 2005. [5] M. I. Jordan, Z. Ghahramani, T. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37(2):183– 233, 1999. [6] R. M. Neal and G. E. Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models. Cambridge, MIT Press, 1998. [7] D. M. Rocke and B. Durbin. A model for measurement error for gene expression arrays. Journal of Computational Biology, 8(6):557–69, 2001.
2 0.54555011 128 nips-2004-Neural Network Computation by In Vitro Transcriptional Circuits
Author: Jongmin Kim, John Hopfield, Erik Winfree
Abstract: The structural similarity of neural networks and genetic regulatory networks to digital circuits, and hence to each other, was noted from the very beginning of their study [1, 2]. In this work, we propose a simple biochemical system whose architecture mimics that of genetic regulation and whose components allow for in vitro implementation of arbitrary circuits. We use only two enzymes in addition to DNA and RNA molecules: RNA polymerase (RNAP) and ribonuclease (RNase). We develop a rate equation for in vitro transcriptional networks, and derive a correspondence with general neural network rate equations [3]. As proof-of-principle demonstrations, an associative memory task and a feedforward network computation are shown by simulation. A difference between the neural network and biochemical models is also highlighted: global coupling of rate equations through enzyme saturation can lead to global feedback regulation, thus allowing a simple network without explicit mutual inhibition to perform the winner-take-all computation. Thus, the full complexity of the cell is not necessary for biochemical computation: a wide range of functional behaviors can be achieved with a small set of biochemical components. 1
3 0.51444936 177 nips-2004-Supervised Graph Inference
Author: Jean-philippe Vert, Yoshihiro Yamanishi
Abstract: We formulate the problem of graph inference where part of the graph is known as a supervised learning problem, and propose an algorithm to solve it. The method involves the learning of a mapping of the vertices to a Euclidean space where the graph is easy to infer, and can be formulated as an optimization problem in a reproducing kernel Hilbert space. We report encouraging results on the problem of metabolic network reconstruction from genomic data. 1
4 0.48742026 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment
Author: Scott J. Gaffney, Padhraic Smyth
Abstract: Clustering and prediction of sets of curves is an important problem in many areas of science and engineering. It is often the case that curves tend to be misaligned from each other in a continuous manner, either in space (across the measurements) or in time. We develop a probabilistic framework that allows for joint clustering and continuous alignment of sets of curves in curve space (as opposed to a fixed-dimensional featurevector space). The proposed methodology integrates new probabilistic alignment models with model-based curve clustering algorithms. The probabilistic approach allows for the derivation of consistent EM learning algorithms for the joint clustering-alignment problem. Experimental results are shown for alignment of human growth data, and joint clustering and alignment of gene expression time-course data.
5 0.45235971 80 nips-2004-Identifying Protein-Protein Interaction Sites on a Genome-Wide Scale
Author: Haidong Wang, Eran Segal, Asa Ben-Hur, Daphne Koller, Douglas L. Brutlag
Abstract: Protein interactions typically arise from a physical interaction of one or more small sites on the surface of the two proteins. Identifying these sites is very important for drug and protein design. In this paper, we propose a computational method based on probabilistic relational model that attempts to address this task using high-throughput protein interaction data and a set of short sequence motifs. We learn the model using the EM algorithm, with a branch-and-bound algorithm as an approximate inference for the E-step. Our method searches for motifs whose presence in a pair of interacting proteins can explain their observed interaction. It also tries to determine which motif pairs have high affinity, and can therefore lead to an interaction. We show that our method is more accurate than others at predicting new protein-protein interactions. More importantly, by examining solved structures of protein complexes, we find that 2/3 of the predicted active motifs correspond to actual interaction sites. 1
6 0.44159091 52 nips-2004-Discrete profile alignment via constrained information bottleneck
7 0.41958344 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data
8 0.41490376 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data
9 0.35447124 170 nips-2004-Similarity and Discrimination in Classical Conditioning: A Latent Variable Account
10 0.34492469 136 nips-2004-On Semi-Supervised Classification
11 0.3257966 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge
12 0.29146469 95 nips-2004-Large-Scale Prediction of Disulphide Bond Connectivity
13 0.29104611 120 nips-2004-Modeling Conversational Dynamics as a Mixed-Memory Markov Process
14 0.28102747 46 nips-2004-Constraining a Bayesian Model of Human Visual Speed Perception
15 0.27997476 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models
16 0.27574641 122 nips-2004-Modelling Uncertainty in the Game of Go
17 0.27249503 141 nips-2004-Optimal sub-graphical models
18 0.26630223 193 nips-2004-Theories of Access Consciousness
19 0.26547366 50 nips-2004-Dependent Gaussian Processes
20 0.2651917 8 nips-2004-A Machine Learning Approach to Conjoint Analysis
topicId topicWeight
[(13, 0.046), (15, 0.064), (25, 0.015), (26, 0.034), (31, 0.017), (33, 0.666), (35, 0.012), (39, 0.014), (50, 0.031), (69, 0.011)]
simIndex simValue paperId paperTitle
1 0.993527 63 nips-2004-Expectation Consistent Free Energies for Approximate Inference
Author: Manfred Opper, Ole Winther
Abstract: We propose a novel a framework for deriving approximations for intractable probabilistic models. This framework is based on a free energy (negative log marginal likelihood) and can be seen as a generalization of adaptive TAP [1, 2, 3] and expectation propagation (EP) [4, 5]. The free energy is constructed from two approximating distributions which encode different aspects of the intractable model such a single node constraints and couplings and are by construction consistent on a chosen set of moments. We test the framework on a difficult benchmark problem with binary variables on fully connected graphs and 2D grid graphs. We find good performance using sets of moments which either specify factorized nodes or a spanning tree on the nodes (structured approximation). Surprisingly, the Bethe approximation gives very inferior results even on grids. 1
2 0.99303538 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging
Author: Vladimir Koltchinskii, Manel Martínez-ramón, Stefan Posse
Abstract: We study a method of optimal data-driven aggregation of classifiers in a convex combination and establish tight upper bounds on its excess risk with respect to a convex loss function under the assumption that the solution of optimal aggregation problem is sparse. We use a boosting type algorithm of optimal aggregation to develop aggregate classifiers of activation patterns in fMRI based on locally trained SVM classifiers. The aggregation coefficients are then used to design a ”boosting map” of the brain needed to identify the regions with most significant impact on classification. 1
3 0.99203032 39 nips-2004-Coarticulation in Markov Decision Processes
Author: Khashayar Rohanimanesh, Robert Platt, Sridhar Mahadevan, Roderic Grupen
Abstract: We investigate an approach for simultaneously committing to multiple activities, each modeled as a temporally extended action in a semi-Markov decision process (SMDP). For each activity we define a set of admissible solutions consisting of the redundant set of optimal policies, and those policies that ascend the optimal statevalue function associated with them. A plan is then generated by merging them in such a way that the solutions to the subordinate activities are realized in the set of admissible solutions satisfying the superior activities. We present our theoretical results and empirically evaluate our approach in a simulated domain. 1
4 0.99179089 120 nips-2004-Modeling Conversational Dynamics as a Mixed-Memory Markov Process
Author: Tanzeem Choudhury, Sumit Basu
Abstract: In this work, we quantitatively investigate the ways in which a given person influences the joint turn-taking behavior in a conversation. After collecting an auditory database of social interactions among a group of twenty-three people via wearable sensors (66 hours of data each over two weeks), we apply speech and conversation detection methods to the auditory streams. These methods automatically locate the conversations, determine their participants, and mark which participant was speaking when. We then model the joint turn-taking behavior as a Mixed-Memory Markov Model [1] that combines the statistics of the individual subjects' self-transitions and the partners ' cross-transitions. The mixture parameters in this model describe how much each person 's individual behavior contributes to the joint turn-taking behavior of the pair. By estimating these parameters, we thus estimate how much influence each participant has in determining the joint turntaking behavior. We show how this measure correlates significantly with betweenness centrality [2], an independent measure of an individual's importance in a social network. This result suggests that our estimate of conversational influence is predictive of social influence. 1
5 0.99037683 126 nips-2004-Nearly Tight Bounds for the Continuum-Armed Bandit Problem
Author: Robert D. Kleinberg
Abstract: In the multi-armed bandit problem, an online algorithm must choose from a set of strategies in a sequence of n trials so as to minimize the total cost of the chosen strategies. While nearly tight upper and lower bounds are known in the case when the strategy set is finite, much less is known when there is an infinite strategy set. Here we consider the case when the set of strategies is a subset of Rd , and the cost functions are continuous. In the d = 1 case, we improve on the best-known upper and lower bounds, closing the gap to a sublogarithmic factor. We also consider the case where d > 1 and the cost functions are convex, adapting a recent online convex optimization algorithm of Zinkevich to the sparser feedback model of the multi-armed bandit problem. 1
same-paper 6 0.98950291 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data
7 0.98694295 115 nips-2004-Maximum Margin Clustering
8 0.98672926 186 nips-2004-The Correlated Correspondence Algorithm for Unsupervised Registration of Nonrigid Surfaces
9 0.90134448 80 nips-2004-Identifying Protein-Protein Interaction Sites on a Genome-Wide Scale
10 0.90087682 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data
11 0.89986813 56 nips-2004-Dynamic Bayesian Networks for Brain-Computer Interfaces
12 0.89227712 32 nips-2004-Boosting on Manifolds: Adaptive Regularization of Base Classifiers
13 0.89048696 72 nips-2004-Generalization Error and Algorithmic Convergence of Median Boosting
14 0.88863587 44 nips-2004-Conditional Random Fields for Object Recognition
15 0.88716692 147 nips-2004-Planning for Markov Decision Processes with Sparse Stochasticity
16 0.8844322 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound
17 0.87827742 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms
18 0.87713766 166 nips-2004-Semi-supervised Learning via Gaussian Processes
19 0.87701255 74 nips-2004-Harmonising Chorales by Probabilistic Inference
20 0.87394696 48 nips-2004-Convergence and No-Regret in Multiagent Learning