nips nips2004 nips2004-52 knowledge-graph by maker-knowledge-mining

52 nips-2004-Discrete profile alignment via constrained information bottleneck

Source: pdf

Author: Sean O'rourke, Gal Chechik, Robin Friedman, Eleazar Eskin

Abstract: Amino acid proﬁles, which capture position-speciﬁc mutation probabilities, are a richer encoding of biological sequences than the individual sequences themselves. However, proﬁle comparisons are much more computationally expensive than discrete symbol comparisons, making proﬁles impractical for many large datasets. Furthermore, because they are such a rich representation, proﬁles can be diﬃcult to visualize. To overcome these problems, we propose a discretization for proﬁles using an expanded alphabet representing not just individual amino acids, but common proﬁles. By using an extension of information bottleneck (IB) incorporating constraints and priors on the class distributions, we ﬁnd an informationally optimal alphabet. This discretization yields a concise, informative textual representation for proﬁle sequences. Also alignments between these sequences, while nearly as accurate as the full proﬁleproﬁle alignments, can be computed almost as quickly as those between individual or consensus sequences. A full pairwise alignment of SwissProt would take years using proﬁles, but less than 3 days using a discrete IB encoding, illustrating how discrete encoding can expand the range of sequence problems to which proﬁle information can be applied. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Discrete proﬁle alignment via constrained information bottleneck Sean O’Rourke∗ seano@cs. [sent-1, score-0.307]

2 edu Abstract Amino acid proﬁles, which capture position-speciﬁc mutation probabilities, are a richer encoding of biological sequences than the individual sequences themselves. [sent-7, score-0.268]

3 However, proﬁle comparisons are much more computationally expensive than discrete symbol comparisons, making proﬁles impractical for many large datasets. [sent-8, score-0.095]

4 To overcome these problems, we propose a discretization for proﬁles using an expanded alphabet representing not just individual amino acids, but common proﬁles. [sent-10, score-0.289]

5 By using an extension of information bottleneck (IB) incorporating constraints and priors on the class distributions, we ﬁnd an informationally optimal alphabet. [sent-11, score-0.24]

6 This discretization yields a concise, informative textual representation for proﬁle sequences. [sent-12, score-0.127]

7 Also alignments between these sequences, while nearly as accurate as the full proﬁleproﬁle alignments, can be computed almost as quickly as those between individual or consensus sequences. [sent-13, score-0.292]

8 A full pairwise alignment of SwissProt would take years using proﬁles, but less than 3 days using a discrete IB encoding, illustrating how discrete encoding can expand the range of sequence problems to which proﬁle information can be applied. [sent-14, score-0.31]

9 1 Introduction One of the most powerful techniques in protein analysis is the comparison of a target amino acid sequence with phylogenetically related or homologous proteins. [sent-15, score-0.376]

10 Such comparisons give insight into which portions of the protein are important by revealing the parts that were conserved through natural selection. [sent-16, score-0.183]

11 While mutations in non-functional regions may be harmless, mutations in functional regions are often lethal. [sent-17, score-0.058]

12 For this reason, functional regions of a protein tend to be conserved between organisms while non-functional regions diverge. [sent-18, score-0.183]

13 [10] uses proﬁles to align distant homologues from the SCOP database[3]; the resulting alignments are similar to results from structural alignments, and tend to reﬂect both secondary and tertiary protein structure. [sent-21, score-0.366]

14 While extremely eﬃcient string algorithms exist for aligning protein sequences (Smith-Waterman[8]) and performing database queries (BLAST[6]), these algorithms operate on strings and are not immediately applicable to proﬁle alignment or proﬁle database queries. [sent-25, score-0.486]

15 While proﬁle-based methods can be substantially more accurate than sequence-based ones, they can require at least an order of magnitude more computation time, since substitution penalties must be calculated by computing distances between probability distributions. [sent-26, score-0.029]

16 Another drawback of proﬁle as compared to string representations is that it is much more diﬃcult to visually interpret a sequence of 20 dimensional vectors than a sequence of letters. [sent-28, score-0.114]

17 First, once a proﬁle is represented using a discrete alphabet, alignment and database search can be performed using the eﬃcient string algorithms developed for sequences. [sent-30, score-0.287]

18 For example, when aligning sequences of 1000 elements, runtime decreases from 20 seconds for proﬁles to 2 for discrete sequences. [sent-31, score-0.174]

19 Second, by representing each class as a letter, discretized proﬁles can be presented in plain text like the original or consensus sequences, while conveying more information about the underlying proﬁles. [sent-32, score-0.058]

20 This makes them more accurate than consensus sequences, and more dense than sequence logos (see ﬁgure 1). [sent-33, score-0.129]

21 To make this representation intuitive, we want the discretization not only to minimize information loss, but also to reﬂect biologically meaningful categories by forming a superset of the standard 20-character amino acid alphabet. [sent-34, score-0.39]

22 This formulation demands two types of constraints: similarities of the centroids to predeﬁned values, and speciﬁc structural similarities between strongly- and weakly-conserved variants. [sent-36, score-0.135]

23 We show below how these constraints can be added to the original IB formalism. [sent-37, score-0.05]

24 In this paper, we present a new discrete representation of proteins that takes into account information from homologues. [sent-38, score-0.073]

25 The main idea behind our approach is to compress the space of probabilistic proﬁles in a data-dependent manner by clustering the actual proﬁles and representing them by a small alphabet of distributions. [sent-39, score-0.052]

26 Since this discretization removes some of the information carried by the full proﬁles, we cluster the distribution in a way that is directly targeted at minimizing the information loss. [sent-40, score-0.123]

27 This is achieved using a variant of Information Bottleneck (IB)[9], a distributional clustering approach for informationally optimal discretization. [sent-41, score-0.033]

28 We apply our algorithm to a subset of MEROPS[4], a database of peptidases organized structurally by family and clan, and analyze the results in terms of both information loss and alignment quality. [sent-42, score-0.25]

29 We show that multivariate IB in particular preserves much of the information in the original proﬁles using a small number of classes. [sent-43, score-0.088]

30 Furthermore, optimal alignments for proﬁle sequences encoded with these classes are much closer to the original proﬁle-proﬁle alignments than are alignments between the seed proteins. [sent-44, score-0.954]

31 IB discretization is therefore an attractive way to gain some of the additional sensitivity of proﬁles with less computational cost. [sent-45, score-0.094]

32 : NNDeaptGDF (c) (a) Figure 1: (a) Proﬁle, (b) sequence logo[2], and (c) textual representations for part of an alignment of Pepsin A precursor P00790, showing IB’s concision compared to proﬁles and logos, and its precision compared to single sequences. [sent-249, score-0.257]

33 This set of solutions is identical to the set of solutions of the tradeoﬀ optimization problem obtained for the spectrum of β values. [sent-254, score-0.06]

34 When X is discrete, its natural compression is fuzzy clustering. [sent-255, score-0.028]

35 Fortunately, its solutions can be characterized analytically by a set of self consistent equations. [sent-257, score-0.03]

36 While the optimal solutions of the IB functional are in general soft clusters, in practice, hard cluster solutions are sometimes more easily interpreted. [sent-259, score-0.117]

37 A series of algorithms was developed for hard IB, including an algorithm that can be viewed as a one-step look-ahead sequential version of K-Means [7]. [sent-260, score-0.028]

38 To apply IB to the problem of proﬁles discretization discussed here, X is a given set of probabilistic proﬁles obtained from a set of aligned sequences and Y is the set of 20 amino acids. [sent-261, score-0.338]

39 First, some clusters’ meanings are naturally determined by limiting them to correspond to the common 20-letter alphabet used to describe amino acids. [sent-265, score-0.195]

40 From the point of view of distributions over amino acids, each of these symbols is used today as the delta function distribution which is fully concentrated on a single amino acid. [sent-266, score-0.341]

41 For the goal of ﬁnding an eﬃcient representation, we require the centroids to be close to these delta distributions. [sent-267, score-0.099]

42 More generally, we require the centroids to be close to some predeﬁned values ci , thus adding constraints to the IB target function ˆ of the form DKL[p(y|ˆi )||p(y|ci )] < Ki for each constrained centroid. [sent-268, score-0.252]

43 While solving c the constrained optimization problem is diﬃcult, the corresponding tradeoﬀ optimization problem can be made very similar to standard IB. [sent-269, score-0.028]

44 (2) ci ∈C We now use the following identity p(x, c)DKL[p(y|x)||p(y|c)] x,c = p(y|x) log p(y|x) − p(x) x y p(c) c log p(y|c) y p(y|x)p(x|c) x = −H(Y |X) + H(Y |C) = I(X; Y ) − I(Y ; C) to rewrite the IB functional of Eq. [sent-271, score-0.098]

45 This is similar to the inclusion of priors in Bayesian estimation. [sent-274, score-0.081]

46 2 Constraints on relations between centroids We want our discretization to capture correlations between strongly- and weaklyconserved variants of the same symbol. [sent-277, score-0.22]

47 This can be done with standard IB using separate classes for the alternatives. [sent-278, score-0.059]

48 However, since the distributions of other amino acids in these two variants are likely to be related, it is preferable to deﬁne a single shared prior for both variants, and to learn a model capturing their correlation. [sent-279, score-0.222]

49 [1] describe multivariate information bottleneck (mIB), an extension of information bottleneck to joint distributions over several correlated input and cluster variables. [sent-281, score-0.292]

50 For proﬁle discretization, we deﬁne two compression variables connected as in Friedman’s “parallel IB”: an amino acid class C ∈ {A, C, . [sent-282, score-0.237]

51 Since this model correlates strong and weak variants of each category, it requires fewer priors than simple IB. [sent-286, score-0.108]

52 It also has fewer parameters: a multivariate model with ns strengths and nc classes has as many categories as a univariate one with nc = ns nc classes, but has only ns +nc −2 free parameters for each x, instead of ns nc − 1. [sent-287, score-0.674]

53 Proteins within the same family typically contain high-conﬁdence alignments, those from diﬀerent families in the same clan less so. [sent-289, score-0.095]

54 For each protein, we generate a proﬁle from alignments obtained from PSI–BLAST with standard parameters, and compute IB classes from a large subset of these proﬁles using the priors described below. [sent-290, score-0.397]

55 For univariate IB, we have used four types of priors reﬂecting biases on stability, physical properties, and observed substitution frequencies: (1) Strongly conserved classes, in which a single symbol is seen with S% probability. [sent-292, score-0.307]

56 These are the only priors used for multivariate IB. [sent-293, score-0.146]

57 (2) Weakly conserved classes, in which a single symbol occurs with W % probability; (S −W )% of the remaining probability mass is distributed among symbols with non-negative log-odds of substitution. [sent-294, score-0.217]

58 (3) Physical trait classes, in which all symbols with the same hydrophobicity, charge, polarity, or aromaticity occur uniformly S% of the time. [sent-295, score-0.055]

59 (4) A uniform class, in which all symbols occur with their background probabilities. [sent-296, score-0.055]

60 Unbiased IB on a large subset of MEROPS with several diﬀerent numbers of unbiased categories yielded a mean frequency approaching 0. [sent-298, score-0.087]

61 7 for the most common symbol in the 20 most sharply-distributed classes (0. [sent-299, score-0.111]

62 Similarly, the next 20 classes have a mean most-likely-symbol frequency around 0. [sent-306, score-0.059]

63 5, reﬂecting a bias toward stronger deﬁnitions of conservation than those inferred from the data. [sent-311, score-0.029]

64 Therefore instead of sIB, we use iterative IB (iIB) with hard clustering, which only recomputes the centroids after performing all updates. [sent-317, score-0.127]

65 Since Slonim argues that sIB outperforms soft iIB in part because sIB’s discrete steps allow it to escape local optima, we expect hard iIB to have similar behavior. [sent-320, score-0.071]

66 To test this, we applied three complete sIB iterations initialized with categories from multivariate iIB. [sent-321, score-0.152]

67 sIB decreased the loss L by only about 3 percent (from 0. [sent-322, score-0.032]

68 Also, the resulting categories were mostly conserved up to exchanging labels, suggesting that hard iIB ﬁnds categories similar sIB ones (see ﬁgure 2). [sent-325, score-0.335]

69 Figure (3b) shows the effect on information loss of varying the prior weight w with three sets of priors: 20 strongly conserved symbols and one background; these plus 20 weakly conserved symbols; and these plus 10 categories for physical characteristics. [sent-328, score-0.394]

70 As expected, both decreasing the number of categories and increasing the number or weight of priors increases information loss. [sent-329, score-0.168]

71 However, with a ﬁxed number of free categories, information loss is nearly independent of prior strength, suggesting that our priors correspond to actual regularities in the data. [sent-330, score-0.136]

72 Finally, note that despite having fewer free parameters than the univariate models, mIB’s achieves comparable performance, suggesting that our decomposition into conserved class and degree of conservation is reasonable. [sent-331, score-0.197]

73 Since we are ultimately using these classes in alignments, the true cost of discretization is best measured by the amount of change between proﬁle and IB alignments, and the signiﬁcance of this change. [sent-332, score-0.153]

74 The latter is important because the best path can be very sensitive to small changes in the sequences or scoring matrix; if two radically diﬀerent alignments have similar scores, neither is clearly “correct”. [sent-333, score-0.358]

75 We can represent an alignment as a pair of index-insertion sequences, one for each proﬁle sequence to be aligned (e. [sent-334, score-0.224]

76 The edit distance between these sequences for two alignments then measures how much they diﬀer. [sent-343, score-0.397]

77 However, even when this distance is large, the diﬀerence between two alignments may not be signiﬁcant if both choices’ scores are nearly the same. [sent-344, score-0.257]

78 That is, if the optimal proﬁle alignment’s score is only slightly lower than the optimal IB class alignment’s score as computed with the original proﬁles, either might be correct. [sent-345, score-0.093]

79 Figure 4 shows at left both the edit distance and score change per length between proﬁle alignments and those using IB classes, mIB classes, and the original sequences with the BLOSUM62 scoring matrix. [sent-346, score-0.455]

80 To compare the proﬁle and sequence alignments, proﬁles corresponding to gaps in the original sequences are replaced 64 Profile-profile IB-profile 2e-5 * L^2 + 0. [sent-347, score-0.207]

81 46 Time (s) 16 4 multivariate 21/52 priors 41/52 priors 51/52 priors 0. [sent-350, score-0.308]

82 The information loss for 52 categories without priors is 0. [sent-360, score-0.2]

83 062 Figure 4: Left: alignment diﬀerences for IB models and sequence alignment, within and between superfamilies. [sent-383, score-0.224]

84 Right: ROC curve for same/diﬀerent superfamily classiﬁcation by alignment score. [sent-384, score-0.23]

85 by gaps, and resulting pairs of aligned gaps in the proﬁle-proﬁle alignment are removed. [sent-385, score-0.219]

86 We consider both sequences from the same family and those from other families in the same clan, the former being more similar than the latter, and therefore having better alignments. [sent-386, score-0.13]

87 Assuming the proﬁle-proﬁle alignment is closest to the “true” alignment, iIB alignment signiﬁcantly outperforms sequence alignment in both cases, with mIB showing a slight additional improvement. [sent-387, score-0.584]

88 At right is the ROC curve for detecting superfamily relationships between proﬁles from diﬀerent families based on alignment scores, showing that while IB fares worse than proﬁles, simple sequences perform essentially at chance. [sent-388, score-0.36]

89 Finally, ﬁgure 3a compares the performance of proﬁle and IB alignment for diﬀerent sequence lengths. [sent-389, score-0.224]

90 To use a proﬁle alphabet for novel alignments, we must map each input proﬁle to the closest IB class. [sent-390, score-0.052]

91 Aligning two sequences of lengths n and m requires computing the |C|(n+m) JS-distances between each proﬁle and each category, a signiﬁcant improvement over the mn distance computations required for min(m,n) proﬁle-proﬁle alignment when |C| . [sent-393, score-0.281]

92 Our results show that JS distance 2 computations dominate running time, since IB alignment time scales linearly with the input size, while proﬁle alignment scales quadratically, yielding an order of magnitude improvement for typical 500- to 1000-base-pair sequences. [sent-394, score-0.36]

93 4 Discussion We have described a discrete approximation to amino acid proﬁles, based on minimizing information loss, that allows proﬁle information to be used for alignment and search without additional computational cost compared to simple sequence alignment. [sent-395, score-0.476]

94 Alignments of sequences encoded with a modest number of classes correspond to the original proﬁle alignments signiﬁcantly better than alignments of the original sequences. [sent-396, score-0.72]

95 In addition to minimizing information loss, the classes can be constrained to correspond to the standard amino acid representation, yielding an intuitive, compact textual form for proﬁle information. [sent-397, score-0.329]

96 Our model is useful in three ways: (1) it makes it possible to apply existing fast discrete algorithms to arbitrary continuous sequences; (2) it models rich conditional distribution structures; and (3) its models can incorporate a variety of class constraints. [sent-398, score-0.043]

97 More generally, we can encode arbitrary conserved regions and still treat them symbolically for alignment and search. [sent-404, score-0.29]

98 Other extensions include incorporating structural information in the input representation; assigning structural signiﬁcance to the resulting categories; and learning multivariate IB’s underlying model’s structure. [sent-405, score-0.137]

99 SCOP: a structural classiﬁcation of proteins database for the investigation of sequences and structures. [sent-420, score-0.205]

100 Prediction of protein secondary structure at better than 70% accuracy. [sent-437, score-0.073]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pro', 0.499), ('ib', 0.452), ('le', 0.261), ('alignments', 0.257), ('les', 0.219), ('alignment', 0.18), ('sib', 0.149), ('amino', 0.143), ('conserved', 0.11), ('sequences', 0.101), ('centroids', 0.099), ('iib', 0.099), ('mib', 0.099), ('bottleneck', 0.099), ('ci', 0.098), ('discretization', 0.094), ('categories', 0.087), ('dkl', 0.086), ('priors', 0.081), ('protein', 0.073), ('di', 0.069), ('clan', 0.066), ('merops', 0.066), ('acid', 0.066), ('multivariate', 0.065), ('nc', 0.061), ('erent', 0.06), ('classes', 0.059), ('symbols', 0.055), ('acids', 0.052), ('alphabet', 0.052), ('symbol', 0.052), ('blast', 0.05), ('homologous', 0.05), ('logos', 0.05), ('superfamily', 0.05), ('yona', 0.05), ('ns', 0.046), ('py', 0.046), ('sequence', 0.044), ('ey', 0.043), ('tradeo', 0.043), ('discrete', 0.043), ('yq', 0.039), ('edit', 0.039), ('gaps', 0.039), ('ks', 0.039), ('slonim', 0.039), ('database', 0.038), ('ra', 0.037), ('structural', 0.036), ('score', 0.035), ('consensus', 0.035), ('univariate', 0.035), ('friedman', 0.033), ('acdefgh', 0.033), ('blosum', 0.033), ('informationally', 0.033), ('klmnpqrstvwy', 0.033), ('ky', 0.033), ('lepro', 0.033), ('logo', 0.033), ('psi', 0.033), ('scop', 0.033), ('swissprot', 0.033), ('pq', 0.033), ('textual', 0.033), ('loss', 0.032), ('ta', 0.031), ('proteins', 0.03), ('aligning', 0.03), ('solutions', 0.03), ('families', 0.029), ('cluster', 0.029), ('kr', 0.029), ('noam', 0.029), ('brenner', 0.029), ('conservation', 0.029), ('gal', 0.029), ('js', 0.029), ('mutations', 0.029), ('naftali', 0.029), ('ry', 0.029), ('substitution', 0.029), ('vq', 0.029), ('constrained', 0.028), ('hard', 0.028), ('compression', 0.028), ('molecular', 0.028), ('variants', 0.027), ('constraints', 0.027), ('clusters', 0.026), ('ew', 0.026), ('fq', 0.026), ('qv', 0.026), ('sf', 0.026), ('wq', 0.026), ('string', 0.026), ('original', 0.023), ('suggesting', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000018 52 nips-2004-Discrete profile alignment via constrained information bottleneck

Author: Sean O'rourke, Gal Chechik, Robin Friedman, Eleazar Eskin

2 0.13205965 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment

Author: Scott J. Gaffney, Padhraic Smyth

Abstract: Clustering and prediction of sets of curves is an important problem in many areas of science and engineering. It is often the case that curves tend to be misaligned from each other in a continuous manner, either in space (across the measurements) or in time. We develop a probabilistic framework that allows for joint clustering and continuous alignment of sets of curves in curve space (as opposed to a ﬁxed-dimensional featurevector space). The proposed methodology integrates new probabilistic alignment models with model-based curve clustering algorithms. The probabilistic approach allows for the derivation of consistent EM learning algorithms for the joint clustering-alignment problem. Experimental results are shown for alignment of human growth data, and joint clustering and alignment of gene expression time-course data.

3 0.11426613 124 nips-2004-Multiple Alignment of Continuous Time Series

Author: Jennifer Listgarten, Radford M. Neal, Sam T. Roweis, Andrew Emili

Abstract: Multiple realizations of continuous-valued time series from a stochastic process often contain systematic variations in rate and amplitude. To leverage the information contained in such noisy replicate sets, we need to align them in an appropriate way (for example, to allow the data to be properly combined by adaptive averaging). We present the Continuous Proﬁle Model (CPM), a generative model in which each observed time series is a non-uniformly subsampled version of a single latent trace, to which local rescaling and additive noise are applied. After unsupervised training, the learned trace represents a canonical, high resolution fusion of all the replicates. As well, an alignment in time and scale of each observation to this trace can be found by inference in the model. We apply CPM to successfully align speech signals from multiple speakers and sets of Liquid Chromatography-Mass Spectrometry proteomic data. 1 A Proﬁle Model for Continuous Data When observing multiple time series generated by a noisy, stochastic process, large systematic sources of variability are often present. For example, within a set of nominally replicate time series, the time axes can be variously shifted, compressed and expanded, in complex, non-linear ways. Additionally, in some circumstances, the scale of the measured data can vary systematically from one replicate to the next, and even within a given replicate. We propose a Continuous Proﬁle Model (CPM) for simultaneously analyzing a set of such time series. In this model, each time series is generated as a noisy transformation of a single latent trace. The latent trace is an underlying, noiseless representation of the set of replicated, observable time series. Output time series are generated from this model by moving through a sequence of hidden states in a Markovian manner and emitting an observable value at each step, as in an HMM. Each hidden state corresponds to a particular location in the latent trace, and the emitted value from the state depends on the value of the latent trace at that position. To account for changes in the amplitude of the signals across and within replicates, the latent time states are augmented by a set of scale states, which control how the emission signal will be scaled relative to the value of the latent trace. During training, the latent trace is learned, as well as the transition probabilities controlling the Markovian evolution of the scale and time states and the overall noise level of the observed data. After training, the latent trace learned by the model represents a higher resolution ’fusion’ of the experimental replicates. Figure 1 illustrate the model in action. Unaligned, Linear Warp Alignment and CPM Alignment Amplitude 40 30 20 10 0 50 Amplitude 40 30 20 10 Amplitude 0 30 20 10 0 Time a) b) Figure 1: a) Top: ten replicated speech energy signals as described in Section 4), Middle: same signals, aligned using a linear warp with an offset, Bottom: aligned with CPM (the learned latent trace is also shown in cyan). b) Speech waveforms corresponding to energy signals in a), Top: unaligned originals, Bottom: aligned using CPM. 2 Deﬁning the Continuous Proﬁle Model (CPM) The CPM is generative model for a set of K time series, xk = (xk , xk , ..., xk k ). The 1 2 N temporal sampling rate within each xk need not be uniform, nor must it be the same across the different xk . Constraints on the variability of the sampling rate are discussed at the end of this section. For notational convenience, we henceforth assume N k = N for all k, but this is not a requirement of the model. The CPM is set up as follows: We assume that there is a latent trace, z = (z1 , z2 , ..., zM ), a canonical representation of the set of noisy input replicate time series. Any given observed time series in the set is modeled as a non-uniformly subsampled version of the latent trace to which local scale transformations have been applied. Ideally, M would be inﬁnite, or at least very large relative to N so that any experimental data could be mapped precisely to the correct underlying trace point. Aside from the computational impracticalities this would pose, great care to avoid overﬁtting would have to be taken. Thus in practice, we have used M = (2 + )N (double the resolution, plus some slack on each end) in our experiments and found this to be sufﬁcient with < 0.2. Because the resolution of the latent trace is higher than that of the observed time series, experimental time can be made effectively to speed up or slow down by advancing along the latent trace in larger or smaller jumps. The subsampling and local scaling used during the generation of each observed time series are determined by a sequence of hidden state variables. Let the state sequence for observation k be π k . Each state in the state sequence maps to a time state/scale state pair: k πi → {τik , φk }. Time states belong to the integer set (1..M ); scale states belong to an i ordered set (φ1 ..φQ ). (In our experiments we have used Q=7, evenly spaced scales in k logarithmic space). States, πi , and observation values, xk , are related by the emission i k probability distribution: Aπi (xk |z) ≡ p(xk |πi , z, σ, uk ) ≡ N (xk ; zτik φk uk , σ), where σ k i i i i is the noise level of the observed data, N (a; b, c) denotes a Gaussian probability density for a with mean b and standard deviation c. The uk are real-valued scale parameters, one per observed time series, that correct for any overall scale difference between time series k and the latent trace. To fully specify our model we also need to deﬁne the state transition probabilities. We deﬁne the transitions between time states and between scale states separately, so that k Tπi−1 ,πi ≡ p(πi |πi−1 ) = p(φi |φi−1 )pk (τi |τi−1 ). The constraint that time must move forward, cannot stand still, and that it can jump ahead no more than Jτ time states is enforced. (In our experiments we used Jτ = 3.) As well, we only allow scale state transitions between neighbouring scale states so that the local scale cannot jump arbitrarily. These constraints keep the number of legal transitions to a tractable computational size and work well in practice. Each observed time series has its own time transition probability distribution to account for experiment-speciﬁc patterns. Both the time and scale transition probability distributions are given by multinomials:  dk , if a − b = 1  1  k  d2 , if a − b = 2   k . p (τi = a|τi−1 = b) = . .  k d , if a − b = J  τ  Jτ  0, otherwise p(φi = a|φi−1  s0 , if D(a, b) = 0   s1 , if D(a, b) = 1 = b) =  s1 , if D(a, b) = −1  0, otherwise where D(a, b) = 1 means that a is one scale state larger than b, and D(a, b) = −1 means that a is one scale state smaller than b, and D(a, b) = 0 means that a = b. The distributions Jτ are constrained by: i=1 dk = 1 and 2s1 + s0 = 1. i Jτ determines the maximum allowable instantaneous speedup of one portion of a time series relative to another portion, within the same series or across different series. However, the length of time for which any series can move so rapidly is constrained by the length of the latent trace; thus the maximum overall ratio in speeds achievable by the model between any two entire time series is given by min(Jτ , M ). N After training, one may examine either the latent trace or the alignment of each observable time series to the latent trace. Such alignments can be achieved by several methods, including use of the Viterbi algorithm to ﬁnd the highest likelihood path through the hidden states [1], or sampling from the posterior over hidden state sequences. We found Viterbi alignments to work well in the experiments below; samples from the posterior looked quite similar. 3 Training with the Expectation-Maximization (EM) Algorithm As with HMMs, training with the EM algorithm (often referred to as Baum-Welch in the context of HMMs [1]), is a natural choice. In our model the E-Step is computed exactly using the Forward-Backward algorithm [1], which provides the posterior probability over k states for each time point of every observed time series, γs (i) ≡ p(πi = s|x) and also the pairwise state posteriors, ξs,t (i) ≡ p(πi−1 = s, πi = t|xk ). The algorithm is modiﬁed only in that the emission probabilities depend on the latent trace as described in Section 2. The M-Step consists of a series of analytical updates to the various parameters as detailed below. Given the latent trace (and the emission and state transition probabilities), the complete log likelihood of K observed time series, xk , is given by Lp ≡ L + P. L is the likelihood term arising in a (conditional) HMM model, and can be obtained from the Forward-Backward algorithm. It is composed of the emission and state transition terms. P is the log prior (or penalty term), regularizing various aspects of the model parameters as explained below. These two terms are: K N N L≡ log Aπi (xk |z) + i log p(π1 ) + τ −1 K (zj+1 − zj )2 + P ≡ −λ (1) i=2 i=1 k=1 k log Tπi−1 ,πi j=1 k log D(dk |{ηv }) + log D(sv |{ηv }), v (2) k=1 where p(π1 ) are priors over the initial states. The ﬁrst term in Equation 2 is a smoothing k penalty on the latent trace, with λ controlling the amount of smoothing. ηv and ηv are Dirichlet hyperprior parameters for the time and scale state transition probability distributions respectively. These ensure that all non-zero transition probabilities remain non-zero. k For the time state transitions, v ∈ {1, Jτ } and ηv corresponds to the pseudo-count data for k the parameters d1 , d2 . . . dJτ . For the scale state transitions, v ∈ {0, 1} and ηv corresponds to the pseudo-count data for the parameters s0 and s1 . Letting S be the total number of possible states, that is, the number of elements in the cross-product of possible time states and possible scale states, the expected complete log likelihood is: K S K p k k γs (1) log T0,s

4 0.094317809 6 nips-2004-A Hidden Markov Model for de Novo Peptide Sequencing

Author: Bernd Fischer, Volker Roth, Jonas Grossmann, Sacha Baginsky, Wilhelm Gruissem, Franz Roos, Peter Widmayer, Joachim M. Buhmann

Abstract: De novo Sequencing of peptides is a challenging task in proteome research. While there exist reliable DNA-sequencing methods, the highthroughput de novo sequencing of proteins by mass spectrometry is still an open problem. Current approaches suffer from a lack in precision to detect mass peaks in the spectrograms. In this paper we present a novel method for de novo peptide sequencing based on a hidden Markov model. Experiments effectively demonstrate that this new method significantly outperforms standard approaches in matching quality. 1

5 0.081644617 77 nips-2004-Hierarchical Clustering of a Mixture Model

Author: Jacob Goldberger, Sam T. Roweis

Abstract: In this paper we propose an eﬃcient algorithm for reducing a large mixture of Gaussians into a smaller mixture while still preserving the component structure of the original model; this is achieved by clustering (grouping) the components. The method minimizes a new, easily computed distance measure between two Gaussian mixtures that can be motivated from a suitable stochastic model and the iterations of the algorithm use only the model parameters, avoiding the need for explicit resampling of datapoints. We demonstrate the method by performing hierarchical clustering of scenery images and handwritten digits. 1

6 0.067358211 182 nips-2004-Synergistic Face Detection and Pose Estimation with Energy-Based Models

7 0.061538354 177 nips-2004-Supervised Graph Inference

8 0.058757532 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data

9 0.05813593 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

10 0.056920696 41 nips-2004-Comparing Beliefs, Surveys, and Random Walks

11 0.054273881 161 nips-2004-Self-Tuning Spectral Clustering

12 0.053118557 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

13 0.049932204 95 nips-2004-Large-Scale Prediction of Disulphide Bond Connectivity

14 0.049511898 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension

15 0.046919819 80 nips-2004-Identifying Protein-Protein Interaction Sites on a Genome-Wide Scale

16 0.043406833 87 nips-2004-Integrating Topics and Syntax

17 0.042314354 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

18 0.036908664 168 nips-2004-Semigroup Kernels on Finite Sets

19 0.036027934 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

20 0.035066228 207 nips-2004-ℓ₀-norm Minimization for Basis Selection

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.131), (1, 0.021), (2, -0.015), (3, -0.055), (4, -0.02), (5, 0.015), (6, -0.06), (7, 0.08), (8, 0.003), (9, -0.015), (10, -0.002), (11, 0.027), (12, 0.015), (13, 0.018), (14, 0.053), (15, -0.046), (16, -0.05), (17, -0.078), (18, -0.046), (19, -0.061), (20, -0.045), (21, 0.041), (22, -0.2), (23, 0.016), (24, 0.152), (25, 0.011), (26, -0.104), (27, -0.195), (28, 0.031), (29, -0.073), (30, -0.071), (31, 0.116), (32, 0.048), (33, -0.122), (34, 0.027), (35, -0.065), (36, 0.05), (37, 0.176), (38, 0.001), (39, 0.035), (40, 0.162), (41, -0.001), (42, 0.048), (43, -0.071), (44, 0.225), (45, -0.171), (46, -0.032), (47, 0.014), (48, 0.127), (49, -0.006)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97896647 52 nips-2004-Discrete profile alignment via constrained information bottleneck

Author: Sean O'rourke, Gal Chechik, Robin Friedman, Eleazar Eskin

2 0.63443482 6 nips-2004-A Hidden Markov Model for de Novo Peptide Sequencing

Author: Bernd Fischer, Volker Roth, Jonas Grossmann, Sacha Baginsky, Wilhelm Gruissem, Franz Roos, Peter Widmayer, Joachim M. Buhmann

3 0.49550617 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment

Author: Scott J. Gaffney, Padhraic Smyth

4 0.45965812 95 nips-2004-Large-Scale Prediction of Disulphide Bond Connectivity

Author: Jianlin Cheng, Alessandro Vullo, Pierre F. Baldi

Abstract: The formation of disulphide bridges among cysteines is an important feature of protein structures. Here we develop new methods for the prediction of disulphide bond connectivity. We ﬁrst build a large curated data set of proteins containing disulphide bridges and then use 2-Dimensional Recursive Neural Networks to predict bonding probabilities between cysteine pairs. These probabilities in turn lead to a weighted graph matching problem that can be addressed efﬁciently. We show how the method consistently achieves better results than previous approaches on the same validation data. In addition, the method can easily cope with chains with arbitrary numbers of bonded cysteines. Therefore, it overcomes one of the major limitations of previous approaches restricting predictions to chains containing no more than 10 oxidized cysteines. The method can be applied both to situations where the bonded state of each cysteine is known or unknown, in which case bonded state can be predicted with 85% precision and 90% recall. The method also yields an estimate for the total number of disulphide bridges in each chain. 1

5 0.42766702 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data

Author: Ofer Shai, Brendan J. Frey, Quaid D. Morris, Qun Pan, Christine Misquitta, Benjamin J. Blencowe

Abstract: Alternative splicing (AS) is an important and frequent step in mammalian gene expression that allows a single gene to specify multiple products, and is crucial for the regulation of fundamental biological processes. The extent of AS regulation, and the mechanisms involved, are not well understood. We have developed a custom DNA microarray platform for surveying AS levels on a large scale. We present here a generative model for the AS Array Platform (GenASAP) and demonstrate its utility for quantifying AS levels in different mouse tissues. Learning is performed using a variational expectation maximization algorithm, and the parameters are shown to correctly capture expected AS trends. A comparison of the results obtained with a well-established but low through-put experimental method demonstrate that AS levels obtained from GenASAP are highly predictive of AS levels in mammalian tissues. 1 Biological diversity through alternative splicing Current estimates place the number of genes in the human genome at approximately 30,000, which is a surprisingly small number when one considers that the genome of yeast, a singlecelled organism, has 6,000 genes. The number of genes alone cannot account for the complexity and cell specialization exhibited by higher eukaryotes (i.e. mammals, plants, etc.). Some of that added complexity can be achieved through the use of alternative splicing, whereby a single gene can be used to code for a multitude of products. Genes are segments of the double stranded DNA that contain the information required by the cell for protein synthesis. That information is coded using an alphabet of 4 (A, C, G, and T), corresponding to the four nucleotides that make up the DNA. In what is known as the central dogma of molecular biology, DNA is transcribed to RNA, which in turn is translated into proteins. Messenger RNA (mRNA) is synthesized in the nucleus of the cell and carries the genomic information to the ribosome. In eukaryotes, genes are generally comprised of both exons, which contain the information needed by the cell to synthesize proteins, and introns, sometimes referred to as spacer DNA, which are spliced out of the pre-mRNA to create mature mRNA. An estimated 35%-75% of human genes [1] can be C1 (a) C1 A C1 C2 C1 A 3’ C2 (b) C2 C1 A 3’ C1 A 5’ C2 A 5’ C2 C1 C1 C2 C1 (c) A2 A1 C1 A1 C1 A2 C1 C2 C2 C1 (d) C2 C2 A C2 C2 C2 C1 C2 Figure 1: Four types of AS. Boxes represent exons and lines represent introns, with the possible splicing alternatives indicated by the connectors. (a) Single cassette exon inclusion/exclusion. C1 and C2 are constitutive exons (exons that are included in all isoforms) and ﬂank a single alternative exon (A). The alternative exon is included in one isoform and excluded in the other. (b) Alternative 3’ (or donor) and alternative 5’ (acceptor) splicing sites. Both exons are constitutive, but may contain alternative donor and/or acceptor splicing sites. (c) Mutually exclusive exons. One of the two alternative exons (A1 and A2 ) may be included in the isoform, but not both. (d) Intron inclusion. An intron may be included in the mature mRNA strand. spliced to yield different combinations of exons (called isoforms), a phenomenon referred to as alternative splicing (AS). There are four major types of AS as shown in Figure 1. Many multi-exon genes may undergo more than one alternative splicing event, resulting in many possible isoforms from a single gene. [2] In addition to adding to the genetic repertoire of an organism by enabling a single gene to code for more than one protein, AS has been shown to be critical for gene regulation, contributing to tissue speciﬁcity, and facilitating evolutionary processes. Despite the evident importance of AS, its regulation and impact on speciﬁc genes remains poorly understood. The work presented here is concerned with the inference of single cassette exon AS levels (Figure 1a) based on data obtained from RNA expression arrays, also known as microarrays. 1.1 An exon microarray data set that probes alternative splicing events Although it is possible to directly analyze the proteins synthesized by a cell, it is easier, and often more informative, to instead measure the abundance of mRNA present. Traditionally, gene expression (abundance of mRNA) has been studied using low throughput techniques (such as RT-PCR or Northern blots), limited to studying a few sequences at a time and making large scale analysis nearly impossible. In the early 1990s, microarray technology emerged as a method capable of measuring the expression of thousands of DNA sequences simultaneously. Sequences of interest are deposited on a substrate the size of a small microscope slide, to form probes. The mRNA is extracted from the cell and reverse-transcribed back into DNA, which is labelled with red and green ﬂuorescent dye molecules (cy3 and cy5 respectively). When the sample of tagged DNA is washed over the slide, complementary strands of DNA from the sample hybridize to the probes on the array forming A-T and C-G pairings. The slide is then scanned and the ﬂuorescent intensity is measured at each probe. It is generally assumed that the intensity measure at the probe is linearly related to the abundance of mRNA in the cell over a wide dynamic range. Despite signiﬁcant improvements in microarray technologies in recent years, microarray data still presents some difﬁculties in analysis. Low measurements tend to have extremely low signal to noise ratio (SNR) [7] and probes often bind to sequences that are very similar, but not identical, to the one for which they were designed (a process referred to as cross- C1 A C1 A C1:A C1 C2 C2 3 Body probes A:C2 A C2 2 Inclusion junction probes C1:C2 C1 C2 1 Exclusion junction probe Figure 2: Each alternative splicing event is studied using six probes. Probes were chosen to measure the expression levels of each of the three exons involved in the event. Additionally, 3 probes are used that target the junctions that are formed by each of the two isoforms. The inclusion isoform would express the junctions formed by C1 and A, and A and C2 , while the exclusion isoform would express the junction formed by C1 and C2 hybridization). Additionally, probes exhibit somewhat varying hybridization efﬁciency, and sequences exhibit varying labelling efﬁciency. To design our data sets, we mined public sequence databases and identiﬁed exons that were strong candidates for exhibiting AS (the details of that analysis are provided elsewhere [4, 3]). Of the candidates, 3,126 potential AS events in 2,647 unique mouse genes were selected for the design of Agilent Custom Oligonucleotide microarray. The arrays were hybridized with unampliﬁed mRNA samples extracted from 10 wild-type mouse tissues (brain, heart, intestine, kidney, liver, lung, salivary gland, skeletal muscle, spleen, and testis). Each AS event has six target probes on the arrays, chosen from regions of the C1 exon, C2 exon, A exon, C1 :A splice junction, A:C2 splice junction, and C1 :C2 splice junction, as shown in Figure 2. 2 Unsupervised discovery of alternative splicing With the exception of the probe measuring the alternative exon, A (Figure 2), all probes measure sequences that occur in both isoforms. For example, while the sequence of the probe measuring the junction A:C1 is designed to measure the inclusion isoform, half of it corresponds to a sequence that is found in the exclusion isoform. We can therefore safely assume that the measured intensity at each probe is a result of a certain amount of both isoforms binding to the probe. Due to the generally assumed linear relationship between the abundance of mRNA hybridized at a probe and the ﬂuorescent intensity measured, we model the measured intensity as a weighted sum of the overall abundance of the two isoforms. A stronger assumption is that of a single, consistent hybridization proﬁle for both isoforms across all probes and all slides. Ideally, one would prefer to estimate an individual hybridization proﬁle for each AS event studied across all slides. However, in our current setup, the number of tissues is small (10), resulting in two difﬁculties. First, the number of parameters is very large when compared to the number of data point using this model, and second, a portion of the events do not exhibit tissue speciﬁc alternative splicing within our small set of tissues. While the ﬁrst hurdle could be accounted for using Baysian parameter estimation, the second cannot. 2.1 GenASAP - a generative model for alternative splicing array platform Using the setup described above, the expression vector x, containing the six microarray measurements as real numbers, can be decomposed as a linear combination of the abundance of the two splice isoforms, represented by the real vector s, with some added noise: x = Λs + noise, where Λ is a 6 × 2 weight matrix containing the hybridization proﬁles for s1 ^ xC ^ xC 1 2 s2 ^ xC :A ^ xA ^ xA:C 2 ^ xC1:C2 2 xC1:C2 1 r xC xC 2 xA xC :A xA:C oC1 oC2 oA oC :A oA:C 1 1 1 2 oC :C 1 2 Σn 2 Figure 3: Graphical model for alternative splicing. Each measurement in the observed expression proﬁle, x, is generated by either using a scale factor, r, on a linear combination of the isoforms, s, or drawing randomly from an outlier model. For a detailed description of the model, see text. the two isoforms across the six probes. Note that we may not have a negative amount of a given isoform, nor can the presence of an isoform deduct from the measured expression, and so both s and Λ are constrained to be positive. Expression levels measured by microarrays have previously been modelled as having expression-dependent noise [7]. To address this, we rewrite the above formulation as x = r(Λs + ε), (1) where r is a scale factor and ε is a zero-mean normally distributed random variable with a diagonal covariance matrix, Ψ, denoted as p(ε) = N (ε; 0, Ψ). The prior distribution for the abundance of the splice isoforms is given by a truncated normal distribution, denoted as p(s) ∝ N (s, 0, I)[s ≥ 0], where [·] is an indicator function such that [s ≥ 0] = 1 if ∀i, si ≥ 0, and [s ≥ 0] = 0 otherwise. Lastly, there is a need to account for aberrant observations (e.g. due to faulty probes, ﬂakes of dust, etc.) with an outlier model. The complete GenASAP model (shown in Figure 3) accounts for the observations as the outcome of either applying equation (1) or an outlier model. To avoid degenerate cases and ensure meaningful and interpretable results, the number of faulty probes considered for each AS event may not exceed two, as indicated by the ﬁlled-in square constraint node in Figure 3. The distribution of x conditional on the latent variables, s, r, and o, is: N (xi ; rΛi s, r2 Ψi )[oi =0] N (xi ; Ei , Vi )[oi =1] , p(x|s, r, o) = (2) i where oi ∈ {0, 1} is a bernoulli random variable indicating if the measurement at probe xi is the result of the AS model or the outlier model parameterized by p(oi = 1) = γi . The parameters of the outlier model, E and V, are not optimized and are set to the mean and variance of the data. 2.2 Variational learning in the GenASAP model To infer the posterior distribution over the splice isoform abundances while at the same time learning the model parameters we use a variational expectation-maximization algorithm (EM). EM maximizes the log likelihood of the data by iteratively estimating the posterior distribution of the model given the data in the expectation (E) step, and maximizing the log likelihood with respect to the parameters, while keeping the posterior ﬁxed, in the maximization (M) step. Variational EM is used when, as in the case of GenASAP, the exact posterior is intractable. Variational EM minimizes the free energy of the model, deﬁned as the KL-divergence between the joint distribution of the latent and observed variables and the approximation to the posterior under the model parameters [5, 6]. We approximate the true posterior using the Q distribution given by T Q({s(t) }, {o(t) }, {r(t) }) = (t) Q(r(t) )Q(o(t) |r(t) ) t=1 (t) Q(si |oi , r(t) ) i −1 T =Z (t) (3) d d ρ(t) ω (t) N (s(t) ; µ(t) , Σ(t) )[s(t) ≥ 0], ro ro t=1 where Z is a normalization constant, the superscript d indicates that Σ is constrained to be diagonal, and there are T iid AS events. For computational efﬁciency, r is selected from a ﬁnite set, r ∈ {r1 , r2 , . . . , rC } with uniform probability. The variational free energy is given by Q({s(t) }, {o(t) }, {r(t) }) . P ({s(t) }, {o(t) }, {r(t) }, {x(t) }) s r o (4) Variational EM minimizes the free energy by iteratively updating the Q distribution’s vari(t)d (t)d ational parameters (ρ(t) , ω (t) , µro , and Σro ) in the E-step, and the model parameters (Λ, Ψ, {r1 , r2 , . . . , rC }, and γ) in the M-step. The resulting updates are too long to be shown in the context of this paper and are discussed in detail elsewhere [3]. A few particular points regarding the E-step are worth covering in detail here. Q({s(t) }, {o(t) }, {r(t) }) log F(Q, P ) = If the prior on s was a full normal distribution, there would be no need for a variational approach, and exact EM is possible. For a truncated normal distribution, however, the mixing proportions, Q(r)Q(o|r) cannot be calculated analytically except for the case where s is scalar, necessitating the diagonality constraint. Note that if Σ was allowed to be a full covariance matrix, equation (3) would be the true posterior, and we could ﬁnd the sufﬁcient statistics of Q(s(t) |o(t) , r(t) ): −1 µ(t) = (I + ΛT (I − O(t) )T Ψ−1 (I − O(t) )Λ)−1 ΛT (I − O(t) )T Ψ−1 x(t) r(t) ro −1 Σ(t) ro = (I + ΛT (I − O(t) )T Ψ−1 (I − O(t) )Λ) (5) (6) where O is a diagonal matrix with elements Oi,i = oi . Furthermore, it can be easily shown that the optimal settings for µd and Σd approximating a normal distribution with full covariance Σ and mean µ is µd optimal = µ −1 Σd optimal = diag(Σ (7) −1 ) (8) In the truncated case, equation (8) is still true. Equation (7) does not hold, though, and µd optimal cannot be found analytically. In our experiments, we found that using equation (7) still decreases the free energy every E-step, and it is signiﬁcantly more efﬁcient than using, for example, a gradient decent method to compute the optimal µd . Intuitive Weigh Matrix Optimal Weight Matrix 50 50 40 40 30 30 20 20 10 0 10 Inclusion Isoform 0 Exclusion Isoform Inclusion Isoform (a) Exclusion Isoform (b) Figure 4: (a) An intuitive set of weights. Based on the biological background, one would expect to see the inclusion isoform hybridize to the probes measuring C1 , C2 , A, C1 :A, and A:C2 , while the exclusion isoform hybridizes to C1 , C2 , and C1 :C2 . (b) The learned set of weights closely agrees with the intuition, and captures cross hybridization between the probes Contribution of exclusion isoform Contribution of inclusion isoform AS model Original Data (a) RT−PCR AS model measurement prediction (% exclusion) (% exclusion) 14% 72% (b) 27% 70% 8% 22% outliers (c) Figure 5: Three examples of data cases and their predictions. (a) The data does not follow our notion of single cassette exon AS, but the AS level is predicted accurately by the model.(b) The probe C1 :A is marked as outlier, allowing the model to predict the other probes accurately. (c) Two probes are marked as outliers, and the model is still successful in predicting the AS levels. 3 Making biological predictions about alternative splicing The results presented in this paper were obtained using two stages of learning. In the ﬁrst step, the weight matrix, Λ, is learned on a subset of the data that is selected for quality. Two selection criteria were used: (a) sequencing data was used to select those cases for which, with high conﬁdence, no other AS event is present (Figure 1) and (b) probe sets were selected for high expression, as determined by a set of negative controls. The second selection criterion is motivated by the common assumption that low intensity measurements are of lesser quality (see Section 1.1). In the second step, Λ is kept ﬁxed, and we introduce the additional constraint that the noise is isotropic (Ψ = ψI) and learn on the entire data set. The constraint on the noise is introduced to prevent the model from using only a subset of the six probes for making the ﬁnal set of predictions. We show a typical learned set of weights in Figure 4. The weights ﬁt well with our intuition of what they should be to capture the presence of the two isoforms. Moreover, the learned weights account for the speciﬁc trends in the data. Examples of model prediction based on the microarray data are shown in Figure 5. Due to the nature of the microarray data, we do not expect all the inferred abundances to be equally good, and we devised a scoring criterion that ranks each AS event based on its ﬁt to the model. Intuitively, given two input vectors that are equivalent up to a scale factor, with inferred MAP estimations that are equal up to the same scale factor, we would like their scores to be identical. The scoring criterion used, therefore is k (xk − rΛk s)2 /(xk + Rank 500 1000 2000 5000 10000 15000 20000 30000 Pearson’s correlation coefﬁcient 0.94 0.95 0.95 0.79 0.79 0.78 0.75 0.65 False positive rate 0.11 0.08 0.05 0.2 0.25 0.29 0.32 0.42 Table 1: Model performance evaluated at various ranks. Using 180 RT-PCR measurements, we are able to predict the model’s performance at various ranks. Two evaluation criteria are used: Pearson’s correlation coefﬁcient between the model’s predictions and the RT-PCR measurements and false positive rate, where a prediction is considered to be false positive if it is more than 15% away from the RT-PCR measurement. rΛk s)2 , where the MAP estimations for r and s are used. This scoring criterion can be viewed as proportional to the sum of noise to signal ratios, as estimated using the two values given by the observation and the model’s best prediction of that observation. Since it is the relative amount of the isoforms that is of most interest, we need to use the inferred distribution of the isoform abundances to obtain an estimate for the relative levels of AS. It is not immediately clear how this should be done. We do, however, have RTPCR measurements for 180 AS events to guide us (see ﬁgure 6 for details). Using the top 50 ranked RT-PCR measurement, we ﬁt three parameters, {a1 , a2 , a3 }, such that the s2 proportion of excluded isoform present, p, is given by p = a1 s1 +a2 s2 + a3 , where s1 is the MAP estimation of the abundance of the inclusion isoform, s2 is the MAP estimation of the abundance of the exclusion isoform, and the RT-PCR measurement are used for target p. The parameters are ﬁtted using gradient descent on a least squared error (LSE) evaluation criterion. We used two criteria to evaluate the quality of the AS model predictions. Pearson’s correlation coefﬁcient (PCC) is used to evaluate the overall ability of the model to correctly estimate trends in the data. PCC is invariant to afﬁne transformation and so is independent of the transformation parameters a1 and a3 discussed above, while the parameter a2 was found to effect PCC very little. The PCC stays above 0.75 for the top two thirds ranked predictions. The second evaluation criterion used is the false positive rate, where a prediction is considered to be false positive if it is more than 15% away from the RT-PCR measurement. This allows us to say, for example, that if a prediction is within the top 10000, we are 75% conﬁdent that it is within 15% of the actual levels of AS. 4 Summary We designed a novel AS model for the inference of the relative abundance of two alternatively spliced isoforms from six measurements. Unsupervised learning in the model is performed using a structured variational EM algorithm, which correctly captures the underlying structure of the data, as suggested by its biological nature. The AS model, though presented here for a cassette exon AS events, can be used to learn any type of AS, and with a simple adjustment, multiple types. The predictions obtained from the AS model are currently being used to verify various claims about the role of AS in evolution and functional genomics, and to help identify sequences that affect the regulation of AS. % Exclusion isoform RT−PCR measurement Vs. AS model predictions RT−PCR measurements: 90 80 AS model prediction Int es Te tine sti Kid s n Sa ey liva Br ry ain Sp le Liv en er Mu sc Lu le ng 100 70 60 50 40 30 14 22 27 32 47 46 66 78 63 20 AS model prediction: 10 27 24 26 26 51 75 60 85 100 (a) 0 0 20 40 60 80 RT−PCR measurement 100 (b) Figure 6: (a) Sample RT-PCR. RNA extracted from the cell is reverse-transcribed to DNA, ampliﬁed and labelled with radioactive or ﬂuorescent molecules. The sample is pulled through a viscous gel in an electric ﬁeld (DNA, being an acid, is positively charged). Shorter strands travel further through the gel than longer ones, resulting in two distinct bands, corresponding to the two isoforms, when exposed to a photosensitive or x-ray ﬁlm. (b) A scatter plot showing the RT-PCR measurements as compared to the AS model predictions. The plot shows all available RT-PCR measurements with a rank of 8000 or better. The AS model presented assumes a single weight matrix for all data cases. This is an oversimpliﬁed view of the data, and current work is being carried out in identifying probe speciﬁc expression proﬁles. However, due to the low dimensionality of the problem (10 tissues, six probes per event), care must be taken to avoid overﬁtting and to ensure meaningful interpretations. Acknowledgments We would like to thank Wen Zhang, Naveed Mohammad, and Timothy Hughes for their contributions in generating the data set. This work was funded in part by an operating and infrastructure grants from the CIHR and CFI, and a operating grants from NSERC and a Premier’s Research Excellence Award. References [1] J. M. Johnson et al. Genome-wide survey of human alternative pre-mrna splicing with exon junction microarrays. Science, 302:2141–44, 2003. [2] L. Cartegni et al. Listening to silence and understanding nonsense: exonic mutations that affect splicing. Nature Gen. Rev., 3:285–98, 2002. [3] Q. Pan et al. Revealing global regulatory features of mammalian alternative splicing using a quantitative microarray platform. Molecular Cell, 16(6):929–41, 2004. [4] Q. Pan et al. Alternative splicing of conserved exons is frequently species speciﬁc in human and mouse. Trends Gen., In Press, 2005. [5] M. I. Jordan, Z. Ghahramani, T. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37(2):183– 233, 1999. [6] R. M. Neal and G. E. Hinton. A view of the em algorithm that justiﬁes incremental, sparse, and other variants. In Learning in Graphical Models. Cambridge, MIT Press, 1998. [7] D. M. Rocke and B. Durbin. A model for measurement error for gene expression arrays. Journal of Computational Biology, 8(6):557–69, 2001.

6 0.41329938 80 nips-2004-Identifying Protein-Protein Interaction Sites on a Genome-Wide Scale

7 0.39133823 124 nips-2004-Multiple Alignment of Continuous Time Series

8 0.38570341 146 nips-2004-Pictorial Structures for Molecular Modeling: Interpreting Density Maps

9 0.34601974 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data

10 0.27800769 207 nips-2004-ℓ₀-norm Minimization for Basis Selection

11 0.26080871 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

12 0.25635397 41 nips-2004-Comparing Beliefs, Surveys, and Random Walks

13 0.25549257 104 nips-2004-Linear Multilayer Independent Component Analysis for Large Natural Scenes

14 0.25439292 74 nips-2004-Harmonising Chorales by Probabilistic Inference

15 0.24651931 182 nips-2004-Synergistic Face Detection and Pose Estimation with Energy-Based Models

16 0.24649896 128 nips-2004-Neural Network Computation by In Vitro Transcriptional Circuits

17 0.24261631 177 nips-2004-Supervised Graph Inference

18 0.23965208 196 nips-2004-Triangle Fixing Algorithms for the Metric Nearness Problem

19 0.23133767 161 nips-2004-Self-Tuning Spectral Clustering

20 0.22431894 51 nips-2004-Detecting Significant Multidimensional Spatial Clusters

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.089), (15, 0.085), (17, 0.015), (25, 0.337), (26, 0.037), (31, 0.011), (33, 0.151), (35, 0.02), (36, 0.027), (50, 0.035), (58, 0.014), (82, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.88210934 109 nips-2004-Mass Meta-analysis in Talairach Space

Author: Finn \. Nielsen

Abstract: We provide a method for mass meta-analysis in a neuroinformatics database containing stereotaxic Talairach coordinates from neuroimaging experiments. Database labels are used to group the individual experiments, e.g., according to cognitive function, and the consistent pattern of the experiments within the groups are determined. The method voxelizes each group of experiments via a kernel density estimation, forming probability density volumes. The values in the probability density volumes are compared to null-hypothesis distributions generated by resamplings from the entire unlabeled set of experiments, and the distances to the nullhypotheses are used to sort the voxels across groups of experiments. This allows for mass meta-analysis, with the construction of a list with the most prominent associations between brain areas and group labels. Furthermore, the method can be used for functional labeling of voxels. 1

same-paper 2 0.85532421 52 nips-2004-Discrete profile alignment via constrained information bottleneck

Author: Sean O'rourke, Gal Chechik, Robin Friedman, Eleazar Eskin

3 0.84240699 123 nips-2004-Multi-agent Cooperation in Diverse Population Games

Author: K. Wong, S. W. Lim, Z. Gao

Abstract: We consider multi-agent systems whose agents compete for resources by striving to be in the minority group. The agents adapt to the environment by reinforcement learning of the preferences of the policies they hold. Diversity of preferences of policies is introduced by adding random biases to the initial cumulative payoffs of their policies. We explain and provide evidence that agent cooperation becomes increasingly important when diversity increases. Analyses of these mechanisms yield excellent agreement with simulations over nine decades of data. 1

4 0.66771859 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

Author: Jian Zhang, Zoubin Ghahramani, Yiming Yang

Abstract: In this paper we propose a probabilistic model for online document clustering. We use non-parametric Dirichlet process prior to model the growing number of clusters, and use a prior of general English language model as the base distribution to handle the generation of novel clusters. Furthermore, cluster uncertainty is modeled with a Bayesian Dirichletmultinomial distribution. We use empirical Bayes method to estimate hyperparameters based on a historical dataset. Our probabilistic model is applied to the novelty detection task in Topic Detection and Tracking (TDT) and compared with existing approaches in the literature. 1

5 0.63996881 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

Author: Nicolò Cesa-bianchi, Claudio Gentile, Luca Zaniboni

Abstract: We provide a worst-case analysis of selective sampling algorithms for learning linear threshold functions. The algorithms considered in this paper are Perceptron-like algorithms, i.e., algorithms which can be efﬁciently run in any reproducing kernel Hilbert space. Our algorithms exploit a simple margin-based randomized rule to decide whether to query the current label. We obtain selective sampling algorithms achieving on average the same bounds as those proven for their deterministic counterparts, but using much fewer labels. We complement our theoretical ﬁndings with an empirical comparison on two text categorization tasks. The outcome of these experiments is largely predicted by our theoretical results: Our selective sampling algorithms tend to perform as good as the algorithms receiving the true label after each classiﬁcation, while observing in practice substantially fewer labels. 1

6 0.58652455 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

7 0.52698314 124 nips-2004-Multiple Alignment of Continuous Time Series

8 0.51034003 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection

9 0.51025379 151 nips-2004-Rate- and Phase-coded Autoassociative Memory

10 0.50705647 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

11 0.50678849 102 nips-2004-Learning first-order Markov models for control

12 0.50673556 53 nips-2004-Discriminant Saliency for Visual Recognition from Cluttered Scenes

13 0.50554287 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

14 0.5032863 131 nips-2004-Non-Local Manifold Tangent Learning

15 0.5018031 116 nips-2004-Message Errors in Belief Propagation

16 0.50139356 207 nips-2004-ℓ₀-norm Minimization for Basis Selection

17 0.50073349 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

18 0.5006485 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons

19 0.49981633 64 nips-2004-Experts in a Markov Decision Process

20 0.49941605 127 nips-2004-Neighbourhood Components Analysis