nips nips2008 nips2008-70 knowledge-graph by maker-knowledge-mining

70 nips-2008-Efficient Inference in Phylogenetic InDel Trees


Source: pdf

Author: Alexandre Bouchard-côté, Dan Klein, Michael I. Jordan

Abstract: Accurate and efficient inference in evolutionary trees is a central problem in computational biology. While classical treatments have made unrealistic site independence assumptions, ignoring insertions and deletions, realistic approaches require tracking insertions and deletions along the phylogenetic tree—a challenging and unsolved computational problem. We propose a new ancestry resampling procedure for inference in evolutionary trees. We evaluate our method in two problem domains—multiple sequence alignment and reconstruction of ancestral sequences—and show substantial improvement over the current state of the art. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Accurate and efficient inference in evolutionary trees is a central problem in computational biology. [sent-4, score-0.277]

2 While classical treatments have made unrealistic site independence assumptions, ignoring insertions and deletions, realistic approaches require tracking insertions and deletions along the phylogenetic tree—a challenging and unsolved computational problem. [sent-5, score-0.543]

3 We propose a new ancestry resampling procedure for inference in evolutionary trees. [sent-6, score-0.608]

4 We evaluate our method in two problem domains—multiple sequence alignment and reconstruction of ancestral sequences—and show substantial improvement over the current state of the art. [sent-7, score-0.503]

5 1 Introduction Phylogenetic analysis plays a significant role in modern biological applications such as ancestral sequence reconstruction and multiple sequence alignment [1, 2, 3]. [sent-8, score-0.643]

6 While insertions and deletions (InDels) of nucleotides or amino acids are an important aspect of phylogenetic inference, they pose formidable computational challenges and they are usually handled with heuristics [4, 5, 6]. [sent-9, score-0.566]

7 Concretely, the models considered in the phylogenetic literature take the form of a tree-shaped graphical model where nodes are string-valued random variables representing a fragment of DNA, RNA or protein of a species. [sent-11, score-0.378]

8 Usually, only the terminal nodes are observed, while the internal nodes are hidden. [sent-14, score-0.303]

9 The interpretation is that the sequence at the root is the common ancestor of those at the terminal nodes, and it subsequently evolved in a branching process following the topology of the tree. [sent-15, score-0.352]

10 We will concentrate on the problem of computing the posterior of these hidden nodes rather than the problem of selecting the topology of the tree—hence we will assume the tree is known or estimated with some other algorithm (a guide tree assumption). [sent-16, score-0.401]

11 However, because alignments between the sequences are unknown in practice, it is difficult to exploit this structure in a principled way. [sent-21, score-0.266]

12 In many previous works [4, 5, 6], the following heuristic approach is taken to perform inference on the hidden nodes (refer to Fig. [sent-22, score-0.21]

13 1): First, a guide tree (d) and a multiple sequence alignment (a) (a transitive alignment between the characters in the sequences of the modern species) are computed using heuristics [7, 8]. [sent-23, score-1.083]

14 For each equivalence class in the multiple sequence alignment (called a site, corresponding to a column in Fig. [sent-25, score-0.4]

15 1(b)), a new graphical model is created with the same tree structure as the original problem, but where there is exactly one character in each node rather than a string. [sent-26, score-0.329]

16 (AR) Figure 1: Comparison of different approaches to phylogenetic modeling: (a,b,c,d) heuristics based on site independence; (e) Single Sequence Resampling; (f) Ancestry Resampling. [sent-112, score-0.335]

17 in the current equivalence class, the node in this new tree is observed, and the rest of the nodes are considered as unobserved data (Fig. [sent-114, score-0.277]

18 This heuristic has several problems, the most important being that it does not allow explicit modeling of insertions and deletions (InDel), which are frequent in real biological data and play an important role in evolution [9]. [sent-118, score-0.249]

19 The first factor is a random walk behavior that arises in tall chains found in large or unbalanced trees [2, 12]: initially, the InDel events resampled at the top of the tree are independent of all the observations. [sent-129, score-0.329]

20 , putting a bound on the relative positions of characters that mutate from one to the other) to speed things up [12]. [sent-134, score-0.162]

21 In this paper, we present a novel MCMC procedure for phylogenetic InDel models that we refer to as Ancestry Resampling (AR). [sent-137, score-0.24]

22 We assume that a phylogenetic directed tree topology τ = (V, E) is fixed, where nodes in this tree are string-valued random variables, from 2 an alphabet of K characters—K is four in nucleotide sequences and about twenty in amino-acid sequences. [sent-144, score-0.779]

23 We start the description of the model in the simple case of a single branch of known length t, with a string x at the root and a string y at the leaf. [sent-146, score-0.204]

24 There is one rate µ for deletion (death in the original TKF terminology) and one rate λ for insertions, which can occur either to the right of one of the existing character (birth), or to the left of the sequence (immigration). [sent-148, score-0.233]

25 Fortunately, the TKF91 model has a closed form solution for the conditional distribution over strings y at the leaf given the string x at the root. [sent-150, score-0.188]

26 In defining descendants, we count the character itself, its children, grandchildren, etc. [sent-160, score-0.157]

27 Since we only work with these conditionals, note that the situation resembles that of a standard weighted edit process with a specific, branch-length dependent structure over insertions and deletions. [sent-163, score-0.155]

28 Between each pair of nodes a, b ∈ V connected by an edge and with respective strings x, y , we define an alignment random variable: its values are bipartite matchings between the characters of the strings x and y. [sent-170, score-0.766]

29 Links in this alignment denote survival of a character (allowing zero or more substitutions). [sent-171, score-0.417]

30 Note that this alignment is monotonic: if character i in x is linked to character j in y, then the characters i > i in x can only be unlinked or linked to a character with index j > j in y. [sent-172, score-0.961]

31 The random variable that consists of the alignments and the strings for all the edges and nodes in the phylogenetic tree τ will be called a derivation. [sent-173, score-0.725]

32 Note also that a derivation D defines another graph that we will call a derivation graph. [sent-174, score-0.13]

33 Its nodes are the characters of all the strings in the tree. [sent-175, score-0.37]

34 We put an edge between two characters x, y in this graph iff two properties hold. [sent-176, score-0.195]

35 Let a, b ∈ V be the nodes corresponding to the strings from which respectively x, y belongs to. [sent-177, score-0.208]

36 We put an edge between x, y iff (1) there is an edge between a and b in E and (2) there is a link between x, y in the alignment of the corresponding strings. [sent-178, score-0.326]

37 While the SSR kernel resamples the whole sequence corresponding to a single node, AR works around the difficulties of SSR by joint resampling of a “thin vertical slice” (Fig. [sent-182, score-0.313]

38 1(f)) in the tree that is composed of a short substring in every node. [sent-183, score-0.177]

39 1 Ancestry Resampling We will call one of these “thin slices” an ancestry A, and we now discuss what its definition should be. [sent-186, score-0.256]

40 3 (a) b c (d) a: a b: c: z anchor (b) (c) Legend (e) observed characters selected characters anchor Figure 2: (a): the simple guide tree used in this example (left) and the corresponding sequences and alignments (right). [sent-188, score-0.877]

41 Let D be the current derivation and let x be a substring of one of the terminal nodes, say in node e. [sent-194, score-0.273]

42 The ancestry will depend on both a derivation and an anchor. [sent-196, score-0.321]

43 The overall MH sampler is a mixture of proposal distributions indexed by a set of anchors covering all the characters in the terminal strings. [sent-197, score-0.396]

44 Each proposal resamples a new value of A(D, x) given the terminal nodes and keeping A(D, x)c frozen. [sent-198, score-0.297]

45 We first let A0 (D, x) be the set of characters connected to some character in x in the derivation graph of D (see Fig. [sent-199, score-0.384]

46 First, it does not yield an irreducible chain, as illustrated in same figure, where nine of the characters of this sample (those inside the dashed curve) will never be resampled, no matter which substring of the terminal node is selected as anchor. [sent-203, score-0.401]

47 For i > 0, we will say that a character token y is in Ai (D, x) if one of the following conditions is true: 1. [sent-208, score-0.157]

48 In words, a symbol is in A∞ (D, x) if it is linked to i=0 an anchored character through the alignments, or if it is “squeezed” between previously connected characters. [sent-214, score-0.191]

49 One can see that with the definition of A∞ (D, x) given above, from the state 2 (e), the state in 2 (d) is now unreachable by the same resampling operator, the reason being that the substring labeled z in the figure belongs to the frozen part of the state if the transition is visited backwards. [sent-221, score-0.217]

50 While there exist MCMC methods that are not based on reversible chains [13], we prefer to take a simpler approach: a variation on our definition solves the issue, informally by taking vertical slices A(D, x) to be the “complement of the ancestry taken on the complement of the anchor. [sent-222, score-0.412]

51 ” More precisely, if x = x xx is the string at the anchor node e, we let the resampled section to be A(D, x) := (A∞ (D, x ) ∪ A∞ (D, x ))c . [sent-223, score-0.289]

52 We will call A(D, x) the ancestry of the anchor x. [sent-226, score-0.341]

53 4 The problem of resampling a single slice decomposes along the tree structure τ , but an unbounded number of InDels could occur a priori inside the thin slice. [sent-228, score-0.358]

54 Indeed, the anchors x can be taken relatively small (we used anchors of length 3 to 5 in our experiments). [sent-231, score-0.17]

55 The summation over the possible alignments can be done using a standard quadratic dynamic program known in its max version as the Needleman-Wunsch algorithm [14]. [sent-233, score-0.16]

56 We formalize closeness as follows: Let a1 , a2 be two values for the ancestry A(D, x). [sent-236, score-0.256]

57 We define the cylindric distance as the maximum over all the nodes e of the Levenshtein edit distance between the substrings in a1 and a2 at node e. [sent-237, score-0.343]

58 The proposal distribution consider the substitution ancestry that are within a ball of radius m centered at the current state in the cylindric metric. [sent-239, score-0.397]

59 Here the number of states in the tree-structured dynamic program at each node is polynomial in the lengths of the strings in the current ancestry. [sent-241, score-0.158]

60 : min 1, P(ap ) × Q(ac |ap ) , P(ac ) × Q(ap |ac ) where ac , ap are the current and proposed ancestry values and Q(a2 |a1 ) is the transition probability of the MH kernel, proportional to P(·), but with support restricted to the cylindric ball centered at a1 . [sent-245, score-0.454]

61 4 Experiments We consider two tasks: reconstruction of ancestral sequences and prediction of alignments between multiple genetically-related proteins. [sent-246, score-0.497]

62 We are interested in comparing the ancestry sampling method (AR) presented in this paper with the Markov kernel used in previous literature (SSR). [sent-247, score-0.256]

63 1 Reconstruction of ancestral sequences Given a set of genetically-related sequences, the reconstruction task is to infer properties of the common ancestor of these modern species. [sent-249, score-0.326]

64 Just as in the task of topology reconstruction, there are no gold ancestral sequences available to evaluate ancestral sequence reconstruction. [sent-251, score-0.446]

65 For this reason, we take the same approach as in topology reconstruction and perform comparisons on synthetic data [15]. [sent-252, score-0.128]

66 We generated a root node from the DNA alphabet and evolved it down a binary tree of seven nodes. [sent-253, score-0.24]

67 4 Error 5 4 Error 5 Max dev = MAX 3 2 1 3 2 1 0 0 0 100 200 300 0 Time 100 200 300 Time Figure 3: Left: Single Sequence Resampling versus Ancestry Resampling on the sequence reconstruction task. [sent-260, score-0.142]

68 The task is to predict the sequences at the root node with error measured using the Levenshtein edit distance l. [sent-265, score-0.259]

69 While the maximum deviation heuristic is necessary for SSR to be able to handle the long sequences found in biological datasets, it is not necessary for AR samplers. [sent-284, score-0.153]

70 2 Protein multiple sequence alignment We also performed experiments on the task of protein multiple sequence alignment, for which the BAliBASE [16] dataset provides a standard benchmark. [sent-286, score-0.573]

71 Note first that we can get a multiple sequence alignment from an InDel evolutionary model. [sent-288, score-0.537]

72 For a set S of sequences to align, construct a phylogenetic tree such that its terminal leaves coincide with S. [sent-289, score-0.556]

73 A multiple sequence alignment can be extracted from the inferred derivation D as follows: deem the amino acids x, y ∈ S aligned iff y ∈ A0 (D, x). [sent-290, score-0.566]

74 The state- of- the- art for multiple sequence alignment systems based on an evolutionary model is Handel [2]. [sent-291, score-0.537]

75 It is based on TKF91 and produces a multiple sequence alignment as described above. [sent-292, score-0.4]

76 While other heuristic approaches are known to perform better than Handel on this dataset [8, 17], they are not based on explicit evolutionary models. [sent-294, score-0.184]

77 They perform better because they leverage more sophisticated features such as affine gap penalties and hydrophobic core modeling. [sent-295, score-0.174]

78 We ran each system for the same time on the sequences in the ref1 directory of BAliBASE v. [sent-304, score-0.157]

79 Both are recall measures on the subset of the alignments that were labeled, called the core blocks; see, e. [sent-309, score-0.215]

80 performs XXXCITE are recall metrics XXX In order to investigate where the advantage comes from, we did another multiple alignment experiment, plotting derivation D as function of the amino acids x, y S lignment can be extracted from the inferedperformance as afollow: deemthe depth of the trees. [sent-313, score-0.59]

81 For short trees, the alignment systems based on an evolutionary model is Handel [2]. [sent-318, score-0.397]

82 of-the-art for multiple sequencetwo algorithms perform equally, SSR beating AR slightly for trees with three nodes, which is on TKF91 and produces anot surprising since alignment as described exact inference in this tiny topology. [sent-319, score-0.507]

83 However, as the multiple sequence SSR actually performs above. [sent-320, score-0.14]

84 The key difference trees get taller, the task SSR rather than the AR move that we advocate pproach is that their inference algorithm is based onbecomes more difficult, and only AR maintains good performance. [sent-321, score-0.174]

85 5 Conclusion r heuristic approaches are known to perform better than Handel on this dataset [7, 16], they are not xplicit evolutionary models. [sent-323, score-0.184]

86 They perform better because they leverage more sophisticated features fine gap penalty and hydrophobic core modelling. [sent-324, score-0.174]

87 We have evaluated its perWe have described a While these features can be incorporated in our formance against a state- of- the- art statistical alignment procedure and shown its clear superiority. [sent-326, score-0.26]

88 In contrast to heuristics such as Clustalw [8], it can be used both for reconstruction of ancestral volutionary trees using weighbor [17]. [sent-328, score-0.33]

89 We ran each system for the same time on the sequences sequences and multiple alignment. [sent-329, score-0.276]

90 Incorporating affine gap penalties and hydrophobic core modeling is of particular interest ecall measures on the subset of the alignments that were labeled, called the core blocks, see [16] as they performs to dramatically improve multiple alignment performance [2]. [sent-335, score-0.713]

91 This creates tall trees, but investigate where the advantage comes from, we did another multiple alignment experiment plot- as we have seen, AR still performs a function of depth of the me performance after a fixed time as very well in this setting. [sent-339, score-0.41]

92 If the random walk argument n the introduction held, we would expect the advantage of AR over SSR to increase as the tree This prediction is confirmed as illustrated in Figure 4, middle, right. [sent-341, score-0.148]

93 egrating affine gap, hydrophobic core modelling, CRF models [4] Z. [sent-360, score-0.14]

94 Bayesian phylogenetic inference using DNA sequences: A Markov chain Monte Carlo method. [sent-363, score-0.354]

95 CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. [sent-402, score-0.4]

96 An evolutionary model for maximum likelihood alignment of DNA sequences. [sent-416, score-0.397]

97 An efficient algorithm for statistical multiple o alignment on arbitrary phylogenetic trees. [sent-425, score-0.564]

98 A general method applicable to the search for similarities in the amino acid sequence of two proteins. [sent-445, score-0.136]

99 Performance study of phylogenetic methods: (unweighted) quartet methods and neighbor-joining. [sent-455, score-0.24]

100 BAliBASE: A benchmark alignments database for the evaluation of multiple sequence alignment programs. [sent-461, score-0.56]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ssr', 0.373), ('ar', 0.32), ('alignment', 0.26), ('ancestry', 0.256), ('phylogenetic', 0.24), ('indel', 0.192), ('characters', 0.162), ('alignments', 0.16), ('character', 0.157), ('resampling', 0.157), ('evolutionary', 0.137), ('tree', 0.117), ('balibase', 0.107), ('handel', 0.107), ('sequences', 0.106), ('nodes', 0.105), ('strings', 0.103), ('ancestral', 0.101), ('terminal', 0.093), ('insertions', 0.091), ('anchor', 0.085), ('anchors', 0.085), ('cylindric', 0.085), ('hydrophobic', 0.085), ('string', 0.085), ('sp', 0.083), ('trees', 0.082), ('sequence', 0.076), ('cs', 0.07), ('molecular', 0.067), ('reconstruction', 0.066), ('derivation', 0.065), ('mh', 0.064), ('edit', 0.064), ('indels', 0.064), ('reversibility', 0.064), ('deletions', 0.064), ('resampled', 0.064), ('multiple', 0.064), ('topology', 0.062), ('substring', 0.06), ('amino', 0.06), ('inference', 0.058), ('ac', 0.058), ('site', 0.057), ('chain', 0.056), ('proposal', 0.056), ('core', 0.055), ('ap', 0.055), ('node', 0.055), ('ancestor', 0.053), ('depth', 0.052), ('directory', 0.051), ('descendants', 0.05), ('dna', 0.05), ('slices', 0.05), ('metrics', 0.048), ('evolution', 0.047), ('heuristic', 0.047), ('slice', 0.047), ('beating', 0.043), ('clustalw', 0.043), ('ctmc', 0.043), ('kishino', 0.043), ('levenshtein', 0.043), ('mes', 0.043), ('nonhyperthermophilic', 0.043), ('resamples', 0.043), ('taller', 0.043), ('thorne', 0.043), ('tourasse', 0.043), ('weighbor', 0.043), ('ai', 0.043), ('acids', 0.041), ('species', 0.041), ('life', 0.041), ('heuristics', 0.038), ('vertical', 0.037), ('irreducibility', 0.037), ('thin', 0.037), ('complement', 0.037), ('holmes', 0.036), ('biology', 0.036), ('linked', 0.034), ('rna', 0.034), ('evolved', 0.034), ('substrings', 0.034), ('tall', 0.034), ('advocate', 0.034), ('extant', 0.034), ('gap', 0.034), ('mcmc', 0.034), ('root', 0.034), ('protein', 0.033), ('edge', 0.033), ('nucleotide', 0.032), ('nucleotides', 0.032), ('detrimental', 0.032), ('chains', 0.032), ('illustrated', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 70 nips-2008-Efficient Inference in Phylogenetic InDel Trees

Author: Alexandre Bouchard-côté, Dan Klein, Michael I. Jordan

Abstract: Accurate and efficient inference in evolutionary trees is a central problem in computational biology. While classical treatments have made unrealistic site independence assumptions, ignoring insertions and deletions, realistic approaches require tracking insertions and deletions along the phylogenetic tree—a challenging and unsolved computational problem. We propose a new ancestry resampling procedure for inference in evolutionary trees. We evaluate our method in two problem domains—multiple sequence alignment and reconstruction of ancestral sequences—and show substantial improvement over the current state of the art. 1

2 0.13370547 239 nips-2008-Tighter Bounds for Structured Estimation

Author: Olivier Chapelle, Chuong B. Do, Choon H. Teo, Quoc V. Le, Alex J. Smola

Abstract: Large-margin structured estimation methods minimize a convex upper bound of loss functions. While they allow for efficient optimization algorithms, these convex formulations are not tight and sacrifice the ability to accurately model the true loss. We present tighter non-convex bounds based on generalizing the notion of a ramp loss from binary classification to structured estimation. We show that a small modification of existing optimization algorithms suffices to solve this modified problem. On structured prediction tasks such as protein sequence alignment and web page ranking, our algorithm leads to improved accuracy. 1

3 0.11015169 9 nips-2008-A mixture model for the evolution of gene expression in non-homogeneous datasets

Author: Gerald Quon, Yee W. Teh, Esther Chan, Timothy Hughes, Michael Brudno, Quaid D. Morris

Abstract: We address the challenge of assessing conservation of gene expression in complex, non-homogeneous datasets. Recent studies have demonstrated the success of probabilistic models in studying the evolution of gene expression in simple eukaryotic organisms such as yeast, for which measurements are typically scalar and independent. Models capable of studying expression evolution in much more complex organisms such as vertebrates are particularly important given the medical and scientific interest in species such as human and mouse. We present Brownian Factor Phylogenetic Analysis, a statistical model that makes a number of significant extensions to previous models to enable characterization of changes in expression among highly complex organisms. We demonstrate the efficacy of our method on a microarray dataset profiling diverse tissues from multiple vertebrate species. We anticipate that the model will be invaluable in the study of gene expression patterns in other diverse organisms as well, such as worms and insects. 1

4 0.090937436 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

Author: Chunhua Shen, Alan Welsh, Lei Wang

Abstract: In this work, we consider the problem of learning a positive semidefinite matrix. The critical issue is how to preserve positive semidefiniteness during the course of learning. Our algorithm is mainly inspired by LPBoost [1] and the general greedy convex optimization framework of Zhang [2]. We demonstrate the essence of the algorithm, termed PSDBoost (positive semidefinite Boosting), by focusing on a few different applications in machine learning. The proposed PSDBoost algorithm extends traditional Boosting algorithms in that its parameter is a positive semidefinite matrix with trace being one instead of a classifier. PSDBoost is based on the observation that any trace-one positive semidefinite matrix can be decomposed into linear convex combinations of trace-one rank-one matrices, which serve as base learners of PSDBoost. Numerical experiments are presented. 1

5 0.089696087 203 nips-2008-Scalable Algorithms for String Kernels with Inexact Matching

Author: Pavel P. Kuksa, Pai-hsi Huang, Vladimir Pavlovic

Abstract: We present a new family of linear time algorithms for string comparison with mismatches under the string kernels framework. Based on sufficient statistics, our algorithms improve theoretical complexity bounds of existing approaches while scaling well in sequence alphabet size, the number of allowed mismatches and the size of the dataset. In particular, on large alphabets and under loose mismatch constraints our algorithms are several orders of magnitude faster than the existing algorithms for string comparison under the mismatch similarity measure. We evaluate our algorithms on synthetic data and real applications in music genre classification, protein remote homology detection and protein fold prediction. The scalability of the algorithms allows us to consider complex sequence transformations, modeled using longer string features and larger numbers of mismatches, leading to a state-of-the-art performance with significantly reduced running times. 1

6 0.078324459 11 nips-2008-A spatially varying two-sample recombinant coalescent, with applications to HIV escape response

7 0.066634223 4 nips-2008-A Scalable Hierarchical Distributed Language Model

8 0.064103864 84 nips-2008-Fast Prediction on a Tree

9 0.058860503 235 nips-2008-The Infinite Hierarchical Factor Regression Model

10 0.058828406 170 nips-2008-Online Optimization in X-Armed Bandits

11 0.057845451 100 nips-2008-How memory biases affect information transmission: A rational analysis of serial reproduction

12 0.056728352 30 nips-2008-Bayesian Experimental Design of Magnetic Resonance Imaging Sequences

13 0.056164812 129 nips-2008-MAS: a multiplicative approximation scheme for probabilistic inference

14 0.05468294 118 nips-2008-Learning Transformational Invariants from Natural Movies

15 0.054438606 19 nips-2008-An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis

16 0.053135864 115 nips-2008-Learning Bounded Treewidth Bayesian Networks

17 0.052375071 154 nips-2008-Nonparametric Bayesian Learning of Switching Linear Dynamical Systems

18 0.052165937 12 nips-2008-Accelerating Bayesian Inference over Nonlinear Differential Equations with Gaussian Processes

19 0.051995918 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel

20 0.050358001 50 nips-2008-Continuously-adaptive discretization for message-passing algorithms


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.161), (1, -0.003), (2, 0.02), (3, 0.008), (4, 0.049), (5, -0.098), (6, 0.054), (7, 0.062), (8, 0.049), (9, -0.074), (10, -0.045), (11, 0.044), (12, 0.046), (13, -0.025), (14, 0.081), (15, 0.08), (16, -0.001), (17, -0.087), (18, -0.036), (19, 0.005), (20, 0.029), (21, 0.118), (22, -0.156), (23, -0.072), (24, -0.127), (25, 0.097), (26, -0.152), (27, 0.032), (28, 0.032), (29, 0.045), (30, -0.005), (31, 0.027), (32, 0.019), (33, -0.053), (34, 0.02), (35, -0.04), (36, 0.086), (37, 0.09), (38, 0.078), (39, 0.087), (40, -0.055), (41, 0.018), (42, 0.042), (43, 0.005), (44, 0.048), (45, 0.086), (46, -0.072), (47, -0.125), (48, 0.007), (49, 0.005)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94412893 70 nips-2008-Efficient Inference in Phylogenetic InDel Trees

Author: Alexandre Bouchard-côté, Dan Klein, Michael I. Jordan

Abstract: Accurate and efficient inference in evolutionary trees is a central problem in computational biology. While classical treatments have made unrealistic site independence assumptions, ignoring insertions and deletions, realistic approaches require tracking insertions and deletions along the phylogenetic tree—a challenging and unsolved computational problem. We propose a new ancestry resampling procedure for inference in evolutionary trees. We evaluate our method in two problem domains—multiple sequence alignment and reconstruction of ancestral sequences—and show substantial improvement over the current state of the art. 1

2 0.69322175 11 nips-2008-A spatially varying two-sample recombinant coalescent, with applications to HIV escape response

Author: Alexander Braunstein, Zhi Wei, Shane T. Jensen, Jon D. Mcauliffe

Abstract: Statistical evolutionary models provide an important mechanism for describing and understanding the escape response of a viral population under a particular therapy. We present a new hierarchical model that incorporates spatially varying mutation and recombination rates at the nucleotide level. It also maintains separate parameters for treatment and control groups, which allows us to estimate treatment effects explicitly. We use the model to investigate the sequence evolution of HIV populations exposed to a recently developed antisense gene therapy, as well as a more conventional drug therapy. The detection of biologically relevant and plausible signals in both therapy studies demonstrates the effectiveness of the method. 1

3 0.55725402 9 nips-2008-A mixture model for the evolution of gene expression in non-homogeneous datasets

Author: Gerald Quon, Yee W. Teh, Esther Chan, Timothy Hughes, Michael Brudno, Quaid D. Morris

Abstract: We address the challenge of assessing conservation of gene expression in complex, non-homogeneous datasets. Recent studies have demonstrated the success of probabilistic models in studying the evolution of gene expression in simple eukaryotic organisms such as yeast, for which measurements are typically scalar and independent. Models capable of studying expression evolution in much more complex organisms such as vertebrates are particularly important given the medical and scientific interest in species such as human and mouse. We present Brownian Factor Phylogenetic Analysis, a statistical model that makes a number of significant extensions to previous models to enable characterization of changes in expression among highly complex organisms. We demonstrate the efficacy of our method on a microarray dataset profiling diverse tissues from multiple vertebrate species. We anticipate that the model will be invaluable in the study of gene expression patterns in other diverse organisms as well, such as worms and insects. 1

4 0.53406018 183 nips-2008-Predicting the Geometry of Metal Binding Sites from Protein Sequence

Author: Paolo Frasconi, Andrea Passerini

Abstract: Metal binding is important for the structural and functional characterization of proteins. Previous prediction efforts have only focused on bonding state, i.e. deciding which protein residues act as metal ligands in some binding site. Identifying the geometry of metal-binding sites, i.e. deciding which residues are jointly involved in the coordination of a metal ion is a new prediction problem that has been never attempted before from protein sequence alone. In this paper, we formulate it in the framework of learning with structured outputs. Our solution relies on the fact that, from a graph theoretical perspective, metal binding has the algebraic properties of a matroid, enabling the application of greedy algorithms for learning structured outputs. On a data set of 199 non-redundant metalloproteins, we obtained precision/recall levels of 75%/46% correct ligand-ion assignments, which improves to 88%/88% in the setting where the metal binding state is known. 1

5 0.52023435 152 nips-2008-Non-stationary dynamic Bayesian networks

Author: Joshua W. Robinson, Alexander J. Hartemink

Abstract: A principled mechanism for identifying conditional dependencies in time-series data is provided through structure learning of dynamic Bayesian networks (DBNs). An important assumption of DBN structure learning is that the data are generated by a stationary process—an assumption that is not true in many important settings. In this paper, we introduce a new class of graphical models called non-stationary dynamic Bayesian networks, in which the conditional dependence structure of the underlying data-generation process is permitted to change over time. Non-stationary dynamic Bayesian networks represent a new framework for studying problems in which the structure of a network is evolving over time. We define the non-stationary DBN model, present an MCMC sampling algorithm for learning the structure of the model from time-series data under different assumptions, and demonstrate the effectiveness of the algorithm on both simulated and biological data. 1

6 0.50687104 236 nips-2008-The Mondrian Process

7 0.49500352 115 nips-2008-Learning Bounded Treewidth Bayesian Networks

8 0.48239729 203 nips-2008-Scalable Algorithms for String Kernels with Inexact Matching

9 0.46649411 18 nips-2008-An Efficient Sequential Monte Carlo Algorithm for Coalescent Clustering

10 0.45193109 30 nips-2008-Bayesian Experimental Design of Magnetic Resonance Imaging Sequences

11 0.44219723 98 nips-2008-Hierarchical Semi-Markov Conditional Random Fields for Recursive Sequential Data

12 0.43640035 188 nips-2008-QUIC-SVD: Fast SVD Using Cosine Trees

13 0.42857674 235 nips-2008-The Infinite Hierarchical Factor Regression Model

14 0.42297429 239 nips-2008-Tighter Bounds for Structured Estimation

15 0.4009667 50 nips-2008-Continuously-adaptive discretization for message-passing algorithms

16 0.39642724 186 nips-2008-Probabilistic detection of short events, with application to critical care monitoring

17 0.39381361 82 nips-2008-Fast Computation of Posterior Mode in Multi-Level Hierarchical Models

18 0.3907702 134 nips-2008-Mixed Membership Stochastic Blockmodels

19 0.38512883 4 nips-2008-A Scalable Hierarchical Distributed Language Model

20 0.37897384 19 nips-2008-An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.042), (7, 0.069), (12, 0.043), (28, 0.175), (57, 0.069), (59, 0.021), (63, 0.019), (71, 0.032), (77, 0.043), (78, 0.018), (83, 0.075), (87, 0.313)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76973224 70 nips-2008-Efficient Inference in Phylogenetic InDel Trees

Author: Alexandre Bouchard-côté, Dan Klein, Michael I. Jordan

Abstract: Accurate and efficient inference in evolutionary trees is a central problem in computational biology. While classical treatments have made unrealistic site independence assumptions, ignoring insertions and deletions, realistic approaches require tracking insertions and deletions along the phylogenetic tree—a challenging and unsolved computational problem. We propose a new ancestry resampling procedure for inference in evolutionary trees. We evaluate our method in two problem domains—multiple sequence alignment and reconstruction of ancestral sequences—and show substantial improvement over the current state of the art. 1

2 0.63829082 138 nips-2008-Modeling human function learning with Gaussian processes

Author: Thomas L. Griffiths, Chris Lucas, Joseph Williams, Michael L. Kalish

Abstract: Accounts of how people learn functional relationships between continuous variables have tended to focus on two possibilities: that people are estimating explicit functions, or that they are performing associative learning supported by similarity. We provide a rational analysis of function learning, drawing on work on regression in machine learning and statistics. Using the equivalence of Bayesian linear regression and Gaussian processes, we show that learning explicit rules and using similarity can be seen as two views of one solution to this problem. We use this insight to define a Gaussian process model of human function learning that combines the strengths of both approaches. 1

3 0.57833123 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks

Author: Jun Zhu, Eric P. Xing, Bo Zhang

Abstract: Learning graphical models with hidden variables can offer semantic insights to complex data and lead to salient structured predictors without relying on expensive, sometime unattainable fully annotated training data. While likelihood-based methods have been extensively explored, to our knowledge, learning structured prediction models with latent variables based on the max-margin principle remains largely an open problem. In this paper, we present a partially observed Maximum Entropy Discrimination Markov Network (PoMEN) model that attempts to combine the advantages of Bayesian and margin based paradigms for learning Markov networks from partially labeled data. PoMEN leads to an averaging prediction rule that resembles a Bayes predictor that is more robust to overfitting, but is also built on the desirable discriminative laws resemble those of the M3 N. We develop an EM-style algorithm utilizing existing convex optimization algorithms for M3 N as a subroutine. We demonstrate competent performance of PoMEN over existing methods on a real-world web data extraction task. 1

4 0.57810879 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization

Author: Liu Yang, Rong Jin, Rahul Sukthankar

Abstract: The cluster assumption is exploited by most semi-supervised learning (SSL) methods. However, if the unlabeled data is merely weakly related to the target classes, it becomes questionable whether driving the decision boundary to the low density regions of the unlabeled data will help the classification. In such case, the cluster assumption may not be valid; and consequently how to leverage this type of unlabeled data to enhance the classification accuracy becomes a challenge. We introduce “Semi-supervised Learning with Weakly-Related Unlabeled Data” (SSLW), an inductive method that builds upon the maximum-margin approach, towards a better usage of weakly-related unlabeled information. Although the SSLW could improve a wide range of classification tasks, in this paper, we focus on text categorization with a small training pool. The key assumption behind this work is that, even with different topics, the word usage patterns across different corpora tends to be consistent. To this end, SSLW estimates the optimal wordcorrelation matrix that is consistent with both the co-occurrence information derived from the weakly-related unlabeled documents and the labeled documents. For empirical evaluation, we present a direct comparison with a number of stateof-the-art methods for inductive semi-supervised learning and text categorization. We show that SSLW results in a significant improvement in categorization accuracy, equipped with a small training set and an unlabeled resource that is weakly related to the test domain.

5 0.57573688 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data

Author: Xuming He, Richard S. Zemel

Abstract: Extensive labeled data for image annotation systems, which learn to assign class labels to image regions, is difficult to obtain. We explore a hybrid model framework for utilizing partially labeled data that integrates a generative topic model for image appearance with discriminative label prediction. We propose three alternative formulations for imposing a spatial smoothness prior on the image labels. Tests of the new models and some baseline approaches on three real image datasets demonstrate the effectiveness of incorporating the latent structure. 1

6 0.57544714 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

7 0.57387167 26 nips-2008-Analyzing human feature learning as nonparametric Bayesian inference

8 0.57366502 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations

9 0.57302171 4 nips-2008-A Scalable Hierarchical Distributed Language Model

10 0.57298779 194 nips-2008-Regularized Learning with Networks of Features

11 0.57193333 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

12 0.57182652 113 nips-2008-Kernelized Sorting

13 0.5712136 231 nips-2008-Temporal Dynamics of Cognitive Control

14 0.56955481 245 nips-2008-Unlabeled data: Now it helps, now it doesn't

15 0.569314 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation

16 0.56875199 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification

17 0.56870407 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features

18 0.56821227 179 nips-2008-Phase transitions for high-dimensional joint support recovery

19 0.56820524 201 nips-2008-Robust Near-Isometric Matching via Structured Learning of Graphical Models

20 0.56703806 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words