nips nips2004 nips2004-108 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mark Craven, Joseph Bockhorst
Abstract: Many sequential prediction tasks involve locating instances of patterns in sequences. Generative probabilistic language models, such as hidden Markov models (HMMs), have been successfully applied to many of these tasks. A limitation of these models however, is that they cannot naturally handle cases in which pattern instances overlap in arbitrary ways. We present an alternative approach, based on conditional Markov networks, that can naturally represent arbitrarily overlapping elements. We show how to efficiently train and perform inference with these models. Experimental results from a genomics domain show that our models are more accurate at locating instances of overlapping patterns than are baseline models based on HMMs. 1
Reference: text
sentIndex sentText sentNum sentScore
1 A limitation of these models however, is that they cannot naturally handle cases in which pattern instances overlap in arbitrary ways. [sent-9, score-0.185]
2 We present an alternative approach, based on conditional Markov networks, that can naturally represent arbitrarily overlapping elements. [sent-10, score-0.179]
3 Experimental results from a genomics domain show that our models are more accurate at locating instances of overlapping patterns than are baseline models based on HMMs. [sent-12, score-0.273]
4 1 Introduction Hidden Markov models (HMMs) and related probabilistic sequence models have been among the most accurate methods used for sequence-based prediction tasks in genomics, natural language processing and other problem domains. [sent-13, score-0.152]
5 One key limitation of these models, however, is that they cannot represent general overlaps among sequence elements in a concise and natural manner. [sent-14, score-0.198]
6 We present a novel approach to modeling and predicting overlapping sequence elements that is based on undirected Markov networks. [sent-15, score-0.298]
7 Our work is motivated by the task of predicting DNA sequence elements involved in the regulation of gene expression in bacteria. [sent-16, score-0.276]
8 Like HMM-based methods, our approach is able to represent and exploit relationships among different sequence elements of interest. [sent-17, score-0.176]
9 In contrast to HMMs, however, our approach can naturally represent sequence elements that overlap in arbitrary ways. [sent-18, score-0.252]
10 We describe and evaluate our approach in the context of predicting a bacterial genome’s genes and regulatory “signals” (together its regulatory elements). [sent-19, score-0.409]
11 Part of the process of understanding a given genome is to assemble a “parts list”, often using computational methods, of its regulatory elements. [sent-20, score-0.12]
12 Predictions, in this case, entail specifying the start and end coordinates of subsequences of interest. [sent-21, score-0.16]
13 It is common in bacterial genomes for these important sequence elements to overlap. [sent-22, score-0.275]
14 (a) (b) prom 1 gene1 prom2 prom 3 gene 2 START END term 1 prom gene term Figure 1: (a) Example arrangement of two genes, three promoters and one terminator in a DNA sequence. [sent-23, score-0.717]
15 Our approach to predicting overlapping sequence elements, which is based on discriminatively trained undirected graphical models called conditional Markov networks [5, 10] (also called conditional random fields), uses two key steps to make a set of predictions. [sent-27, score-0.363]
16 In the first step, candidate elements are generated by having a set of models independently make predictions. [sent-28, score-0.145]
17 Consider the task of predicting gene, promoter, and terminator elements encoded in bacterial DNA. [sent-30, score-0.478]
18 Regulatory elements often overlap each other, for example prom2 and prom3 or gene1 and prom2 in Figure 1. [sent-35, score-0.134]
19 One technique for predicting these elements is first to train a probabilistic sequence model for each element type (e. [sent-36, score-0.261]
20 Although this approach can predict overlapping elements, it is limited since it ignores inter-element dependencies. [sent-39, score-0.118]
21 Given an input sequence, this HMM defines a probability distribution over parses, partitionings of the sequence into subsequences corresponding to elements and the regions between them. [sent-44, score-0.215]
22 These models are not naturally suited to representing overlapping elements. [sent-45, score-0.139]
23 For the case shown in Figure 1(a) for example, even if the subsequences for gene1 and prom2 match their respective sub-models very well, since both elements cannot be in the same parse there is a competition between predictions of gene1 and prom2 . [sent-46, score-0.247]
24 One could expand the state set to include states for specific overlap situations however, the number of states increases exponentially with the number of overlap configurations. [sent-47, score-0.108]
25 These models, however, assume a fixed number of loosely connected processes evolving in parallel, which is not a good match to our genomics domain. [sent-49, score-0.136]
26 Like HMMs, our method, called CMN-OP (conditional Markov networks for overlapping patterns), employs element-specific sub-models and probabilistic constraints on neighboring elements qualitatively expressed in a graph. [sent-50, score-0.17]
27 The left side of (a) shows a sequence of length eight for which an HMM has predicted that an element of interest occupies two subsequences, [1:3] and [6:7]. [sent-55, score-0.129]
28 The darker subsequences, [4:5] and [8:8], represent sequence regions between predicted elements. [sent-56, score-0.126]
29 The left side of (b) shows four predicted elements made by a CMN-OP model. [sent-58, score-0.11]
30 As with standard Markov networks, a CMN consists of a qualitative graphical component G = (V, E) with vertex set V and edge set E that encodes a set of conditional independence assertions along with a quantitative component in the form of a set of potentials Φ over the cliques of G. [sent-62, score-0.121]
31 Each clique, q = (Xq , Yq ), in the clique set Q(G) has a potential function φq (xq , yq ) ∈ Φ that assigns a non-negative number to each of the joint settings of (Xq , Yq ). [sent-65, score-0.155]
32 A CMN (G, Φ) defines the conditional probability distribution 1 Pr(y|x) = Z(x) q∈Q(G) φq (xq , yq ) where Z(x) = y q∈Q(G) φq (xq , y q ) is the x dependent normalization factor called the partition function. [sent-66, score-0.141]
33 A common representation for the potentials φq (yq , xq ) is with a log-linear model: b b T b φq (yq , xq ) = exp{ b wq fq (yq , xq )} = exp{wq · fq (yq , xq )}. [sent-68, score-0.52]
34 Here wq is the weight b of feature fq and wq and fq are column vectors of q’s weights and features. [sent-69, score-0.192]
35 Given a sequence x of length L, our task is to identify the types and locations of all instances of patterns in P = {P1 , . [sent-71, score-0.115]
36 In the genomics domain x is a DNA sequence and P is a set of regulatory elements such as {gene, promoter, terminator}. [sent-75, score-0.316]
37 A match m of a pattern to x specifies a subsequence xi:j and a pattern type Pk ∈ P. [sent-76, score-0.304]
38 We denote the set of all matches of pattern types in P to x with M(P, x). [sent-77, score-0.179]
39 The ath pattern match Ya is conditionally independent of its non-neighbors given its neighbors X, Ya−1 and Ya+1 . [sent-83, score-0.181]
40 (b) The interaction graph we use in the regulatory element prediction task. [sent-84, score-0.228]
41 Vertices are the pattern types along with START and END. [sent-85, score-0.122]
42 Edges from START connect to pattern types that may be the first matches Edges into END come from pattern types that may be the last matches. [sent-87, score-0.301]
43 to overlap however, we assume that no two matches in C have the same start index 1 . [sent-88, score-0.182]
44 Thus, the maximum size of a configuration C is L, and the elements of C may be ordered by start position such that ma ≤ ma+1 . [sent-89, score-0.193]
45 Our models define a conditional probability distribution over configurations given an input sequence x. [sent-90, score-0.148]
46 Our models assume that a pattern match is independent of other matches given its neighbors. [sent-102, score-0.235]
47 We define the clique potential of qa for a = 1 as the product of a pattern match term g(ya , x) and a pattern interaction term h(ya , ya+1 , x). [sent-108, score-0.454]
48 The first clique q1 includes an additional start placement term α(y1 , x) that scores the type and position of the first match y1 . [sent-110, score-0.326]
49 To ensure that real matches come before any null settings and that additional null settings do not affect Pr(y|x), we require that g(null, x) = 1, h(null,null, x) = 1 and h(null,ya , x) = 0 for all x and ya = null. [sent-111, score-0.88]
50 The pattern match term measures the agreement between the matched subsequence and the pattern type associated with y a . [sent-112, score-0.333]
51 In the genomics domain our representation of the sequence match term is based on regulatory element specific HMMs. [sent-113, score-0.358]
52 The pattern interaction term measures the compatibility between the types and spacing (or overlap) of adjacent matches. [sent-114, score-0.238]
53 Using the log-linear representation for g() and h() we have L T T Pr(y|x) = α(y1 ) exp{ a=1 wg · fg (ya , x) + wh · fh (ya , ya+1 , x)}. [sent-116, score-0.245]
54 Here wg , fg , wh Z(x) and fh are g() and h()’s weights and features. [sent-117, score-0.245]
55 1 Representation Our representation of the pattern match function g() is based on HMMs. [sent-121, score-0.151]
56 We construct an HMM with parameters Θk for each pattern type Pk along with a single background HMM with parameters ΘB . [sent-122, score-0.111]
57 The pattern match score of ya = null with subsequence xi:j and pattern type Pk is the odds Pr(xi:j |Θk )/ Pr(xi:j |ΘB ). [sent-123, score-1.112]
58 k We have a feature fg (ya , x) for each pattern type Pk whose value is the logarithm of the odds if the pattern associated with ya is Pk and zero otherwise. [sent-124, score-1.022]
59 So, wg · fg (ya , x) = fg (ya , x) = log(Pr(xi:j |Θk )/ Pr(xi:j |ΘB )) where Pk is the pattern of ya . [sent-126, score-1.007]
60 Our representation of the pattern interaction function h() consists of two components: (i) a directed graph I called the interaction graph that contains a vertex for each pattern type in P along with special vertices START and END and (ii) a set of weighted features for each edge in I. [sent-127, score-0.402]
61 The interaction graph encodes qualitative domain knowledge about allowable orderings of pattern types. [sent-128, score-0.199]
62 The value T of h(ya , ya+1 , x) = wh · fh (ya , ya+1 , x) is non-zero only if there is an edge in I from the pattern type associated with ya to the pattern type associated with ya+1 . [sent-129, score-1.127]
63 Figure 3(b) shows the interaction graph we use to predict bacterial regulatory elements. [sent-131, score-0.329]
64 It asserts that between the start positions of two genes there may be no element starts, a single terminator start or zero or more promoter starts with the requirement that all promoters start after the start of the terminator. [sent-132, score-0.882]
65 Note that in CMN-OP models, the interaction graph indicates legal orderings over the start position of matches not over complete matches as in an HMM. [sent-133, score-0.344]
66 Each of the pattern interaction features f ∈ fh is associated with an edge in the b interaction graph I. [sent-134, score-0.33]
67 Each edge e in I has single bias feature fe and a set of distance D b features fe . [sent-135, score-0.297]
68 The value of fe (ya , ya+1 , x) is 1 if the pattern types connected by e correspond to the types associated with ya and ya+1 and 0 otherwise. [sent-136, score-1.038]
69 The distance features for edge e provide a discretized representation of the distance between (or amount of overlap of) two adjacent matches of types consistent with e. [sent-137, score-0.212]
70 We associate r D r each distance feature fe ∈ fe with a range r. [sent-138, score-0.262]
71 The value of fe (ya , ya+1 , x) is 1 if the (possibly negative) difference between the start position of ya+1 and the end position of ya is in r, otherwise it is 0. [sent-139, score-1.06]
72 So, h(ya , ya+1 , x) = exp(wh · fh (ya , ya+1 , x)) = exp(we + we ) where e b b r is the edge for ya and ya+1 , we is the weight of the bias feature fe and we is the r weight of the single distance feature fe whose range contains the spacing between the matches of ya and ya+1 . [sent-141, score-1.909]
73 Our inference procedure exploits two properties of our representation of the pattern interaction function h(). [sent-146, score-0.166]
74 First, we use the invariance of h(ya , ya+1 , x) to the start position of ya and the end position of ya+1 . [sent-147, score-0.929]
75 In this section, we make this explicit by writing h(ya , ya+1 , x) as h(k, k , d) where k and k are the pattern types of ya and ya+1 respectively and d is the distance between (or overlap of if negative) ya and ya+1 . [sent-148, score-1.666]
76 The forward pass fills an L × L × N matrix F where we define F (i, j, k) to be the sum of the scores of all partial configurations y that end with y ∗ where y ∗ is the match of xi:j to Pk : F (i, j, k) ≡ ˜ ˜ g(y ∗ , x) y α(y1 , x) ya ∈(˜\y∗ ) g(ya , x)h(ya , ya+1 , x) Here y = (y1 , y2 , . [sent-151, score-0.909]
77 The first term is for updating Fin and the second term is for the constant time setting of the O(L2 N ) elements of F . [sent-162, score-0.138]
78 If the sequence length L dominates N and B, as it does in the gene regulation domain, the effective running time is O(L2 ). [sent-163, score-0.143]
79 An element d of ˆ ˆ D is a pair (xd , yd ) where xd is a fully observable sequence and yd is a partially observable configuration for xd . [sent-165, score-0.313]
80 4 Empirical Evaluation In this section we evaluate our Markov network approach by applying it to recognize regulatory signals in the E. [sent-174, score-0.131]
81 Our hypothesis is that the CMN-OP models will provide more accurate predictions than either of two baselines: (i) predicting the signals independently, and (ii) predicting the signals using an HMM. [sent-176, score-0.245]
82 8 1 Figure 4: Precision-recall curves for the CMN-OP, HMM and SCAN models on (a) the promoter localization task, (b) the terminator localization task and (c) the terminator localization task for terminators known to overlap genes or promoters. [sent-202, score-1.011]
83 candidate promoters and the second submodel is a stochastic context free grammar (SCFG) that is used to predict candidate terminators. [sent-203, score-0.261]
84 The first baseline approach, which we refer to as SCAN, involves “scanning” a promoter model and a terminator model along each sequence being processed, and at each position producing a score indicating the likelihood that a promoter or terminator starts at that position. [sent-204, score-0.965]
85 We have the HMM and CMN-OP models make terminator and promoter predictions for each position in each test sequence. [sent-208, score-0.515]
86 We do this using posterior decoding which involves having a model compute the probability that a promoter (terminator) ends at a specified position given that the model somehow explains the sequence. [sent-209, score-0.225]
87 coli genome that collectively contain 471 known promoters and 211 known terminators. [sent-211, score-0.174]
88 Using tenfold cross-validation, we evaluate the three methods by considering how well each method is able to localize predicted promoters and terminators in the test sequences. [sent-212, score-0.253]
89 Under this evaluation criterion, a correct prediction predicts a promoter (terminator) within k bases of an actual promoter (terminator). [sent-213, score-0.389]
90 We set k to 10 for promoters and to 25 for terminators. [sent-214, score-0.118]
91 Figures 4(a) and 4(b) show PR curves for the promoter and terminator localization tasks, respectively. [sent-217, score-0.456]
92 The difference is not so marked for promoter localization, however. [sent-221, score-0.183]
93 Overall, we conclude that these results show the benefits of representing relationships among predicted signals (as is done in the HMMs and CMN-OP models) and being able to represent and predict overlapping signals. [sent-223, score-0.206]
94 Figure 4(c) shows the PR curves specifically for a set of filtered test sets in which each actual terminator overlaps either a gene or a promoter. [sent-224, score-0.315]
95 5 Conclusion We have presented an approach, based on Markov networks, able to naturally represent and predict overlapping sequence elements. [sent-226, score-0.236]
96 Our approach first generates a set of candidate elements by having a set of models independently make predictions. [sent-227, score-0.145]
97 We have empirically validated our approach by using it to recognize promoter and terminator “signals” in a bacterial genome. [sent-229, score-0.528]
98 Predicting bacterial transcription units using sequence and expression data. [sent-244, score-0.243]
99 Characterization of prokaryotic and eukaryotic promoters using hidden Markov models. [sent-297, score-0.118]
100 A novel bacterial gene-finding system with improved accuracy in locating start codons. [sent-317, score-0.221]
wordName wordTfidf (topN-words)
[('ya', 0.745), ('terminator', 0.225), ('promoter', 0.183), ('hmm', 0.179), ('pr', 0.131), ('fe', 0.131), ('bacterial', 0.12), ('promoters', 0.118), ('terminators', 0.105), ('yq', 0.095), ('xq', 0.094), ('regulatory', 0.094), ('overlapping', 0.09), ('pattern', 0.082), ('elements', 0.08), ('hmms', 0.08), ('sequence', 0.075), ('start', 0.071), ('scan', 0.07), ('match', 0.069), ('gene', 0.068), ('dna', 0.067), ('genomics', 0.067), ('markov', 0.066), ('fh', 0.065), ('xd', 0.063), ('pk', 0.062), ('interaction', 0.061), ('cmn', 0.06), ('prom', 0.06), ('wg', 0.06), ('wh', 0.06), ('subsequences', 0.06), ('clique', 0.06), ('fg', 0.06), ('matches', 0.057), ('overlap', 0.054), ('predicting', 0.053), ('fin', 0.052), ('erence', 0.048), ('fq', 0.048), ('wq', 0.048), ('transcription', 0.048), ('localization', 0.048), ('genes', 0.048), ('conditional', 0.046), ('cmns', 0.045), ('yd', 0.044), ('position', 0.042), ('subsequence', 0.042), ('qa', 0.042), ('cliques', 0.04), ('types', 0.04), ('submodel', 0.039), ('null', 0.039), ('di', 0.039), ('candidate', 0.038), ('predictions', 0.038), ('signals', 0.037), ('yl', 0.037), ('guration', 0.035), ('edge', 0.035), ('forward', 0.034), ('cw', 0.033), ('pass', 0.032), ('baseline', 0.032), ('ath', 0.03), ('bockhorst', 0.03), ('coli', 0.03), ('orderings', 0.03), ('submodels', 0.03), ('locating', 0.03), ('predicted', 0.03), ('type', 0.029), ('term', 0.029), ('end', 0.029), ('con', 0.029), ('predict', 0.028), ('models', 0.027), ('adjacent', 0.026), ('placement', 0.026), ('baselines', 0.026), ('discriminatively', 0.026), ('recall', 0.026), ('graph', 0.026), ('genome', 0.026), ('triple', 0.026), ('nes', 0.025), ('odds', 0.024), ('craven', 0.024), ('pictorial', 0.024), ('element', 0.024), ('precision', 0.023), ('inference', 0.023), ('prediction', 0.023), ('overlaps', 0.022), ('naturally', 0.022), ('represent', 0.021), ('madison', 0.021), ('gurations', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data
Author: Mark Craven, Joseph Bockhorst
Abstract: Many sequential prediction tasks involve locating instances of patterns in sequences. Generative probabilistic language models, such as hidden Markov models (HMMs), have been successfully applied to many of these tasks. A limitation of these models however, is that they cannot naturally handle cases in which pattern instances overlap in arbitrary ways. We present an alternative approach, based on conditional Markov networks, that can naturally represent arbitrarily overlapping elements. We show how to efficiently train and perform inference with these models. Experimental results from a genomics domain show that our models are more accurate at locating instances of overlapping patterns than are baseline models based on HMMs. 1
2 0.12024547 37 nips-2004-Co-Training and Expansion: Towards Bridging Theory and Practice
Author: Maria-florina Balcan, Avrim Blum, Ke Yang
Abstract: Co-training is a method for combining labeled and unlabeled data when examples can be thought of as containing two distinct sets of features. It has had a number of practical successes, yet previous theoretical analyses have needed very strong assumptions on the data that are unlikely to be satisfied in practice. In this paper, we propose a much weaker “expansion” assumption on the underlying data distribution, that we prove is sufficient for iterative cotraining to succeed given appropriately strong PAC-learning algorithms on each feature set, and that to some extent is necessary as well. This expansion assumption in fact motivates the iterative nature of the original co-training algorithm, unlike stronger assumptions (such as independence given the label) that allow a simpler one-shot co-training to succeed. We also heuristically analyze the effect on performance of noise in the data. Predicted behavior is qualitatively matched in synthetic experiments on expander graphs. 1
3 0.06586159 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction
Author: Sunita Sarawagi, William W. Cohen
Abstract: We describe semi-Markov conditional random fields (semi-CRFs), a conditionally trained version of semi-Markov chains. Intuitively, a semiCRF on an input sequence x outputs a “segmentation” of x, in which labels are assigned to segments (i.e., subsequences) of x rather than to individual elements xi of x. Importantly, features for semi-CRFs can measure properties of segments, and transitions within a segment can be non-Markovian. In spite of this additional power, exact learning and inference algorithms for semi-CRFs are polynomial-time—often only a small constant factor slower than conventional CRFs. In experiments on five named entity recognition problems, semi-CRFs generally outperform conventional CRFs. 1
4 0.064810082 87 nips-2004-Integrating Topics and Syntax
Author: Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum
Abstract: Statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. We present a generative model that uses both kinds of dependencies, and can be used to simultaneously find syntactic classes and semantic topics despite having no representation of syntax or semantics beyond statistical dependency. This model is competitive on tasks like part-of-speech tagging and document classification with models that exclusively use short- and long-range dependencies respectively. 1
5 0.061680309 177 nips-2004-Supervised Graph Inference
Author: Jean-philippe Vert, Yoshihiro Yamanishi
Abstract: We formulate the problem of graph inference where part of the graph is known as a supervised learning problem, and propose an algorithm to solve it. The method involves the learning of a mapping of the vertices to a Euclidean space where the graph is easy to infer, and can be formulated as an optimization problem in a reproducing kernel Hilbert space. We report encouraging results on the problem of metabolic network reconstruction from genomic data. 1
6 0.057063289 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
7 0.052719459 128 nips-2004-Neural Network Computation by In Vitro Transcriptional Circuits
8 0.052431263 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition
9 0.047639225 202 nips-2004-VDCBPI: an Approximate Scalable Algorithm for Large POMDPs
10 0.047196869 51 nips-2004-Detecting Significant Multidimensional Spatial Clusters
11 0.046293646 6 nips-2004-A Hidden Markov Model for de Novo Peptide Sequencing
12 0.046050746 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment
13 0.045270715 43 nips-2004-Conditional Models of Identity Uncertainty with Application to Noun Coreference
14 0.044324216 44 nips-2004-Conditional Random Fields for Object Recognition
15 0.04427135 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach
16 0.042395789 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
17 0.04037375 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge
18 0.03969441 136 nips-2004-On Semi-Supervised Classification
19 0.039046645 107 nips-2004-Making Latin Manuscripts Searchable using gHMM's
20 0.038683478 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data
topicId topicWeight
[(0, -0.122), (1, 0.005), (2, 0.011), (3, -0.052), (4, 0.018), (5, 0.056), (6, 0.027), (7, 0.104), (8, -0.026), (9, -0.078), (10, 0.008), (11, -0.013), (12, -0.064), (13, 0.117), (14, 0.013), (15, -0.016), (16, 0.03), (17, -0.036), (18, 0.044), (19, -0.002), (20, -0.069), (21, 0.073), (22, -0.142), (23, -0.054), (24, -0.047), (25, 0.066), (26, -0.109), (27, -0.057), (28, -0.048), (29, 0.161), (30, -0.101), (31, 0.058), (32, 0.039), (33, 0.006), (34, -0.081), (35, 0.02), (36, -0.015), (37, 0.081), (38, 0.026), (39, -0.06), (40, 0.022), (41, -0.028), (42, 0.094), (43, -0.107), (44, 0.083), (45, 0.077), (46, 0.056), (47, 0.002), (48, -0.103), (49, -0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.94870359 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data
Author: Mark Craven, Joseph Bockhorst
Abstract: Many sequential prediction tasks involve locating instances of patterns in sequences. Generative probabilistic language models, such as hidden Markov models (HMMs), have been successfully applied to many of these tasks. A limitation of these models however, is that they cannot naturally handle cases in which pattern instances overlap in arbitrary ways. We present an alternative approach, based on conditional Markov networks, that can naturally represent arbitrarily overlapping elements. We show how to efficiently train and perform inference with these models. Experimental results from a genomics domain show that our models are more accurate at locating instances of overlapping patterns than are baseline models based on HMMs. 1
2 0.60398084 37 nips-2004-Co-Training and Expansion: Towards Bridging Theory and Practice
Author: Maria-florina Balcan, Avrim Blum, Ke Yang
Abstract: Co-training is a method for combining labeled and unlabeled data when examples can be thought of as containing two distinct sets of features. It has had a number of practical successes, yet previous theoretical analyses have needed very strong assumptions on the data that are unlikely to be satisfied in practice. In this paper, we propose a much weaker “expansion” assumption on the underlying data distribution, that we prove is sufficient for iterative cotraining to succeed given appropriately strong PAC-learning algorithms on each feature set, and that to some extent is necessary as well. This expansion assumption in fact motivates the iterative nature of the original co-training algorithm, unlike stronger assumptions (such as independence given the label) that allow a simpler one-shot co-training to succeed. We also heuristically analyze the effect on performance of noise in the data. Predicted behavior is qualitatively matched in synthetic experiments on expander graphs. 1
3 0.52718616 128 nips-2004-Neural Network Computation by In Vitro Transcriptional Circuits
Author: Jongmin Kim, John Hopfield, Erik Winfree
Abstract: The structural similarity of neural networks and genetic regulatory networks to digital circuits, and hence to each other, was noted from the very beginning of their study [1, 2]. In this work, we propose a simple biochemical system whose architecture mimics that of genetic regulation and whose components allow for in vitro implementation of arbitrary circuits. We use only two enzymes in addition to DNA and RNA molecules: RNA polymerase (RNAP) and ribonuclease (RNase). We develop a rate equation for in vitro transcriptional networks, and derive a correspondence with general neural network rate equations [3]. As proof-of-principle demonstrations, an associative memory task and a feedforward network computation are shown by simulation. A difference between the neural network and biochemical models is also highlighted: global coupling of rate equations through enzyme saturation can lead to global feedback regulation, thus allowing a simple network without explicit mutual inhibition to perform the winner-take-all computation. Thus, the full complexity of the cell is not necessary for biochemical computation: a wide range of functional behaviors can be achieved with a small set of biochemical components. 1
4 0.50847828 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction
Author: Sunita Sarawagi, William W. Cohen
Abstract: We describe semi-Markov conditional random fields (semi-CRFs), a conditionally trained version of semi-Markov chains. Intuitively, a semiCRF on an input sequence x outputs a “segmentation” of x, in which labels are assigned to segments (i.e., subsequences) of x rather than to individual elements xi of x. Importantly, features for semi-CRFs can measure properties of segments, and transitions within a segment can be non-Markovian. In spite of this additional power, exact learning and inference algorithms for semi-CRFs are polynomial-time—often only a small constant factor slower than conventional CRFs. In experiments on five named entity recognition problems, semi-CRFs generally outperform conventional CRFs. 1
5 0.45307818 80 nips-2004-Identifying Protein-Protein Interaction Sites on a Genome-Wide Scale
Author: Haidong Wang, Eran Segal, Asa Ben-Hur, Daphne Koller, Douglas L. Brutlag
Abstract: Protein interactions typically arise from a physical interaction of one or more small sites on the surface of the two proteins. Identifying these sites is very important for drug and protein design. In this paper, we propose a computational method based on probabilistic relational model that attempts to address this task using high-throughput protein interaction data and a set of short sequence motifs. We learn the model using the EM algorithm, with a branch-and-bound algorithm as an approximate inference for the E-step. Our method searches for motifs whose presence in a pair of interacting proteins can explain their observed interaction. It also tries to determine which motif pairs have high affinity, and can therefore lead to an interaction. We show that our method is more accurate than others at predicting new protein-protein interactions. More importantly, by examining solved structures of protein complexes, we find that 2/3 of the predicted active motifs correspond to actual interaction sites. 1
6 0.43370533 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data
7 0.42186788 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling
8 0.41882476 202 nips-2004-VDCBPI: an Approximate Scalable Algorithm for Large POMDPs
9 0.40555206 177 nips-2004-Supervised Graph Inference
10 0.40208942 43 nips-2004-Conditional Models of Identity Uncertainty with Application to Noun Coreference
11 0.3874532 95 nips-2004-Large-Scale Prediction of Disulphide Bond Connectivity
12 0.3605167 52 nips-2004-Discrete profile alignment via constrained information bottleneck
13 0.35867935 74 nips-2004-Harmonising Chorales by Probabilistic Inference
14 0.35819858 122 nips-2004-Modelling Uncertainty in the Game of Go
15 0.34266609 51 nips-2004-Detecting Significant Multidimensional Spatial Clusters
16 0.33044633 159 nips-2004-Schema Learning: Experience-Based Construction of Predictive Action Models
17 0.32932386 107 nips-2004-Making Latin Manuscripts Searchable using gHMM's
18 0.31877407 29 nips-2004-Beat Tracking the Graphical Model Way
19 0.30703637 205 nips-2004-Who's In the Picture
20 0.30146337 87 nips-2004-Integrating Topics and Syntax
topicId topicWeight
[(13, 0.103), (15, 0.085), (17, 0.012), (26, 0.062), (31, 0.027), (32, 0.019), (33, 0.175), (35, 0.02), (39, 0.016), (50, 0.021), (51, 0.018), (90, 0.335)]
simIndex simValue paperId paperTitle
same-paper 1 0.77061594 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data
Author: Mark Craven, Joseph Bockhorst
Abstract: Many sequential prediction tasks involve locating instances of patterns in sequences. Generative probabilistic language models, such as hidden Markov models (HMMs), have been successfully applied to many of these tasks. A limitation of these models however, is that they cannot naturally handle cases in which pattern instances overlap in arbitrary ways. We present an alternative approach, based on conditional Markov networks, that can naturally represent arbitrarily overlapping elements. We show how to efficiently train and perform inference with these models. Experimental results from a genomics domain show that our models are more accurate at locating instances of overlapping patterns than are baseline models based on HMMs. 1
2 0.73421663 150 nips-2004-Proximity Graphs for Clustering and Manifold Learning
Author: Richard S. Zemel, Miguel Á. Carreira-Perpiñán
Abstract: Many machine learning algorithms for clustering or dimensionality reduction take as input a cloud of points in Euclidean space, and construct a graph with the input data points as vertices. This graph is then partitioned (clustering) or used to redefine metric information (dimensionality reduction). There has been much recent work on new methods for graph-based clustering and dimensionality reduction, but not much on constructing the graph itself. Graphs typically used include the fullyconnected graph, a local fixed-grid graph (for image segmentation) or a nearest-neighbor graph. We suggest that the graph should adapt locally to the structure of the data. This can be achieved by a graph ensemble that combines multiple minimum spanning trees, each fit to a perturbed version of the data set. We show that such a graph ensemble usually produces a better representation of the data manifold than standard methods; and that it provides robustness to a subsequent clustering or dimensionality reduction algorithm based on the graph. 1
3 0.69761354 28 nips-2004-Bayesian inference in spiking neurons
Author: Sophie Deneve
Abstract: We propose a new interpretation of spiking neurons as Bayesian integrators accumulating evidence over time about events in the external world or the body, and communicating to other neurons their certainties about these events. In this model, spikes signal the occurrence of new information, i.e. what cannot be predicted from the past activity. As a result, firing statistics are close to Poisson, albeit providing a deterministic representation of probabilities. We proceed to develop a theory of Bayesian inference in spiking neural networks, recurrent interactions implementing a variant of belief propagation. Many perceptual and motor tasks performed by the central nervous system are probabilistic, and can be described in a Bayesian framework [4, 3]. A few important but hidden properties, such as direction of motion, or appropriate motor commands, are inferred from many noisy, local and ambiguous sensory cues. These evidences are combined with priors about the sensory world and body. Importantly, because most of these inferences should lead to quick and irreversible decisions in a perpetually changing world, noisy cues have to be integrated on-line, but in a way that takes into account unpredictable events, such as a sudden change in motion direction or the appearance of a new stimulus. This raises the question of how this temporal integration can be performed at the neural level. It has been proposed that single neurons in sensory cortices represent and compute the log probability that a sensory variable takes on a certain value (eg Is visual motion in the neuron’s preferred direction?) [9, 7]. Alternatively, to avoid normalization issues and provide an appropriate signal for decision making, neurons could represent the log probability ratio of a particular hypothesis (eg is motion more likely to be towards the right than towards the left) [7, 6]. Log probabilities are convenient here, since under some assumptions, independent noisy cues simply combine linearly. Moreover, there are physiological evidence for the neural representation of log probabilities and log probability ratios [9, 6, 7]. However, these models assume that neurons represent probabilities in their firing rates. We argue that it is important to study how probabilistic information are encoded in spikes. Indeed, it seems spurious to marry the idea of an exquisite on-line integration of noisy cues with an underlying rate code that requires averaging on large populations of noisy neurons and long periods of time. In particular, most natural tasks require this integration to take place on the time scale of inter-spike intervals. Spikes are more efficiently signaling events ∗ Institute of Cognitive Science, 69645 Bron, France than analog quantities. In addition, a neural theory of inference with spikes will bring us closer to the physiological level and generate more easily testable predictions. Thus, we propose a new theory of neural processing in which spike trains provide a deterministic, online representation of a log-probability ratio. Spikes signals events, eg that the log-probability ratio has exceeded what could be predicted from previous spikes. This form of coding was loosely inspired by the idea of ”energy landscape” coding proposed by Hinton and Brown [2]. However, contrary to [2] and other theories using rate-based representation of probabilities, this model is self-consistent and does not require different models for encoding and decoding: As output spikes provide new, unpredictable, temporally independent evidence, they can be used directly as an input to other Bayesian neurons. Finally, we show that these neurons can be used as building blocks in a theory of approximate Bayesian inference in recurrent spiking networks. Connections between neurons implement an underlying Bayesian network, consisting of coupled hidden Markov models. Propagation of spikes is a form of belief propagation in this underlying graphical model. Our theory provides computational explanations of some general physiological properties of cortical neurons, such as spike frequency adaptation, Poisson statistics of spike trains, the existence of strong local inhibition in cortical columns, and the maintenance of a tight balance between excitation and inhibition. Finally, we discuss the implications of this model for the debate about temporal versus rate-based neural coding. 1 Spikes and log posterior odds 1.1 Synaptic integration seen as inference in a hidden Markov chain We propose that each neuron codes for an underlying ”hidden” binary variable, xt , whose state evolves over time. We assume that xt depends only on the state at the previous time step, xt−dt , and is conditionally independent of other past states. The state xt can switch 1 from 0 to 1 with a constant rate ron = dt limdt→0 P (xt = 1|xt−dt = 0), and from 1 to 0 with a constant rate roff . For example, these transition rates could represent how often motion in a preferred direction appears the receptive field and how long it is likely to stay there. The neuron infers the state of its hidden variable from N noisy synaptic inputs, considered to be observations of the hidden state. In this initial version of the model, we assume that these inputs are conditionally independent homogeneous Poisson processes, synapse i i emitting a spike between time t and t + dt (si = 1) with constant probability qon dt if t i xt = 1, and another constant probability qoff dt if xt = 0. The synaptic spikes are assumed to be otherwise independent of previous synaptic spikes, previous states and spikes at other synapses. The resulting generative model is a hidden Markov chain (figure 1-A). However, rather than estimating the state of its hidden variable and communicating this estimate to other neurons (for example by emitting a spike when sensory evidence for xt = 1 goes above a threshold) the neuron reports and communicates its certainty that the current state is 1. This certainty takes the form of the log of the ratio of the probability that the hidden state is 1, and the probability that the state is 0, given all the synaptic inputs P (xt =1|s0→t ) received so far: Lt = log P (xt =0|s0→t ) . We use s0→t as a short hand notation for the N synaptic inputs received at present and in the past. We will refer to it as the log odds ratio. Thanks to the conditional independencies assumed in the generative model, we can compute this Log odds ratio iteratively. Taking the limit as dt goes to zero, we get the following differential equation: ˙ L = ron 1 + e−L − roff 1 + eL + i wi δ(si − 1) − θ t B. A. xt ron .roff dt qon , qoff st xt ron .roff i t st dt s qon , qoff qon , qoff st dt xt j st Ot It Gt Ot Lt t t dt C. E. 2 0 -2 -4 D. 500 1000 1500 2000 2500 2 3000 Count Log odds 4 20 Lt 0 -2 0 500 1000 1500 2000 2500 Time Ot 3000 0 200 400 600 ISI Figure 1: A. Generative model for the synaptic input. B. Schematic representation of log odds ratio encoding and decoding. The dashed circle represents both eventual downstream elements and the self-prediction taking place inside the model neuron. A spike is fired only when Lt exceeds Gt . C. One example trial, where the state switches from 0 to 1 (shaded area) and back to 0. plain: Lt , dotted: Gt . Black stripes at the top: corresponding spikes train. D. Mean Log odds ratio (dark line) and mean output firing rate (clear line). E. Output spike raster plot (1 line per trial) and ISI distribution for the neuron shown is C. and D. Clear line: ISI distribution for a poisson neuron with the same rate. wi , the synaptic weight, describe how informative synapse i is about the state of the hidden i qon variable, e.g. wi = log qi . Each synaptic spike (si = 1) gives an impulse to the log t off odds ratio, which is positive if this synapse is more active when the hidden state if 1 (i.e it increases the neuron’s confidence that the state is 1), and negative if this synapse is more active when xt = 0 (i.e it decreases the neuron’s confidence that the state is 1). The bias, θ, is determined by how informative it is not to receive any spike, e.g. θ = i i i qon − qoff . By convention, we will consider that the ”bias” is positive or zero (if not, we need simply to invert the status of the state x). 1.2 Generation of output spikes The spike train should convey a sparse representation of Lt , so that each spike reports new information about the state xt that is not redundant with that reported by other, preceding, spikes. This proposition is based on three arguments: First, spikes, being metabolically expensive, should be kept to a minimum. Second, spikes conveying redundant information would require a decoding of the entire spike train, whereas independent spike can be taken into account individually. And finally, we seek a self consistent model, with the spiking output having a similar semantics to its spiking input. To maximize the independence of the spikes (conditioned on xt ), we propose that the neuron fires only when the difference between its log odds ratio Lt and a prediction Gt of this log odds ratio based on the output spikes emitted so far reaches a certain threshold. Indeed, supposing that downstream elements predicts Lt as best as they can, the neuron only needs to fire when it expects that prediction to be too inaccurate (figure 1-B). In practice, this will happen when the neuron receives new evidence for xt = 1. Gt should thereby follow the same dynamics as Lt when spikes are not received. The equation for Gt and the output Ot (Ot = 1 when an output spike is fired) are given by: ˙ G = Ot = ron 1 + e−L − roff 1 + eL + go δ(Ot − 1) go 1. when Lt > Gt + , 0 otherwise, 2 (1) (2) Here go , a positive constant, is the only free parameter, the other parameters being constrained by the statistics of the synaptic input. 1.3 Results Figure 1-C plots a typical trial, showing the behavior of L, G and O before, during and after presentation of the stimulus. As random synaptic inputs are integrated, L fluctuates and eventually exceeds G + 0.5, leading to an output spike. Immediately after a spike, G jumps to G + go , which prevents (except in very rare cases) a second spike from immediately following the first. Thus, this ”jump” implements a relative refractory period. However, ron G decays as it tends to converge back to its stable level gstable = log roff . Thus L eventually exceeds G again, leading to a new spike. This threshold crossing happens more often during stimulation (xt = 1) as the net synaptic input alters to create a higher overall level of certainty, Lt . Mean Log odds ratio and output firing rate ¯ The mean firing rate Ot of the Bayesian neuron during presentation of its preferred stimulus (i.e. when xt switches from 0 to 1 and back to 0) is plotted in figure 1-D, together with the ¯ mean log posterior ratio Lt , both averaged over trials. Not surprisingly, the log-posterior ratio reflects the leaky integration of synaptic evidence, with an effective time constant that depends on the transition probabilities ron , roff . If the state is very stable (ron = roff ∼ 0), synaptic evidence is integrated over almost infinite time periods, the mean log posterior ratio tending to either increase or decrease linearly with time. In the example in figure 1D, the state is less stable, so ”old” synaptic evidence are discounted and Lt saturates. ¯ In contrast, the mean output firing rate Ot tracks the state of xt almost perfectly. This is because, as a form of predictive coding, the output spikes reflect the new synaptic i evidence, It = i δ(st − 1) − θ, rather than the log posterior ratio itself. In particular, the mean output firing rate is a rectified linear function of the mean input, e. g. + ¯ ¯ wi q i −θ . O= 1I= go i on(off) Analogy with a leaky integrate and fire neuron We can get an interesting insight into the computation performed by this neuron by linearizing L and G around their mean levels over trials. Here we reduce the analysis to prolonged, statistically stable periods when the state is constant (either ON or OFF). In this case, the ¯ ¯ mean level of certainty L and its output prediction G are also constant over time. We make the rough approximation that the post spike jump, go , and the input fluctuations are small ¯ compared to the mean level of certainty L. Rewriting Vt = Lt − Gt + go 2 as the ”membrane potential” of the Bayesian neuron: ˙ V = −kL V + It − ∆go − go Ot ¯ ¯ ¯ where kL = ron e−L + roff eL , the ”leak” of the membrane potential, depends on the overall ¯ level of certainty. ∆go is positive and a monotonic increasing function of go . A. s t1 dt s t1 s t1 dt B. C. x t1 x t3 dt x t3 x t3 dt x t1 x t1 x t1 x t2 x t3 x t1 … x tn x t3 x t2 … x tn … dt dt Lx2 D. x t2 dt s t2 dt x t2 s t2 x t2 dt s t2 dt Log odds 10 No inh -0.5 -1 -1 -1.5 -2 5 Feedback 500 1000 1500 2000 Tiger Stripes 0 -5 -10 500 1000 1500 2000 2500 Time Figure 2: A. Bayesian causal network for yt (tiger), x1 (stripes) and x2 (paws). B. A nett t work feedforward computing the log posterior for x1 . C. A recurrent network computing t the log posterior odds for all variables. D. Log odds ratio in a simulated trial with the net2 1 1 work in C (see text). Thick line: Lx , thin line: Lx , dash-dotted: Lx without inhibition. t t t 2 Insert: Lx averaged over trials, showing the effect of feedback. t The linearized Bayesian neuron thus acts in its stable regime as a leaky integrate and fire (LIF) neuron. The membrane potential Vt integrates its input, Jt = It − ∆go , with a leak kL . The neuron fires when its membrane potential reaches a constant threshold go . After ¯ each spikes, Vt is reset to 0. Interestingly, for appropriately chosen compression factor go , the mean input to the lin¯ ¯ earized neuron J = I − ∆go ≈ 0 1 . This means that the membrane potential is purely driven to its threshold by input fluctuations, or a random walk in membrane potential. As a consequence, the neuron’s firing will be memoryless, and close to a Poisson process. In particular, we found Fano factor close to 1 and quasi-exponential ISI distribution (figure 1E) on the entire range of parameters tested. Indeed, LIF neurons with balanced inputs have been proposed as a model to reproduce the statistics of real cortical neurons [8]. This balance is implemented in our model by the neuron’s effective self-inhibition, even when the synaptic input itself is not balanced. Decoding As we previously said, downstream elements could predict the log odds ratio Lt by computing Gt from the output spikes (Eq 1, fig 1-B). Of course, this requires an estimate of the transition probabilities ron , roff , that could be learned from the observed spike trains. However, we show next that explicit decoding is not necessary to perform bayesian inference in spiking networks. Intuitively, this is because the quantity that our model neurons receive and transmit, eg new information, is exactly what probabilistic inference algorithm propagate between connected statistical elements. 1 ¯ Even if go is not chosen optimally, the influence of the drift J is usually negligible compared to the large fluctuations in membrane potential. 2 Bayesian inference in cortical networks The model neurons, having the same input and output semantics, can be used as building blocks to implement more complex generative models consisting of coupled Markov chains. Consider, for example, the example in figure 2-A. Here, a ”parent” variable x1 t (the presence of a tiger) can cause the state of n other ”children” variables ([xk ]k=2...n ), t of whom two are represented (the presence of stripes,x2 , and motion, x3 ). The ”chilt t dren” variables are Bayesian neurons identical to those described previously. The resulting bayesian network consist of n + 1 coupled hidden Markov chains. Inference in this architecture corresponds to computing the log posterior odds ratio for the tiger, x1 , and the log t posterior of observing stripes or motion, ([xk ]k=2...n ), given the synaptic inputs received t by the entire network so far, i.e. s2 , . . . , sk . 0→t 0→t Unfortunately, inference and learning in this network (and in general in coupled Markov chains) requires very expensive computations, and cannot be performed by simply propagating messages over time and among the variable nodes. In particular, the state of a child k variable xt depends on xk , sk , x1 and the state of all other children at the previous t t t−dt time step, [xj ]2
4 0.5632931 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss
Author: Liam Paninski
Abstract: We develop a family of upper and lower bounds on the worst-case expected KL loss for estimating a discrete distribution on a finite number m of points, given N i.i.d. samples. Our upper bounds are approximationtheoretic, similar to recent bounds for estimating discrete entropy; the lower bounds are Bayesian, based on averages of the KL loss under Dirichlet distributions. The upper bounds are convex in their parameters and thus can be minimized by descent methods to provide estimators with low worst-case error; the lower bounds are indexed by a one-dimensional parameter and are thus easily maximized. Asymptotic analysis of the bounds demonstrates the uniform KL-consistency of a wide class of estimators as c = N/m → ∞ (no matter how slowly), and shows that no estimator is consistent for c bounded (in contrast to entropy estimation). Moreover, the bounds are asymptotically tight as c → 0 or ∞, and are shown numerically to be tight within a factor of two for all c. Finally, in the sparse-data limit c → 0, we find that the Dirichlet-Bayes (add-constant) estimator with parameter scaling like −c log(c) optimizes both the upper and lower bounds, suggesting an optimal choice of the “add-constant” parameter in this regime.
5 0.5607363 102 nips-2004-Learning first-order Markov models for control
Author: Pieter Abbeel, Andrew Y. Ng
Abstract: First-order Markov models have been successfully applied to many problems, for example in modeling sequential data using Markov chains, and modeling control problems using the Markov decision processes (MDP) formalism. If a first-order Markov model’s parameters are estimated from data, the standard maximum likelihood estimator considers only the first-order (single-step) transitions. But for many problems, the firstorder conditional independence assumptions are not satisfied, and as a result the higher order transition probabilities may be poorly approximated. Motivated by the problem of learning an MDP’s parameters for control, we propose an algorithm for learning a first-order Markov model that explicitly takes into account higher order interactions during training. Our algorithm uses an optimization criterion different from maximum likelihood, and allows us to learn models that capture longer range effects, but without giving up the benefits of using first-order Markov models. Our experimental results also show the new algorithm outperforming conventional maximum likelihood estimation in a number of control problems where the MDP’s parameters are estimated from data. 1
6 0.5581736 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
7 0.55551863 64 nips-2004-Experts in a Markov Decision Process
8 0.55531353 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve
9 0.55523622 131 nips-2004-Non-Local Manifold Tangent Learning
10 0.55515075 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
11 0.55485153 7 nips-2004-A Large Deviation Bound for the Area Under the ROC Curve
12 0.55464429 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
13 0.55454975 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
14 0.55452532 116 nips-2004-Message Errors in Belief Propagation
15 0.55418146 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification
16 0.55412436 163 nips-2004-Semi-parametric Exponential Family PCA
17 0.55393165 124 nips-2004-Multiple Alignment of Continuous Time Series
18 0.55386209 23 nips-2004-Analysis of a greedy active learning strategy
19 0.55380768 69 nips-2004-Fast Rates to Bayes for Kernel Machines
20 0.55346113 207 nips-2004-ℓ₀-norm Minimization for Basis Selection