nips nips2006 nips2006-54 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jade P. Vinson, David Decaprio, Matthew D. Pearson, Stacey Luoma, James E. Galagan
Abstract: Computational gene prediction using generative models has reached a plateau, with several groups converging to a generalized hidden Markov model (GHMM) incorporating phylogenetic models of nucleotide sequence evolution. Further improvements in gene calling accuracy are likely to come through new methods that incorporate additional data, both comparative and species specific. Conditional Random Fields (CRFs), which directly model the conditional probability P (y|x) of a vector of hidden states conditioned on a set of observations, provide a unified framework for combining probabilistic and non-probabilistic information and have been shown to outperform HMMs on sequence labeling tasks in natural language processing. We describe the use of CRFs for comparative gene prediction. We implement a model that encapsulates both a phylogenetic-GHMM (our baseline comparative model) and additional non-probabilistic features. We tested our model on the genome sequence of the fungal human pathogen Cryptococcus neoformans. Our baseline comparative model displays accuracy comparable to the the best available gene prediction tool for this organism. Moreover, we show that discriminative training and the incorporation of non-probabilistic evidence significantly improve performance. Our software implementation, Conrad, is freely available with an open source license at http://www.broad.mit.edu/annotation/conrad/. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu The Broad Institute of MIT and Harvard Cambridge, MA 02142 Abstract Computational gene prediction using generative models has reached a plateau, with several groups converging to a generalized hidden Markov model (GHMM) incorporating phylogenetic models of nucleotide sequence evolution. [sent-14, score-1.325]
2 Further improvements in gene calling accuracy are likely to come through new methods that incorporate additional data, both comparative and species specific. [sent-15, score-0.852]
3 We describe the use of CRFs for comparative gene prediction. [sent-17, score-0.644]
4 We implement a model that encapsulates both a phylogenetic-GHMM (our baseline comparative model) and additional non-probabilistic features. [sent-18, score-0.301]
5 We tested our model on the genome sequence of the fungal human pathogen Cryptococcus neoformans. [sent-19, score-0.414]
6 Our baseline comparative model displays accuracy comparable to the the best available gene prediction tool for this organism. [sent-20, score-0.962]
7 Moreover, we show that discriminative training and the incorporation of non-probabilistic evidence significantly improve performance. [sent-21, score-0.23]
8 1 Introduction Gene prediction is the task of labeling nucleotide sequences to identify the location and components of genes (Figure 1). [sent-26, score-0.464]
9 The accurate automated prediction of genes is essential to both downstream bioinformatic analyses and the interpretation of biological experiments. [sent-27, score-0.23]
10 Currently, the most accurate approach to computational gene prediction is generative modeling. [sent-28, score-0.619]
11 In this approach, one models the joint probability of the hidden gene structure y and the observed nucleotide sequence x. [sent-29, score-0.97]
12 Given a new set of observations x, one predicts genes by selecting the path of hidden labels y that maximizes P rθ (y, x). [sent-31, score-0.306]
13 Several independent groups have converged on the same generative model: a phylogenetic generalized hidden Markov model with explicit state durations (phylo-GHMM) [1, 2, 3, 4]. [sent-32, score-0.386]
14 com Figure 1: The processes of RNA transcription, intron splicing, and protein translation (panel A) and a state diagram for gene structure (panel B). [sent-34, score-0.617]
15 The 3-periodicity in the state diagram corresponds to the translation of nucleotide triplets into amino acids. [sent-36, score-0.277]
16 One approach to combining multiple sources of information for gene prediction, conditional maximum likelihood, was proposed in 1994 by Stormo and Haussler [5] and later implemented in the program GAZE by Howe et. [sent-39, score-0.579]
17 In this approach, one defines a Boltzmann distribution where the probability of each hidden sequence is proportional to the exponential of a weighted sum of different types of evidence. [sent-42, score-0.234]
18 One then trains the weights to maximize the conditional probability P rw (y|x) of the hidden sequence given the observations in the training data. [sent-43, score-0.578]
19 Like the earlier work in gene prediction, CRFs assign a probability to each hidden sequence that is proportional to the exponential of a weighted sum, and the weights are trained to maximize the conditional probability of the training data. [sent-45, score-1.045]
20 1 and [8]) is one of the strengths of this approach, but was not noticed in the earlier work on gene prediction. [sent-47, score-0.455]
21 We introduce a novel strategy for feature selection, allowing us to directly incorporate the best existing generative models with additional sources of evidence in the same theoretical framework. [sent-50, score-0.292]
22 First, we use probabalistic features based on generative models whenever well-developed models are available. [sent-51, score-0.219]
23 Second, we add non-probabilistic features for information that is not easily modeled generatively, such as alignments of expressed sequence tags (ESTs). [sent-53, score-0.219]
24 We developed Conrad, a gene predictor and highly optimized CRF engine. [sent-54, score-0.489]
25 We applied Conrad to predict genes in the fungal human pathogen Cryptococcus neoformans. [sent-59, score-0.323]
26 Our baseline comparative model is as accurate as Twinscan [9, 10], the most accurate gene predictor trained for C. [sent-60, score-0.836]
27 Training the weights of our model discriminatively further improves prediction accuracy, indicating that discriminatively trained models can outperform generatively trained models on the same data. [sent-62, score-0.421]
28 The addition of non-probabilistic features further improves prediction accuracy. [sent-63, score-0.215]
29 Figure 2: Graphical models for a first-order chain-structured conditional random field (panel A) and a first-order hidden Markov model (panel B). [sent-64, score-0.324]
30 2 Conditional Random Fields Conditional random fields are a framework for expressing the conditional probability Pr(y|x) of hidden states y = (y1 , y2 , . [sent-67, score-0.374]
31 The conditional probabilities can be viewed as a Boltzman distribution where the pairwise energy between two positions i − 1 and i is a weighted sum of the feature functions fj . [sent-72, score-0.39]
32 The feature functions fj (yi−1 , y, x, i) can be any real-valued functions defined for all possible hidden states yi−1 and y, observations x, and positions i. [sent-74, score-0.52]
33 An alternative viewpoint comes by reversing the order of summation in Equation 1 and expressing the conditional probability using feature sums Fj that depend on the entire hidden sequence y: n 1 Pr(y|x) = fj (yi−1 , yi , x, i). [sent-77, score-0.754]
34 These theoretical derivations also apply to generalizations of CRFs, such as semi-Markov CRFs [11], in which one modifies the formula expressing the feature sums Fj in terms of the feature functions fj . [sent-79, score-0.402]
35 1 Inference and Training Given a CRF and observations x, the inference problem is to determine the sequence of hidden states with the highest conditional probability, ymax = argmaxy (Pr(y|x)). [sent-81, score-0.423]
36 For a linear-chain CRF, each feature function fj depends only on pairs of adjacent hidden states and there is an efficient Viterbi algorithm for solving the inference problem. [sent-82, score-0.414]
37 In the first step, free parameters associated with individual feature functions are fixed using the training data. [sent-84, score-0.185]
38 In the second step, the weights λ are selected to maximize the conditional log-likelihood: λmax = argmax (log (P rλ (y|x))) λ The log-likelihood is a concave function of λ (its Hessian is the negative covariance matrix of the random variables Fj relative to the Boltzmann distribution). [sent-86, score-0.231]
39 Using the weights obtained by training, the resulting probability distribution on P rλ (·|x) is the maximum entropy distribution subject to the constraints that the expected value of each feature sum Fj is equal to its value in the training data. [sent-88, score-0.202]
40 For example, one can maximize the expected value GAOF (λ) = EP rλ (S(y, y 0 , x)) of the similarity between the actual hidden sequence y 0 and a random hidden sequence y selected according to Equation 1. [sent-90, score-0.522]
41 In this paper we consider this simplest possible alternate objective function, where the local similarity function is 1 at position i if the hidden sequence is correct and 0 otherwise. [sent-95, score-0.395]
42 Tyi ,yi+1 P r(y, x) = πy1 i=1 i=1 Given an observation sequence x, the conditional probabilities implied by this HMM can be expressed as a CRF by defining the following three features and setting all weights to 1. [sent-101, score-0.33]
43 One of the most important extensions for gene prediction is to explicitly model state durations: for many species the lengths of some components are tightly constrained, such as introns in fungi. [sent-103, score-0.672]
44 However, for gene prediction there are well-developed probabilistic models that can serve as a starting point in the design of a CRF. [sent-107, score-0.655]
45 We propose a new approach to CRF feature selection with the following guiding principle: use probabilistic models for feature functions when possible and add non-probabistic features only where necessary. [sent-108, score-0.337]
46 The CRF training algorithm determines the relative contributions of these features through discriminative training, without having to assume independence between the features or explicitly model dependencies. [sent-109, score-0.259]
47 Our approach to gene prediction is implemented as Conrad, a highly configurable Java executable. [sent-110, score-0.566]
48 Because the CRF engine is a separate module with a well-defined interface for feature functions, it can also be used for applications other than gene prediction. [sent-112, score-0.525]
49 1 The baseline comparative model: a phylogenetic GHMM Phylogenetic generalized hidden Markov models are now the standard approach to gene prediction using generative models [1, 2, 3, 4], and capture many of the signals for resolving gene structure (e. [sent-114, score-1.798]
50 splice models or phylogenetic models of nucleotide sequence evolution). [sent-116, score-0.628]
51 0, reproduce the phylo-GHMM that we refer to as our baseline comparative model. [sent-118, score-0.301]
52 Our baseline comparative model is based on the state diagram of Figure 1, enforces the basic gene constraints (e. [sent-119, score-0.799]
53 open reading frames and GT-AG splice junctions), explicitly models intron length using a mixture model, and uses a set of multiply aligned genomes (including the reference genome to be annotated) as observations. [sent-121, score-0.402]
54 = δ(yi = exon3) log(P r(multiple alignment column | reference nucleotide )), using a phylogenetic evolutionary model trained by ML. [sent-127, score-0.485]
55 Non-probabalistic features For many signals useful in resolving gene structure (e. [sent-128, score-0.559]
56 Indels are known to be relevant to gene prediction: evolution preserves the functions of most genes and an indel that is not a multiple of three would dirsupt the translation of a protein. [sent-133, score-0.678]
57 Thus, indels not a multiple of three provide evidence against a position being part of an exon. [sent-134, score-0.194]
58 We introduce the features fGAP,1 fGAP,2 = δ(yi = exon & an indel of length 0 mod 3 has a boundary at position i ) = δ(yi = exon & an indel of length 1 or 2 mod 3 has a boundary at position i ), plus the four analogous features for introns and intergenic regions. [sent-135, score-0.599]
59 4 Results We evaluated our model using the genome of fungal human pathogen Cryptococcus neoformans strain JEC21 [18]. [sent-138, score-0.401]
60 neoformans is an ideal test case due to the availability of genomes for four closely related strains for use as comparative data and a high-quality manual annotation with deep EST sequencing. [sent-140, score-0.341]
61 To determine an absolute benchmark, we compared our baseline comparative model to Twinscan [9, 10], the most accurate comparative gene predictor trained for C. [sent-141, score-1.025]
62 At the locations containing both an EST and a gene prediction, the accuracy of our model is comparable to (or better than) that of Twinscan. [sent-144, score-0.55]
63 Table 1: Comparing the prediction accuracy of our baseline comparative model with that of Twinscan. [sent-146, score-0.507]
64 Accuracy statistics are collected at loci where an EST overlaps with a gene prediction. [sent-147, score-0.455]
65 20 Figure 3: Gene prediction accuracy increases with additional features and with the training of feature weights. [sent-156, score-0.474]
66 All models were trained with the alternate objective function (see text), with the exception of models labeled “weights fixed”. [sent-157, score-0.251]
67 We next measured the relative effects of different sets of features and methods for training the feature weights. [sent-162, score-0.221]
68 First, we created a set of 1190 trusted genes by selecting those genes which had EST support along their entire length. [sent-163, score-0.238]
69 We then performed 10 cross-validation replicates for several combinations of a set of features and a method for training weights, and a training set sizes (50, 100, 200, 400, or 800 genes). [sent-164, score-0.266]
70 For each set of replicates, we record the average nucleotide accuracy. [sent-165, score-0.234]
71 As expected, the testing accuracy increases with larger training sets, while the training accuracy decreases. [sent-167, score-0.395]
72 1 Adding features improves accuracy The effect of including additional features is shown in Figure 3. [sent-170, score-0.271]
73 As can be seen in each case, model accuracy improves as new evidence is added. [sent-171, score-0.203]
74 For a 400 gene training set, adding the EST features increases the accuracy of the baseline single species model from 89. [sent-172, score-0.927]
75 Adding the gap features increases the accuracy of the baseline comparative model from 93. [sent-175, score-0.515]
76 Finally, adding both types of evidence together increases accuracy more than either addition in isolation: adding EST and gap features to the baseline comparative model increases accuracy from 93. [sent-178, score-0.733]
77 2 Training using an alternate objective function The standard training of weights for CRFs seeks to maximize the conditional log probability of the training data. [sent-183, score-0.5]
78 Previous work in natural language processing found no accuracy benefit to changing the objective function [19]. [sent-185, score-0.196]
79 However, relative to the usual training to maximize conditional log-likelihood, we observed about 2% greater nucleotide accuracy in testing data using models trained to maximize an alternative objective function (the expected nucleotide accuracy of a random sequence on training data). [sent-186, score-1.271]
80 For example, for a 400 gene training set, training the weights increases the accuracy of the baseline single species model from 87. [sent-190, score-0.987]
81 2% to 89% and the baseline comparative model from 90. [sent-191, score-0.301]
82 5 Concluding Remarks CRFs are a promising framework for gene prediction. [sent-194, score-0.455]
83 CRFs offer several advantages relative to standard HMM-based gene prediction methods including the ability to capture long-range dependencies and to incorporate heterogeneous data within a single framework. [sent-195, score-0.646]
84 We have implemented a semi-Markov CRF by explicitly expressing a phylogenetic GHMM within a CRF framework and extending this baseline with non-probabilisitic evidence. [sent-196, score-0.322]
85 When used to predict genes in the fungal human pathogen C. [sent-197, score-0.323]
86 neoformans, our model displays accuracy comparable to the best existing gene prediction tools. [sent-198, score-0.661]
87 We adopt the following guiding principle: we use probabilistic models as features where possible and incorporate non-probabilistic features only when necessary. [sent-201, score-0.279]
88 Our approach also differs from an initial study of using CRFs for gene prediction [20], which does not use a probabilistic model as the baseline. [sent-203, score-0.608]
89 CRFs offer a solution to an important problem in gene prediction: how to combine probabilistic models of nucleotide sequences with additional evidence from diverse sources. [sent-204, score-0.888]
90 Prior research in this direction has focused on either handcrafted heuristics for a particular type of feature [21], a mixture-of-experts approach applied at each nucleotide position [22], and decision trees [23]. [sent-205, score-0.399]
91 CRFs offer an alternative approach in which probabilistic features and non-probabilistic evidence are both incorporated in the same framework. [sent-206, score-0.224]
92 In all these examples, as with gene predictions, CRFs provide the ability to incorporate supplementary evidence not captured in current generative models. [sent-209, score-0.63]
93 Combining phylogenetic and hidden Markov models in biosequence analysis. [sent-215, score-0.344]
94 Multiple-sequence functional annotation and the generalized hidden Markov phylogeny. [sent-218, score-0.198]
95 Gene finding with a hidden Markov model of genome structure and evolution. [sent-221, score-0.282]
96 Gene prediction and verification in a compact genome with numerous small introns. [sent-250, score-0.24]
97 A tutorial on hidden markov models and selected applications in speech recognition. [sent-265, score-0.241]
98 The genome of the basidiomycetous yeast and human pathogen Cryptococcus neoformans. [sent-286, score-0.242]
99 Computational inference of homologous gene structures in the human genome. [sent-295, score-0.455]
100 JIGSAW: integration of multiple sources of evidence for gene prediction. [sent-301, score-0.531]
wordName wordTfidf (topN-words)
[('gene', 0.455), ('crfs', 0.332), ('crf', 0.244), ('nucleotide', 0.234), ('comparative', 0.189), ('fj', 0.16), ('hidden', 0.153), ('est', 0.144), ('phylogenetic', 0.144), ('conrad', 0.136), ('genome', 0.129), ('conditional', 0.124), ('genes', 0.119), ('ests', 0.113), ('pathogen', 0.113), ('baseline', 0.112), ('prediction', 0.111), ('yi', 0.1), ('accuracy', 0.095), ('cryptococcus', 0.091), ('fungal', 0.091), ('twinscan', 0.091), ('sequence', 0.081), ('intron', 0.079), ('training', 0.079), ('evidence', 0.076), ('splice', 0.075), ('features', 0.072), ('feature', 0.07), ('ghmm', 0.068), ('indel', 0.068), ('indels', 0.068), ('neoformans', 0.068), ('species', 0.067), ('expressing', 0.066), ('alternate', 0.062), ('alignment', 0.061), ('stormo', 0.059), ('hmm', 0.059), ('res', 0.055), ('exon', 0.054), ('maximize', 0.054), ('weights', 0.053), ('generative', 0.053), ('language', 0.052), ('position', 0.05), ('objective', 0.049), ('bioinformatics', 0.048), ('hmms', 0.047), ('pr', 0.047), ('increases', 0.047), ('models', 0.047), ('trained', 0.046), ('incorporate', 0.046), ('gaof', 0.045), ('gurable', 0.045), ('handcrafted', 0.045), ('howe', 0.045), ('siepel', 0.045), ('annotation', 0.045), ('diagram', 0.043), ('probabilistic', 0.042), ('markov', 0.041), ('protein', 0.04), ('generatively', 0.039), ('genomes', 0.039), ('introns', 0.039), ('jade', 0.039), ('license', 0.039), ('randall', 0.039), ('incorporation', 0.039), ('richard', 0.039), ('andrew', 0.037), ('discriminative', 0.036), ('brown', 0.036), ('homology', 0.036), ('recomb', 0.036), ('replicates', 0.036), ('mod', 0.036), ('durations', 0.036), ('functions', 0.036), ('michael', 0.034), ('offer', 0.034), ('predictor', 0.034), ('david', 0.034), ('observations', 0.034), ('gaps', 0.034), ('kevin', 0.034), ('tags', 0.034), ('aligned', 0.033), ('improves', 0.032), ('fields', 0.032), ('alignments', 0.032), ('freely', 0.032), ('gaze', 0.032), ('sep', 0.032), ('resolving', 0.032), ('gross', 0.032), ('samuel', 0.032), ('states', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000019 54 nips-2006-Comparative Gene Prediction using Conditional Random Fields
Author: Jade P. Vinson, David Decaprio, Matthew D. Pearson, Stacey Luoma, James E. Galagan
Abstract: Computational gene prediction using generative models has reached a plateau, with several groups converging to a generalized hidden Markov model (GHMM) incorporating phylogenetic models of nucleotide sequence evolution. Further improvements in gene calling accuracy are likely to come through new methods that incorporate additional data, both comparative and species specific. Conditional Random Fields (CRFs), which directly model the conditional probability P (y|x) of a vector of hidden states conditioned on a set of observations, provide a unified framework for combining probabilistic and non-probabilistic information and have been shown to outperform HMMs on sequence labeling tasks in natural language processing. We describe the use of CRFs for comparative gene prediction. We implement a model that encapsulates both a phylogenetic-GHMM (our baseline comparative model) and additional non-probabilistic features. We tested our model on the genome sequence of the fungal human pathogen Cryptococcus neoformans. Our baseline comparative model displays accuracy comparable to the the best available gene prediction tool for this organism. Moreover, we show that discriminative training and the incorporation of non-probabilistic evidence significantly improve performance. Our software implementation, Conrad, is freely available with an open source license at http://www.broad.mit.edu/annotation/conrad/. 1
2 0.22528115 108 nips-2006-Large Scale Hidden Semi-Markov SVMs
Author: Gunnar Rätsch, Sören Sonnenburg
Abstract: We describe Hidden Semi-Markov Support Vector Machines (SHM SVMs), an extension of HM SVMs to semi-Markov chains. This allows us to predict segmentations of sequences based on segment-based features measuring properties such as the length of the segment. We propose a novel technique to partition the problem into sub-problems. The independently obtained partial solutions can then be recombined in an efficient way, which allows us to solve label sequence learning problems with several thousands of labeled sequences. We have tested our algorithm for predicting gene structures, an important problem in computational biology. Results on a well-known model organism illustrate the great potential of SHM SVMs in computational biology. 1
3 0.20786308 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
Author: Samuel S. Gross, Olga Russakovsky, Chuong B. Do, Serafim Batzoglou
Abstract: We consider the problem of training a conditional random field (CRF) to maximize per-label predictive accuracy on a training set, an approach motivated by the principle of empirical risk minimization. We give a gradient-based procedure for minimizing an arbitrarily accurate approximation of the empirical risk under a Hamming loss function. In experiments with both simulated and real data, our optimization procedure gives significantly better testing performance than several current approaches for CRF training, especially in situations of high label noise. 1
4 0.11061219 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data
Author: Yevgeny Seldin, Noam Slonim, Naftali Tishby
Abstract: We present a general model-independent approach to the analysis of data in cases when these data do not appear in the form of co-occurrence of two variables X, Y , but rather as a sample of values of an unknown (stochastic) function Z(X, Y ). For example, in gene expression data, the expression level Z is a function of gene X and condition Y ; or in movie ratings data the rating Z is a function of viewer X and movie Y . The approach represents a consistent extension of the Information Bottleneck method that has previously relied on the availability of co-occurrence statistics. By altering the relevance variable we eliminate the need in the sample of joint distribution of all input variables. This new formulation also enables simple MDL-like model complexity control and prediction of missing values of Z. The approach is analyzed and shown to be on a par with the best known clustering algorithms for a wide range of domains. For the prediction of missing values (collaborative filtering) it improves the currently best known results. 1
5 0.098274022 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors
Author: Edward Meeds, Zoubin Ghahramani, Radford M. Neal, Sam T. Roweis
Abstract: We introduce binary matrix factorization, a novel model for unsupervised matrix decomposition. The decomposition is learned by fitting a non-parametric Bayesian probabilistic model with binary latent variables to a matrix of dyadic data. Unlike bi-clustering models, which assign each row or column to a single cluster based on a categorical hidden feature, our binary feature model reflects the prior belief that items and attributes can be associated with more than one latent cluster at a time. We provide simple learning and inference rules for this new model and show how to extend it to an infinite model in which the number of features is not a priori fixed but is allowed to grow with the size of the data. 1 Distributed representations for dyadic data One of the major goals of probabilistic unsupervised learning is to discover underlying or hidden structure in a dataset by using latent variables to describe a complex data generation process. In this paper we focus on dyadic data: our domains have two finite sets of objects/entities and observations are made on dyads (pairs with one element from each set). Examples include sparse matrices of movie-viewer ratings, word-document counts or product-customer purchases. A simple way to capture structure in this kind of data is to do “bi-clustering” (possibly using mixture models) by grouping the rows and (independently or simultaneously) the columns[6, 13, 9]. The modelling assumption in such a case is that movies come in à types and viewers in Ä types and that knowing the type of movie and type of viewer is sufficient to predict the response. Clustering or mixture models are quite restrictive – their major disadvantage is that they do not admit a componential or distributed representation because items cannot simultaneously belong to several classes. (A movie, for example, might be explained as coming from a cluster of “dramas” or “comedies”; a viewer as a “single male” or as a “young mother”.) We might instead prefer a model (e.g. [10, 5]) in which objects can be assigned to multiple latent clusters: a movie might be a drama and have won an Oscar and have subtitles; a viewer might be single and female and a university graduate. Inference in such models falls under the broad area of factorial learning (e.g. [7, 1, 3, 12]), in which multiple interacting latent causes explain each observed datum. In this paper, we assume that both data items (rows) and attributes (columns) have this kind of componential structure: each item (row) has associated with it an unobserved vector of à binary features; similarly each attribute (column) has a hidden vector of Ä binary features. Knowing the features of the item and the features of the attribute are sufficient to generate (before noise) the response at that location in the matrix. In effect, we are factorizing a real-valued data (response) , where and are binary feature matrix into (a distribution defined by) the product is a real-valued weight matrix. Below, we develop this binary matrix factorization matrices, and Ï ÍÏÎ Í Î , ÛÓ « K L ÛÐ Ü Ù I Ð ÚÐ =f Ï Í Î J (A) (B) Figure 1: (A) The graphical model representation of the linear-Gaussian BMF model. The concentration parameter and Beta weights for the columns of are represented by the symbols and Ð . (B) BMF shown pictorally. (BMF) model using Bayesian non-parametric priors over the number and values of the unobserved binary features and the unknown weights. 2 BMF model description Binary matrix factorization is a model of an Á ¢  dyadic data matrix with exchangeable rows and columns. The entries of can be real-valued, binary, or categorial; BMF models suitable for each type are described below. Associated with each row is a latent binary feature vector ; similarly each column has an unobserved binary vector . The primary parameters are represented of interaction weights. is generated by a fixed observation process ´¡µ applied by a matrix (elementwise) to the linear inner product of the features and weights, which is the “factorization” or approximation of the data: Ù Ú Ï ÍÎÏ ´ÍÏÎ ¢µ (1) where ¢ are extra parameters specific to the model variant. Three possible parametric forms for and covariance ´½ µ ; the noise (observation) distribution are: Gaussian, with mean logistic, with mean ½ ´½ · ÜÔ´ µµ; and Poisson, with mean (and variance) . Other parametric forms are also possible. For illustrative purposes, we will use the linear-Gaussian model throughout this paper; this can be thought of as a two-sided version of the linear-Gaussian model found in [5]. ÍÏÎ ÍÏÎ ÍÏÎ Á To complete the description of the model, we need to specify prior distributions over the feature matrices and the weights . We adopt the same priors over binary matrices as previously described in [5]. For finite sized matrices with Á rows and à columns, we generate a bias independently for each column using a Beta prior (denoted ) and then conditioned on this bias generate the entries in column independently from a Bernoulli with mean . ÍÎ Ï «Ã Í È Á Í ´« à ¬ µ à ½ ½ Ù « ´½ µ½ Ù « « à ½ ´ Ò « «µ ´½ µÁ Ò where Ò Ù . The hyperprior on the concentration « is a Gamma distribution (denoted ), whose shape and scale hyperparameters control the expected fraction of zeros/ones in the matrix. The biases are easily integrated out, which creates dependencies between the rows, although they remain exchangeable. The resulting prior depends only on the number Ò of active features in each column. An identical prior is used on , with  rows and Ä columns, but with different concentration prior . The variable ¬ was set to ½ for all experiments. Î The appropriate prior distribution over weights depends on the observation distribution is a matrix normal with prior mean the linear-Gaussian variant, a convenient prior on Ï ´¡µ. For ÏÓ and µ Á. covariance ´½ hyperpriors: The scale of the weights and output precision Ï ÏÓ (if needed) have Gamma Æ ´ÏÓ ´½ µ Áµ ´ µ ´ µ In certain cases, when the prior on the weights is conjugate to the output distribution model , the weights may be analytically integrated out, expressing the marginal distribution of the data only in terms of the binary features. This is true, for example, when we place a Gaussian prior on the weights and use a linear-Gaussian output process. ÍÎ Í Î Remarkably, the Beta-Bernoulli prior distribution over (and similarly ) can easily be extended to the case where à ½, creating a distribution over binary matrices with a fixed number Á of exchangeable rows and a potentially infinite number of columns (although the expected number of columns which are not entirely zero remains finite). Such a distribution, the Indian Buffet Process (IBP) was described by [5] and is analogous to the Dirichlet process and the associated Chinese restaurant process (CRP) [11]. Fortunately, as we will see, inference with this infinite prior is not only tractable, but is also nearly as efficient as the finite version. 3 Inference of features and parameters Í As with many other complex hierarchical Bayesian models, exact inference of the latent variables and in the BMF model is intractable (ie there is no efficient way to sample exactly from the posterior nor to compute its exact marginals). However, as with many other non-parametric Bayesian models, we can employ Markov Chain Monte Carlo (MCMC) methods to create an iterative procedure which, if run for sufficiently long, will produce correct posterior samples. Î 3.1 Finite binary latent feature matrices Í Î The posterior distribution of a single entry in (or ) given all other model parameters is proportional to the product of the conditional prior and the data likelihood. The conditional prior comes from integrating out the biases in the Beta-Bernoulli model and is proportional the number of active entries in other rows of the same column plus a term for new activations. Gibbs sampling for single entries of (or ) can be done using the following updates: È ´Ù È ´Ù where Ò on « à and Í Î ½ Í Î Ï µ ´ « à · Ò µ È ´ Í Ù ½ Î Ï µ (2) ¼ Í Î Ï µ ´¬ · ´Á ½µ Ò µ È ´ Í Ù ¼ Πϵ (3) È Ù , Í excludes entry , and is a normalizing constant. (Conditioning is implicit.) When conditioning on Ï, we only need to calculate the ratio of likeli- hoods corresponding to row . (Note that this is not the case when the weights are integrated out.) È ½) and This ratio is a simple function of the model’s predictions Ü· Ð Ù Ú Ð Û Ð (when Ù È Ü Ù Ú Ð Û Ð (when Ù ¼). In the linear-Gaussian case: Ð ÐÓ È ´Ù È ´Ù ½ ¼ Í Î Ï Í Î Ï µ µ ÐÓ ´« à · Ò µ ¬ · ´Á ½µ Ò ´ µ ½ ¾ ¢ Ü ´ Ü· µ¾ ´Ü Ü µ¾ £ In the linear-Gaussian case, we can easily derive analogous Gibbs sampling updates for the weights and hyperparameters. To simplify the presentation, we consider a “vectorized” representation of our variables. Let be an Á column vector taken column-wise from , be a ÃÄ column vector taken column-wise from and be a Á ¢ ÃÄ binary matrix which is the kronecker product ª . (In “Matlab notation”, ´µ ´ µ and ÖÓÒ´ µ.) In this notation, the data distribution is written as: Æ´ ´½ µ µ. Given values for and , samples can be drawn for , , and using the following posterior distributions (where conditioning on Ó is implicit): Æ ´ · µ ½ ´ · Óµ ´ · µ ½ Ï Î Í Û Ü Û Û Ü Ï Ü Ü Û Û Ï Û Á Û ÎÍ Á Ü Û Í Á Î Û Û Ü · ÃÄ ¾ · Á ¾ · ½ ´Û ÛÓ µ ´Û ÛÓ µ ¾ ½ ´Ü Ûµ ´Ü Ûµ ·¾ Note that we do not have to explicitly compute the matrix . For computing the posterior of linearGaussian weights, the matrix can be computed as ÖÓÒ´ µ. Similarly, the expression is constructed by computing and taking the elements column-wise. Ü Î ÎÍ Í Í Î 3.2 Infinite binary latent feature matrices One of the most elegant aspects of non-parametric Bayesian modeling is the ability to use a prior which allows a countably infinite number of latent features. The number of instantiated features is automatically adjusted during inference and depends on the amount of data and how many features it supports. Remarkably, we can do MCMC sampling using such infinite priors with essentially no computational penalty over the finite case. To derive these updates (e.g. for row of the matrix ), it is useful to consider partitioning the columns of into two sets as shown below. Let set A have at least one non-zero entry in rows other than . Let set B be all other set A set B columns, including the set of columns where 0 1 0 0 1 0 0 0 0 0 ¡¡¡ the only non-zero entries are found in row 0 0 1 0 0 0 0 0 0 0 ¡¡¡ and the countably infinite number of all-zero 1 1 0 0 1 0 0 0 0 0 ¡¡¡ 1 0 0 1 1 0 0 0 0 0 ¡¡¡ columns. Sampling values for elements in row 1 1 0 0 1 0 1 0 1 0 row of set A given everything else is straightfor0 1 0 0 0 0 0 0 0 0 ¡¡¡ ward, and involves Gibbs updates almost iden0 0 0 1 0 0 0 0 0 0 ¡¡¡ tical to those in the finite case handled by equa1 0 0 0 1 0 0 0 0 0 ¡¡¡ tions (2) and (3); as à ½ and in set A we get: Í Í ½ Í Î Ï µ ¼ Í Î Ï µ È ´Ù È ´Ù ¡ Ò È ´ Í Ù ½ Î Ï µ ¡ ´¬ · Á ½ Ò µ È ´ Í Ù ¼ Πϵ (4) (5) When sampling new values for set B, the columns are exchangeable, and so we are really only interested in the number of entries Ò in set B which will be turned on in row . Sampling the number of entries set to ½ can be done with Metropolis-Hastings updates. Let  ´Ò Ò µ Poisson ´Ò « ´¬ · Á ½µµ be the proposal distribution for a move which replaces the current Ò active entries with Ò active entries in set B. The reverse proposal is  ´Ò Ò µ. The acceptance ¡ probability is Ñ Ò ½ ÖÒ Ò , where ÖÒ Ò is È ´Ò È ´Ò µ  ´Ò Ò µ µ  ´Ò Ò µ È´ Ò È´ Ò µ Poisson´Ò « ´¬ · Á ½µµÂ ´Ò Ò µ µ Poisson´Ò « ´¬ · Á ½µµÂ ´Ò Ò µ Ï È´ Ò È´ Ò µ µ (6) This assumes a conjugate situation in which the weights are explicitly integrated out of the model to compute the marginal likelihood È ´ Ò µ. In the non-conjugate case, a more complicated proposal is required. Instead of proposing Ò , we jointly propose Ò and associated feature parameters from their prior distributions. In the linear-Gaussian model, where is a set of weights for features in set B, the proposal distribution is: Û Û « ´¬ · Á ½µµ Normal ´Û Ò µ (7) We need actually sample only the finite portion of Û where Ù ½. As in the conjugate case, the  ´Ò Û Ò Ûµ Poisson ´Ò acceptance ratio reduces to the ratio of data likelihoods: ÖÒ Û Ò Û È´ Ò È´ Ò Ûµ Ûµ ÍÎ Ï (8) 3.3 Faster mixing transition proposals are the simplest moves we could The Gibbs updates described above for the entries of , and make in a Markov Chain Monte Carlo inference procedure for the BMF model. However, these limited local updates may result in extremely slow mixing. In practice, we often implement larger moves in indicator space using, for example, Metropolis-Hastings proposals on multiple features for row simultaneously. For example, we can propose new values for several columns in row of matrix by sampling feature values independently from their conditional priors. To compute the reverse proposal, we imagine forgetting the current configuration of those features for row and compute the probability under the conditional prior of proposing the current configuration. The acceptance probability of such a proposal is (the maximum of unity and) the ratio of likelihoods between the new proposed configuration and the current configuration. Í Split-merge moves may also be useful for efficiently sampling from the posterior distribution of the binary feature matrices. Jain and Neal [8] describe split-merge algorithms for Dirichlet process mixture models with non-conjugate component distributions. We have developed and implemented similar split-merge proposals for binary matrices with IBP priors. Due to space limitations, we present here only a sketch of the procedure. Two nonzero entries in are selected uniformly at random. If they are in the same column, we propose splitting that column; if they are in different columns, we propose merging their columns. The key difference between this algorithm and the Jain and Neal algorithm is that the binary features are not constrained to sum to unity in each row. Our split-merge algorithm also performs restricted Gibbs scans on columns of to increase acceptance probability. Í Í 3.4 Predictions A major reason for building generative models of data is to be able to impute missing data values given some observations. In the linear-Gaussian model, the predictive distribution at each iteration of the Markov chain is a Gaussian distribution. The interaction weights can be analytically integrated out at each iteration, also resulting in a Gaussian posterior, removing sampling noise contributed by having the weights explicitly represented. Computing the exact predictive distribution, however, conditional only on the model hyperparameters, is analytically intractable: it requires integrating over all binary matrices and , and all other nuisance parameters (e.g., the weights and precisions). Instead we integrate over these parameters implicitly by averaging predictive distributions from many MCMC iterations. This posterior, which is conditional only on the observed data and hyperparameters, is highly complex, potentially multimodal, and non-linear function of the observed variables. Í Î Í Î and . In our By averaging predictive distributions, our algorithm implicitly integrates over experiments, we show samples from the posteriors of and to help explain what the model is doing, but we stress that the posterior may have significant mass on many possible binary matrices. The number of features and their degrees of overlap will vary over MCMC iterations. Such variation will depend, for example, on the current value of « and (higher values will result in more features) and precision values (higher weight precision results in less variation in weights). Í Î 4 Experiments 4.1 Modified “bars” problem A toy problem commonly used to illustrate additive feature or multiple cause models is the bars problem ([2, 12, 1]). Vertical and horizontal bars are combined in some way to generate data samples. The goal of the illustration is to show recovery of the latent structure in the form of bars. We have modified the typical usage of bars to accommodate the linear-Gaussian BMF with infinite features. Data consists of Á vectors of size ¾ where each vector can be reshaped into a square image. The generation process is as follows: since has the same number of rows as the dimension of the images, is fixed to be a set of vertical and horizontal bars (when reshaped into an image). is sampled from the IBP, and global precisions and are set to ½ ¾. The weights are sampled from zero mean Gaussians. Model estimates of and were initialized from an IBP prior. Î Î Í Ï Î Í In Figure 2 we demonstrate the performance of the linear-Gaussian BMF on the bars data. We train the BMF with 200 training examples of the type shown in the top row in Figure 2. Some examples have their bottom halves labeled missing and are shown in the Figure with constant grey values. To handle this, we resample their values at each iteration of the Markov chain. The bottom row shows . Despite the relatively high the expected reconstruction using MCMC samples of , , and ÍÎ Ï noise levels in the data, the model is able to capture the complex relationships between bars and weights. The reconstruction of vertical bars is very good. The reconstruction of horizontal bars is good as well, considering that the model has no information regarding the existence of horizontal bars on the bottom half. (A) Data samples (B) Noise-free data (C) Initial reconstruction (D) Mean reconstruction (E) Nearest neighbour Figure 2: Bars reconstruction. (A) Bars randomly sampled from the complete dataset. The bottom half of these bars were removed and labeled missing during learning. (B) Noise-free versions of the same data. (C) The initial reconstruction. The missing values have been set to their expected value, ¼, to highlight the missing region. (D) The average MCMC reconstruction of the entire image. (E) Based solely on the information in the top-half of the original data, these are the noise-free nearest neighbours in pixel space. Î ÎÏ Î ÏÎ Figure 3: Bars features. The top row shows values of and used to generate the data. The second row shows a sample of and from the Markov chain. can be thought of as a set of basis images which can be added together with binary coefficients ( ) to create images. Î ÏÎ ÏÎ Í By examining the features captured by the model, we can understand the performance just described. In Figure 3 we show the generating, or true, values of and along with one sample of those basis features from the Markov chain. Because the model is generated by adding multiple images shown on the right of Figure 3, multiple bars are used in each image. This is reflected in the captured features. The learned are fairly similar to the generating , but the former are composed of overlapping bar structure (learned ). Î ÏÎ ÏÎ ÏÎ ÏÎ Î 4.2 Digits In Section 2 we briefly stated that BMF can be applied to data models other than the linear-Gaussian model. We demonstrate this with a logistic BMF applied to binarized images of handwritten digits. We train logistic BMF with 100 examples each of digits ½, ¾, and ¿ from the USPS dataset. In the first five rows of Figure 4 we again illustrate the ability of BMF to impute missing data values. The top row shows all 16 samples from the dataset which had their bottom halves labeled missing. Missing values are filled-in at each iteration of the Markov chain. In the third and fourth rows we show the mean and mode (È ´Ü ½µ ¼ ) of the BMF reconstruction. In the bottom row we have shown the nearest neighbors, in pixel space, to the training examples based only on the top halves of the original digits. In the last three rows of Figure 4 we show the features captured by the model. In row F, we show the average image of the data which have each feature in on. It is clear that some row features are shown. have distinct digit forms and others are overlapping. In row G, the basis images By adjusting the features that are non-zero in each row of , images are composed by adding basis images together. Finally, in row H we show . These pixel features mask out different regions in Î Í Í ÏÎ pixel space, which are weighted together to create the basis images. Note that there are à features in rows F and G, and Ä features in row H. (A) (B) (C) (D) (E) (F) (G) (H) Figure 4: Digits reconstruction. (A) Digits randomly sampled from the complete dataset. The bottom half of these digits were removed and labeled missing during learning. (B) The data shown to the algorithm. The top half is the original data value. (C) The mean of the reconstruction for the bottom halves. (D) The mode reconstruction of the bottom halves. (E) The nearest neighbours of the original data are shown in the bottom half, and were found based solely on the information from the top halves of the images. (F) The average of all digits for each feature. (G) The feature reshaped in the form of digits. By adding these features together, which the features do, reconstructions of the digits is possible. (H) reshaped into the form of digits. The first image represents a bias feature. ÏÎ Í Î Í 4.3 Gene expression data Gene expression data is able to exhibit multiple and overlapping clusters simultaneously; finding models for such complex data is an interesting and active research area ([10], [13]). The plaid model[10], originally introduced for analysis of gene expression data, can be thought of as a nonBayesian special case of our model in which the matrix is diagonal and the number of binary features is fixed. Our goal in this experiment is merely to illustrate qualitatively the ability of BMF to find multiple clusters in gene expression data, some of which are overlapping, others non-overlapping. The data in this experiment consists of rows corresponding to genes and columns corresponding to patients; the patients suffer from one of two types of acute Leukemia [4]. In Figure 5 we show the factorization produced by the final state in the Markov chain. The rows and columns of the data and its expected reconstruction are ordered such that contiguous regions in were observable. Some of the many feature pairings are highlighted. The BMF clusters consist of broad, overlapping clusters, and small, non-overlapping clusters. One of the interesting possibilities of using BMF to model gene expression data would be to fix certain columns of or with knowledge gained from experiments or literature, and to allow the model to add new features that help explain the data in more detail. Ï Í Î 5 Conclusion We have introduced a new model, binary matrix factorization, for unsupervised decomposition of dyadic data matrices. BMF makes use of non-parametric Bayesian methods to simultaneously discover binary distributed representations of both rows and columns of dyadic data. The model explains each row and column entity using a componential code composed of multiple binary latent features along with a set of parameters describing how the features interact to create the observed responses at each position in the matrix. BMF is based on a hierarchical Bayesian model and can be naturally extended to make use of a prior distribution which permits an infinite number of features, at very little extra computational cost. We have given MCMC algorithms for posterior inference of both the binary factors and the interaction parameters conditioned on some observed data, and (A) (B) Figure 5: Gene expression results. (A) The top-left is sorted according to contiguous features in the final and in the Markov chain. The bottom-left is and the top-right is . The bottomright is . (B) The same as (A), but the expected value of , . We have highlighted regions that have both Ù and Ú Ð on. For clarity, we have only shown the (at most) two largest contiguous regions for each feature pair. Í Ï Î Î ÍÏÎ Í demonstrated the model’s ability to capture overlapping structure and model complex joint distributions on a variety of data. BMF is fundamentally different from bi-clustering algorithms because of its distributed latent representation and from factorial models with continuous latent variables which interact linearly to produce the observations. This allows a much richer latent structure, which we believe makes BMF useful for many applications beyond the ones we outlined in this paper. References [1] P. Dayan and R. S. Zemel. Competition and multiple cause models. Neural Computation, 7(3), 1995. [2] P. Foldiak. Forming sparse representations by local anti-Hebbian learning. Biological Cybernetics, 64, 1990. [3] Z. Ghahramani. Factorial learning and the EM algorithm. In NIPS, volume 7. MIT Press, 1995. [4] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286(5439), 1999. [5] T. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffet process. In NIPS, volume 18. MIT Press, 2005. [6] J. A. Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association, 67, 1972. [7] G. Hinton and R. S. Zemel. Autoencoders, minimum description length, and Helmholtz free energy. In NIPS, volume 6. Morgan Kaufmann, 1994. [8] S. Jain and R. M. Neal. Splitting and merging for a nonconjugate Dirichlet process mixture model. To appear in Bayesian Analysis. [9] C. Kemp, J. B. Tenebaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. Proceedings of the Twenty-First National Conference on Artificial Intelligence, 2006. [10] L. Lazzeroni and A. Owen. Plaid models for gene expression data. Statistica Sinica, 12, 2002. [11] J. Pitman. Combinatorial stochastic processes. Lecture Notes for St. Flour Course, 2002. [12] E. Saund. A multiple cause mixture model for unsupervised learning. Neural Computation, 7(1), 1994. [13] R. Tibshirani, T. Hastie, M. Eisen, D. Ross, D. Botstein, and P. Brown. Clustering methods for the analysis of DNA microarray data. Technical report, Stanford University, 1999. Department of Statistics.
6 0.090481915 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data
7 0.088053286 135 nips-2006-Modelling transcriptional regulation using Gaussian Processes
8 0.085023902 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
9 0.084399126 101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow
10 0.082486175 134 nips-2006-Modeling Human Motion Using Binary Latent Variables
11 0.081990726 172 nips-2006-Scalable Discriminative Learning for Natural Language Parsing and Translation
12 0.077111334 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields
13 0.077105761 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
14 0.075319797 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors
15 0.070976049 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition
16 0.065345488 130 nips-2006-Max-margin classification of incomplete data
17 0.064512558 199 nips-2006-Unsupervised Learning of a Probabilistic Grammar for Object Detection and Parsing
18 0.062630966 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
19 0.060804293 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure
20 0.060292009 122 nips-2006-Learning to parse images of articulated bodies
topicId topicWeight
[(0, -0.2), (1, 0.031), (2, 0.063), (3, -0.109), (4, 0.033), (5, 0.066), (6, 0.014), (7, -0.064), (8, -0.02), (9, -0.059), (10, -0.259), (11, -0.014), (12, -0.172), (13, 0.117), (14, -0.14), (15, 0.041), (16, 0.08), (17, -0.042), (18, -0.015), (19, -0.078), (20, 0.162), (21, -0.023), (22, -0.115), (23, -0.219), (24, 0.081), (25, -0.099), (26, 0.001), (27, -0.056), (28, 0.115), (29, 0.04), (30, -0.067), (31, 0.021), (32, -0.128), (33, 0.084), (34, 0.112), (35, -0.021), (36, -0.048), (37, -0.145), (38, 0.049), (39, 0.042), (40, -0.009), (41, -0.117), (42, -0.152), (43, -0.039), (44, 0.088), (45, -0.121), (46, 0.014), (47, 0.002), (48, -0.02), (49, -0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.95369399 54 nips-2006-Comparative Gene Prediction using Conditional Random Fields
Author: Jade P. Vinson, David Decaprio, Matthew D. Pearson, Stacey Luoma, James E. Galagan
Abstract: Computational gene prediction using generative models has reached a plateau, with several groups converging to a generalized hidden Markov model (GHMM) incorporating phylogenetic models of nucleotide sequence evolution. Further improvements in gene calling accuracy are likely to come through new methods that incorporate additional data, both comparative and species specific. Conditional Random Fields (CRFs), which directly model the conditional probability P (y|x) of a vector of hidden states conditioned on a set of observations, provide a unified framework for combining probabilistic and non-probabilistic information and have been shown to outperform HMMs on sequence labeling tasks in natural language processing. We describe the use of CRFs for comparative gene prediction. We implement a model that encapsulates both a phylogenetic-GHMM (our baseline comparative model) and additional non-probabilistic features. We tested our model on the genome sequence of the fungal human pathogen Cryptococcus neoformans. Our baseline comparative model displays accuracy comparable to the the best available gene prediction tool for this organism. Moreover, we show that discriminative training and the incorporation of non-probabilistic evidence significantly improve performance. Our software implementation, Conrad, is freely available with an open source license at http://www.broad.mit.edu/annotation/conrad/. 1
2 0.717022 108 nips-2006-Large Scale Hidden Semi-Markov SVMs
Author: Gunnar Rätsch, Sören Sonnenburg
Abstract: We describe Hidden Semi-Markov Support Vector Machines (SHM SVMs), an extension of HM SVMs to semi-Markov chains. This allows us to predict segmentations of sequences based on segment-based features measuring properties such as the length of the segment. We propose a novel technique to partition the problem into sub-problems. The independently obtained partial solutions can then be recombined in an efficient way, which allows us to solve label sequence learning problems with several thousands of labeled sequences. We have tested our algorithm for predicting gene structures, an important problem in computational biology. Results on a well-known model organism illustrate the great potential of SHM SVMs in computational biology. 1
3 0.70839071 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
Author: Samuel S. Gross, Olga Russakovsky, Chuong B. Do, Serafim Batzoglou
Abstract: We consider the problem of training a conditional random field (CRF) to maximize per-label predictive accuracy on a training set, an approach motivated by the principle of empirical risk minimization. We give a gradient-based procedure for minimizing an arbitrarily accurate approximation of the empirical risk under a Hamming loss function. In experiments with both simulated and real data, our optimization procedure gives significantly better testing performance than several current approaches for CRF training, especially in situations of high label noise. 1
4 0.66279775 101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow
Author: Yi Mao, Guy Lebanon
Abstract: We examine the problem of predicting local sentiment flow in documents, and its application to several areas of text analysis. Formally, the problem is stated as predicting an ordinal sequence based on a sequence of word sets. In the spirit of isotonic regression, we develop a variant of conditional random fields that is wellsuited to handle this problem. Using the M¨ bius transform, we express the model o as a simple convex optimization problem. Experiments demonstrate the model and its applications to sentiment prediction, style analysis, and text summarization. 1
5 0.52606106 135 nips-2006-Modelling transcriptional regulation using Gaussian Processes
Author: Neil D. Lawrence, Guido Sanguinetti, Magnus Rattray
Abstract: Modelling the dynamics of transcriptional processes in the cell requires the knowledge of a number of key biological quantities. While some of them are relatively easy to measure, such as mRNA decay rates and mRNA abundance levels, it is still very hard to measure the active concentration levels of the transcription factor proteins that drive the process and the sensitivity of target genes to these concentrations. In this paper we show how these quantities for a given transcription factor can be inferred from gene expression levels of a set of known target genes. We treat the protein concentration as a latent function with a Gaussian process prior, and include the sensitivities, mRNA decay rates and baseline expression levels as hyperparameters. We apply this procedure to a human leukemia dataset, focusing on the tumour repressor p53 and obtaining results in good accordance with recent biological studies.
6 0.42235032 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields
7 0.39085799 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors
8 0.39025745 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data
9 0.38944846 172 nips-2006-Scalable Discriminative Learning for Natural Language Parsing and Translation
10 0.35375845 199 nips-2006-Unsupervised Learning of a Probabilistic Grammar for Object Detection and Parsing
11 0.34201932 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
12 0.34106871 117 nips-2006-Learning on Graph with Laplacian Regularization
13 0.3371346 139 nips-2006-Multi-dynamic Bayesian Networks
14 0.31690732 122 nips-2006-Learning to parse images of articulated bodies
15 0.3161476 134 nips-2006-Modeling Human Motion Using Binary Latent Variables
16 0.31374517 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition
17 0.31174022 81 nips-2006-Game Theoretic Algorithms for Protein-DNA binding
18 0.30646071 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
19 0.30293131 90 nips-2006-Hidden Markov Dirichlet Process: Modeling Genetic Recombination in Open Ancestral Space
20 0.29847917 174 nips-2006-Similarity by Composition
topicId topicWeight
[(1, 0.075), (3, 0.02), (7, 0.06), (9, 0.033), (22, 0.053), (44, 0.067), (57, 0.076), (65, 0.032), (69, 0.036), (71, 0.019), (76, 0.027), (90, 0.42)]
simIndex simValue paperId paperTitle
same-paper 1 0.81894791 54 nips-2006-Comparative Gene Prediction using Conditional Random Fields
Author: Jade P. Vinson, David Decaprio, Matthew D. Pearson, Stacey Luoma, James E. Galagan
Abstract: Computational gene prediction using generative models has reached a plateau, with several groups converging to a generalized hidden Markov model (GHMM) incorporating phylogenetic models of nucleotide sequence evolution. Further improvements in gene calling accuracy are likely to come through new methods that incorporate additional data, both comparative and species specific. Conditional Random Fields (CRFs), which directly model the conditional probability P (y|x) of a vector of hidden states conditioned on a set of observations, provide a unified framework for combining probabilistic and non-probabilistic information and have been shown to outperform HMMs on sequence labeling tasks in natural language processing. We describe the use of CRFs for comparative gene prediction. We implement a model that encapsulates both a phylogenetic-GHMM (our baseline comparative model) and additional non-probabilistic features. We tested our model on the genome sequence of the fungal human pathogen Cryptococcus neoformans. Our baseline comparative model displays accuracy comparable to the the best available gene prediction tool for this organism. Moreover, we show that discriminative training and the incorporation of non-probabilistic evidence significantly improve performance. Our software implementation, Conrad, is freely available with an open source license at http://www.broad.mit.edu/annotation/conrad/. 1
2 0.78738868 1 nips-2006-A Bayesian Approach to Diffusion Models of Decision-Making and Response Time
Author: Michael D. Lee, Ian G. Fuss, Daniel J. Navarro
Abstract: We present a computational Bayesian approach for Wiener diffusion models, which are prominent accounts of response time distributions in decision-making. We first develop a general closed-form analytic approximation to the response time distributions for one-dimensional diffusion processes, and derive the required Wiener diffusion as a special case. We use this result to undertake Bayesian modeling of benchmark data, using posterior sampling to draw inferences about the interesting psychological parameters. With the aid of the benchmark data, we show the Bayesian account has several advantages, including dealing naturally with the parameter variation needed to account for some key features of the data, and providing quantitative measures to guide decisions about model construction. 1
3 0.43792629 101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow
Author: Yi Mao, Guy Lebanon
Abstract: We examine the problem of predicting local sentiment flow in documents, and its application to several areas of text analysis. Formally, the problem is stated as predicting an ordinal sequence based on a sequence of word sets. In the spirit of isotonic regression, we develop a variant of conditional random fields that is wellsuited to handle this problem. Using the M¨ bius transform, we express the model o as a simple convex optimization problem. Experiments demonstrate the model and its applications to sentiment prediction, style analysis, and text summarization. 1
4 0.43296355 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space
Author: Wenye Li, Kin-hong Lee, Kwong-sak Leung
Abstract: Kernel-based regularized learning seeks a model in a hypothesis space by minimizing the empirical error and the model’s complexity. Based on the representer theorem, the solution consists of a linear combination of translates of a kernel. This paper investigates a generalized form of representer theorem for kernel-based learning. After mapping predefined features and translates of a kernel simultaneously onto a hypothesis space by a specific way of constructing kernels, we proposed a new algorithm by utilizing a generalized regularizer which leaves part of the space unregularized. Using a squared-loss function in calculating the empirical error, a simple convex solution is obtained which combines predefined features with translates of the kernel. Empirical evaluations have confirmed the effectiveness of the algorithm for supervised learning tasks.
5 0.40257838 108 nips-2006-Large Scale Hidden Semi-Markov SVMs
Author: Gunnar Rätsch, Sören Sonnenburg
Abstract: We describe Hidden Semi-Markov Support Vector Machines (SHM SVMs), an extension of HM SVMs to semi-Markov chains. This allows us to predict segmentations of sequences based on segment-based features measuring properties such as the length of the segment. We propose a novel technique to partition the problem into sub-problems. The independently obtained partial solutions can then be recombined in an efficient way, which allows us to solve label sequence learning problems with several thousands of labeled sequences. We have tested our algorithm for predicting gene structures, an important problem in computational biology. Results on a well-known model organism illustrate the great potential of SHM SVMs in computational biology. 1
6 0.39923733 174 nips-2006-Similarity by Composition
7 0.39853749 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
8 0.38681456 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields
9 0.38366938 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
10 0.37898335 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency
11 0.37711191 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
12 0.37670913 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
13 0.37370092 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition
14 0.37284726 43 nips-2006-Bayesian Model Scoring in Markov Random Fields
15 0.36912316 16 nips-2006-A Theory of Retinal Population Coding
16 0.36743662 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data
17 0.36650074 115 nips-2006-Learning annotated hierarchies from relational data
18 0.36645439 15 nips-2006-A Switched Gaussian Process for Estimating Disparity and Segmentation in Binocular Stereo
19 0.36550629 29 nips-2006-An Information Theoretic Framework for Eukaryotic Gradient Sensing
20 0.3646999 91 nips-2006-Hierarchical Dirichlet Processes with Random Effects