nips nips2005 nips2005-103 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jean-philippe Vert, Robert Thurman, William S. Noble
Abstract: We describe a hierarchy of motif-based kernels for multiple alignments of biological sequences, particularly suitable to process regulatory regions of genes. The kernels incorporate progressively more information, with the most complex kernel accounting for a multiple alignment of orthologous regions, the phylogenetic tree relating the species, and the prior knowledge that relevant sequence patterns occur in conserved motif blocks. These kernels can be used in the presence of a library of known transcription factor binding sites, or de novo by iterating over all k-mers of a given length. In the latter mode, a discriminative classifier built from such a kernel not only recognizes a given class of promoter regions, but as a side effect simultaneously identifies a collection of relevant, discriminative sequence motifs. We demonstrate the utility of the motif-based multiple alignment kernels by using a collection of aligned promoter regions from five yeast species to recognize classes of cell-cycle regulated genes. Supplementary data is available at http://noble.gs.washington.edu/proj/pkernel. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Kernels for gene regulatory regions Jean-Philippe Vert Geostatistics Center Ecole des Mines de Paris - ParisTech Jean-Philippe. [sent-1, score-0.448]
2 edu Abstract We describe a hierarchy of motif-based kernels for multiple alignments of biological sequences, particularly suitable to process regulatory regions of genes. [sent-7, score-0.585]
3 The kernels incorporate progressively more information, with the most complex kernel accounting for a multiple alignment of orthologous regions, the phylogenetic tree relating the species, and the prior knowledge that relevant sequence patterns occur in conserved motif blocks. [sent-8, score-1.346]
4 These kernels can be used in the presence of a library of known transcription factor binding sites, or de novo by iterating over all k-mers of a given length. [sent-9, score-0.461]
5 In the latter mode, a discriminative classifier built from such a kernel not only recognizes a given class of promoter regions, but as a side effect simultaneously identifies a collection of relevant, discriminative sequence motifs. [sent-10, score-0.841]
6 We demonstrate the utility of the motif-based multiple alignment kernels by using a collection of aligned promoter regions from five yeast species to recognize classes of cell-cycle regulated genes. [sent-11, score-1.414]
7 These switches typically contain multiple binding site motifs, each of length 5–15 nucleotides, for a class of DNA-binding proteins known as transcription factors. [sent-18, score-0.388]
8 As a result, the detection of such regulatory motifs proximal to a gene provides important clues about its regulation and, therefore, its function. [sent-19, score-0.666]
9 These regulatory motifs, however, usually represent a tiny fraction of the intergenic sequence, and their automatic detection remains extremely challenging. [sent-21, score-0.395]
10 For well-studied transcription factors, libraries of known binding site motifs can be used to scan the intergenic sequence. [sent-22, score-0.761]
11 A common approach for the de novo detection of regulatory motifs is to start from a set of genes known to be similarly regulated, for example by clustering gene expression data, and search for over-represented short sequences in their proximal intergenic regions. [sent-23, score-1.284]
12 Alternatively, some authors have proposed to represent each intergenic sequence by its content in short sequences, and to correlate this representation with gene expression data [1]. [sent-24, score-0.473]
13 Finally, additional information to characterize regulatory motifs can be gained by comparing the intergenic sequences of orthologous genes, i. [sent-25, score-0.961]
14 , genes from different species that have evolved from a common ancestor, because regulatory motifs are more conserved than non-functional intergenic DNA [2]. [sent-27, score-1.174]
15 We propose in this paper a hierarchy of increasingly complex representations for intergenic sequences. [sent-28, score-0.245]
16 Each representation yields a positive definite kernel between intergenic sequences. [sent-29, score-0.346]
17 While various motif-based sequence kernels have been described in the literature (e. [sent-30, score-0.22]
18 , [3, 4, 5]), these kernels typically operate on sequences from a single species, ignoring relevant information from orthologous sequences. [sent-32, score-0.467]
19 In contrast, our hierarchy of motif-based kernels accounts for a multiple alignment of orthologous regions, the phylogenetic tree relating the species, and the prior knowledge that relevant sequence patterns occur in conserved motif blocks. [sent-33, score-1.218]
20 These kernels can be used in the presence of a library of known transcription factor binding sites, or de novo by iterating over all k-mers of a given length. [sent-34, score-0.461]
21 In the latter mode, a discriminative classifier built from such a kernel not only recognizes a given class of regulatory sequences, but as a side effect simultaneously identifies a collection of discriminative sequence motifs. [sent-35, score-0.552]
22 We demonstrate the utility of the motif-based multiple alignment kernels by using a collection of aligned intergenic regions from five yeast species to recognize classes of co-regulated genes. [sent-36, score-1.14]
23 All kernels were designed before any experiment was conducted, and we then performed an objective empirical evaluation of each kernel without further parameter optimization. [sent-39, score-0.298]
24 2 Kernels for intergenic sequences In a complex eukaryotic genome, regulatory switches may occur anywhere within a relatively large genomic region near a given gene. [sent-43, score-0.624]
25 In this work we focus on a well-studied model organism, the budding yeast Saccharomyces cerevisiae, in which the typical intergenic region is less than 1000 bases long. [sent-44, score-0.371]
26 We refer to the intergenic region upstream of a yeast gene as its promoter region. [sent-45, score-0.987]
27 Denoting the four-letter set of nucleotides as A = {A, C, G, T }, the promoter region of a gene is a finite-length sequence of nucleotides ∞ x ∈ A∗ = i=0 Ai . [sent-46, score-0.901]
28 Given several sequenced organisms, in silico comparison of genes between organisms often allows the detection of orthologous genes, that is, genes that evolved from a common ancestor. [sent-47, score-0.518]
29 If the species are evolutionarily close, as are different yeast strains, then the promoter regions are usually quite similar and can be represented as a multiple alignment. [sent-48, score-0.977]
30 Each position in this alignment represents one letter in the shared ancestor’s promoter region. [sent-49, score-0.705]
31 Mathematically speaking, a multiple alignment of length n of ¯ p sequences is a sequence c = c1 , c2 , . [sent-50, score-0.377]
32 We are now in the position to describe a family of representations and kernels for promoter regions, incorporating an increasing amount of prior knowledge about the properties of regulatory motifs. [sent-58, score-0.91]
33 All kernels below are simple inner products between vector representations of promoter regions. [sent-59, score-0.636]
34 A promoter region P (either single sequence or multiple alignment) is therefore always represented by a vector ΦM (P ) = (Φa (P ))a∈M . [sent-63, score-0.636]
35 Motif kernel on a single sequence The simplest approach to index a single promoter region x ∈ A∗ with an alphabet M is to define ΦSpectrum (x) = na (x) , a ∀a ∈ M , where na (x) counts the number of occurrences of a in x. [sent-64, score-0.755]
36 When M = Ad , the resulting kernel is the spectrum kernel [3] between single promoter regions. [sent-65, score-0.866]
37 Motif kernel on multiple sequences When a gene has p orthologs in other species, then a p set of p promoter regions {x1 , x2 , . [sent-66, score-1.096]
38 It is essentially the spectrum kernel on the concatenation of the available promoter regions—ignoring, however, k-mers that overlap different sequences in the concatenation. [sent-75, score-0.842]
39 First, if all promoters contain common functional motifs and randomly varying nonfunctional motifs, then the signal-to-noise ratio of the relevant regulatory features compared to other irrelevant non-functional features increases by taking the sum (or mean) of individual feature vectors. [sent-77, score-0.697]
40 Second, even functional motifs representing transcription factor binding sites are known to have some variability in some positions, and merging the occurrences of a similar motif in different sequences is a way to model this flexibility in the framework of a vector representation. [sent-78, score-0.992]
41 Marginalized motif kernel on a multiple alignment The summation kernel might suffer from at least two limitations. [sent-79, score-0.757]
42 Then the promoter regions of two out of three orthologs would be virtually identical, giving an unjustified double weight to this duplicated species compared to the third one in the summation kernel. [sent-82, score-0.878]
43 In order to overcome these limitations, we propose to transform the set of promoter regions into a multiple alignment. [sent-85, score-0.65]
44 We therefore assume that a fixed number of q species has been ¯ ¯ selected, and that a probabilistic model p(h, c), with h ∈ A and c ∈ Aq has been tuned on these species. [sent-86, score-0.197]
45 We also assume that all sets of q promoter regions of groups of orthologous genes in the q species have been turned into multiple alignments. [sent-89, score-1.135]
46 , cn , suppose for the moment that we know the corresponding true sequence of nucleotides of the common ancestor h = h1 , h2 , . [sent-93, score-0.233]
47 Then the spectrum of the sequence h, that is, ΦSpectrum (h), would be a good summary for the M multiple alignment, and the inner product between two such spectra would be a candidate kernel between multiple alignments. [sent-97, score-0.409]
48 The sequence h being of course unknown, we propose to estimate its conditional probability given the multiple alignment c, under the model where all columns are independent and identically distributed according to the evolutionary n model, that is, p(h|c) = i=1 p (hi |ci ). [sent-98, score-0.317]
49 Under this probabilistic model, it is now possible to define the representation of the multiple alignment as the expectation of the spectrum representation of h with respect to this conditional probability, that is: ΦMarginalized (c) = a ΦSpectrum (h)p(h|c) , a ∀a ∈ M . [sent-99, score-0.328]
50 We call the resulting kernel the marginalized kernel because it corresponds to the marginalization of the spectrum kernel under the phylogenetic probabilistic model [7]. [sent-115, score-0.831]
51 Marginalized motif kernel with phylogenetic shadowing The marginalized kernel is expected to be useful when relevant information is distributed along the entire length of the sequences analyzed. [sent-116, score-1.005]
52 In the case of promoter regions, however, the relevant information is more likely to be located within a few short motifs. [sent-117, score-0.574]
53 Because only a small fraction of the total set of promoter regions lies within such motifs, this information is likely to be lost when the whole sequence is represented by its spectrum. [sent-118, score-0.663]
54 In order to overcome this limitation, we exploit the observation that relevant motifs are more evolutionarily conserved on average than the surrounding sequence. [sent-119, score-0.521]
55 This hypothesis has been confirmed by many studies that show that functional parts, being under more evolutionary pressure, are more conserved than non-functional ones. [sent-120, score-0.23]
56 Given a multiple alignment c, let us assume (temporarily) that we know which parts are relevant. [sent-121, score-0.197]
57 sn ∈ n {0, 1} , where si = 1 means that the ith position is relevant, and irrelevant if si = 0. [sent-125, score-0.256]
58 A typical sequence for a promoter region consist primarily of 0’s, except for a few positions indicating the position of the transcription factor binding motifs. [sent-126, score-0.907]
59 Then it would make sense to use a spectrum kernel based on the spectrum of h restricted to the relevant positions only. [sent-128, score-0.461]
60 In other words, all positions where si = 0 could be thrown away, in order to focus only on the relevant positions. [sent-129, score-0.213]
61 Given only a multiple alignment c, the sequences h and s are not known but can be estimated. [sent-137, score-0.314]
62 This is where the hypothesis that relevant nucleotides are more conserved than irrelevant nucleotides can be encoded, by using two models of evolution with different rates of mutations, as in phylogenetic shadowing [2]. [sent-138, score-0.605]
63 Given these two models of evolution, let us also define a prior probability p(s) that a position is relevant or not (related to the proportion of relevant parts we expect in a promoter region), and prior probabilities for the ancestor nucleotide p(h|s = 0) and p(h|s = 1). [sent-141, score-0.849]
64 The joint probability of being in state s, having an ancestor nucleotide h and a resulting alignment c is then p(c, h, s) = p(s)p(h|s)p(c|h, s). [sent-142, score-0.287]
65 p(s = 0)p(c|s = 0) + p(s = 1)p(c|s = 1) Moreover, it can easily be seen that, like the marginalized kernel, the shadow kernel is the marginalization of the kernel corresponding to ΦRelevant with respect to p(h, s|c). [sent-152, score-0.559]
66 Incorporating Markov dependencies between positions The probabilistic model used in the shadow kernel models each position independently from the others. [sent-153, score-0.435]
67 As a result, a conserved position has the same contribution to the shadow kernel if it is surrounded by other conserved positions, or by varying positions. [sent-154, score-0.593]
68 Once again, this feature vector can be computed as a sum of window weights over sequences by n−d+1 ΦMarkov (c) = a p (si = 1|c) p (hi = aj+1 |ci , si = 1) i=1 d−1 × p(hi+j = aj+1 , si+j = 1|ci+j , si+j−1 = 1) . [sent-157, score-0.22]
69 j=1 The main difference with the computation of the shadow kernel is the need to compute the term P (si = 1|c), which can be done using the general sum-product algorithm. [sent-158, score-0.305]
70 3 Experiments We measure the utility of our hierarchy of kernels in a cross-validated, supervised learning framework. [sent-159, score-0.237]
71 We hypothesize that co-expression implies co-regulation of a given group of genes by a common set of transcription factors. [sent-162, score-0.299]
72 Hence, the corresponding promoter regions should be enriched for a corresponding set of transcription factor binding motifs. [sent-163, score-0.851]
73 We test the ability of a support vector machine (SVM) classifier to learn to recapitulate the co-expression classes, based only upon the promoter regions. [sent-164, score-0.479]
74 Our results show that the SVM performance improves as we incorporate more prior knowledge into the promoter kernel. [sent-165, score-0.54]
75 We collected the promoter regions from five closely related yeast species [9, 10]. [sent-166, score-0.893]
76 Promoter regions from orthologous genes were aligned using ClustalW, discarding promoter regions that aligned with less than 30% sequence identity relative to the other sequences in the alignment. [sent-167, score-1.277]
77 For the phylogenetic kernels, we inferred a phylogenetic tree among the five yeast species from alignments of four highly conserved proteins—MCM2, MCM3, CDC47 and MCM6. [sent-169, score-0.796]
78 For every gene class, the worst-performing kernel is one of the three simplest kernels: “simple,” “summation” or “marginalization. [sent-181, score-0.278]
79 ” The mean ROC scores across all eight classes for these three kernels are 0. [sent-182, score-0.261]
80 Classification performance improves dramatically using the shadow kernel with either a small (2) or large (5) ratio of fast-to-slow evolutionary rates. [sent-186, score-0.362]
81 Furthermore, across five of the eight gene classes, one of the two shadow kernels is the best-performing kernel. [sent-190, score-0.493]
82 The Markov kernel performs approximately as well as the shadow kernel. [sent-191, score-0.305]
83 Although further tuning Table 1: Mean ROC scores for SVMs trained using various kernels to recognize classes of co-expressed yeast genes. [sent-196, score-0.384]
84 The classes of genes (columns) are, respectively, ATP synthesis, DNA replication, glycolysis, mitochondrial ribosome, proteasome, spindle-pole body, splicing and TCA cycle. [sent-203, score-0.239]
85 For the shadow and Markov kernels, the values “2” and “5” refer to the ratio of fast to slow evolutionary rates. [sent-205, score-0.221]
86 Kernel single summation marginalized shadow 2 shadow 5 Markov 2 90/90 Markov 2 90/99 Markov 2 99/99 Markov 5 90/90 Markov 5 90/99 Markov 5 99/99 ATP 15 0. [sent-207, score-0.497]
87 se), searching each class of promoter regions using MONKEY (rana. [sent-311, score-0.627]
88 For each gene class, we identify the three JASPAR motifs that occur most frequently within that class, and we create a list of all 5-mers that appear within those motif occurrences. [sent-315, score-0.671]
89 Table 2 indicates that the discriminative 5-mers identified by the SVM are significantly enriched in 5-mers that appear within biologically significant motif regions, with significant p-values for all eight gene classes (see caption for details). [sent-319, score-0.498]
90 4 Conclusion We have described and demonstrated the utility of a class of kernels for characterizing gene regulatory regions. [sent-320, score-0.551]
91 These kernels allow us to incorporate prior knowledge about the evolution of a set of orthologous sequences and the conservation of transcription factor binding site motifs. [sent-321, score-0.753]
92 We have also demonstrated that the motifs identified by an SVM trained using these kernels correspond to biologically significant motif regions. [sent-322, score-0.691]
93 Our future work will focus on automating the process of agglomerating the identified k-mers into a smaller set of motif models, and on applying these kernels in combination with gene expression, protein-protein interaction and other genome-wide data sets. [sent-323, score-0.516]
94 Row two gives the number of 5-mers constructed from JASPAR motif occurrences in the 5-species alignments. [sent-327, score-0.25]
95 Visualizing associations between genome sequences and gene expression data using genome-mean expression profiles. [sent-355, score-0.366]
96 Phylogenetic shadowing of primate sequences to find functional regions of the human genome. [sent-368, score-0.343]
97 The spectrum kernel: A string kernel for SVM protein classification. [sent-375, score-0.246]
98 Finding functional features in Saccharomyces genomes by phylogenetic footprinting. [sent-431, score-0.218]
99 Sequencing and comparison of yeast species to identify genes and regulatory elements. [sent-434, score-0.66]
100 fastDNAmL: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood. [sent-437, score-0.281]
wordName wordTfidf (topN-words)
[('promoter', 0.479), ('motifs', 0.312), ('motif', 0.222), ('intergenic', 0.205), ('regulatory', 0.19), ('genes', 0.177), ('species', 0.171), ('shadow', 0.164), ('phylogenetic', 0.164), ('kernels', 0.157), ('alignment', 0.147), ('kernel', 0.141), ('orthologous', 0.137), ('gene', 0.137), ('yeast', 0.122), ('transcription', 0.122), ('regions', 0.121), ('conserved', 0.119), ('sequences', 0.117), ('marginalized', 0.113), ('hi', 0.112), ('spectrum', 0.105), ('si', 0.103), ('binding', 0.095), ('nucleotides', 0.089), ('ci', 0.086), ('ancestor', 0.081), ('svm', 0.073), ('dna', 0.072), ('sequence', 0.063), ('nucleotide', 0.059), ('roc', 0.058), ('evolutionary', 0.057), ('summation', 0.056), ('relevant', 0.056), ('functional', 0.054), ('genome', 0.054), ('positions', 0.054), ('jaspar', 0.051), ('nonfunctional', 0.051), ('novo', 0.051), ('orthologs', 0.051), ('shadowing', 0.051), ('tca', 0.051), ('multiple', 0.05), ('position', 0.05), ('markov', 0.049), ('atp', 0.045), ('region', 0.044), ('sites', 0.042), ('aj', 0.041), ('hierarchy', 0.04), ('utility', 0.04), ('short', 0.039), ('discriminative', 0.038), ('evolution', 0.037), ('scores', 0.037), ('recognize', 0.036), ('library', 0.036), ('eight', 0.035), ('identi', 0.035), ('enriched', 0.034), ('eukaryotic', 0.034), ('evolutionarily', 0.034), ('fastdnaml', 0.034), ('fulton', 0.034), ('glyc', 0.034), ('leslie', 0.034), ('noble', 0.034), ('ovcharenko', 0.034), ('pmarkov', 0.034), ('promoters', 0.034), ('prot', 0.034), ('ribo', 0.034), ('saccharomyces', 0.034), ('splic', 0.034), ('ve', 0.034), ('switches', 0.034), ('prior', 0.034), ('proteins', 0.033), ('classes', 0.032), ('bioinformatics', 0.031), ('ad', 0.031), ('aligned', 0.031), ('intersection', 0.03), ('splicing', 0.03), ('letter', 0.029), ('tree', 0.029), ('expression', 0.029), ('collection', 0.028), ('occurrences', 0.028), ('site', 0.027), ('class', 0.027), ('organisms', 0.027), ('mutations', 0.027), ('alignments', 0.027), ('proximal', 0.027), ('recognizes', 0.027), ('incorporate', 0.027), ('probabilistic', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 103 nips-2005-Kernels for gene regulatory regions
Author: Jean-philippe Vert, Robert Thurman, William S. Noble
Abstract: We describe a hierarchy of motif-based kernels for multiple alignments of biological sequences, particularly suitable to process regulatory regions of genes. The kernels incorporate progressively more information, with the most complex kernel accounting for a multiple alignment of orthologous regions, the phylogenetic tree relating the species, and the prior knowledge that relevant sequence patterns occur in conserved motif blocks. These kernels can be used in the presence of a library of known transcription factor binding sites, or de novo by iterating over all k-mers of a given length. In the latter mode, a discriminative classifier built from such a kernel not only recognizes a given class of promoter regions, but as a side effect simultaneously identifies a collection of relevant, discriminative sequence motifs. We demonstrate the utility of the motif-based multiple alignment kernels by using a collection of aligned promoter regions from five yeast species to recognize classes of cell-cycle regulated genes. Supplementary data is available at http://noble.gs.washington.edu/proj/pkernel. 1
2 0.10189565 175 nips-2005-Sequence and Tree Kernels with Statistical Feature Mining
Author: Jun Suzuki, Hideki Isozaki
Abstract: This paper proposes a new approach to feature selection based on a statistical feature mining technique for sequence and tree kernels. Since natural language data take discrete structures, convolution kernels, such as sequence and tree kernels, are advantageous for both the concept and accuracy of many natural language processing tasks. However, experiments have shown that the best results can only be achieved when limited small sub-structures are dealt with by these kernels. This paper discusses this issue of convolution kernels and then proposes a statistical feature selection that enable us to use larger sub-structures effectively. The proposed method, in order to execute efficiently, can be embedded into an original kernel calculation process by using sub-structure mining algorithms. Experiments on real NLP tasks confirm the problem in the conventional method and compare the performance of a conventional method to that of the proposed method.
3 0.086274467 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
Author: Tong Zhang, Rie Kubota Ando
Abstract: We consider a framework for semi-supervised learning using spectral decomposition based un-supervised kernel design. This approach subsumes a class of previously proposed semi-supervised learning methods on data graphs. We examine various theoretical properties of such methods. In particular, we derive a generalization performance bound, and obtain the optimal kernel design by minimizing the bound. Based on the theoretical analysis, we are able to demonstrate why spectral kernel design based methods can often improve the predictive performance. Experiments are used to illustrate the main consequences of our analysis.
4 0.086006165 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
Author: Andreas Argyriou, Mark Herbster, Massimiliano Pontil
Abstract: A foundational problem in semi-supervised learning is the construction of a graph underlying the data. We propose to use a method which optimally combines a number of differently constructed graphs. For each of these graphs we associate a basic graph kernel. We then compute an optimal combined kernel. This kernel solves an extended regularization problem which requires a joint minimization over both the data and the set of graph kernels. We present encouraging results on different OCR tasks where the optimal combined kernel is computed from graphs constructed with a variety of distances functions and the ‘k’ in nearest neighbors. 1
5 0.070006475 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
Author: Y. Altun, D. McAllester, M. Belkin
Abstract: Many real-world classification problems involve the prediction of multiple inter-dependent variables forming some structural dependency. Recent progress in machine learning has mainly focused on supervised classification of such structured variables. In this paper, we investigate structured classification in a semi-supervised setting. We present a discriminative approach that utilizes the intrinsic geometry of input patterns revealed by unlabeled data points and we derive a maximum-margin formulation of semi-supervised learning for structured variables. Unlike transductive algorithms, our formulation naturally extends to new test points. 1
6 0.069534734 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm
7 0.068696588 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation
8 0.059942778 185 nips-2005-Subsequence Kernels for Relation Extraction
9 0.058949463 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification
10 0.054875996 47 nips-2005-Consistency of one-class SVM and related algorithms
11 0.053867333 78 nips-2005-From Weighted Classification to Policy Search
12 0.053631961 127 nips-2005-Mixture Modeling by Affinity Propagation
13 0.050429732 105 nips-2005-Large-Scale Multiclass Transduction
14 0.048716489 148 nips-2005-Online Discovery and Learning of Predictive State Representations
15 0.047850739 77 nips-2005-From Lasso regression to Feature vector machine
16 0.046527725 198 nips-2005-Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails
17 0.046309415 151 nips-2005-Pattern Recognition from One Example by Chopping
18 0.045932185 184 nips-2005-Structured Prediction via the Extragradient Method
19 0.045857146 79 nips-2005-Fusion of Similarity Data in Clustering
20 0.045766361 196 nips-2005-Two view learning: SVM-2K, Theory and Practice
topicId topicWeight
[(0, 0.152), (1, 0.043), (2, -0.023), (3, -0.005), (4, -0.002), (5, 0.061), (6, 0.102), (7, 0.097), (8, 0.024), (9, 0.054), (10, -0.008), (11, -0.065), (12, -0.02), (13, -0.032), (14, -0.05), (15, 0.028), (16, 0.044), (17, -0.009), (18, 0.011), (19, 0.008), (20, -0.019), (21, -0.1), (22, 0.031), (23, -0.121), (24, 0.042), (25, -0.014), (26, -0.0), (27, -0.155), (28, 0.007), (29, 0.118), (30, -0.027), (31, -0.073), (32, -0.095), (33, 0.034), (34, 0.017), (35, -0.005), (36, 0.074), (37, -0.126), (38, -0.071), (39, 0.044), (40, -0.13), (41, -0.067), (42, 0.003), (43, -0.069), (44, 0.156), (45, -0.106), (46, 0.074), (47, -0.158), (48, -0.018), (49, 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 0.95325726 103 nips-2005-Kernels for gene regulatory regions
Author: Jean-philippe Vert, Robert Thurman, William S. Noble
Abstract: We describe a hierarchy of motif-based kernels for multiple alignments of biological sequences, particularly suitable to process regulatory regions of genes. The kernels incorporate progressively more information, with the most complex kernel accounting for a multiple alignment of orthologous regions, the phylogenetic tree relating the species, and the prior knowledge that relevant sequence patterns occur in conserved motif blocks. These kernels can be used in the presence of a library of known transcription factor binding sites, or de novo by iterating over all k-mers of a given length. In the latter mode, a discriminative classifier built from such a kernel not only recognizes a given class of promoter regions, but as a side effect simultaneously identifies a collection of relevant, discriminative sequence motifs. We demonstrate the utility of the motif-based multiple alignment kernels by using a collection of aligned promoter regions from five yeast species to recognize classes of cell-cycle regulated genes. Supplementary data is available at http://noble.gs.washington.edu/proj/pkernel. 1
2 0.71349245 175 nips-2005-Sequence and Tree Kernels with Statistical Feature Mining
Author: Jun Suzuki, Hideki Isozaki
Abstract: This paper proposes a new approach to feature selection based on a statistical feature mining technique for sequence and tree kernels. Since natural language data take discrete structures, convolution kernels, such as sequence and tree kernels, are advantageous for both the concept and accuracy of many natural language processing tasks. However, experiments have shown that the best results can only be achieved when limited small sub-structures are dealt with by these kernels. This paper discusses this issue of convolution kernels and then proposes a statistical feature selection that enable us to use larger sub-structures effectively. The proposed method, in order to execute efficiently, can be embedded into an original kernel calculation process by using sub-structure mining algorithms. Experiments on real NLP tasks confirm the problem in the conventional method and compare the performance of a conventional method to that of the proposed method.
3 0.51873082 185 nips-2005-Subsequence Kernels for Relation Extraction
Author: Raymond J. Mooney, Razvan C. Bunescu
Abstract: We present a new kernel method for extracting semantic relations between entities in natural language text, based on a generalization of subsequence kernels. This kernel uses three types of subsequence patterns that are typically employed in natural language to assert relationships between two entities. Experiments on extracting protein interactions from biomedical corpora and top-level relations from newspaper corpora demonstrate the advantages of this approach. 1
4 0.48660704 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
Author: Tong Zhang, Rie Kubota Ando
Abstract: We consider a framework for semi-supervised learning using spectral decomposition based un-supervised kernel design. This approach subsumes a class of previously proposed semi-supervised learning methods on data graphs. We examine various theoretical properties of such methods. In particular, we derive a generalization performance bound, and obtain the optimal kernel design by minimizing the bound. Based on the theoretical analysis, we are able to demonstrate why spectral kernel design based methods can often improve the predictive performance. Experiments are used to illustrate the main consequences of our analysis.
5 0.47149366 198 nips-2005-Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails
Author: Nebojsa Jojic, Vladimir Jojic, Christopher Meek, David Heckerman, Brendan J. Frey
Abstract: We introduce a new model of genetic diversity which summarizes a large input dataset into an epitome, a short sequence or a small set of short sequences of probability distributions capturing many overlapping subsequences from the dataset. The epitome as a representation has already been used in modeling real-valued signals, such as images and audio. The discrete sequence model we introduce in this paper targets applications in genetics, from multiple alignment to recombination and mutation inference. In our experiments, we concentrate on modeling the diversity of HIV where the epitome emerges as a natural model for producing relatively small vaccines covering a large number of immune system targets known as epitopes. Our experiments show that the epitome includes more epitopes than other vaccine designs of similar length, including cocktails of consensus strains, phylogenetic tree centers, and observed strains. We also discuss epitome designs that take into account uncertainty about Tcell cross reactivity and epitope presentation. In our experiments, we find that vaccine optimization is fairly robust to these uncertainties. 1
6 0.45892316 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
7 0.45737824 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm
8 0.45737287 4 nips-2005-A Bayesian Spatial Scan Statistic
9 0.4167473 171 nips-2005-Searching for Character Models
10 0.40948302 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression
11 0.37521023 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery
12 0.37040886 45 nips-2005-Conditional Visual Tracking in Kernel Space
13 0.36754084 18 nips-2005-Active Learning For Identifying Function Threshold Boundaries
14 0.34232309 127 nips-2005-Mixture Modeling by Affinity Propagation
15 0.33905277 77 nips-2005-From Lasso regression to Feature vector machine
16 0.33795312 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation
17 0.32804 105 nips-2005-Large-Scale Multiclass Transduction
18 0.31898805 130 nips-2005-Modeling Neuronal Interactivity using Dynamic Bayesian Networks
19 0.31771594 37 nips-2005-Benchmarking Non-Parametric Statistical Tests
20 0.31644768 102 nips-2005-Kernelized Infomax Clustering
topicId topicWeight
[(3, 0.03), (10, 0.017), (27, 0.025), (31, 0.033), (34, 0.1), (41, 0.457), (55, 0.024), (65, 0.014), (69, 0.056), (73, 0.023), (88, 0.063), (91, 0.037)]
simIndex simValue paperId paperTitle
1 0.8769874 107 nips-2005-Large scale networks fingerprinting and visualization using the k-core decomposition
Author: J. I. Alvarez-hamelin, Luca Dall'asta, Alain Barrat, Alessandro Vespignani
Abstract: We use the k-core decomposition to develop algorithms for the analysis of large scale complex networks. This decomposition, based on a recursive pruning of the least connected vertices, allows to disentangle the hierarchical structure of networks by progressively focusing on their central cores. By using this strategy we develop a general visualization algorithm that can be used to compare the structural properties of various networks and highlight their hierarchical structure. The low computational complexity of the algorithm, O(n + e), where n is the size of the network, and e is the number of edges, makes it suitable for the visualization of very large sparse networks. We show how the proposed visualization tool allows to find specific structural fingerprints of networks. 1
same-paper 2 0.85504675 103 nips-2005-Kernels for gene regulatory regions
Author: Jean-philippe Vert, Robert Thurman, William S. Noble
Abstract: We describe a hierarchy of motif-based kernels for multiple alignments of biological sequences, particularly suitable to process regulatory regions of genes. The kernels incorporate progressively more information, with the most complex kernel accounting for a multiple alignment of orthologous regions, the phylogenetic tree relating the species, and the prior knowledge that relevant sequence patterns occur in conserved motif blocks. These kernels can be used in the presence of a library of known transcription factor binding sites, or de novo by iterating over all k-mers of a given length. In the latter mode, a discriminative classifier built from such a kernel not only recognizes a given class of promoter regions, but as a side effect simultaneously identifies a collection of relevant, discriminative sequence motifs. We demonstrate the utility of the motif-based multiple alignment kernels by using a collection of aligned promoter regions from five yeast species to recognize classes of cell-cycle regulated genes. Supplementary data is available at http://noble.gs.washington.edu/proj/pkernel. 1
3 0.72248816 79 nips-2005-Fusion of Similarity Data in Clustering
Author: Tilman Lange, Joachim M. Buhmann
Abstract: Fusing multiple information sources can yield significant benefits to successfully accomplish learning tasks. Many studies have focussed on fusing information in supervised learning contexts. We present an approach to utilize multiple information sources in the form of similarity data for unsupervised learning. Based on similarity information, the clustering task is phrased as a non-negative matrix factorization problem of a mixture of similarity measurements. The tradeoff between the informativeness of data sources and the sparseness of their mixture is controlled by an entropy-based weighting mechanism. For the purpose of model selection, a stability-based approach is employed to ensure the selection of the most self-consistent hypothesis. The experiments demonstrate the performance of the method on toy as well as real world data sets. 1
4 0.38923791 178 nips-2005-Soft Clustering on Graphs
Author: Kai Yu, Shipeng Yu, Volker Tresp
Abstract: We propose a simple clustering framework on graphs encoding pairwise data similarities. Unlike usual similarity-based methods, the approach softly assigns data to clusters in a probabilistic way. More importantly, a hierarchical clustering is naturally derived in this framework to gradually merge lower-level clusters into higher-level ones. A random walk analysis indicates that the algorithm exposes clustering structures in various resolutions, i.e., a higher level statistically models a longer-term diffusion on graphs and thus discovers a more global clustering structure. Finally we provide very encouraging experimental results. 1
5 0.36621591 170 nips-2005-Scaling Laws in Natural Scenes and the Inference of 3D Shape
Author: Tai-sing Lee, Brian R. Potetz
Abstract: This paper explores the statistical relationship between natural images and their underlying range (depth) images. We look at how this relationship changes over scale, and how this information can be used to enhance low resolution range data using a full resolution intensity image. Based on our findings, we propose an extension to an existing technique known as shape recipes [3], and the success of the two methods are compared using images and laser scans of real scenes. Our extension is shown to provide a two-fold improvement over the current method. Furthermore, we demonstrate that ideal linear shape-from-shading filters, when learned from natural scenes, may derive even more strength from shadow cues than from the traditional linear-Lambertian shading cues. 1
6 0.35581437 198 nips-2005-Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails
7 0.35397291 114 nips-2005-Learning Rankings via Convex Hull Separation
8 0.34220368 105 nips-2005-Large-Scale Multiclass Transduction
9 0.34220064 86 nips-2005-Generalized Nonnegative Matrix Approximations with Bregman Divergences
10 0.34007159 169 nips-2005-Saliency Based on Information Maximization
11 0.33712313 184 nips-2005-Structured Prediction via the Extragradient Method
12 0.33421594 46 nips-2005-Consensus Propagation
13 0.3327406 23 nips-2005-An Application of Markov Random Fields to Range Sensing
14 0.32940096 127 nips-2005-Mixture Modeling by Affinity Propagation
15 0.32773933 60 nips-2005-Dynamic Social Network Analysis using Latent Space Models
16 0.32753637 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
17 0.32526791 171 nips-2005-Searching for Character Models
18 0.32487091 175 nips-2005-Sequence and Tree Kernels with Statistical Feature Mining
19 0.32364616 133 nips-2005-Nested sampling for Potts models
20 0.32254818 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation