nips nips2008 nips2008-183 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Paolo Frasconi, Andrea Passerini
Abstract: Metal binding is important for the structural and functional characterization of proteins. Previous prediction efforts have only focused on bonding state, i.e. deciding which protein residues act as metal ligands in some binding site. Identifying the geometry of metal-binding sites, i.e. deciding which residues are jointly involved in the coordination of a metal ion is a new prediction problem that has been never attempted before from protein sequence alone. In this paper, we formulate it in the framework of learning with structured outputs. Our solution relies on the fact that, from a graph theoretical perspective, metal binding has the algebraic properties of a matroid, enabling the application of greedy algorithms for learning structured outputs. On a data set of 199 non-redundant metalloproteins, we obtained precision/recall levels of 75%/46% correct ligand-ion assignments, which improves to 88%/88% in the setting where the metal binding state is known. 1
Reference: text
sentIndex sentText sentNum sentScore
1 it Abstract Metal binding is important for the structural and functional characterization of proteins. [sent-6, score-0.249]
2 Previous prediction efforts have only focused on bonding state, i. [sent-7, score-0.238]
3 deciding which protein residues act as metal ligands in some binding site. [sent-9, score-1.288]
4 deciding which residues are jointly involved in the coordination of a metal ion is a new prediction problem that has been never attempted before from protein sequence alone. [sent-12, score-1.179]
5 In this paper, we formulate it in the framework of learning with structured outputs. [sent-13, score-0.126]
6 Our solution relies on the fact that, from a graph theoretical perspective, metal binding has the algebraic properties of a matroid, enabling the application of greedy algorithms for learning structured outputs. [sent-14, score-0.957]
7 On a data set of 199 non-redundant metalloproteins, we obtained precision/recall levels of 75%/46% correct ligand-ion assignments, which improves to 88%/88% in the setting where the metal binding state is known. [sent-15, score-0.757]
8 1 Introduction Metal ions play important roles in protein function and structure and metalloproteins are involved in a number of diseases for which medicine is still seeking effective treatment, including cancer, Parkinson, dementia, and AIDS [10]. [sent-16, score-0.421]
9 A metal binding site typically consists of an ion bound to one or more protein residues (called ligands). [sent-17, score-1.307]
10 In some cases, the ion is embedded in a prosthetic group (e. [sent-18, score-0.216]
11 Among the 20 amino acids, the four most common ligands are cysteine (C), histidine (H), aspartic acid (D), and glutamic acid (E). [sent-21, score-0.455]
12 Highly conserved residues are more likely to be involved in the coordination of a metal ion, although in the case of cysteines, conservation is also often associated with the presence of a disulfide bridge (a covalent bond between the sulfur atoms of two cysteines) [8]. [sent-22, score-0.739]
13 Predicting metal binding from sequence alone can be very useful in genomic annotation for characterizing the function and the structure of non determined proteins, but also during the experimental determination of new metalloproteins. [sent-23, score-0.761]
14 Current high-throughput experimental technologies only annotate whole proteins as metal binding [13], but cannot determine the involved ligands. [sent-24, score-0.859]
15 Most of the research for understanding metal binding has focused on finding sequence patterns that characterize binding sites [8]. [sent-25, score-1.129]
16 The easiest task to formulate in this context is bonding state prediction, which is a binary classification problem: either a residue is involved in the coordination of a metal ion or is free (in the case of cysteines, a third class can also be introduced for disulfide bridges). [sent-27, score-1.029]
17 This prediction task has been addressed in a number of recent works in the case of cysteines only [6], in the case of transition metals (for C and H residues) [12] and for in the special but important case of zinc proteins (for C,H,D, and E residues) [11, 14]. [sent-28, score-0.347]
18 Hovever, classification of individual residues does not provide sufficient information about a binding site. [sent-29, score-0.457]
19 Many proteins bind to several ions in their holo form and a complete characterization requires us to identify the site geometry, i. [sent-30, score-0.361]
20 This problem has been only studied assuming knowledge of the protein 3D structure (e. [sent-33, score-0.187]
21 [5, 1]), limiting its applicability to structurally determined proteins or their close homologs, but not from sequence alone. [sent-35, score-0.193]
22 Abstracting away the biology, this is a structured output prediction problem where the input consists of a string of protein residues and the output is a labeling of each residue with the corresponding ion identifier (specific details are given in the next section). [sent-36, score-0.987]
23 The supervised learning problem with structured outputs has recently received a considerable amount of attention (see [2] for an overview). [sent-37, score-0.159]
24 Different structured output learners deal with this issue by exploiting specific domain properties for the application at hand. [sent-40, score-0.186]
25 Another solution is to construct the structured output in a suitable Hilbert space of features and seek the corresponding pre-image for obtaining the desired discrete structure [17]. [sent-47, score-0.222]
26 We borrow ideas from [15] and [4] but specifically take advantage of the fact that, from a graph theoretical perspective, the metal binding problem has the algebraic structure of a matroid, enabling the application of greedy algorithms. [sent-50, score-0.867]
27 2 A formalization of the metal binding sites prediction problem A protein sequence s is a string in the alphabet of the 20 amino acids. [sent-51, score-1.235]
28 Since only some of the 20 amino acids that exist in nature can act as ligands, we begin by extracting from s the subsequence x obtained by deleting characters corresponding to amino acids that never (or very rarely) act as ligands. [sent-52, score-0.422]
29 By using T = {C, H, D, E} as the set of candidate ligands, we cover 92% ligands of structurally known proteins. [sent-53, score-0.292]
30 We also introduce the set I of symbols associated with metal ion identifiers. [sent-57, score-0.665]
31 The goal is to predict the coordination relation between amino acids in x and metal ions identifiers in I. [sent-59, score-0.85]
32 Ideally, it would be also interesting to predict the chemical element of the bound metal ion. [sent-61, score-0.479]
33 Hence, ion identifiers will have no chemical element attribute attached. [sent-63, score-0.246]
34 In practice, we fix a maximum number m of possible ions (m = 4 in the subsequent experiments, covering 93% of structurally known proteins) and let I = {nil , ι1 , . [sent-64, score-0.191]
35 The number of admissible binding geometries for a given protein chain having n candidate ligands n! [sent-68, score-0.661]
36 being m the number of ions and ki the number of ligands for ion ιi . [sent-73, score-0.581]
37 In practice, each ion is coordinated by a variable number of ligands (typically ranging from 1 to 4, but occasionally more), and each protein chain binds a variable number of ions (typically ranging from 1 to 4). [sent-74, score-0.76]
38 The number of candidate ligands n grows linearly with the protein chain. [sent-75, score-0.412]
39 For example, in the case of PDB chain 1H0Hb (see Figure 1), there are n = 52 candidate ligands and m = 3 ions coordinated by 4 residues each, yielding a set of 7 · 1015 admissible conformations. [sent-76, score-0.657]
40 In this view, the string x should be regarded as a set of vertices labeled with the corresponding amino acid in T . [sent-78, score-0.218]
41 Let x and I be two sets of vertices (associated with candidate ligands and metal ion identifiers, respectively). [sent-82, score-0.952]
42 We say that a bipartite edge set y ⊂ x × I satisfies the metal binding geometry (MBG) property if the degree of each vertex in x in the graph (x ∪ I, y) is at most 1. [sent-83, score-0.798]
43 nil … ι1 ι2 ι3 D C C C C H E H D H H E E D D D C H C C D E D H D D C D E D E C D E C D C D C C D E E E D C D D C H H E 1 1 2 3 4 5 0 0 0 0 0 Figure 1: Metal binding structure of PDB entry 1H0Hb. [sent-87, score-0.375]
44 For readability, only a few connections from free residues to the nil symbol are shown. [sent-88, score-0.298]
45 Note that the MBG problem is not a matching problem (such as those studied in [15]) since more than one edge can be incident to vertices belonging to I. [sent-89, score-0.101]
46 As discussed above, we are not interested in distinguishing metal ions based on the element type. [sent-90, score-0.609]
47 Hence, any two label-isomorphic bipartite graphs (obtained by exchanging two non-nil metal ion vertices) should be regarded as equivalent. [sent-91, score-0.706]
48 In this view, the binding geometry consists of a very shallow “parse tree” for string x, as examplified in Figure 1. [sent-96, score-0.364]
49 A difficulty that is immediately apparent is that the underlying grammar needs to be context sensitive in order to capture the crossing-dependencies between bound amino acids. [sent-97, score-0.102]
50 In real data, when representing metal bonding state in this way, crossing edges are very common. [sent-98, score-0.697]
51 This view enlightens a difficulty that would be encountered by attempting to solve the structured output problem with a generative model as in [16]. [sent-99, score-0.213]
52 Weighted matroids can be seen as a kind of discrete counterparts of concave functions: thanks to the above theorem, if M is a weighted matroid, then the following greedy algorithm is guaranteed to find the optimal structure, i. [sent-123, score-0.125]
53 If F is a consistent objective function then, for each matroid on S, all greedy bases are optimal. [sent-136, score-0.329]
54 3 is also necessary for a slighly more general class of algebraic structures that include matroids, called matroid embeddings [7]. [sent-138, score-0.26]
55 We now show that the MBG problem is a suitable candidate for a greedy algorithmic solution. [sent-139, score-0.149]
56 We can finally formulate the greedy algorithm for constructing the structured output in the MBG problem. [sent-149, score-0.279]
57 Given the input x, we begin by forming the associated MBG matroid Mx and a corresponding objective function Fx : Yx → I + (in the next section we will show how to learn the R objective function from data). [sent-150, score-0.28]
58 4 Learning the greedy objective function A data set for the MBG problem consist of pairs D = {(xi , yi )} where xi is a string in T ∗ and yi a bipartite graph. [sent-158, score-0.393]
59 For any input string x and (partial) output structure y ∈ Y, let Fx (y) = wT φx (y), being w a weight vector and φx (y) a feature vector for (x, y). [sent-161, score-0.152]
60 , |D|, ∀y ⊂ yi , ∀e ∈ ext(y ) ∩ yi , ∀e ∈ ext(y ) \ yi , ∀y : y ⊂ y ⊂ Sx . [sent-165, score-0.189]
61 edges that actually belong to the target output structure yi ) receive a higher weight than “wrong” extensions (i. [sent-169, score-0.189]
62 For this purpose, we will use an online active learner that samples constraints chosen by the execution of the greedy construction algorithm. [sent-180, score-0.16]
63 For each epoch, the algorithm maintains the current highest scoring partial correct output yi ⊆ yi for each example, initialized with the empty MBG structure, where the score is computed by the current objective function F . [sent-181, score-0.263]
64 It also performs a predefined number L of lookaheads by picking a random superset of y which is included in the target yi , evaluating it and updating the best MBG structure if needed, and adding a corresponding consistency constraint (see Eq. [sent-186, score-0.099]
65 , L do randomly choose y : y ⊂ y ⊂ yi ∧ e, e ∈ Sx \ y F ORCE -C ONSTRAINT(Fxi (y ∪ {e}) − Fxi (y ∪ {e }) ≥ 1) if F (yi ) < F (y ∪ {e}) then yi ← y ∪ {e} There are several suitable online learners implementing the interface required by the above procedure. [sent-197, score-0.126]
66 Possible candidates include perceptron-like or ALMA-like update rules like those proposed in [4] for structured output learning (in our case the update would depend on the difference between feature vectors of correctly and incorrectly extended structures in the inner loop of G REEDY E POCH). [sent-198, score-0.241]
67 We will therefore rewrite the objective function F using a kernel k(z, z ) = φx (y), φx (y ) between two structured instances z = (x, y) and z = (x , y ), so that Fx (y) = F (z) = i αi k(z, zi ). [sent-204, score-0.17]
68 Let σi (z) denote the set of edges incident on ion ιi ∈ I \ nil and n(z) the number of non-nil ion identifiers that have at least one incident edge. [sent-205, score-0.642]
69 kmbs measures the similarity between individual sites (two sites are orthogonal if have a different number of ligands, a choice that is supported by protein functional considerations). [sent-208, score-0.509]
70 kglob ensures that two structures are orthogonal unless they have the same number of sites and down weights their similarity when their number of candidate ligands differs. [sent-209, score-0.492]
71 5 Experiments We tested the method on a dataset of non-redundant proteins previously used in [12] for metal bonding state prediction (http://www. [sent-210, score-0.848]
72 Proteins that do not bind metal ions (used in [12] as negative examples) are of no interest in the present case and were removed, resulting in a set of 199 metalloproteins binding transition metals. [sent-215, score-0.938]
73 The first 220 attributes consist of multiple alignment profiles in the window of 11 amino acids centered around xi ( ) (the window was formed from the original protein sequence, not the substring xi of candidate ligands). [sent-221, score-0.458]
74 Two prediction tasks were considered, from unknown and from known metal bonding state (a similar distinction is also customary for the related task of disulfide bonds prediction, see e. [sent-227, score-0.741]
75 In the latter case, the input x only contains actual ligands and no nil symbol is needed. [sent-230, score-0.295]
76 PS and RS are the metal binding site precision and recall, respectively (ratio of correctly predicted sites to the number of predicted/actual sites). [sent-236, score-0.939]
77 Finally, PB and RB are precision and recall for metal bonding state prediction (as in binary classification, being “bonded” the positive class). [sent-237, score-0.738]
78 Table 2 reports the breakdown of these performance measures for proteins binding different numbers of metal ions (for L = 10). [sent-238, score-1.021]
79 Results show that enforcing consistency constraints tends to improve recall, especially for the bonding state prediction, i. [sent-239, score-0.26]
80 helps the predictor to assign a residue to a metal ion identifier rather than to nil. [sent-241, score-0.729]
81 Correct prediction of whole sites is very challenging and correct prediction of whole chains even more difficult (given the enormous number of alternatives to be compared). [sent-243, score-0.336]
82 Correct edge assignment, however, appears satisfactory and reasonably good when the bonding state is given. [sent-246, score-0.218]
83 As in [15], our method is based on a large-margin approach for solving a structured output prediction problem. [sent-258, score-0.232]
84 Disulfide connectivity is a (perfect) matching problem since each cysteine is bound to exactly one other cysteine (assuming known bonding state, yielding a perfect matching) or can be bound to another cysteine or free (unknown bonding state, yielding a non-perfect matching). [sent-260, score-0.654]
85 The MBG problem is not a matching problem but has the structure of a matroid and our formulation allows us to control the number of effectively enforced constraints by taking advantage of a greedy algorithm. [sent-263, score-0.393]
86 LaSO aims to solve a much broader class of structured output problems where good output structures can be generated by AI-style search algorithms such as beam search or A*. [sent-265, score-0.324]
87 The generation of a fresh set of siblings in LaSO when the search is stuck with a frontier of wrong candidates (essentially a backtrack) is costly compared to our greedy selection procedure and (at least in principle) unnecessary when working on matroids. [sent-266, score-0.118]
88 Another general way to deal with the exponential growth of the search space is to introduce a generative model so that arg maxy F (x, y) can be computed efficiently, e. [sent-267, score-0.097]
89 Indeed, an alternative view of the MBG problem is supervised sequence labeling, where the output string consists of symbols in I. [sent-275, score-0.143]
90 A (higher-order) hidden Markov model or chain-structured conditional random field could be used as the underlying generative model for structured output learning. [sent-276, score-0.213]
91 Unfortunately, these approaches are unlikely to be very accurate since models that are structured as linear chains of dependencies cannot easily capture long-ranged interactions such as those occurring in the example. [sent-277, score-0.182]
92 In our preliminary experiments, SVMHMM [16] systematically assigned all bonded residues to the same ion, thus never correctly predicted the geometry except in trivial cases. [sent-278, score-0.326]
93 7 Conclusions We have reported about the first successful solution to the challenging problem of predicting protein metal binding geometry from sequence alone. [sent-279, score-0.935]
94 Learning with structured outputs is a fairly difficult task and in spite of the fact that several methodologies have been proposed, no single general approach can effectively solve every possible application problem. [sent-281, score-0.159]
95 The solution proposed in this paper draws on several previous ideas and specifically leverages the existence of a matroid for the metal binding problem. [sent-282, score-0.89]
96 Other problems that formally exhibit a greedy structure might benefit of similar solutions. [sent-283, score-0.129]
97 Prediction of transition metal-binding sites from apo protein structures. [sent-291, score-0.306]
98 Learning as search optimization: Approximate large margin methods for structured prediction. [sent-312, score-0.178]
99 Identifying cysteines and histidines in transition-metal-binding sites using support vector machines and neural networks. [sent-384, score-0.283]
100 Metalloproteomics: high-throughput structural and functional annotation of proteins in structural genomics. [sent-397, score-0.135]
wordName wordTfidf (topN-words)
[('metal', 0.449), ('mbg', 0.304), ('binding', 0.249), ('ion', 0.216), ('residues', 0.208), ('ligands', 0.205), ('bonding', 0.192), ('matroid', 0.192), ('yx', 0.168), ('ions', 0.16), ('sites', 0.155), ('protein', 0.151), ('proteins', 0.135), ('ext', 0.128), ('structured', 0.126), ('fx', 0.119), ('amino', 0.102), ('cysteines', 0.096), ('disul', 0.096), ('greedy', 0.093), ('nil', 0.09), ('reedy', 0.084), ('acids', 0.083), ('cysteine', 0.08), ('ag', 0.076), ('pe', 0.076), ('ps', 0.073), ('residue', 0.064), ('lasvm', 0.064), ('onstraint', 0.064), ('orce', 0.064), ('passerini', 0.064), ('yi', 0.063), ('rs', 0.062), ('output', 0.06), ('geometry', 0.059), ('candidate', 0.056), ('sx', 0.056), ('fxi', 0.056), ('coordination', 0.056), ('string', 0.056), ('chains', 0.056), ('mx', 0.051), ('helman', 0.048), ('kglob', 0.048), ('kmbs', 0.048), ('laso', 0.048), ('metalloproteins', 0.048), ('onstruct', 0.048), ('poch', 0.048), ('prediction', 0.046), ('epoch', 0.045), ('incident', 0.045), ('maxy', 0.045), ('objective', 0.044), ('constraints', 0.042), ('lexicographically', 0.042), ('zinc', 0.042), ('bipartite', 0.041), ('algebraic', 0.04), ('structure', 0.036), ('site', 0.034), ('acid', 0.034), ('correct', 0.033), ('outputs', 0.033), ('xi', 0.033), ('bind', 0.032), ('bonded', 0.032), ('coordinations', 0.032), ('dianna', 0.032), ('firenze', 0.032), ('histidines', 0.032), ('kres', 0.032), ('matroids', 0.032), ('structurally', 0.031), ('matching', 0.03), ('edges', 0.03), ('chemical', 0.03), ('bs', 0.03), ('structures', 0.028), ('breakdown', 0.028), ('coordinated', 0.028), ('degli', 0.028), ('studi', 0.028), ('bonds', 0.028), ('metals', 0.028), ('pdb', 0.028), ('identi', 0.028), ('sequence', 0.027), ('generative', 0.027), ('margin', 0.027), ('correctly', 0.027), ('state', 0.026), ('vertices', 0.026), ('involved', 0.026), ('act', 0.026), ('search', 0.025), ('precision', 0.025), ('learner', 0.025), ('rb', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999997 183 nips-2008-Predicting the Geometry of Metal Binding Sites from Protein Sequence
Author: Paolo Frasconi, Andrea Passerini
Abstract: Metal binding is important for the structural and functional characterization of proteins. Previous prediction efforts have only focused on bonding state, i.e. deciding which protein residues act as metal ligands in some binding site. Identifying the geometry of metal-binding sites, i.e. deciding which residues are jointly involved in the coordination of a metal ion is a new prediction problem that has been never attempted before from protein sequence alone. In this paper, we formulate it in the framework of learning with structured outputs. Our solution relies on the fact that, from a graph theoretical perspective, metal binding has the algebraic properties of a matroid, enabling the application of greedy algorithms for learning structured outputs. On a data set of 199 non-redundant metalloproteins, we obtained precision/recall levels of 75%/46% correct ligand-ion assignments, which improves to 88%/88% in the setting where the metal binding state is known. 1
2 0.10780049 108 nips-2008-Integrating Locally Learned Causal Structures with Overlapping Variables
Author: David Danks, Clark Glymour, Robert E. Tillman
Abstract: In many domains, data are distributed among datasets that share only some variables; other recorded variables may occur in only one dataset. While there are asymptotically correct, informative algorithms for discovering causal relationships from a single dataset, even with missing values and hidden variables, there have been no such reliable procedures for distributed data with overlapping variables. We present a novel, asymptotically correct procedure that discovers a minimal equivalence class of causal DAG structures using local independence information from distributed data of this form and evaluate its performance using synthetic and real-world data against causal discovery algorithms for single datasets and applying Structural EM, a heuristic DAG structure learning procedure for data with missing values, to the concatenated data.
3 0.10004342 239 nips-2008-Tighter Bounds for Structured Estimation
Author: Olivier Chapelle, Chuong B. Do, Choon H. Teo, Quoc V. Le, Alex J. Smola
Abstract: Large-margin structured estimation methods minimize a convex upper bound of loss functions. While they allow for efficient optimization algorithms, these convex formulations are not tight and sacrifice the ability to accurately model the true loss. We present tighter non-convex bounds based on generalizing the notion of a ramp loss from binary classification to structured estimation. We show that a small modification of existing optimization algorithms suffices to solve this modified problem. On structured prediction tasks such as protein sequence alignment and web page ranking, our algorithm leads to improved accuracy. 1
4 0.099017948 225 nips-2008-Supervised Bipartite Graph Inference
Author: Yoshihiro Yamanishi
Abstract: We formulate the problem of bipartite graph inference as a supervised learning problem, and propose a new method to solve it from the viewpoint of distance metric learning. The method involves the learning of two mappings of the heterogeneous objects to a unified Euclidean space representing the network topology of the bipartite graph, where the graph is easy to infer. The algorithm can be formulated as an optimization problem in a reproducing kernel Hilbert space. We report encouraging results on the problem of compound-protein interaction network reconstruction from chemical structure data and genomic sequence data. 1
5 0.075628988 203 nips-2008-Scalable Algorithms for String Kernels with Inexact Matching
Author: Pavel P. Kuksa, Pai-hsi Huang, Vladimir Pavlovic
Abstract: We present a new family of linear time algorithms for string comparison with mismatches under the string kernels framework. Based on sufficient statistics, our algorithms improve theoretical complexity bounds of existing approaches while scaling well in sequence alphabet size, the number of allowed mismatches and the size of the dataset. In particular, on large alphabets and under loose mismatch constraints our algorithms are several orders of magnitude faster than the existing algorithms for string comparison under the mismatch similarity measure. We evaluate our algorithms on synthetic data and real applications in music genre classification, protein remote homology detection and protein fold prediction. The scalability of the algorithms allows us to consider complex sequence transformations, modeled using longer string features and larger numbers of mismatches, leading to a state-of-the-art performance with significantly reduced running times. 1
6 0.075392343 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation
7 0.053436615 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks
8 0.051695284 134 nips-2008-Mixed Membership Stochastic Blockmodels
9 0.051588934 19 nips-2008-An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis
10 0.051526766 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations
11 0.049235884 70 nips-2008-Efficient Inference in Phylogenetic InDel Trees
12 0.046150148 113 nips-2008-Kernelized Sorting
13 0.043760344 78 nips-2008-Exact Convex Confidence-Weighted Learning
14 0.043204878 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
15 0.0425236 194 nips-2008-Regularized Learning with Networks of Features
16 0.042443786 201 nips-2008-Robust Near-Isometric Matching via Structured Learning of Graphical Models
17 0.040947434 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations
18 0.039929725 152 nips-2008-Non-stationary dynamic Bayesian networks
19 0.03891518 196 nips-2008-Relative Margin Machines
20 0.038832609 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE
topicId topicWeight
[(0, -0.144), (1, -0.025), (2, -0.037), (3, 0.023), (4, 0.015), (5, -0.04), (6, 0.036), (7, -0.005), (8, 0.017), (9, -0.049), (10, 0.001), (11, 0.042), (12, -0.012), (13, -0.046), (14, 0.079), (15, 0.024), (16, 0.028), (17, -0.041), (18, -0.04), (19, 0.067), (20, 0.024), (21, 0.033), (22, -0.013), (23, -0.062), (24, 0.061), (25, -0.004), (26, -0.093), (27, 0.01), (28, -0.043), (29, 0.047), (30, -0.036), (31, 0.005), (32, 0.007), (33, -0.048), (34, 0.055), (35, 0.086), (36, -0.046), (37, 0.056), (38, -0.013), (39, 0.073), (40, -0.049), (41, 0.192), (42, 0.071), (43, -0.032), (44, 0.048), (45, 0.115), (46, -0.076), (47, -0.097), (48, 0.019), (49, -0.055)]
simIndex simValue paperId paperTitle
same-paper 1 0.89907193 183 nips-2008-Predicting the Geometry of Metal Binding Sites from Protein Sequence
Author: Paolo Frasconi, Andrea Passerini
Abstract: Metal binding is important for the structural and functional characterization of proteins. Previous prediction efforts have only focused on bonding state, i.e. deciding which protein residues act as metal ligands in some binding site. Identifying the geometry of metal-binding sites, i.e. deciding which residues are jointly involved in the coordination of a metal ion is a new prediction problem that has been never attempted before from protein sequence alone. In this paper, we formulate it in the framework of learning with structured outputs. Our solution relies on the fact that, from a graph theoretical perspective, metal binding has the algebraic properties of a matroid, enabling the application of greedy algorithms for learning structured outputs. On a data set of 199 non-redundant metalloproteins, we obtained precision/recall levels of 75%/46% correct ligand-ion assignments, which improves to 88%/88% in the setting where the metal binding state is known. 1
2 0.61346167 203 nips-2008-Scalable Algorithms for String Kernels with Inexact Matching
Author: Pavel P. Kuksa, Pai-hsi Huang, Vladimir Pavlovic
Abstract: We present a new family of linear time algorithms for string comparison with mismatches under the string kernels framework. Based on sufficient statistics, our algorithms improve theoretical complexity bounds of existing approaches while scaling well in sequence alphabet size, the number of allowed mismatches and the size of the dataset. In particular, on large alphabets and under loose mismatch constraints our algorithms are several orders of magnitude faster than the existing algorithms for string comparison under the mismatch similarity measure. We evaluate our algorithms on synthetic data and real applications in music genre classification, protein remote homology detection and protein fold prediction. The scalability of the algorithms allows us to consider complex sequence transformations, modeled using longer string features and larger numbers of mismatches, leading to a state-of-the-art performance with significantly reduced running times. 1
3 0.57942647 225 nips-2008-Supervised Bipartite Graph Inference
Author: Yoshihiro Yamanishi
Abstract: We formulate the problem of bipartite graph inference as a supervised learning problem, and propose a new method to solve it from the viewpoint of distance metric learning. The method involves the learning of two mappings of the heterogeneous objects to a unified Euclidean space representing the network topology of the bipartite graph, where the graph is easy to infer. The algorithm can be formulated as an optimization problem in a reproducing kernel Hilbert space. We report encouraging results on the problem of compound-protein interaction network reconstruction from chemical structure data and genomic sequence data. 1
4 0.55777472 239 nips-2008-Tighter Bounds for Structured Estimation
Author: Olivier Chapelle, Chuong B. Do, Choon H. Teo, Quoc V. Le, Alex J. Smola
Abstract: Large-margin structured estimation methods minimize a convex upper bound of loss functions. While they allow for efficient optimization algorithms, these convex formulations are not tight and sacrifice the ability to accurately model the true loss. We present tighter non-convex bounds based on generalizing the notion of a ramp loss from binary classification to structured estimation. We show that a small modification of existing optimization algorithms suffices to solve this modified problem. On structured prediction tasks such as protein sequence alignment and web page ranking, our algorithm leads to improved accuracy. 1
5 0.54726493 70 nips-2008-Efficient Inference in Phylogenetic InDel Trees
Author: Alexandre Bouchard-côté, Dan Klein, Michael I. Jordan
Abstract: Accurate and efficient inference in evolutionary trees is a central problem in computational biology. While classical treatments have made unrealistic site independence assumptions, ignoring insertions and deletions, realistic approaches require tracking insertions and deletions along the phylogenetic tree—a challenging and unsolved computational problem. We propose a new ancestry resampling procedure for inference in evolutionary trees. We evaluate our method in two problem domains—multiple sequence alignment and reconstruction of ancestral sequences—and show substantial improvement over the current state of the art. 1
6 0.5045585 108 nips-2008-Integrating Locally Learned Causal Structures with Overlapping Variables
7 0.456002 113 nips-2008-Kernelized Sorting
8 0.42879415 152 nips-2008-Non-stationary dynamic Bayesian networks
9 0.42083308 19 nips-2008-An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis
10 0.39956671 86 nips-2008-Finding Latent Causes in Causal Networks: an Efficient Approach Based on Markov Blankets
11 0.39471546 134 nips-2008-Mixed Membership Stochastic Blockmodels
12 0.38863206 185 nips-2008-Privacy-preserving logistic regression
13 0.3876242 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks
14 0.37645349 196 nips-2008-Relative Margin Machines
15 0.36788955 11 nips-2008-A spatially varying two-sample recombinant coalescent, with applications to HIV escape response
16 0.36679664 228 nips-2008-Support Vector Machines with a Reject Option
17 0.36543548 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
18 0.34794781 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations
19 0.34516209 104 nips-2008-Improved Moves for Truncated Convex Models
20 0.34291688 163 nips-2008-On the Efficient Minimization of Classification Calibrated Surrogates
topicId topicWeight
[(6, 0.039), (7, 0.042), (12, 0.023), (15, 0.01), (28, 0.116), (57, 0.029), (59, 0.012), (63, 0.018), (64, 0.01), (71, 0.017), (77, 0.024), (83, 0.57)]
simIndex simValue paperId paperTitle
1 0.93705362 6 nips-2008-A ``Shape Aware'' Model for semi-supervised Learning of Objects and its Context
Author: Abhinav Gupta, Jianbo Shi, Larry S. Davis
Abstract: We present an approach that combines bag-of-words and spatial models to perform semantic and syntactic analysis for recognition of an object based on its internal appearance and its context. We argue that while object recognition requires modeling relative spatial locations of image features within the object, a bag-of-word is sufficient for representing context. Learning such a model from weakly labeled data involves labeling of features into two classes: foreground(object) or “informative” background(context). We present a “shape-aware” model which utilizes contour information for efficient and accurate labeling of features in the image. Our approach iterates between an MCMC-based labeling and contour based labeling of features to integrate co-occurrence of features and shape similarity. 1
2 0.93525124 241 nips-2008-Transfer Learning by Distribution Matching for Targeted Advertising
Author: Steffen Bickel, Christoph Sawade, Tobias Scheffer
Abstract: We address the problem of learning classifiers for several related tasks that may differ in their joint distribution of input and output variables. For each task, small – possibly even empty – labeled samples and large unlabeled samples are available. While the unlabeled samples reflect the target distribution, the labeled samples may be biased. This setting is motivated by the problem of predicting sociodemographic features for users of web portals, based on the content which they have accessed. Here, questionnaires offered to a portion of each portal’s users produce biased samples. We derive a transfer learning procedure that produces resampling weights which match the pool of all examples to the target distribution of any given task. Transfer learning enables us to make predictions even for new portals with few or no training data and improves the overall prediction accuracy. 1
3 0.9109292 225 nips-2008-Supervised Bipartite Graph Inference
Author: Yoshihiro Yamanishi
Abstract: We formulate the problem of bipartite graph inference as a supervised learning problem, and propose a new method to solve it from the viewpoint of distance metric learning. The method involves the learning of two mappings of the heterogeneous objects to a unified Euclidean space representing the network topology of the bipartite graph, where the graph is easy to infer. The algorithm can be formulated as an optimization problem in a reproducing kernel Hilbert space. We report encouraging results on the problem of compound-protein interaction network reconstruction from chemical structure data and genomic sequence data. 1
same-paper 4 0.89440835 183 nips-2008-Predicting the Geometry of Metal Binding Sites from Protein Sequence
Author: Paolo Frasconi, Andrea Passerini
Abstract: Metal binding is important for the structural and functional characterization of proteins. Previous prediction efforts have only focused on bonding state, i.e. deciding which protein residues act as metal ligands in some binding site. Identifying the geometry of metal-binding sites, i.e. deciding which residues are jointly involved in the coordination of a metal ion is a new prediction problem that has been never attempted before from protein sequence alone. In this paper, we formulate it in the framework of learning with structured outputs. Our solution relies on the fact that, from a graph theoretical perspective, metal binding has the algebraic properties of a matroid, enabling the application of greedy algorithms for learning structured outputs. On a data set of 199 non-redundant metalloproteins, we obtained precision/recall levels of 75%/46% correct ligand-ion assignments, which improves to 88%/88% in the setting where the metal binding state is known. 1
5 0.87797642 68 nips-2008-Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection
Author: Takafumi Kanamori, Shohei Hido, Masashi Sugiyama
Abstract: We address the problem of estimating the ratio of two probability density functions (a.k.a. the importance). The importance values can be used for various succeeding tasks such as non-stationarity adaptation or outlier detection. In this paper, we propose a new importance estimation method that has a closed-form solution; the leave-one-out cross-validation score can also be computed analytically. Therefore, the proposed method is computationally very efficient and numerically stable. We also elucidate theoretical properties of the proposed method such as the convergence rate and approximation error bound. Numerical experiments show that the proposed method is comparable to the best existing method in accuracy, while it is computationally more efficient than competing approaches. 1
6 0.75224286 32 nips-2008-Bayesian Kernel Shaping for Learning Control
7 0.6114974 95 nips-2008-Grouping Contours Via a Related Image
8 0.59127063 26 nips-2008-Analyzing human feature learning as nonparametric Bayesian inference
9 0.58660364 128 nips-2008-Look Ma, No Hands: Analyzing the Monotonic Feature Abstraction for Text Classification
10 0.57646918 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
11 0.56838542 194 nips-2008-Regularized Learning with Networks of Features
12 0.56191528 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
13 0.55714822 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias
14 0.55542076 142 nips-2008-Multi-Level Active Prediction of Useful Image Annotations for Recognition
15 0.55453688 42 nips-2008-Cascaded Classification Models: Combining Models for Holistic Scene Understanding
16 0.54101568 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
17 0.52864146 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
18 0.52860701 19 nips-2008-An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis
20 0.5172447 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features