nips nips2004 nips2004-80 knowledge-graph by maker-knowledge-mining

80 nips-2004-Identifying Protein-Protein Interaction Sites on a Genome-Wide Scale

Source: pdf

Author: Haidong Wang, Eran Segal, Asa Ben-Hur, Daphne Koller, Douglas L. Brutlag

Abstract: Protein interactions typically arise from a physical interaction of one or more small sites on the surface of the two proteins. Identifying these sites is very important for drug and protein design. In this paper, we propose a computational method based on probabilistic relational model that attempts to address this task using high-throughput protein interaction data and a set of short sequence motifs. We learn the model using the EM algorithm, with a branch-and-bound algorithm as an approximate inference for the E-step. Our method searches for motifs whose presence in a pair of interacting proteins can explain their observed interaction. It also tries to determine which motif pairs have high afﬁnity, and can therefore lead to an interaction. We show that our method is more accurate than others at predicting new protein-protein interactions. More importantly, by examining solved structures of protein complexes, we ﬁnd that 2/3 of the predicted active motifs correspond to actual interaction sites. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper, we propose a computational method based on probabilistic relational model that attempts to address this task using high-throughput protein interaction data and a set of short sequence motifs. [sent-11, score-0.541]

2 Our method searches for motifs whose presence in a pair of interacting proteins can explain their observed interaction. [sent-13, score-0.747]

3 It also tries to determine which motif pairs have high afﬁnity, and can therefore lead to an interaction. [sent-14, score-0.683]

4 More importantly, by examining solved structures of protein complexes, we ﬁnd that 2/3 of the predicted active motifs correspond to actual interaction sites. [sent-16, score-1.03]

5 Discovering the protein interaction map can therefore help to better understand the workings of the cell. [sent-18, score-0.497]

6 Interactions between two proteins arise from physical interactions between small regions on the surface of the proteins [4] (see Fig. [sent-20, score-0.561]

7 The small elements denote motif occurrences on proteins, with red denoting active and gray denoting inactive motifs. [sent-31, score-0.868]

8 The dependencies involving inactive motif pairs were removed from the graph because they do not affect the rest of the model. [sent-34, score-0.759]

9 and the sites at which the interactions take place, which uses as input only high-throughput protein-protein interaction data and the protein sequences. [sent-35, score-0.705]

10 In particular, our method assumes no knowledge of the 3D protein structure, or of the sites at which binding occurs. [sent-36, score-0.548]

11 Our approach is based on the assumption that interaction sites can be described using a limited repertoire of conserved sequence motifs [11]. [sent-37, score-0.654]

12 This is a reasonable assumption since interaction sites are signiﬁcantly more conserved than the rest of the protein surface [12]. [sent-38, score-0.594]

13 Given a protein interaction map, our method tries to explain the observed interactions by identifying a set of sites of motif occurrence on every pair of interacting proteins through which the interaction is mediated. [sent-39, score-1.833]

14 Here, the interaction pattern of the protein P1 can best be explained using the motif pair a, b, where a appears in P1 and b in the proteins P2 , P3 , P4 but not in P5 . [sent-42, score-1.358]

15 By contrast, the motif pair d, b is not as good an explanation, because d also appears in P5 , which has a different interaction pattern. [sent-43, score-0.812]

16 In general, our method aims to identify motif pairs that have high afﬁnity, potentially leading to interaction between protein pairs that contain them. [sent-44, score-1.218]

17 However, a sequence motif might be used for a different purpose, and not give rise to an active binding site; it might also be buried inside the protein, and thus be inaccessible for interaction. [sent-45, score-0.88]

18 Thus, the appearance of an appropriate motif does not always imply interaction. [sent-46, score-0.647]

19 A key feature of our approach is that we allow each motif occurrence in a protein to be either active or inactive. [sent-47, score-1.116]

20 Interactions are then induced only by the interactions of highafﬁnity active motifs in the two proteins. [sent-48, score-0.682]

21 Thus, in our example, the motif d in p2 is inactive, and hence does not lead to an interaction between p2 and p4 , despite the afﬁnity between the motif pair c, d. [sent-49, score-1.459]

22 [8] proposed a somewhat related method for genome-wide analysis of protein interaction data, based on protein domains. [sent-51, score-0.836]

23 Our goal is thus to identify two components: the afﬁnities between pairs of motifs, and the activity of the occurrences of motifs in different proteins. [sent-53, score-0.543]

24 Our algorithm addresses this problem by using the framework of Bayesian networks [13] and probabilistic relational models [14], which allows us to handle the inherent noise in the protein interaction data and the uncertain relationship between interactions and motif pairs. [sent-54, score-1.309]

25 We construct a model encoding our assumption that protein interactions are induced by the interactions of active motif pairs. [sent-55, score-1.432]

26 We then use the EM algorithm [15], to ﬁll in the details of the model, learning both the motif afﬁnities and activities from the observed data of protein-protein interactions and protein motif occurrences. [sent-56, score-1.829]

27 We evaluated our model on protein-protein interactions in yeast and Prosite motifs [11]. [sent-58, score-0.628]

28 Finally, we examined co-crystallized protein pairs where the 3D structure of the interaction is known, so that we can determine the sites at which the interaction took place. [sent-62, score-0.702]

29 We show that our active motifs are more likely to participate in interactions. [sent-63, score-0.547]

30 2 The Probabilistic Model The basic entities in our probabilistic model are the proteins and the set of sequence motifs that can mediate protein interactions. [sent-64, score-1.041]

31 Each protein P is associated with the set of motifs that occur in it, denoted by P. [sent-69, score-0.773]

32 As we discussed, a key premise of our approach is that a speciﬁc occurrence of a sequence motif may or may not be active. [sent-71, score-0.685]

33 A a , which takes the value true if Aa is active in protein P and false otherwise. [sent-74, score-0.54]

34 M | the number of active motifs in a protein is roughly a constant fraction of the total number of motifs in the protein, but that even proteins with few motifs tend to have at least some number of active motifs. [sent-80, score-1.998]

35 A pair of active motifs in two proteins can potentially bind and induce an interaction between the corresponding proteins. [sent-81, score-0.931]

36 Thus, in our model, a pair of proteins interact if each contains an active motif, and this pair of motifs bind to each other. [sent-82, score-0.862]

37 The probability with which two active motif occurrences bind is their afﬁnity. [sent-90, score-0.85]

38 We model the binding event between two motif occurrences using a variable Tij . [sent-91, score-0.834]

39 This model reﬂects our assumption that two motif occurrences can bind only if they are both active, but their actual binding probability depends on their afﬁnity. [sent-97, score-0.881]

40 Note that this afﬁnity is a feature of the motif pair and does not depend on the proteins in which they appear. [sent-98, score-0.874]

41 We must also account for interactions that are not explained by our set of motifs, whether because of false positives in the data, or because of inadequacies of our model or of our motif set. [sent-99, score-0.905]

42 Finally, we observe an interaction between two proteins if and only if some form of binding occurs, whether by a motif pair or a spurious binding. [sent-104, score-1.137]

43 O, which represents whether protein i was observed to interact with protein j, to be a deterministic OR of all the binding variables Tij . [sent-106, score-0.89]

44 O is a noisy-OR [13] of all motif pair variables Tij . [sent-110, score-0.734]

45 Note that our model accounts for both types of errors in the protein interaction data. [sent-112, score-0.486]

46 False negatives (missing interactions) in the data are addressed through the fact that the presence of an active motif pair only implies that binding takes place with some probability. [sent-113, score-0.904]

47 In a typical setting, we are given as input a protein interaction data set, specifying a set of proteins P and a set of observed interacting pairs T. [sent-138, score-0.782]

48 We are also given a set of potentially relevant motifs, and the occurrences of these motifs in the different proteins in P. [sent-140, score-0.679]

49 S; we also need to ﬁnd a setting of the model parameters Θ, which specify the motif afﬁnities. [sent-146, score-0.659]

50 We use a variant of the EM algorithm [15] to ﬁnd both an assignment to the parameters Θ, and an assignment to the motif variables P. [sent-147, score-0.932]

51 Note that, to maximize this objective, we search for a MAP assignment to the motif activity variables, but sum out over the other hidden variables. [sent-151, score-0.794]

52 This design decision is reasonable in our setting, where determining motif activities is an important goal; it is a key assumption for our computational procedure. [sent-152, score-0.669]

53 In our model, any two motif variables (both within the same protein and across different proteins) are correlated, as there exists a path of inﬂuence between them in the underlying Bayesian network (see Fig. [sent-154, score-1.043]

54 Our ﬁrst observation is that the only variables that correlate the different protein pairs Tij are the motif variables P. [sent-160, score-1.124]

55 Given an assignment to these activity variables, the network decomposes into a set of independent subnetworks, one for each protein pair. [sent-162, score-0.498]

56 In the ﬁrst, we ﬁnd an assignment to the motif variables in each protein, P. [sent-164, score-0.812]

57 A; in the second, we compute the posterior probability over the binding motif pair variables T. [sent-165, score-0.851]

58 We observe that, as all the motif pair variables, T. [sent-169, score-0.689]

59 A, are fully determined by the motif variables, the only variables left to reason about are the binding variables T. [sent-170, score-0.854]

60 P (O | A, Θ) The ﬁrst term in the numerator is simply the motif afﬁnity θab ; the second term is 1 if O = true and 0 otherwise. [sent-179, score-0.687]

61 We now turn to the ﬁrst phase, of ﬁnding a setting to all of the motif variables. [sent-182, score-0.647]

62 Our method iterates over proteins, ﬁnding in each iteration the optimal assignment to the motif variables of each protein given the current assignment to the motif activities in the remaining proteins. [sent-186, score-1.963]

63 However, the computation for each protein is still exponential in the number of motifs it contains, which can be large (e. [sent-190, score-0.773]

64 However, in our speciﬁc model, we can apply the following branch-and-bound algorithm (similar to an approach proposed by Henrion [17] for BN2O networks) to ﬁnd the globally optimal assignment to the motif variables of each protein. [sent-193, score-0.812]

65 We can show that if making a motif active relative to one assignment does not improve the objective, it will also not improve the objective relative to a large set of other assignments. [sent-196, score-0.878]

66 A is the ﬁxed assignment to motif variables in all proteins except Pi . [sent-201, score-0.997]

67 A−a denote the assignment to all the motif variables in Pi except for Aa . [sent-203, score-0.812]

68 Ab =true (1 − θab ) denote the probability that motif a does not bind with any active motif in Pj . [sent-207, score-1.439]

69 O=true where g is the prior probability for a motif in protein Pi to be active. [sent-217, score-0.998]

70 Now, consider a different point in the search, where our current motif activity assignment is Pi . [sent-218, score-0.794]

71 A−a ) It follows that, if switching motif a from inactive to active relative to Pi . [sent-239, score-0.81]

72 Our algorithm keeps a set V of viable candidates for motif assignments. [sent-243, score-0.647]

73 For presentation, we encode assignments via the set of active motifs they contain. [sent-244, score-0.554]

74 We start out by considering motif assignments with a single active motif. [sent-246, score-0.779]

75 We continue this process for all assignments of size k: For each assignment with active motif set S, we test whether S − {a} ∈ V for all a ∈ S; if we compare f (S) to each f (S − {a}), and add it if it dominates all of them. [sent-252, score-0.911]

76 However, in our setting, a protein with many motifs has a low prior probability that each of them is active. [sent-258, score-0.773]

77 O, Θ): The assignment has higher probability than any assignment that changes any of the motif variables for any single protein. [sent-264, score-0.932]

78 cerevisiae protein interactions data from MIPS [2] and DIP [3] databases. [sent-269, score-0.513]

79 We used sequence motifs from the Prosite database [11] resulting in a dataset of 516 different motifs with an average of 7. [sent-274, score-0.876]

80 If a motif pair doesn’t appear between any pair of interacting proteins, we initialize its afﬁnity to be 0 to maximize the joint likelihood. [sent-276, score-0.818]

81 We set the initial afﬁnity for the remaining 8475 motif pairs to 0. [sent-278, score-0.683]

82 We train our model with motifs initialized to be either all active (P. [sent-280, score-0.532]

83 Our branch-and-bound algorithm is able to signiﬁcantly reduce the number of motif activity assignments that need to be evaluated. [sent-285, score-0.708]

84 We thus compare our method to a baseline method that ranks pairs of proteins on the basis of the maximum enrichment of over-represented motif pairs (see [18] for details). [sent-294, score-0.946]

85 For completeness, we compare the two variants of the model using data on the domain (Pfam and ProDom [19]) content of the proteins as well as the Prosite motif content. [sent-297, score-0.844]

86 Y-axis is the proportion of all interacting protein pairs in the training data predicted to interact. [sent-315, score-0.528]

87 (b) Two protein chains that form a part of the 1ryp complex in PDB, interacting at the site of two short sequence motifs. [sent-319, score-0.493]

88 2(a) show that our method outperforms the other methods, and that the additional degree of freedom of allowing motifs to be inactive is essential. [sent-321, score-0.498]

89 From the residues that are in contact between two chains (distance < 5 Angstr¨ ms), we infer which protein motifs participate in o interactions. [sent-329, score-0.82]

90 On these proteins, our data contained a total of 620 motif occurrences, of which 386 are predicted to be active. [sent-331, score-0.683]

91 On the residue level, our predicted active motifs consist of 3736 amino acids, and 1388 of them (37. [sent-337, score-0.569]

92 In comparison, our predicted inactive motifs consist of 3506 amino acids, and only 588 of them (16. [sent-339, score-0.536]

93 This signiﬁcant enrichment provides support for the ability of our method to detect motifs that participate in interactions. [sent-341, score-0.495]

94 In fact, the set of interactions in PDB is only a subset of the interactions those proteins participate in. [sent-342, score-0.536]

95 Therefore, the actual rate of false positive active motifs is likely to be lower than we report here. [sent-343, score-0.583]

96 Our result shows that our method successfully uncovers motif activities and binding afﬁnities, and uses them to predict both protein interactions and speciﬁc binding sites. [sent-346, score-1.438]

97 We can also integrate protein interaction data across multiple species; for example, we might try to use the yeast interaction data to provide more accurate predictions for the protein-protein interactions in ﬂy [10]. [sent-350, score-0.78]

98 Dip ; the database of interacting proteins: a research tool for studying cellular networks of protein interactions. [sent-363, score-0.481]

99 Discovering molecular pathways from protein interaction and gene expression data. [sent-400, score-0.491]

100 Are protein protein interfaces more conserved in sequence than the rest of the protein surface? [sent-414, score-1.104]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('motif', 0.647), ('motifs', 0.422), ('protein', 0.351), ('tij', 0.218), ('proteins', 0.185), ('interactions', 0.162), ('pi', 0.162), ('interaction', 0.123), ('assignment', 0.12), ('binding', 0.117), ('active', 0.098), ('interacting', 0.087), ('sites', 0.069), ('bab', 0.068), ('inactive', 0.065), ('pj', 0.064), ('false', 0.063), ('af', 0.06), ('prosite', 0.059), ('occurrences', 0.058), ('bind', 0.047), ('variables', 0.045), ('acids', 0.044), ('nity', 0.043), ('pdb', 0.043), ('pair', 0.042), ('aab', 0.039), ('deng', 0.039), ('predicted', 0.036), ('pairs', 0.036), ('assignments', 0.034), ('ha', 0.029), ('cellular', 0.029), ('true', 0.028), ('ab', 0.028), ('activity', 0.027), ('nucleic', 0.027), ('participate', 0.027), ('interact', 0.026), ('spurious', 0.023), ('activities', 0.022), ('conserved', 0.022), ('et', 0.022), ('predicting', 0.021), ('yeast', 0.021), ('em', 0.021), ('occurrence', 0.02), ('chains', 0.02), ('brutlag', 0.02), ('enrichment', 0.02), ('haidong', 0.02), ('mediate', 0.02), ('mips', 0.02), ('segal', 0.02), ('sprinzak', 0.02), ('nities', 0.02), ('res', 0.019), ('entities', 0.019), ('sequence', 0.018), ('proportion', 0.018), ('aa', 0.018), ('surface', 0.018), ('asa', 0.017), ('eran', 0.017), ('ad', 0.017), ('site', 0.017), ('molecular', 0.017), ('koller', 0.016), ('scarce', 0.016), ('ability', 0.015), ('drug', 0.015), ('database', 0.014), ('probabilistic', 0.014), ('hb', 0.014), ('instantiation', 0.014), ('densely', 0.014), ('potentially', 0.014), ('identifying', 0.013), ('objective', 0.013), ('biology', 0.013), ('amino', 0.013), ('numerator', 0.012), ('relational', 0.012), ('domains', 0.012), ('understand', 0.012), ('model', 0.012), ('discovering', 0.012), ('add', 0.012), ('validate', 0.012), ('physical', 0.011), ('method', 0.011), ('fold', 0.011), ('genome', 0.011), ('evaluated', 0.011), ('positives', 0.011), ('map', 0.011), ('intuition', 0.011), ('rest', 0.011), ('predict', 0.011), ('explained', 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 80 nips-2004-Identifying Protein-Protein Interaction Sites on a Genome-Wide Scale

Author: Haidong Wang, Eran Segal, Asa Ben-Hur, Daphne Koller, Douglas L. Brutlag

2 0.089349262 146 nips-2004-Pictorial Structures for Molecular Modeling: Interpreting Density Maps

Author: Frank Dimaio, George Phillips, Jude W. Shavlik

Abstract: X-ray crystallography is currently the most common way protein structures are elucidated. One of the most time-consuming steps in the crystallographic process is interpretation of the electron density map, a task that involves finding patterns in a three-dimensional picture of a protein. This paper describes DEFT (DEFormable Template), an algorithm using pictorial structures to build a flexible protein model from the protein's amino-acid sequence. Matching this pictorial structure into the density map is a way of automating density-map interpretation. Also described are several extensions to the pictorial structure matching algorithm necessary for this automated interpretation. DEFT is tested on a set of density maps ranging from 2 to 4Å resolution, producing rootmean-squared errors ranging from 1.38 to 1.84Å. 1 In trod u ction An important question in molecular biology is what is the structure of a particular protein? Knowledge of a protein’s unique conformation provides insight into the mechanisms by which a protein acts. However, no algorithm exists that accurately maps sequence to structure, and one is forced to use

3 0.081002764 177 nips-2004-Supervised Graph Inference

Author: Jean-philippe Vert, Yoshihiro Yamanishi

Abstract: We formulate the problem of graph inference where part of the graph is known as a supervised learning problem, and propose an algorithm to solve it. The method involves the learning of a mapping of the vertices to a Euclidean space where the graph is easy to infer, and can be formulated as an optimization problem in a reproducing kernel Hilbert space. We report encouraging results on the problem of metabolic network reconstruction from genomic data. 1

4 0.074912496 95 nips-2004-Large-Scale Prediction of Disulphide Bond Connectivity

Author: Jianlin Cheng, Alessandro Vullo, Pierre F. Baldi

Abstract: The formation of disulphide bridges among cysteines is an important feature of protein structures. Here we develop new methods for the prediction of disulphide bond connectivity. We ﬁrst build a large curated data set of proteins containing disulphide bridges and then use 2-Dimensional Recursive Neural Networks to predict bonding probabilities between cysteine pairs. These probabilities in turn lead to a weighted graph matching problem that can be addressed efﬁciently. We show how the method consistently achieves better results than previous approaches on the same validation data. In addition, the method can easily cope with chains with arbitrary numbers of bonded cysteines. Therefore, it overcomes one of the major limitations of previous approaches restricting predictions to chains containing no more than 10 oxidized cysteines. The method can be applied both to situations where the bonded state of each cysteine is known or unknown, in which case bonded state can be predicted with 85% precision and 90% recall. The method also yields an estimate for the total number of disulphide bridges in each chain. 1

5 0.046919819 52 nips-2004-Discrete profile alignment via constrained information bottleneck

Author: Sean O'rourke, Gal Chechik, Robin Friedman, Eleazar Eskin

Abstract: Amino acid proﬁles, which capture position-speciﬁc mutation probabilities, are a richer encoding of biological sequences than the individual sequences themselves. However, proﬁle comparisons are much more computationally expensive than discrete symbol comparisons, making proﬁles impractical for many large datasets. Furthermore, because they are such a rich representation, proﬁles can be diﬃcult to visualize. To overcome these problems, we propose a discretization for proﬁles using an expanded alphabet representing not just individual amino acids, but common proﬁles. By using an extension of information bottleneck (IB) incorporating constraints and priors on the class distributions, we ﬁnd an informationally optimal alphabet. This discretization yields a concise, informative textual representation for proﬁle sequences. Also alignments between these sequences, while nearly as accurate as the full proﬁleproﬁle alignments, can be computed almost as quickly as those between individual or consensus sequences. A full pairwise alignment of SwissProt would take years using proﬁles, but less than 3 days using a discrete IB encoding, illustrating how discrete encoding can expand the range of sequence problems to which proﬁle information can be applied. 1

6 0.041634094 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

7 0.040118162 40 nips-2004-Common-Frame Model for Object Recognition

8 0.038928043 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

9 0.036295045 120 nips-2004-Modeling Conversational Dynamics as a Mixed-Memory Markov Process

10 0.03469754 82 nips-2004-Incremental Algorithms for Hierarchical Classification

11 0.033607297 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach

12 0.033026021 161 nips-2004-Self-Tuning Spectral Clustering

13 0.032854877 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods

14 0.031695534 128 nips-2004-Neural Network Computation by In Vitro Transcriptional Circuits

15 0.030920541 136 nips-2004-On Semi-Supervised Classification

16 0.030070251 23 nips-2004-Analysis of a greedy active learning strategy

17 0.029300045 6 nips-2004-A Hidden Markov Model for de Novo Peptide Sequencing

18 0.029172178 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data

19 0.027730338 44 nips-2004-Conditional Random Fields for Object Recognition

20 0.027674608 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.083), (1, 0.006), (2, -0.011), (3, -0.026), (4, 0.0), (5, 0.019), (6, -0.018), (7, 0.042), (8, -0.015), (9, -0.055), (10, -0.002), (11, 0.02), (12, -0.009), (13, 0.036), (14, 0.014), (15, -0.038), (16, -0.017), (17, -0.055), (18, 0.019), (19, -0.018), (20, 0.051), (21, 0.098), (22, -0.137), (23, -0.049), (24, 0.083), (25, -0.071), (26, -0.095), (27, -0.097), (28, 0.0), (29, 0.053), (30, -0.239), (31, 0.097), (32, 0.148), (33, -0.027), (34, -0.012), (35, 0.066), (36, 0.068), (37, 0.125), (38, 0.014), (39, 0.156), (40, -0.035), (41, -0.042), (42, 0.038), (43, 0.041), (44, -0.062), (45, 0.102), (46, -0.13), (47, -0.081), (48, 0.028), (49, 0.041)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96100789 80 nips-2004-Identifying Protein-Protein Interaction Sites on a Genome-Wide Scale

Author: Haidong Wang, Eran Segal, Asa Ben-Hur, Daphne Koller, Douglas L. Brutlag

2 0.69784123 95 nips-2004-Large-Scale Prediction of Disulphide Bond Connectivity

Author: Jianlin Cheng, Alessandro Vullo, Pierre F. Baldi

3 0.66046917 146 nips-2004-Pictorial Structures for Molecular Modeling: Interpreting Density Maps

Author: Frank Dimaio, George Phillips, Jude W. Shavlik

4 0.52832007 177 nips-2004-Supervised Graph Inference

Author: Jean-philippe Vert, Yoshihiro Yamanishi

5 0.4534286 6 nips-2004-A Hidden Markov Model for de Novo Peptide Sequencing

Author: Bernd Fischer, Volker Roth, Jonas Grossmann, Sacha Baginsky, Wilhelm Gruissem, Franz Roos, Peter Widmayer, Joachim M. Buhmann

Abstract: De novo Sequencing of peptides is a challenging task in proteome research. While there exist reliable DNA-sequencing methods, the highthroughput de novo sequencing of proteins by mass spectrometry is still an open problem. Current approaches suffer from a lack in precision to detect mass peaks in the spectrograms. In this paper we present a novel method for de novo peptide sequencing based on a hidden Markov model. Experiments effectively demonstrate that this new method significantly outperforms standard approaches in matching quality. 1

6 0.4130356 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data

7 0.40381932 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data

8 0.38995862 52 nips-2004-Discrete profile alignment via constrained information bottleneck

9 0.34156707 128 nips-2004-Neural Network Computation by In Vitro Transcriptional Circuits

10 0.30692074 57 nips-2004-Economic Properties of Social Networks

11 0.28798363 120 nips-2004-Modeling Conversational Dynamics as a Mixed-Memory Markov Process

12 0.24386078 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

13 0.23834373 40 nips-2004-Common-Frame Model for Object Recognition

14 0.23278376 35 nips-2004-Chemosensory Processing in a Spiking Model of the Olfactory Bulb: Chemotopic Convergence and Center Surround Inhibition

15 0.23078379 141 nips-2004-Optimal sub-graphical models

16 0.21292487 76 nips-2004-Hierarchical Bayesian Inference in Networks of Spiking Neurons

17 0.20961522 43 nips-2004-Conditional Models of Identity Uncertainty with Application to Noun Coreference

18 0.205559 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

19 0.20050351 136 nips-2004-On Semi-Supervised Classification

20 0.19347565 23 nips-2004-Analysis of a greedy active learning strategy

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.011), (13, 0.053), (15, 0.068), (17, 0.013), (25, 0.014), (26, 0.022), (31, 0.024), (33, 0.204), (35, 0.021), (39, 0.013), (50, 0.04), (58, 0.383)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.82990688 88 nips-2004-Intrinsically Motivated Reinforcement Learning

Author: Nuttapong Chentanez, Andrew G. Barto, Satinder P. Singh

Abstract: Psychologists call behavior intrinsically motivated when it is engaged in for its own sake rather than as a step toward solving a speciﬁc problem of clear practical value. But what we learn during intrinsically motivated behavior is essential for our development as competent autonomous entities able to efﬁciently solve a wide range of practical problems as they arise. In this paper we present initial results from a computational study of intrinsically motivated reinforcement learning aimed at allowing artiﬁcial agents to construct and extend hierarchies of reusable skills that are needed for competent autonomy. 1

same-paper 2 0.76441467 80 nips-2004-Identifying Protein-Protein Interaction Sites on a Genome-Wide Scale

Author: Haidong Wang, Eran Segal, Asa Ben-Hur, Daphne Koller, Douglas L. Brutlag

3 0.69930631 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data

Author: Mario Marchand, Mohak Shah

Abstract: We propose a “soft greedy” learning algorithm for building small conjunctions of simple threshold functions, called rays, deﬁned on single real-valued attributes. We also propose a PAC-Bayes risk bound which is minimized for classiﬁers achieving a non-trivial tradeoﬀ between sparsity (the number of rays used) and the magnitude of the separating margin of each ray. Finally, we test the soft greedy algorithm on four DNA micro-array data sets. 1

4 0.50000846 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data

Author: Ofer Shai, Brendan J. Frey, Quaid D. Morris, Qun Pan, Christine Misquitta, Benjamin J. Blencowe

Abstract: Alternative splicing (AS) is an important and frequent step in mammalian gene expression that allows a single gene to specify multiple products, and is crucial for the regulation of fundamental biological processes. The extent of AS regulation, and the mechanisms involved, are not well understood. We have developed a custom DNA microarray platform for surveying AS levels on a large scale. We present here a generative model for the AS Array Platform (GenASAP) and demonstrate its utility for quantifying AS levels in different mouse tissues. Learning is performed using a variational expectation maximization algorithm, and the parameters are shown to correctly capture expected AS trends. A comparison of the results obtained with a well-established but low through-put experimental method demonstrate that AS levels obtained from GenASAP are highly predictive of AS levels in mammalian tissues. 1 Biological diversity through alternative splicing Current estimates place the number of genes in the human genome at approximately 30,000, which is a surprisingly small number when one considers that the genome of yeast, a singlecelled organism, has 6,000 genes. The number of genes alone cannot account for the complexity and cell specialization exhibited by higher eukaryotes (i.e. mammals, plants, etc.). Some of that added complexity can be achieved through the use of alternative splicing, whereby a single gene can be used to code for a multitude of products. Genes are segments of the double stranded DNA that contain the information required by the cell for protein synthesis. That information is coded using an alphabet of 4 (A, C, G, and T), corresponding to the four nucleotides that make up the DNA. In what is known as the central dogma of molecular biology, DNA is transcribed to RNA, which in turn is translated into proteins. Messenger RNA (mRNA) is synthesized in the nucleus of the cell and carries the genomic information to the ribosome. In eukaryotes, genes are generally comprised of both exons, which contain the information needed by the cell to synthesize proteins, and introns, sometimes referred to as spacer DNA, which are spliced out of the pre-mRNA to create mature mRNA. An estimated 35%-75% of human genes [1] can be C1 (a) C1 A C1 C2 C1 A 3’ C2 (b) C2 C1 A 3’ C1 A 5’ C2 A 5’ C2 C1 C1 C2 C1 (c) A2 A1 C1 A1 C1 A2 C1 C2 C2 C1 (d) C2 C2 A C2 C2 C2 C1 C2 Figure 1: Four types of AS. Boxes represent exons and lines represent introns, with the possible splicing alternatives indicated by the connectors. (a) Single cassette exon inclusion/exclusion. C1 and C2 are constitutive exons (exons that are included in all isoforms) and ﬂank a single alternative exon (A). The alternative exon is included in one isoform and excluded in the other. (b) Alternative 3’ (or donor) and alternative 5’ (acceptor) splicing sites. Both exons are constitutive, but may contain alternative donor and/or acceptor splicing sites. (c) Mutually exclusive exons. One of the two alternative exons (A1 and A2 ) may be included in the isoform, but not both. (d) Intron inclusion. An intron may be included in the mature mRNA strand. spliced to yield different combinations of exons (called isoforms), a phenomenon referred to as alternative splicing (AS). There are four major types of AS as shown in Figure 1. Many multi-exon genes may undergo more than one alternative splicing event, resulting in many possible isoforms from a single gene. [2] In addition to adding to the genetic repertoire of an organism by enabling a single gene to code for more than one protein, AS has been shown to be critical for gene regulation, contributing to tissue speciﬁcity, and facilitating evolutionary processes. Despite the evident importance of AS, its regulation and impact on speciﬁc genes remains poorly understood. The work presented here is concerned with the inference of single cassette exon AS levels (Figure 1a) based on data obtained from RNA expression arrays, also known as microarrays. 1.1 An exon microarray data set that probes alternative splicing events Although it is possible to directly analyze the proteins synthesized by a cell, it is easier, and often more informative, to instead measure the abundance of mRNA present. Traditionally, gene expression (abundance of mRNA) has been studied using low throughput techniques (such as RT-PCR or Northern blots), limited to studying a few sequences at a time and making large scale analysis nearly impossible. In the early 1990s, microarray technology emerged as a method capable of measuring the expression of thousands of DNA sequences simultaneously. Sequences of interest are deposited on a substrate the size of a small microscope slide, to form probes. The mRNA is extracted from the cell and reverse-transcribed back into DNA, which is labelled with red and green ﬂuorescent dye molecules (cy3 and cy5 respectively). When the sample of tagged DNA is washed over the slide, complementary strands of DNA from the sample hybridize to the probes on the array forming A-T and C-G pairings. The slide is then scanned and the ﬂuorescent intensity is measured at each probe. It is generally assumed that the intensity measure at the probe is linearly related to the abundance of mRNA in the cell over a wide dynamic range. Despite signiﬁcant improvements in microarray technologies in recent years, microarray data still presents some difﬁculties in analysis. Low measurements tend to have extremely low signal to noise ratio (SNR) [7] and probes often bind to sequences that are very similar, but not identical, to the one for which they were designed (a process referred to as cross- C1 A C1 A C1:A C1 C2 C2 3 Body probes A:C2 A C2 2 Inclusion junction probes C1:C2 C1 C2 1 Exclusion junction probe Figure 2: Each alternative splicing event is studied using six probes. Probes were chosen to measure the expression levels of each of the three exons involved in the event. Additionally, 3 probes are used that target the junctions that are formed by each of the two isoforms. The inclusion isoform would express the junctions formed by C1 and A, and A and C2 , while the exclusion isoform would express the junction formed by C1 and C2 hybridization). Additionally, probes exhibit somewhat varying hybridization efﬁciency, and sequences exhibit varying labelling efﬁciency. To design our data sets, we mined public sequence databases and identiﬁed exons that were strong candidates for exhibiting AS (the details of that analysis are provided elsewhere [4, 3]). Of the candidates, 3,126 potential AS events in 2,647 unique mouse genes were selected for the design of Agilent Custom Oligonucleotide microarray. The arrays were hybridized with unampliﬁed mRNA samples extracted from 10 wild-type mouse tissues (brain, heart, intestine, kidney, liver, lung, salivary gland, skeletal muscle, spleen, and testis). Each AS event has six target probes on the arrays, chosen from regions of the C1 exon, C2 exon, A exon, C1 :A splice junction, A:C2 splice junction, and C1 :C2 splice junction, as shown in Figure 2. 2 Unsupervised discovery of alternative splicing With the exception of the probe measuring the alternative exon, A (Figure 2), all probes measure sequences that occur in both isoforms. For example, while the sequence of the probe measuring the junction A:C1 is designed to measure the inclusion isoform, half of it corresponds to a sequence that is found in the exclusion isoform. We can therefore safely assume that the measured intensity at each probe is a result of a certain amount of both isoforms binding to the probe. Due to the generally assumed linear relationship between the abundance of mRNA hybridized at a probe and the ﬂuorescent intensity measured, we model the measured intensity as a weighted sum of the overall abundance of the two isoforms. A stronger assumption is that of a single, consistent hybridization proﬁle for both isoforms across all probes and all slides. Ideally, one would prefer to estimate an individual hybridization proﬁle for each AS event studied across all slides. However, in our current setup, the number of tissues is small (10), resulting in two difﬁculties. First, the number of parameters is very large when compared to the number of data point using this model, and second, a portion of the events do not exhibit tissue speciﬁc alternative splicing within our small set of tissues. While the ﬁrst hurdle could be accounted for using Baysian parameter estimation, the second cannot. 2.1 GenASAP - a generative model for alternative splicing array platform Using the setup described above, the expression vector x, containing the six microarray measurements as real numbers, can be decomposed as a linear combination of the abundance of the two splice isoforms, represented by the real vector s, with some added noise: x = Λs + noise, where Λ is a 6 × 2 weight matrix containing the hybridization proﬁles for s1 ^ xC ^ xC 1 2 s2 ^ xC :A ^ xA ^ xA:C 2 ^ xC1:C2 2 xC1:C2 1 r xC xC 2 xA xC :A xA:C oC1 oC2 oA oC :A oA:C 1 1 1 2 oC :C 1 2 Σn 2 Figure 3: Graphical model for alternative splicing. Each measurement in the observed expression proﬁle, x, is generated by either using a scale factor, r, on a linear combination of the isoforms, s, or drawing randomly from an outlier model. For a detailed description of the model, see text. the two isoforms across the six probes. Note that we may not have a negative amount of a given isoform, nor can the presence of an isoform deduct from the measured expression, and so both s and Λ are constrained to be positive. Expression levels measured by microarrays have previously been modelled as having expression-dependent noise [7]. To address this, we rewrite the above formulation as x = r(Λs + ε), (1) where r is a scale factor and ε is a zero-mean normally distributed random variable with a diagonal covariance matrix, Ψ, denoted as p(ε) = N (ε; 0, Ψ). The prior distribution for the abundance of the splice isoforms is given by a truncated normal distribution, denoted as p(s) ∝ N (s, 0, I)[s ≥ 0], where [·] is an indicator function such that [s ≥ 0] = 1 if ∀i, si ≥ 0, and [s ≥ 0] = 0 otherwise. Lastly, there is a need to account for aberrant observations (e.g. due to faulty probes, ﬂakes of dust, etc.) with an outlier model. The complete GenASAP model (shown in Figure 3) accounts for the observations as the outcome of either applying equation (1) or an outlier model. To avoid degenerate cases and ensure meaningful and interpretable results, the number of faulty probes considered for each AS event may not exceed two, as indicated by the ﬁlled-in square constraint node in Figure 3. The distribution of x conditional on the latent variables, s, r, and o, is: N (xi ; rΛi s, r2 Ψi )[oi =0] N (xi ; Ei , Vi )[oi =1] , p(x|s, r, o) = (2) i where oi ∈ {0, 1} is a bernoulli random variable indicating if the measurement at probe xi is the result of the AS model or the outlier model parameterized by p(oi = 1) = γi . The parameters of the outlier model, E and V, are not optimized and are set to the mean and variance of the data. 2.2 Variational learning in the GenASAP model To infer the posterior distribution over the splice isoform abundances while at the same time learning the model parameters we use a variational expectation-maximization algorithm (EM). EM maximizes the log likelihood of the data by iteratively estimating the posterior distribution of the model given the data in the expectation (E) step, and maximizing the log likelihood with respect to the parameters, while keeping the posterior ﬁxed, in the maximization (M) step. Variational EM is used when, as in the case of GenASAP, the exact posterior is intractable. Variational EM minimizes the free energy of the model, deﬁned as the KL-divergence between the joint distribution of the latent and observed variables and the approximation to the posterior under the model parameters [5, 6]. We approximate the true posterior using the Q distribution given by T Q({s(t) }, {o(t) }, {r(t) }) = (t) Q(r(t) )Q(o(t) |r(t) ) t=1 (t) Q(si |oi , r(t) ) i −1 T =Z (t) (3) d d ρ(t) ω (t) N (s(t) ; µ(t) , Σ(t) )[s(t) ≥ 0], ro ro t=1 where Z is a normalization constant, the superscript d indicates that Σ is constrained to be diagonal, and there are T iid AS events. For computational efﬁciency, r is selected from a ﬁnite set, r ∈ {r1 , r2 , . . . , rC } with uniform probability. The variational free energy is given by Q({s(t) }, {o(t) }, {r(t) }) . P ({s(t) }, {o(t) }, {r(t) }, {x(t) }) s r o (4) Variational EM minimizes the free energy by iteratively updating the Q distribution’s vari(t)d (t)d ational parameters (ρ(t) , ω (t) , µro , and Σro ) in the E-step, and the model parameters (Λ, Ψ, {r1 , r2 , . . . , rC }, and γ) in the M-step. The resulting updates are too long to be shown in the context of this paper and are discussed in detail elsewhere [3]. A few particular points regarding the E-step are worth covering in detail here. Q({s(t) }, {o(t) }, {r(t) }) log F(Q, P ) = If the prior on s was a full normal distribution, there would be no need for a variational approach, and exact EM is possible. For a truncated normal distribution, however, the mixing proportions, Q(r)Q(o|r) cannot be calculated analytically except for the case where s is scalar, necessitating the diagonality constraint. Note that if Σ was allowed to be a full covariance matrix, equation (3) would be the true posterior, and we could ﬁnd the sufﬁcient statistics of Q(s(t) |o(t) , r(t) ): −1 µ(t) = (I + ΛT (I − O(t) )T Ψ−1 (I − O(t) )Λ)−1 ΛT (I − O(t) )T Ψ−1 x(t) r(t) ro −1 Σ(t) ro = (I + ΛT (I − O(t) )T Ψ−1 (I − O(t) )Λ) (5) (6) where O is a diagonal matrix with elements Oi,i = oi . Furthermore, it can be easily shown that the optimal settings for µd and Σd approximating a normal distribution with full covariance Σ and mean µ is µd optimal = µ −1 Σd optimal = diag(Σ (7) −1 ) (8) In the truncated case, equation (8) is still true. Equation (7) does not hold, though, and µd optimal cannot be found analytically. In our experiments, we found that using equation (7) still decreases the free energy every E-step, and it is signiﬁcantly more efﬁcient than using, for example, a gradient decent method to compute the optimal µd . Intuitive Weigh Matrix Optimal Weight Matrix 50 50 40 40 30 30 20 20 10 0 10 Inclusion Isoform 0 Exclusion Isoform Inclusion Isoform (a) Exclusion Isoform (b) Figure 4: (a) An intuitive set of weights. Based on the biological background, one would expect to see the inclusion isoform hybridize to the probes measuring C1 , C2 , A, C1 :A, and A:C2 , while the exclusion isoform hybridizes to C1 , C2 , and C1 :C2 . (b) The learned set of weights closely agrees with the intuition, and captures cross hybridization between the probes Contribution of exclusion isoform Contribution of inclusion isoform AS model Original Data (a) RT−PCR AS model measurement prediction (% exclusion) (% exclusion) 14% 72% (b) 27% 70% 8% 22% outliers (c) Figure 5: Three examples of data cases and their predictions. (a) The data does not follow our notion of single cassette exon AS, but the AS level is predicted accurately by the model.(b) The probe C1 :A is marked as outlier, allowing the model to predict the other probes accurately. (c) Two probes are marked as outliers, and the model is still successful in predicting the AS levels. 3 Making biological predictions about alternative splicing The results presented in this paper were obtained using two stages of learning. In the ﬁrst step, the weight matrix, Λ, is learned on a subset of the data that is selected for quality. Two selection criteria were used: (a) sequencing data was used to select those cases for which, with high conﬁdence, no other AS event is present (Figure 1) and (b) probe sets were selected for high expression, as determined by a set of negative controls. The second selection criterion is motivated by the common assumption that low intensity measurements are of lesser quality (see Section 1.1). In the second step, Λ is kept ﬁxed, and we introduce the additional constraint that the noise is isotropic (Ψ = ψI) and learn on the entire data set. The constraint on the noise is introduced to prevent the model from using only a subset of the six probes for making the ﬁnal set of predictions. We show a typical learned set of weights in Figure 4. The weights ﬁt well with our intuition of what they should be to capture the presence of the two isoforms. Moreover, the learned weights account for the speciﬁc trends in the data. Examples of model prediction based on the microarray data are shown in Figure 5. Due to the nature of the microarray data, we do not expect all the inferred abundances to be equally good, and we devised a scoring criterion that ranks each AS event based on its ﬁt to the model. Intuitively, given two input vectors that are equivalent up to a scale factor, with inferred MAP estimations that are equal up to the same scale factor, we would like their scores to be identical. The scoring criterion used, therefore is k (xk − rΛk s)2 /(xk + Rank 500 1000 2000 5000 10000 15000 20000 30000 Pearson’s correlation coefﬁcient 0.94 0.95 0.95 0.79 0.79 0.78 0.75 0.65 False positive rate 0.11 0.08 0.05 0.2 0.25 0.29 0.32 0.42 Table 1: Model performance evaluated at various ranks. Using 180 RT-PCR measurements, we are able to predict the model’s performance at various ranks. Two evaluation criteria are used: Pearson’s correlation coefﬁcient between the model’s predictions and the RT-PCR measurements and false positive rate, where a prediction is considered to be false positive if it is more than 15% away from the RT-PCR measurement. rΛk s)2 , where the MAP estimations for r and s are used. This scoring criterion can be viewed as proportional to the sum of noise to signal ratios, as estimated using the two values given by the observation and the model’s best prediction of that observation. Since it is the relative amount of the isoforms that is of most interest, we need to use the inferred distribution of the isoform abundances to obtain an estimate for the relative levels of AS. It is not immediately clear how this should be done. We do, however, have RTPCR measurements for 180 AS events to guide us (see ﬁgure 6 for details). Using the top 50 ranked RT-PCR measurement, we ﬁt three parameters, {a1 , a2 , a3 }, such that the s2 proportion of excluded isoform present, p, is given by p = a1 s1 +a2 s2 + a3 , where s1 is the MAP estimation of the abundance of the inclusion isoform, s2 is the MAP estimation of the abundance of the exclusion isoform, and the RT-PCR measurement are used for target p. The parameters are ﬁtted using gradient descent on a least squared error (LSE) evaluation criterion. We used two criteria to evaluate the quality of the AS model predictions. Pearson’s correlation coefﬁcient (PCC) is used to evaluate the overall ability of the model to correctly estimate trends in the data. PCC is invariant to afﬁne transformation and so is independent of the transformation parameters a1 and a3 discussed above, while the parameter a2 was found to effect PCC very little. The PCC stays above 0.75 for the top two thirds ranked predictions. The second evaluation criterion used is the false positive rate, where a prediction is considered to be false positive if it is more than 15% away from the RT-PCR measurement. This allows us to say, for example, that if a prediction is within the top 10000, we are 75% conﬁdent that it is within 15% of the actual levels of AS. 4 Summary We designed a novel AS model for the inference of the relative abundance of two alternatively spliced isoforms from six measurements. Unsupervised learning in the model is performed using a structured variational EM algorithm, which correctly captures the underlying structure of the data, as suggested by its biological nature. The AS model, though presented here for a cassette exon AS events, can be used to learn any type of AS, and with a simple adjustment, multiple types. The predictions obtained from the AS model are currently being used to verify various claims about the role of AS in evolution and functional genomics, and to help identify sequences that affect the regulation of AS. % Exclusion isoform RT−PCR measurement Vs. AS model predictions RT−PCR measurements: 90 80 AS model prediction Int es Te tine sti Kid s n Sa ey liva Br ry ain Sp le Liv en er Mu sc Lu le ng 100 70 60 50 40 30 14 22 27 32 47 46 66 78 63 20 AS model prediction: 10 27 24 26 26 51 75 60 85 100 (a) 0 0 20 40 60 80 RT−PCR measurement 100 (b) Figure 6: (a) Sample RT-PCR. RNA extracted from the cell is reverse-transcribed to DNA, ampliﬁed and labelled with radioactive or ﬂuorescent molecules. The sample is pulled through a viscous gel in an electric ﬁeld (DNA, being an acid, is positively charged). Shorter strands travel further through the gel than longer ones, resulting in two distinct bands, corresponding to the two isoforms, when exposed to a photosensitive or x-ray ﬁlm. (b) A scatter plot showing the RT-PCR measurements as compared to the AS model predictions. The plot shows all available RT-PCR measurements with a rank of 8000 or better. The AS model presented assumes a single weight matrix for all data cases. This is an oversimpliﬁed view of the data, and current work is being carried out in identifying probe speciﬁc expression proﬁles. However, due to the low dimensionality of the problem (10 tissues, six probes per event), care must be taken to avoid overﬁtting and to ensure meaningful interpretations. Acknowledgments We would like to thank Wen Zhang, Naveed Mohammad, and Timothy Hughes for their contributions in generating the data set. This work was funded in part by an operating and infrastructure grants from the CIHR and CFI, and a operating grants from NSERC and a Premier’s Research Excellence Award. References [1] J. M. Johnson et al. Genome-wide survey of human alternative pre-mrna splicing with exon junction microarrays. Science, 302:2141–44, 2003. [2] L. Cartegni et al. Listening to silence and understanding nonsense: exonic mutations that affect splicing. Nature Gen. Rev., 3:285–98, 2002. [3] Q. Pan et al. Revealing global regulatory features of mammalian alternative splicing using a quantitative microarray platform. Molecular Cell, 16(6):929–41, 2004. [4] Q. Pan et al. Alternative splicing of conserved exons is frequently species speciﬁc in human and mouse. Trends Gen., In Press, 2005. [5] M. I. Jordan, Z. Ghahramani, T. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37(2):183– 233, 1999. [6] R. M. Neal and G. E. Hinton. A view of the em algorithm that justiﬁes incremental, sparse, and other variants. In Learning in Graphical Models. Cambridge, MIT Press, 1998. [7] D. M. Rocke and B. Durbin. A model for measurement error for gene expression arrays. Journal of Computational Biology, 8(6):557–69, 2001.

5 0.49817887 115 nips-2004-Maximum Margin Clustering

Author: Linli Xu, James Neufeld, Bryce Larson, Dale Schuurmans

Abstract: We propose a new method for clustering based on ﬁnding maximum margin hyperplanes through data. By reformulating the problem in terms of the implied equivalence relation matrix, we can pose the problem as a convex integer program. Although this still yields a difﬁcult computational problem, the hard-clustering constraints can be relaxed to a soft-clustering formulation which can be feasibly solved with a semidefinite program. Since our clustering technique only depends on the data through the kernel matrix, we can easily achieve nonlinear clusterings in the same manner as spectral clustering. Experimental results show that our maximum margin clustering technique often obtains more accurate results than conventional clustering methods. The real beneﬁt of our approach, however, is that it leads naturally to a semi-supervised training method for support vector machines. By maximizing the margin simultaneously on labeled and unlabeled training data, we achieve state of the art performance by using a single, integrated learning principle. 1

6 0.49689269 120 nips-2004-Modeling Conversational Dynamics as a Mixed-Memory Markov Process

7 0.49580985 126 nips-2004-Nearly Tight Bounds for the Continuum-Armed Bandit Problem

8 0.49553701 186 nips-2004-The Correlated Correspondence Algorithm for Unsupervised Registration of Nonrigid Surfaces

9 0.49535617 44 nips-2004-Conditional Random Fields for Object Recognition

10 0.49478361 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

11 0.49442378 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

12 0.49335191 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging

13 0.49311757 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

14 0.492551 207 nips-2004-ℓ₀-norm Minimization for Basis Selection

15 0.49216786 166 nips-2004-Semi-supervised Learning via Gaussian Processes

16 0.49173996 39 nips-2004-Coarticulation in Markov Decision Processes

17 0.49105477 77 nips-2004-Hierarchical Clustering of a Mixture Model

18 0.49059281 56 nips-2004-Dynamic Bayesian Networks for Brain-Computer Interfaces

19 0.49052322 99 nips-2004-Learning Hyper-Features for Visual Identification

20 0.49043205 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification