jmlr jmlr2007 jmlr2007-57 knowledge-graph by maker-knowledge-mining

57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes

Source: pdf

Author: Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble, Christina Leslie

Abstract: Predicting a protein’s structural class from its amino acid sequence is a fundamental problem in computational biology. Recent machine learning work in this domain has focused on developing new input space representations for protein sequences, that is, string kernels, some of which give state-of-the-art performance for the binary prediction task of discriminating between one class and all the others. However, the underlying protein classiﬁcation problem is in fact a huge multiclass problem, with over 1000 protein folds and even more structural subcategories organized into a hierarchy. To handle this challenging many-class problem while taking advantage of progress on the binary problem, we introduce an adaptive code approach in the output space of one-vsthe-rest prediction scores. Speciﬁcally, we use a ranking perceptron algorithm to learn a weighting of binary classiﬁers that improves multi-class prediction with respect to a ﬁxed set of output codes. We use a cross-validation set-up to generate output vectors for training, and we deﬁne codes that capture information about the protein structural hierarchy. Our code weighting approach signiﬁcantly improves on the standard one-vs-all method for two difﬁcult multi-class protein classiﬁcation problems: remote homology detection and fold recognition. Our algorithm also outperforms a previous code learning approach due to Crammer and Singer, trained here using a perceptron, when the dimension of the code vectors is high and the number of classes is large. Finally, we compare against PSI-BLAST, one of the most widely used methods in protein sequence analysis, and ﬁnd that our method strongly outperforms it on every structure clas∗. The ﬁrst two authors contributed equally to this work. c 2007 Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble and Christina Leslie. M ELVIN , I E , W ESTON , N OBLE AND L ESLIE siﬁcation problem that we consider. Supplementary data and source code are available at http: //www.cs

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Recent machine learning work in this domain has focused on developing new input space representations for protein sequences, that is, string kernels, some of which give state-of-the-art performance for the binary prediction task of discriminating between one class and all the others. [sent-9, score-0.452]

2 However, the underlying protein classiﬁcation problem is in fact a huge multiclass problem, with over 1000 protein folds and even more structural subcategories organized into a hierarchy. [sent-10, score-1.07]

3 We use a cross-validation set-up to generate output vectors for training, and we deﬁne codes that capture information about the protein structural hierarchy. [sent-13, score-0.729]

4 Our code weighting approach signiﬁcantly improves on the standard one-vs-all method for two difﬁcult multi-class protein classiﬁcation problems: remote homology detection and fold recognition. [sent-14, score-1.264]

5 Introduction Numerous statistical and supervised learning methods have been developed for detecting protein structural classes from primary sequence information alone. [sent-25, score-0.447]

6 , 1990; Smith and Waterman, 1981), generative models for protein families (Krogh et al. [sent-27, score-0.409]

7 The binary classiﬁcation performance of semi-supervised discriminative methods, which incorporate unlabeled protein sequence data into the learning algorithm, is particularly strong (Kuang et al. [sent-37, score-0.414]

8 However, it is uncertain how best to leverage these accurate binary classiﬁers to solve the more important multi-class problem of classifying protein sequences into one of a vast number structural classes. [sent-40, score-0.482]

9 , 1995) contains more than 1000 distinct 3D conformation classes called folds and even more structural subcategories (protein superfamilies and families). [sent-42, score-0.551]

10 Moreover, one-vsall and the other standard output vector approaches do not take advantage of known relationships between classes, such as hierarchical relationships in the protein structural taxonomy, although there has been some recent work on hierarchical classiﬁcation (Dekel et al. [sent-58, score-0.538]

11 In this work, we present a simple but effective multi-class method for protein structural classiﬁcation that combines the predictions of state-of-the-art one-vs-the-rest SVM protein classiﬁers by supervised learning in the output space. [sent-63, score-0.846]

12 Instead of using ad hoc output codes as in ECOC, we design codes that are directly related to the structural hierarchy of a known taxonomy, such as SCOP, with components that correspond to fold, superfamily, and family detectors. [sent-65, score-0.674]

13 (2002) considered a more general and difﬁcult problem of adapting codes a and embeddings, that is, learning both the code vectors and the embedding of the vector of prediction scores in output space via a non-convex optimization problem. [sent-68, score-0.583]

14 For example, in protein classiﬁcation, we can use the balanced loss, so that performance on the large classes does not dominate the results. [sent-72, score-0.73]

15 In Section 2, we provide background on the protein classiﬁcation problem, including our choice of base classiﬁers and construction of hierarchical codes. [sent-74, score-0.411]

16 We then present our algorithmic approach for learning code weights in the output space using the ranking perceptron, describe different perceptron update rules, and compare to the code learning method of Crammer and Singer (2000) and other related work in Section 3. [sent-75, score-0.583]

17 We provide large-scale experimental results on the multi-class remote homology detection and fold recognition problems in Section 4, comparing our approach with a number of alternatives: standard one-vs-all, sigmoid ﬁtting, PSI-BLAST (Altschul et al. [sent-76, score-0.91]

18 Our adaptive code algorithm is used in a newly deployed protein fold recognition web server available called SVM-Fold, available at http://svm-fold. [sent-84, score-0.813]

19 Moreover, experimental methods for determining the full 3D structure of a protein (X-ray crystallography, NMR) are time consuming and difﬁcult and cannot keep pace with the rapid accumulation of unannotated protein sequences from newly sequenced genomes. [sent-95, score-0.774]

20 5M protein sequences in the Non-redundant Database of protein sequences. [sent-97, score-0.774]

21 Second, prediction of a protein sequence’s structural class enables the selection of a template structure from the database, which can then used with various comparative modeling techniques to predict a full 3D structure for the protein. [sent-98, score-0.474]

22 Note that template-based modeling approaches far outperform ab ini- Remote homology detection Fold recognition Figure 1: Two protein classiﬁcation problems. [sent-100, score-0.747]

23 (Left) In the SCOP database, we simulate the remote homology detection problem by holding out a test family (shown in dark gray) from a superfamily and using the other families as positive training data (shown in light gray). [sent-101, score-0.852]

24 The task is to correctly predict the superfamily or fold membership of the held-out sequences. [sent-102, score-0.478]

25 (Right) We simulate the fold recognition problem by holding out a test superfamily (dark gray) from a fold and using the other superfamilies as training data (light gray). [sent-103, score-0.95]

26 In this work, we focus on two protein classiﬁcation problems that are considered unsolved in the structural biology community: remote homology detection and fold recognition. [sent-106, score-1.228]

27 In remote homology detection, we wish to recognize when a new protein sequence has a distant evolutionary relationship to a protein sequence in a database (e. [sent-107, score-1.309]

28 Due to a distant common ancestor, the protein sequences exhibit subtle sequence similarities (remote homology) that cannot generally be detected by statistical, alignment-based methods (Altschul et al. [sent-110, score-0.46]

29 In fold recognition, we wish to recognize when a new protein sequence will exhibit the same fold as a protein from the structure database, even is there is no evidence of any evolutionary relationship between the proteins. [sent-112, score-1.225]

30 We can design experiments based on the SCOP hierarchy to test performance on both the remote homology detection and the fold recognition problem, as depicted in Figure 1. [sent-117, score-0.85]

31 We call these trained SVMs fold detectors, superfamily detectors, and family detectors. [sent-121, score-0.478]

32 The proﬁle kernel is a function that measures the similarity of two protein sequence proﬁles based on their representation in a high-dimensional vector space indexed by all k-mers (k-length subsequences of amino acids). [sent-122, score-0.414]

33 A sequence proﬁle is based on a multiple alignment of protein sequences to the input sequence and simply refers to the position-speciﬁc distribution of amino acids estimated from each column of the alignment. [sent-123, score-0.57]

34 A large variety of kernels have been designed speciﬁcally for protein sequences (e. [sent-143, score-0.423]

35 PSI-BLAST reports the signiﬁcance of the similarity as an E-value (deﬁned as the expected number of times that a similarity as strong or stronger than the observed similarity would be observed in a random protein sequence database of the given size), based on a proﬁle-sequence alignment score. [sent-156, score-0.44]

36 For example, in either the remote homology detection or fold recognition setting, both fold and superfamily detectors may be relevant for making fold-level predictions on test examples. [sent-165, score-1.37]

37 Suppose the number of superfamilies and folds in a SCOP-based data set is k and q respectively. [sent-166, score-0.492]

38 , fk+q (x)), where the fi are binary SVM superfamily or fold detectors trained using proﬁle string kernels as described above. [sent-170, score-0.569]

39 , the fold level for fold predictions) as well as sublevels of the target class. [sent-174, score-0.464]

40 , k} the code vectors C j = (superfam j , fold j ), where superfam j and fold j are vectors with length equal to the number of known superfamilies (k) and folds (q), and each of these two vectors has exactly one non-zero component corresponding to structural class identity. [sent-179, score-1.148]

41 1 ˆ The ﬁnal balanced empirical loss is |Y | ∑i lb (yi , yi ), where Y denotes the set of output labels. [sent-207, score-0.466]

42 Balanced loss is relevant to the protein structure prediction because class sizes are unbalanced, and we do not want to perform well only on the largest classes. [sent-208, score-0.415]

43 The particular ranking perceptron training and prediction algorithms that we use are summarized in the pseudocode in Figure 2, including update rules for both zero-one and balanced loss. [sent-209, score-0.727]

44 3 The Friend/Foe and Mean Friend/Foe Update Rules When using codes representing multiple levels of the label hierarchy, we can also use relationships between codes to redeﬁne the perceptron update rule. [sent-211, score-0.701]

45 For example, if we are using both superfamily and fold detectors for fold-level predictions and Cy = (superfamy , foldy ), the set of friends would be the codes for the superfamilies in the same fold as y (in particular, y itself belongs to friends(y)). [sent-213, score-1.309]

46 Intuitively, our deﬁnition of ψ deﬁnes the distance between two different protein embeddings in code space, and we are using large margin SVM methods to ﬁnd the relative weighting of the dimensions in code space. [sent-265, score-0.645]

47 , lb (yi , y) 2005), we found that the structured SVM gave similar performance to the ranking perceptron when used with our joint input-output embedding ψ, so we focus on perceptron approaches in the current study. [sent-268, score-0.423]

48 (2002) also considered the problem of adaptive code learning and proposed a general a approach consisting of learning both code vectors and the embedding of the vector of prediction scores in output space. [sent-281, score-0.521]

49 Furthermore, by training our second-stage code learning by ﬁrst running cross-validation on the ﬁrst-stage predictors, we attempt to learn correcting codes which minimize the cross-validation classiﬁcation error. [sent-284, score-0.414]

50 Moreover, our approach also offers control of the loss function (such as using balanced loss) and use of hierarchical labels, which are not possible in one-vs-all. [sent-306, score-0.413]

51 1 Data Sets We assembled benchmark data sets for the remote homology detection and fold recognition problems using sequences from the SCOP 1. [sent-310, score-0.885]

52 For the fold recognition problem, we designed our experiments so that the test set consists of held-out superfamilies belonging to folds that are represented in the training data. [sent-315, score-0.757]

53 We selected superfamilies for testing at random from the remaining superfamilies such that the test set for the superfamily contains no more than 40% of the remaining sequences for the fold. [sent-318, score-0.732]

54 If at least one suitable superfamily could not be found, then the fold was removed from the experiment. [sent-319, score-0.478]

55 For the remote homology detection, the test set should contain held-out families belonging to superfamilies that are represented in the training data. [sent-322, score-0.75]

56 One can evaluate performance for multi-class prediction of fold or superfamily levels, and it is natural to try different codes for these two tasks; therefore, we prepared a separate data set for remote homology superfamily and fold detection. [sent-323, score-1.764]

57 For the superfamily data set, we used the same selection scheme as for fold recognition, except the minimum number of sequences for the children of the superfamilies is relaxed to 3, and we selected random families for testing instead of superfamilies. [sent-324, score-0.815]

58 For the remote homology fold detection data set, we ﬁrst removed all superfamilies with less than 2 families. [sent-327, score-0.987]

59 We selected families at random from each superfamily such that we never selected more than 40% of the parent superfamily for testing. [sent-329, score-0.55]

60 If a fold was then found to have no superfamilies with held out families for testing, it was removed from the data set. [sent-331, score-0.497]

61 The resulting remote homology detection set contains 44 folds, 424 superfamilies, and 809 families for training. [sent-332, score-0.606]

62 For fold recognition, this means that when we train superfamily or family detectors, we exclude negative example sequences that come from the parent fold. [sent-336, score-0.571]

63 2 Methods We test our weight learning approach using the ranking perceptron with the class-based, friend/foe, and mean class update rules for a variety of code sets for the remote homology detection and fold recognition problems. [sent-339, score-1.23]

64 For all ranking perceptron experiments, we train the perceptron algorithm for 200 iterations. [sent-349, score-0.389]

65 For ranking perceptron on one-vs-all classiﬁer codes with PSI-BLAST extension, we set the initial weights on the PSIBLAST portion of the codes to 0. [sent-352, score-0.739]

66 3 Remote Homology Detection Results For the remote homology detection data set, where the test set consists of held-out protein families that belong to superfamilies represented in the training data, we evaluate performance both for the superfamily-level and fold-level prediction tasks. [sent-361, score-1.228]

67 Results for multi-class superfamily and fold prediction are provided in Tables 1 and 2, respectively. [sent-362, score-0.542]

68 The last two tables use a balanced error measure by averaging the error rates over each prediction label before computing the signiﬁcance test. [sent-364, score-0.443]

69 We compare our adaptive code method to PSI-BLAST, a standard homology detection method based on sequence alignment, as well as simple one-vs-all, sigmoid ﬁtting, and the Crammer-Singer method, using various choices of code vectors. [sent-365, score-0.827]

70 In addition to reporting classiﬁcation loss and balanced loss results, we give “top 5” classiﬁcation and balanced loss performance, which evaluates whether the correct class was found in the top 5 class predictions. [sent-366, score-0.758]

71 For the superfamily prediction task, we ﬁnd that the adaptive codes method signiﬁcantly outperforms one-vs-all both in terms of classiﬁcation and balanced error, even when superfamily-only codes are used, and performance improves as more elements are added to the codes. [sent-368, score-1.271]

72 153 Table 1: Results for multi-class superfamily prediction in the remote homology detection set-up. [sent-409, score-0.858]

73 142 Table 2: Results for multi-class fold prediction in the remote homology detection set-up. [sent-452, score-0.844]

74 Results for the adaptive codes method are reported for a SCOP benchmark data set (44 folds, 424 superfamilies, 809 families, with 381 test sequences) and compared to nearest neighbor using PSI-BLAST, standard one-vs-all, and a perceptron version of the Crammer and Singer method. [sent-453, score-0.47]

75 For the fold prediction task, we use a different set of codes, including code elements corresponding to protein fold detectors. [sent-458, score-1.012]

76 In this case, both Crammer-Singer and adaptive codes beat one-vs-all with respect to classiﬁcation and balanced loss when fold-only codes are used; in fact, for fold-only codes, performance of Crammer-Singer is slightly better than adaptive codes. [sent-460, score-1.125]

77 However, as we add more code elements, the performance of Crammer-Singer degrades while adaptive codes continues to improve, so that the best result for our method (corresponding to the longest code that we tried) is better than the best result for Crammer-Singer (the shortest code). [sent-461, score-0.589]

78 Overall, we observe that when the individual code elements are helpful, as seems to be the case in remote homology detection, our adaptive code method can successfully improve performance by adding elements without overﬁtting. [sent-464, score-0.815]

79 4 Fold Recognition Results For the more difﬁcult fold recognition task, where the data set consists of held-out superfamilies from protein folds represented in the training data, we expect that code elements from subclasses (i. [sent-467, score-1.241]

80 , superfamilies and families) will provide less information, since protein sequences from different superfamilies in principle have no detectable sequence similarity. [sent-469, score-0.874]

81 We ﬁnd that the adaptive codes method can again beat one-vs-all and strongly outperform PSI-BLAST, but we see no trend of improvement as more code elements are added, with various length codes leading to similar error rates. [sent-473, score-0.815]

82 Interestingly, in this case, Crammer-Singer with fold-only codes outperforms the best adaptive codes result in terms of balanced loss, though the top 5 results for adaptive codes are uniformly better than Crammer-Singer by either loss function. [sent-475, score-1.284]

83 We conclude that in this case, since the code elements corresponding to subclasses are not very helpful, the adaptive code method cannot leverage longer codes to achieve much higher accuracy. [sent-476, score-0.589]

84 Results for the adaptive codes method are reported on a SCOP benchmark data set (26 folds, 303 superfamilies, 614 test sequences) and compared to nearest neighbor using PSI-BLAST, standard one-vs-all, and a perceptron version of the Crammer and Singer method. [sent-520, score-0.47]

85 13 0 1 0 1 Table 4: P-values from the Wilcoxon signed rank test for superfamily prediction in the remote homology setup. [sent-530, score-0.827]

86 09 1 Table 8: P-values from the Wilcoxon signed rank test for balanced fold prediction in the remote homology setup. [sent-631, score-1.192]

87 21 1 Table 9: P-values from the Wilcoxon signed rank test for balanced fold recognition. [sent-668, score-0.643]

88 The same performance improvement is true when measured in terms of balanced error for the remote homology fold prediction task; however, for remote homology superfamily prediction, the improvement in balanced error only holds when the perceptron is also trained with balanced error. [sent-677, score-2.796]

89 In the case of fold recognition, previous results indicate that the subclass code elements are less useful, so we expect that update rules which respect the code structure may be less effective. [sent-678, score-0.561]

90 However, even for fold recognition, the new update rules signiﬁcantly improve classiﬁcation error when the perceptrons are trained using balanced loss. [sent-680, score-0.674]

91 Results for the friend/foe and mean friend/foe update rules are compared with the standard perceptron update rule for the fold recognition and remote homology fold and superfamily prediction tasks when using hierarchical codes. [sent-754, score-1.572]

92 Our application domain is the highly multi-class protein structural classiﬁcation problem, where there are typically hundreds of classes arranged in a hierarchical taxonomy. [sent-759, score-0.444]

93 In this domain, we focus on two difﬁcult subproblems: remote homology detection and fold recognition. [sent-760, score-0.78]

94 We then use ﬁxed binary codes to represent the hierarchy of labels associated with each class, and we adapt our output vector embedding in order to improve classiﬁcation relative to these ﬁxed codes. [sent-762, score-0.389]

95 We also convincingly beat PSI-BLAST, which is a widely-used alignment-based method for detecting relationships between protein sequences. [sent-764, score-0.451]

96 A large body of recent work has focused on ﬁnding good representations for the inputs in the protein classiﬁcation problem, in particular in the form of novel string kernels for protein sequence data. [sent-785, score-0.776]

97 As we scale to the full-scale protein classiﬁcation problem, with on the order of 1000 folds, one issue with our approach is limited coverage: for many small SCOP folds, there are not enough labeled sequences to train an SVM fold detector. [sent-788, score-0.676]

98 Proﬁle kernels for detecting remote protein homologs and discriminative motifs. [sent-878, score-0.562]

99 Combining pairwise sequence similarity and support vector machines for remote protein homology detection. [sent-898, score-0.873]

100 Proﬁle-based direct kernels for remote homology detection and fold recognition. [sent-916, score-0.78]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('balanced', 0.379), ('protein', 0.351), ('homology', 0.3), ('folds', 0.285), ('codes', 0.259), ('superfamily', 0.246), ('fold', 0.232), ('superfamilies', 0.207), ('remote', 0.185), ('perceptron', 0.147), ('code', 0.133), ('scop', 0.108), ('beat', 0.1), ('daptive', 0.1), ('sigmoid', 0.097), ('elvin', 0.092), ('eslie', 0.092), ('eston', 0.092), ('oble', 0.092), ('odes', 0.092), ('rotein', 0.092), ('fitting', 0.087), ('ulti', 0.076), ('ranking', 0.074), ('sequences', 0.072), ('crammer', 0.068), ('pro', 0.065), ('adaptive', 0.064), ('sing', 0.064), ('prediction', 0.064), ('detection', 0.063), ('christina', 0.061), ('output', 0.06), ('structural', 0.059), ('families', 0.058), ('ie', 0.058), ('detectors', 0.054), ('beats', 0.054), ('friends', 0.054), ('lassification', 0.05), ('altschul', 0.046), ('noble', 0.046), ('dashes', 0.046), ('singer', 0.04), ('sf', 0.04), ('eugene', 0.039), ('biology', 0.038), ('gray', 0.038), ('hierarchy', 0.037), ('string', 0.037), ('sequence', 0.037), ('update', 0.036), ('le', 0.035), ('hierarchical', 0.034), ('scores', 0.034), ('svm', 0.033), ('embedding', 0.033), ('recognition', 0.033), ('signed', 0.032), ('cy', 0.032), ('ers', 0.032), ('weston', 0.031), ('kuang', 0.031), ('sfams', 0.031), ('wilcoxon', 0.031), ('leslie', 0.03), ('william', 0.03), ('classi', 0.029), ('cance', 0.029), ('jason', 0.029), ('margin', 0.028), ('yi', 0.027), ('rules', 0.027), ('molecular', 0.027), ('alignment', 0.026), ('amino', 0.026), ('klautau', 0.026), ('pushes', 0.026), ('background', 0.026), ('database', 0.026), ('discriminative', 0.026), ('arg', 0.025), ('predictions', 0.025), ('row', 0.024), ('multiclass', 0.024), ('brenner', 0.023), ('eleazar', 0.023), ('saigo', 0.023), ('stafford', 0.023), ('wyi', 0.023), ('structured', 0.022), ('tting', 0.022), ('recognize', 0.022), ('rifkin', 0.022), ('learn', 0.022), ('nec', 0.021), ('acids', 0.021), ('train', 0.021), ('coverage', 0.021), ('reweighting', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000013 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes

Author: Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble, Christina Leslie

2 0.083101287 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm

Author: Roni Khardon, Gabriel Wachman

Abstract: A large number of variants of the Perceptron algorithm have been proposed and partially evaluated in recent work. One type of algorithm aims for noise tolerance by replacing the last hypothesis of the perceptron with another hypothesis or a vote among hypotheses. Another type simply adds a margin term to the perceptron in order to increase robustness and accuracy, as done in support vector machines. A third type borrows further from support vector machines and constrains the update function of the perceptron in ways that mimic soft-margin techniques. The performance of these algorithms, and the potential for combining different techniques, has not been studied in depth. This paper provides such an experimental study and reveals some interesting facts about the algorithms. In particular the perceptron with margin is an effective method for tolerating noise and stabilizing the algorithm. This is surprising since the margin in itself is not designed or used for noise tolerance, and there are no known guarantees for such performance. In most cases, similar performance is obtained by the voted-perceptron which has the advantage that it does not require parameter selection. Techniques using soft margin ideas are run-time intensive and do not give additional performance beneﬁts. The results also highlight the difﬁculty with automatic parameter selection which is required with some of these variants. Keywords: perceptron algorithm, on-line learning, noise tolerance, kernel methods

3 0.067978963 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

Author: Ofer Dekel, Philip M. Long, Yoram Singer

Abstract: We study the problem of learning multiple tasks in parallel within the online learning framework. On each online round, the algorithm receives an instance for each of the parallel tasks and responds by predicting the label of each instance. We consider the case where the predictions made on each round all contribute toward a common goal. The relationship between the various tasks is deﬁned by a global loss function, which evaluates the overall quality of the multiple predictions made on each round. Speciﬁcally, each individual prediction is associated with its own loss value, and then these multiple loss values are combined into a single number using the global loss function. We focus on the case where the global loss function belongs to the family of absolute norms, and present several online learning algorithms for the induced problem. We prove worst-case relative loss bounds for all of our algorithms, and demonstrate the effectiveness of our approach on a largescale multiclass-multilabel text categorization problem. Keywords: online learning, multitask learning, multiclass multilabel classiifcation, perceptron

4 0.063559793 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby

Abstract: Embedding algorithms search for a low dimensional continuous representation of data, but most algorithms only handle objects of a single type for which pairwise distances are speciﬁed. This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space, based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semideﬁnite matrices. The local structure of the embedding corresponds to the statistical correlations via random walks in the Euclidean space. We quantify the performance of our method on two text data sets, and show that it consistently and signiﬁcantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling, IsoMap and correspondence analysis. Keywords: embedding algorithms, manifold learning, exponential families, multidimensional scaling, matrix factorization, semideﬁnite programming

5 0.042964987 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

Author: Sébastien Gadat, Laurent Younes

Abstract: We introduce a new model addressing feature selection from a large dictionary of variables that can be computed from a signal or an image. Features are extracted according to an efﬁciency criterion, on the basis of speciﬁed classiﬁcation or recognition tasks. This is done by estimating a probability distribution P on the complete dictionary, which distributes its mass over the more efﬁcient, or informative, components. We implement a stochastic gradient descent algorithm, using the probability as a state variable and optimizing a multi-task goodness of ﬁt criterion for classiﬁers based on variable randomly chosen according to P. We then generate classiﬁers from the optimal distribution of weights learned on the training set. The method is ﬁrst tested on several pattern recognition problems including face detection, handwritten digit recognition, spam classiﬁcation and micro-array analysis. We then compare our approach with other step-wise algorithms like random forests or recursive feature elimination. Keywords: stochastic learning algorithms, Robbins-Monro application, pattern recognition, classiﬁcation algorithm, feature selection

6 0.041798566 1 jmlr-2007-"Ideal Parent" Structure Learning for Continuous Variable Bayesian Networks

7 0.038811728 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)

8 0.03815116 52 jmlr-2007-Margin Trees for High-dimensional Classification

9 0.036350287 80 jmlr-2007-Synergistic Face Detection and Pose Estimation with Energy-Based Models

10 0.03345393 23 jmlr-2007-Concave Learners for Rankboost

11 0.032734934 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors

12 0.032705691 12 jmlr-2007-Attribute-Efficient and Non-adaptive Learning of Parities and DNF Expressions (Special Topic on the Conference on Learning Theory 2005)

13 0.031403981 82 jmlr-2007-The Need for Open Source Software in Machine Learning

14 0.03118748 70 jmlr-2007-Ranking the Best Instances

15 0.031144956 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation

16 0.030669739 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

17 0.029690493 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

18 0.026270753 91 jmlr-2007-Very Fast Online Learning of Highly Non Linear Problems

19 0.025000978 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression

20 0.022804227 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.149), (1, 0.085), (2, -0.022), (3, 0.057), (4, 0.003), (5, 0.072), (6, -0.144), (7, -0.062), (8, -0.034), (9, 0.014), (10, -0.075), (11, 0.004), (12, 0.061), (13, 0.219), (14, 0.143), (15, -0.229), (16, -0.105), (17, -0.039), (18, 0.002), (19, 0.055), (20, 0.004), (21, 0.118), (22, -0.182), (23, -0.058), (24, -0.306), (25, -0.187), (26, 0.055), (27, 0.094), (28, -0.209), (29, 0.126), (30, 0.018), (31, 0.065), (32, -0.047), (33, -0.084), (34, -0.074), (35, 0.055), (36, 0.112), (37, 0.032), (38, 0.112), (39, 0.082), (40, -0.038), (41, 0.051), (42, 0.005), (43, 0.061), (44, 0.063), (45, 0.019), (46, -0.11), (47, -0.097), (48, 0.099), (49, -0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96404767 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes

Author: Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble, Christina Leslie

2 0.63281381 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm

Author: Roni Khardon, Gabriel Wachman

3 0.37337908 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby

4 0.34013885 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

Author: Ofer Dekel, Philip M. Long, Yoram Singer

5 0.28597859 1 jmlr-2007-"Ideal Parent" Structure Learning for Continuous Variable Bayesian Networks

Author: Gal Elidan, Iftach Nachman, Nir Friedman

Abstract: Bayesian networks in general, and continuous variable networks in particular, have become increasingly popular in recent years, largely due to advances in methods that facilitate automatic learning from data. Yet, despite these advances, the key task of learning the structure of such models remains a computationally intensive procedure, which limits most applications to parameter learning. This problem is even more acute when learning networks in the presence of missing values or hidden variables, a scenario that is part of many real-life problems. In this work we present a general method for speeding structure search for continuous variable networks with common parametric distributions. We efﬁciently evaluate the approximate merit of candidate structure modiﬁcations and apply time consuming (exact) computations only to the most promising ones, thereby achieving signiﬁcant improvement in the running time of the search algorithm. Our method also naturally and efﬁciently facilitates the addition of useful new hidden variables into the network structure, a task that is typically considered both conceptually difﬁcult and computationally prohibitive. We demonstrate our method on synthetic and real-life data sets, both for learning structure on fully and partially observable data, and for introducing new hidden variables during structure search. Keywords: Bayesian networks, structure learning, continuous variables, hidden variables

6 0.22151507 23 jmlr-2007-Concave Learners for Rankboost

7 0.20904154 80 jmlr-2007-Synergistic Face Detection and Pose Estimation with Energy-Based Models

8 0.20900255 52 jmlr-2007-Margin Trees for High-dimensional Classification

9 0.1771546 12 jmlr-2007-Attribute-Efficient and Non-adaptive Learning of Parities and DNF Expressions (Special Topic on the Conference on Learning Theory 2005)

10 0.16826373 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors

11 0.16824102 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

12 0.16704896 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation

13 0.16600539 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)

14 0.15919155 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

15 0.15323444 70 jmlr-2007-Ranking the Best Instances

16 0.14705524 82 jmlr-2007-The Need for Open Source Software in Machine Learning

17 0.14176318 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)

18 0.13301176 15 jmlr-2007-Bilinear Discriminant Component Analysis

19 0.13231404 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions

20 0.1308122 59 jmlr-2007-Nonlinear Boosting Projections for Ensemble Construction

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.019), (10, 0.014), (12, 0.015), (15, 0.036), (20, 0.024), (22, 0.028), (28, 0.048), (40, 0.023), (45, 0.043), (48, 0.042), (60, 0.028), (78, 0.43), (80, 0.015), (85, 0.054), (98, 0.092)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.69798338 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes

Author: Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble, Christina Leslie

2 0.28968167 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby

3 0.28888851 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

Author: Wei Pan, Xiaotong Shen

Abstract: Variable selection in clustering analysis is both challenging and important. In the context of modelbased clustering analysis with a common diagonal covariance matrix, which is especially suitable for “high dimension, low sample size” settings, we propose a penalized likelihood approach with an L1 penalty function, automatically realizing variable selection via thresholding and delivering a sparse solution. We derive an EM algorithm to ﬁt our proposed model, and propose a modiﬁed BIC as a model selection criterion to choose the number of components and the penalization parameter. A simulation study and an application to gene function prediction with gene expression proﬁles demonstrate the utility of our method. Keywords: BIC, EM, mixture model, penalized likelihood, soft-thresholding, shrinkage

4 0.28808099 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

Author: Sébastien Gadat, Laurent Younes

5 0.28740776 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

Author: Guy Lebanon, Yi Mao, Joshua Dillon

Abstract: The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efﬁcient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words representation and its n-gram extensions. This representation uses local smoothing to embed documents as smooth curves in the multinomial simplex thereby preserving valuable sequential information. In contrast to bag of words or n-grams, the new representation is able to robustly capture medium and long range sequential trends in the document. We discuss the representation and its geometric properties and demonstrate its applicability for various text processing tasks. Keywords: text processing, local smoothing

6 0.28488958 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

7 0.28071499 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

8 0.28069347 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

9 0.28029171 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm

10 0.27962497 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

11 0.27814376 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

12 0.27743569 39 jmlr-2007-Handling Missing Values when Applying Classification Models

13 0.27720773 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

14 0.27705699 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters (Special Topic on Model Selection)

15 0.27595466 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression

16 0.2755385 72 jmlr-2007-Relational Dependency Networks

17 0.27337641 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

18 0.27258998 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

19 0.27183843 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

20 0.27139455 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR