nips nips2004 nips2004-143 knowledge-graph by maker-knowledge-mining

143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data


Source: pdf

Author: Mario Marchand, Mohak Shah

Abstract: We propose a “soft greedy” learning algorithm for building small conjunctions of simple threshold functions, called rays, defined on single real-valued attributes. We also propose a PAC-Bayes risk bound which is minimized for classifiers achieving a non-trivial tradeoff between sparsity (the number of rays used) and the magnitude of the separating margin of each ray. Finally, we test the soft greedy algorithm on four DNA micro-array data sets. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Abstract We propose a “soft greedy” learning algorithm for building small conjunctions of simple threshold functions, called rays, defined on single real-valued attributes. [sent-7, score-0.117]

2 We also propose a PAC-Bayes risk bound which is minimized for classifiers achieving a non-trivial tradeoff between sparsity (the number of rays used) and the magnitude of the separating margin of each ray. [sent-8, score-0.543]

3 Finally, we test the soft greedy algorithm on four DNA micro-array data sets. [sent-9, score-0.191]

4 1 Introduction An important challenge in the problem of classification of high-dimensional data is to design a learning algorithm that can often construct an accurate classifier that depends on the smallest possible number of attributes. [sent-10, score-0.031]

5 A filter is an algorithm used to “filter out” irrelevant attributes before using a base learning algorithm, such as the support vector machine (SVM), which was not designed to perform well in the presence of many irrelevant attributes. [sent-13, score-0.208]

6 A wrapper, on the other hand, is used in conjunction with the base learning algorithm: typically removing recursively the attributes that have received a small “weight” by the classifier obtained from the base learner. [sent-14, score-0.255]

7 The recursive feature elimination method is an example of a wrapper that was used by Guyon et al. [sent-15, score-0.078]

8 (2002) in conjunction with the SVM for classification of micro-array data. [sent-16, score-0.117]

9 (2000) have used a filter which consists of ranking the attributes (gene expressions) as function of the difference between the positive-example mean and the negative-example mean. [sent-18, score-0.138]

10 Both filters and wrappers have sometimes produced good empirical results but they are not theoretically justified. [sent-19, score-0.033]

11 What we really need is a learning algorithm that has provably good guarantees in the presence of many irrelevant attributes. [sent-20, score-0.035]

12 We also propose a PAC-Bayes risk bound which is minimized for classifiers achieving a non-trivial tradeoff between sparsity (the number of rays used) and the magnitude of the separating margin of each ray. [sent-25, score-0.543]

13 Finally, we test the proposed soft greedy algorithm on four DNA micro-array data sets. [sent-26, score-0.191]

14 , xn ) where each real-valued component xi ∈ [Ai , Bi ] for i = 1, . [sent-30, score-0.043]

15 Hence, Ai and Bi are, respectively, the a priori lower and upper bounds on values for xi . [sent-34, score-0.043]

16 The output space Y is the set of classification labels that can be assigned to any input vector x ∈ X . [sent-35, score-0.029]

17 The (true) risk R(f ) of a classifier f : X → Y is defined to be the probability that f misclassifies z on a random draw according to D: def R(f ) = Pr(x,y)∼D (f (x) = y) = E(x,y)∼D I(f (x) = y) where I(a) = 1 if predicate a is true and 0 otherwise. [sent-40, score-0.343]

18 , zm ) of m examples, the task of a learning algorithm is to construct a classifier with the smallest possible risk without any information about D. [sent-44, score-0.18]

19 To achieve this goal, the learner can compute the empirical risk RS (f ) of any given classifier f according to: def RS (f ) = 1 m m def I(f (xi ) = yi ) = E(x,y)∼S I(f (x) = y) i=1 We focus on learning algorithms that construct a conjunction of rays from a training set. [sent-45, score-0.946]

20 Each ray is just a threshold classifier defined on a single attribute (component) xi . [sent-46, score-0.57]

21 More formally, a ray is identified by an attribute index i ∈ {1, . [sent-47, score-0.499]

22 , n}, a threshold value t ∈ [Ai , Bi ], and a direction d ∈ {−1, +1} (that specifies whether class 1 is on i the largest or smallest values of xi ). [sent-50, score-0.102]

23 Given any input example x, the output rtd (x) of a ray is defined as: def i rtd (x) = 1 if (xi − t)d > 0 0 if (xi − t)d ≤ 0 To specify a conjunction of rays we need first to list all the attributes who’s ray def is present in the conjunction. [sent-51, score-1.691]

24 < i|i| where |i| is the number of indices present in i (and thus the number of rays in the conjunction) 1 . [sent-61, score-0.262]

25 To complete the specification of a conjunction of rays, we need a vector t = (ti1 , ti2 , . [sent-62, score-0.117]

26 , ti|i| ) of threshold values and a vector of d = (di1 , di2 , . [sent-65, score-0.028]

27 A remarkable result that came out from this line of research, known as the “PAC-Bayes theorem”, provides a tight upper bound on the risk of a stochastic classifier called the Gibbs classifier . [sent-79, score-0.19]

28 Given an input example x, the label GQ (x) assigned to x by the Gibbs classifier is defined by the following process. [sent-80, score-0.029]

29 We first choose a classifier h according to the posterior distribution Q and then use h to assign the label h(x) to x. [sent-81, score-0.037]

30 The risk of GQ is defined as the expected risk of classifiers drawn according to Q: def R(GQ ) = Eh∼Q R(h) = Eh∼Q E(x,y)∼D I(f (x) = y) The PAC-Bayes theorem was first proposed by McAllester (2003). [sent-82, score-0.532]

31 The bound given by the PAC-Bayes theorem for the risk of Gibbs classifiers can be turned into a bound for the risk of Bayes classifiers in the following way. [sent-87, score-0.42]

32 Given a posterior distribution Q, the Bayes classifier BQ performs a majority vote (under measure Q) of binary classifiers in H. [sent-88, score-0.067]

33 In our case, we have seen that ray conjunctions are specified in terms of a mixture of discrete parameters i and d and continuous parameters t. [sent-92, score-0.445]

34 The first two factors come from the belief that the final classifier, constructed from the group of attributes specified by i, should depend only on the number |i| of attributes in this group. [sent-95, score-0.276]

35 If we have complete ignorance about the number of rays the final classifier is likely to have, we should choose p(e) = 1/(n + 1) for e ∈ {0, 1, . [sent-96, score-0.262]

36 However, we should choose a p that decreases as we increase e if we have reasons to believe that the number of rays of the final classifier will be much smaller than n. [sent-100, score-0.262]

37 The third factor of Pi,d (t) gives equal prior probabilities for each of the two possible values of direction dj . [sent-101, score-0.053]

38 Finally, for each ray, every possible threshold value t should have the same prior probability of being chosen if we do not have any prior knowledge that would favor some values over the others. [sent-102, score-0.028]

39 Since each attribute value xi is constrained, a priori, to be in [Ai , Bi ], we have chosen a uniform probability density on [Ai , Bi ] for each ti such that i ∈ i. [sent-103, score-0.235]

40 Given a training set S, the learner will choose an attribute group i and a direction vector d. [sent-105, score-0.173]

41 For each attribute xi ∈ [Ai , Bi ] : i ∈ i, a margin interval [ai , bi ] ⊆ [Ai , Bi ] will also be chosen by the learner. [sent-106, score-0.436]

42 A deterministic ray-conjunction classifier is then specified by choosing the thresholds values ti ∈ [ai , bi ]. [sent-107, score-0.271]

43 It is tempting at this point to choose ti = (ai + bi )/2 ∀i ∈ i (i. [sent-108, score-0.271]

44 However, we will see shortly that the PAC-Bayes theorem offers a better guarantee for another type of deterministic classifier. [sent-111, score-0.04]

45 The Gibbs classifier is defined with a posterior distribution Q having all its weight on the same i and d as chosen by the learner but where each ti is uniformly chosen in [ai , bi ]. [sent-112, score-0.338]

46 Furthermore, the KL divergence between the “discrete components” of Q and P is small for small values of |i| (whenever p(|i|) is not too small). [sent-114, score-0.047]

47 Hence, this KL divergence between our choices for Q and P exhibits a tradeoff between margins (large values of bi − ai ) and sparsity (small value of |i|) for Gibbs classifiers. [sent-115, score-0.506]

48 According to Theorem 1, the Gibbs classifier with the smallest guarantee of risk R(GQ ) should minimize a non trivial combination of KL(Q P ) (margins-sparsity tradeoff) and empirical risk RS (GQ ). [sent-116, score-0.329]

49 Since the posterior Q is identified by an attribute group vector i, a direction vector d, and intervals [ai , bi ] ∀i ∈ i, we will refer to the Gibbs classifier GQ by Gid ab where a and b are the vectors formed by the unions of ai s and bi s respectively. [sent-117, score-1.078]

50 We can obtain a closed-form expression for RS (Gid ) by first considering the risk ab R(x,y) (Gid ) on a single example (x, y) since RS (Gid ) = E(x,y)∼S R(x,y) (Gid ). [sent-118, score-0.46]

51 This occurs iff there exists a ray which outputs 0 on x. [sent-120, score-0.356]

52 We i can also verify that the expression for R(x,y) (Ctd ) is identical to the expression for di id R(x,y) (Gab ) except that the piece-wise linear functions σai ,bi (xi ) are replaced by the indicator functions I((xi − ti )di > 0). [sent-121, score-0.426]

53 The PAC-Bayes theorem provides a risk bound for the Gibbs classifier Gid . [sent-122, score-0.23]

54 Since ab id the Bayes classifier Bab just performs a majority vote under the same posterior id distribution as the one used by Gid , we have that Bab (x) = 1 iff the probability ab id that Gab classifies x as positive exceeds 1/2. [sent-123, score-1.045]

55 Hence, it follows that id Bab (x) = di σai ,bi (xi ) > 1/2 di i∈i σai ,bi (xi ) ≤ 1/2 1 if 0 if i∈i (3) id id Note that Bab has an hyperbolic decision surface. [sent-124, score-0.656]

56 Consequently, Bab is not representable as a conjunction of rays. [sent-125, score-0.117]

57 There is, however, no computational difficulty at id obtaining the output of Bab (x) for any x ∈ X . [sent-126, score-0.162]

58 id id From the relation between Bab and Gid , it also follows that R(x,y) (Bab ) ≤ ab id 2R(x,y) (Gid ) for any (x, y). [sent-127, score-0.732]

59 Bi − Ai bi − ai n 1 ln + m |i| + ln m+1 δ ≥1−δ 4 A Soft Greedy Learning Algorithm id Theorem 2 suggests that the learner should try to find the Bayes classifier Bab that uses a small number of attributes (i. [sent-130, score-0.98]

60 , a small |i|), each with a large separating margin (bi − ai ), while keeping the empirical Gibbs risk RS (Gid ) at a low value. [sent-132, score-0.419]

61 ab To achieve this goal, we have adapted the greedy algorithm for the set covering machine (SCM) proposed by Marchand and Shawe-Taylor (2002). [sent-133, score-0.412]

62 Once the feature with the largest Ui is found, we remove Qi and Pi from the training set S and then repeat (on the remaining examples) until either no more negative examples are present or that a maximum number s of features has been reached. [sent-135, score-0.064]

63 In our case, however, we need to keep the Gibbs risk on S low instead of the risk of a deterministic classifier. [sent-136, score-0.298]

64 Since the Gibbs risk is a “soft measure” that uses the d piece-wise linear functions σa,b instead of the “hard” indicator functions, we need a “softer” version of the utility function Ui . [sent-137, score-0.194]

65 Indeed, a negative example that falls d in the linear region of a σa,b is in fact partly covered. [sent-138, score-0.032]

66 Following this observation, let k be the vector of indices of the attributes that we have used so far for the construction of the classifier. [sent-139, score-0.138]

67 Moreover, we want to decrease the “utility” of ray i by an amount which would become large whenever it has a small separating margin. [sent-142, score-0.462]

68 Our expression for KL(Q P ) suggests that this amount should be proportional to ln((Bi − Ai )/(bi − ai )). [sent-143, score-0.302]

69 Furthermore we should compare this margin term with the fraction of the remaining negative examples that ray i has covered (instead of the absolute amount of negative examples covered). [sent-144, score-0.584]

70 For fixed values of these parameters, the “soft greedy” algorithm simply consists of adding, to the current Gibbs classifier, a ray with maximum added utility until either the maximum number s of rays has been reached or that all the negative examples have been (totally) covered. [sent-148, score-0.727]

71 It is understood that, during this soft greedy algorithm, we can remove an example (x, y) from S whenever it is totally covered. [sent-149, score-0.264]

72 kd Uab (i) 5 def = Results for Classification of DNA Micro-Arrays We have tested the soft greedy learning algorithm on the four DNA micro-array data sets shown in Table 1. [sent-151, score-0.517]

73 , 1999) provides the expression levels of 40 tumor and 22 normal colon tissues measured for 6500 human genes. [sent-153, score-0.229]

74 , 1999) provides the expression levels of 7129 human genes for 47 samples of patients with acute lymphoblastic leukemia (ALL) and 25 samples of patients with acute myeloid leukemia (AML). [sent-155, score-0.278]

75 , 2002) are micro-array samples containing the expression levels of 6817 human genes. [sent-157, score-0.092]

76 Data set B contains 25 classic and 9 desmoplastic medulloblastomas whereas data set C contains 39 medulloblastomas survivors and 21 treatment failures (non-survivors). [sent-158, score-0.074]

77 We have compared the soft greedy learning algorithm with a linear-kernel softmargin SVM trained both on all the attributes (gene expressions) and on a subset of attributes chosen by the filter method of Golub et al. [sent-159, score-0.496]

78 It consists of ranking the attributes as function of the difference between the positive-example mean and the negative-example mean and then use only the first attributes. [sent-161, score-0.138]

79 The resulting learning algorithm, named SVM+gs in Table 1, is basically the one used by Furey et al. [sent-162, score-0.029]

80 (2002) claimed obtaining better results with the recursive feature elimination method but, as pointed out by Ambroise and McLachlan (2002), their work contained a methodological flaw and, consequently, the superiority of this wrapper method is questionable. [sent-165, score-0.049]

81 The learning parameters of all algorithms and the gene subsets (for SVM+gs) were chosen from the training sets only. [sent-168, score-0.104]

82 For the gene subset selection procedure of SVM+gs, we have considered the first = 2i genes (for i = 0, 1, . [sent-170, score-0.164]

83 , 12) ranked according to the criterion of Golub et al. [sent-173, score-0.029]

84 (1999) and have chosen the i value that gave the smallest 5-fold CV error on the training set. [sent-174, score-0.031]

85 Data Set Name #exs Colon 62 B MD 34 C MD 60 ALL/AML 72 SVM errs 12 12 29 18 SVM+gs size 11 256 6 32 21 1024 10 64 errs Soft Greedy ratio 0. [sent-175, score-0.112]

86 For each algorithm, the “errs” columns of Table 1 contain the 5-fold CV error expressed as the sum of errors over the five testing sets and the “size” columns contain the number of attributes used by the classifier averaged over the five testing sets. [sent-180, score-0.138]

87 The “ratio” column refers to the average value of (bi − ai )/(Bi − Ai ) obtained for the rays used by classifiers and the “bound” column refers to the average risk bound of Theorem 2 multiplied by the total number of examples. [sent-182, score-0.66]

88 We see that the gene selection filter generally improves the error rate of SVM and that the Bayes error rate is slightly better than the Gibbs error rate. [sent-183, score-0.104]

89 Finally, the error rates of Bayes and SVM+gs are competitive but the number of genes selected by the soft greedy algorithm is always much smaller. [sent-184, score-0.251]

90 Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. [sent-195, score-0.306]

91 Selection bias in gene extraction on the basis of microarray gene-expression data. [sent-201, score-0.104]

92 Support vector machine classification and validation of cancer tissue samples using microarray expression data. [sent-216, score-0.162]

93 Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. [sent-223, score-0.169]

94 Gene selection for cancer classification using support vector machines. [sent-230, score-0.052]

95 Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. [sent-243, score-0.173]

96 Prediction of central nervous system embryonal tumour outcome based on gene expression. [sent-259, score-0.104]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ray', 0.356), ('gid', 0.277), ('rays', 0.262), ('ab', 0.246), ('bi', 0.222), ('ai', 0.208), ('def', 0.194), ('gkd', 0.187), ('bab', 0.179), ('id', 0.162), ('gq', 0.149), ('risk', 0.149), ('attribute', 0.143), ('attributes', 0.138), ('kd', 0.132), ('classi', 0.125), ('gibbs', 0.123), ('kl', 0.12), ('conjunction', 0.117), ('aj', 0.11), ('ln', 0.11), ('er', 0.104), ('gene', 0.104), ('greedy', 0.102), ('dna', 0.095), ('conjunctions', 0.089), ('soft', 0.089), ('rs', 0.088), ('di', 0.085), ('gs', 0.082), ('colon', 0.065), ('tradeo', 0.065), ('bq', 0.065), ('expression', 0.065), ('covering', 0.064), ('bj', 0.06), ('genes', 0.06), ('cab', 0.056), ('ctd', 0.056), ('eh', 0.056), ('errs', 0.056), ('furey', 0.056), ('gab', 0.056), ('marchand', 0.056), ('ers', 0.054), ('dj', 0.053), ('cancer', 0.052), ('svm', 0.051), ('cv', 0.05), ('ti', 0.049), ('nab', 0.049), ('wrapper', 0.049), ('golub', 0.048), ('misclassi', 0.047), ('divergence', 0.047), ('utility', 0.045), ('tissue', 0.045), ('xi', 0.043), ('whenever', 0.043), ('covered', 0.043), ('tumor', 0.042), ('bound', 0.041), ('theorem', 0.04), ('ambroise', 0.037), ('medulloblastomas', 0.037), ('ottawa', 0.037), ('pomeroy', 0.037), ('rtd', 0.037), ('rtj', 0.037), ('uab', 0.037), ('posterior', 0.037), ('md', 0.036), ('guyon', 0.036), ('bayes', 0.035), ('lter', 0.035), ('irrelevant', 0.035), ('separating', 0.034), ('leukemia', 0.033), ('mario', 0.033), ('dtj', 0.033), ('wrappers', 0.033), ('negative', 0.032), ('examples', 0.032), ('smallest', 0.031), ('ui', 0.031), ('totally', 0.03), ('acute', 0.03), ('tissues', 0.03), ('vote', 0.03), ('learner', 0.03), ('amount', 0.029), ('sparsity', 0.029), ('assigned', 0.029), ('et', 0.029), ('margin', 0.028), ('threshold', 0.028), ('mcallester', 0.028), ('alon', 0.028), ('xj', 0.027), ('levels', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data

Author: Mario Marchand, Mohak Shah

Abstract: We propose a “soft greedy” learning algorithm for building small conjunctions of simple threshold functions, called rays, defined on single real-valued attributes. We also propose a PAC-Bayes risk bound which is minimized for classifiers achieving a non-trivial tradeoff between sparsity (the number of rays used) and the magnitude of the separating margin of each ray. Finally, we test the soft greedy algorithm on four DNA micro-array data sets. 1

2 0.158783 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment

Author: Scott J. Gaffney, Padhraic Smyth

Abstract: Clustering and prediction of sets of curves is an important problem in many areas of science and engineering. It is often the case that curves tend to be misaligned from each other in a continuous manner, either in space (across the measurements) or in time. We develop a probabilistic framework that allows for joint clustering and continuous alignment of sets of curves in curve space (as opposed to a fixed-dimensional featurevector space). The proposed methodology integrates new probabilistic alignment models with model-based curve clustering algorithms. The probabilistic approach allows for the derivation of consistent EM learning algorithms for the joint clustering-alignment problem. Experimental results are shown for alignment of human growth data, and joint clustering and alignment of gene expression time-course data.

3 0.080425881 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging

Author: Vladimir Koltchinskii, Manel Martínez-ramón, Stefan Posse

Abstract: We study a method of optimal data-driven aggregation of classifiers in a convex combination and establish tight upper bounds on its excess risk with respect to a convex loss function under the assumption that the solution of optimal aggregation problem is sparse. We use a boosting type algorithm of optimal aggregation to develop aggregate classifiers of activation patterns in fMRI based on locally trained SVM classifiers. The aggregation coefficients are then used to design a ”boosting map” of the brain needed to identify the regions with most significant impact on classification. 1

4 0.078045994 34 nips-2004-Breaking SVM Complexity with Cross-Training

Author: Léon Bottou, Jason Weston, Gökhan H. Bakir

Abstract: We propose to selectively remove examples from the training set using probabilistic estimates related to editing algorithms (Devijver and Kittler, 1982). This heuristic procedure aims at creating a separable distribution of training examples with minimal impact on the position of the decision boundary. It breaks the linear dependency between the number of SVs and the number of training examples, and sharply reduces the complexity of SVMs during both the training and prediction stages. 1

5 0.076689854 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

Author: Dori Peleg, Ron Meir

Abstract: A novel linear feature selection algorithm is presented based on the global minimization of a data-dependent generalization error bound. Feature selection and scaling algorithms often lead to non-convex optimization problems, which in many previous approaches were addressed through gradient descent procedures that can only guarantee convergence to a local minimum. We propose an alternative approach, whereby the global solution of the non-convex optimization problem is derived via an equivalent optimization problem. Moreover, the convex optimization task is reduced to a conic quadratic programming problem for which efficient solvers are available. Highly competitive numerical results on both artificial and real-world data sets are reported. 1

6 0.070331082 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge

7 0.069365658 168 nips-2004-Semigroup Kernels on Finite Sets

8 0.066532016 32 nips-2004-Boosting on Manifolds: Adaptive Regularization of Base Classifiers

9 0.066139132 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning

10 0.065692686 178 nips-2004-Support Vector Classification with Input Data Uncertainty

11 0.065569468 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

12 0.065512873 128 nips-2004-Neural Network Computation by In Vitro Transcriptional Circuits

13 0.064706303 23 nips-2004-Analysis of a greedy active learning strategy

14 0.064404629 136 nips-2004-On Semi-Supervised Classification

15 0.064314388 115 nips-2004-Maximum Margin Clustering

16 0.063888572 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data

17 0.060685765 19 nips-2004-An Application of Boosting to Graph Classification

18 0.060249176 49 nips-2004-Density Level Detection is Classification

19 0.059752047 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

20 0.059143968 69 nips-2004-Fast Rates to Bayes for Kernel Machines


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.183), (1, 0.067), (2, -0.023), (3, 0.085), (4, 0.002), (5, 0.059), (6, 0.042), (7, 0.06), (8, 0.02), (9, 0.011), (10, -0.044), (11, 0.082), (12, 0.043), (13, -0.0), (14, 0.054), (15, -0.111), (16, -0.136), (17, 0.026), (18, 0.073), (19, 0.034), (20, -0.08), (21, -0.004), (22, -0.051), (23, 0.011), (24, 0.112), (25, 0.082), (26, -0.001), (27, -0.062), (28, -0.057), (29, -0.083), (30, -0.085), (31, 0.104), (32, -0.044), (33, -0.024), (34, -0.096), (35, 0.09), (36, -0.134), (37, -0.028), (38, 0.065), (39, -0.102), (40, 0.174), (41, -0.001), (42, 0.194), (43, 0.103), (44, -0.023), (45, -0.075), (46, 0.067), (47, -0.009), (48, -0.276), (49, -0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93054432 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data

Author: Mario Marchand, Mohak Shah

Abstract: We propose a “soft greedy” learning algorithm for building small conjunctions of simple threshold functions, called rays, defined on single real-valued attributes. We also propose a PAC-Bayes risk bound which is minimized for classifiers achieving a non-trivial tradeoff between sparsity (the number of rays used) and the magnitude of the separating margin of each ray. Finally, we test the soft greedy algorithm on four DNA micro-array data sets. 1

2 0.60077006 128 nips-2004-Neural Network Computation by In Vitro Transcriptional Circuits

Author: Jongmin Kim, John Hopfield, Erik Winfree

Abstract: The structural similarity of neural networks and genetic regulatory networks to digital circuits, and hence to each other, was noted from the very beginning of their study [1, 2]. In this work, we propose a simple biochemical system whose architecture mimics that of genetic regulation and whose components allow for in vitro implementation of arbitrary circuits. We use only two enzymes in addition to DNA and RNA molecules: RNA polymerase (RNAP) and ribonuclease (RNase). We develop a rate equation for in vitro transcriptional networks, and derive a correspondence with general neural network rate equations [3]. As proof-of-principle demonstrations, an associative memory task and a feedforward network computation are shown by simulation. A difference between the neural network and biochemical models is also highlighted: global coupling of rate equations through enzyme saturation can lead to global feedback regulation, thus allowing a simple network without explicit mutual inhibition to perform the winner-take-all computation. Thus, the full complexity of the cell is not necessary for biochemical computation: a wide range of functional behaviors can be achieved with a small set of biochemical components. 1

3 0.55291915 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment

Author: Scott J. Gaffney, Padhraic Smyth

Abstract: Clustering and prediction of sets of curves is an important problem in many areas of science and engineering. It is often the case that curves tend to be misaligned from each other in a continuous manner, either in space (across the measurements) or in time. We develop a probabilistic framework that allows for joint clustering and continuous alignment of sets of curves in curve space (as opposed to a fixed-dimensional featurevector space). The proposed methodology integrates new probabilistic alignment models with model-based curve clustering algorithms. The probabilistic approach allows for the derivation of consistent EM learning algorithms for the joint clustering-alignment problem. Experimental results are shown for alignment of human growth data, and joint clustering and alignment of gene expression time-course data.

4 0.4895826 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data

Author: Ofer Shai, Brendan J. Frey, Quaid D. Morris, Qun Pan, Christine Misquitta, Benjamin J. Blencowe

Abstract: Alternative splicing (AS) is an important and frequent step in mammalian gene expression that allows a single gene to specify multiple products, and is crucial for the regulation of fundamental biological processes. The extent of AS regulation, and the mechanisms involved, are not well understood. We have developed a custom DNA microarray platform for surveying AS levels on a large scale. We present here a generative model for the AS Array Platform (GenASAP) and demonstrate its utility for quantifying AS levels in different mouse tissues. Learning is performed using a variational expectation maximization algorithm, and the parameters are shown to correctly capture expected AS trends. A comparison of the results obtained with a well-established but low through-put experimental method demonstrate that AS levels obtained from GenASAP are highly predictive of AS levels in mammalian tissues. 1 Biological diversity through alternative splicing Current estimates place the number of genes in the human genome at approximately 30,000, which is a surprisingly small number when one considers that the genome of yeast, a singlecelled organism, has 6,000 genes. The number of genes alone cannot account for the complexity and cell specialization exhibited by higher eukaryotes (i.e. mammals, plants, etc.). Some of that added complexity can be achieved through the use of alternative splicing, whereby a single gene can be used to code for a multitude of products. Genes are segments of the double stranded DNA that contain the information required by the cell for protein synthesis. That information is coded using an alphabet of 4 (A, C, G, and T), corresponding to the four nucleotides that make up the DNA. In what is known as the central dogma of molecular biology, DNA is transcribed to RNA, which in turn is translated into proteins. Messenger RNA (mRNA) is synthesized in the nucleus of the cell and carries the genomic information to the ribosome. In eukaryotes, genes are generally comprised of both exons, which contain the information needed by the cell to synthesize proteins, and introns, sometimes referred to as spacer DNA, which are spliced out of the pre-mRNA to create mature mRNA. An estimated 35%-75% of human genes [1] can be C1 (a) C1 A C1 C2 C1 A 3’ C2 (b) C2 C1 A 3’ C1 A 5’ C2 A 5’ C2 C1 C1 C2 C1 (c) A2 A1 C1 A1 C1 A2 C1 C2 C2 C1 (d) C2 C2 A C2 C2 C2 C1 C2 Figure 1: Four types of AS. Boxes represent exons and lines represent introns, with the possible splicing alternatives indicated by the connectors. (a) Single cassette exon inclusion/exclusion. C1 and C2 are constitutive exons (exons that are included in all isoforms) and flank a single alternative exon (A). The alternative exon is included in one isoform and excluded in the other. (b) Alternative 3’ (or donor) and alternative 5’ (acceptor) splicing sites. Both exons are constitutive, but may contain alternative donor and/or acceptor splicing sites. (c) Mutually exclusive exons. One of the two alternative exons (A1 and A2 ) may be included in the isoform, but not both. (d) Intron inclusion. An intron may be included in the mature mRNA strand. spliced to yield different combinations of exons (called isoforms), a phenomenon referred to as alternative splicing (AS). There are four major types of AS as shown in Figure 1. Many multi-exon genes may undergo more than one alternative splicing event, resulting in many possible isoforms from a single gene. [2] In addition to adding to the genetic repertoire of an organism by enabling a single gene to code for more than one protein, AS has been shown to be critical for gene regulation, contributing to tissue specificity, and facilitating evolutionary processes. Despite the evident importance of AS, its regulation and impact on specific genes remains poorly understood. The work presented here is concerned with the inference of single cassette exon AS levels (Figure 1a) based on data obtained from RNA expression arrays, also known as microarrays. 1.1 An exon microarray data set that probes alternative splicing events Although it is possible to directly analyze the proteins synthesized by a cell, it is easier, and often more informative, to instead measure the abundance of mRNA present. Traditionally, gene expression (abundance of mRNA) has been studied using low throughput techniques (such as RT-PCR or Northern blots), limited to studying a few sequences at a time and making large scale analysis nearly impossible. In the early 1990s, microarray technology emerged as a method capable of measuring the expression of thousands of DNA sequences simultaneously. Sequences of interest are deposited on a substrate the size of a small microscope slide, to form probes. The mRNA is extracted from the cell and reverse-transcribed back into DNA, which is labelled with red and green fluorescent dye molecules (cy3 and cy5 respectively). When the sample of tagged DNA is washed over the slide, complementary strands of DNA from the sample hybridize to the probes on the array forming A-T and C-G pairings. The slide is then scanned and the fluorescent intensity is measured at each probe. It is generally assumed that the intensity measure at the probe is linearly related to the abundance of mRNA in the cell over a wide dynamic range. Despite significant improvements in microarray technologies in recent years, microarray data still presents some difficulties in analysis. Low measurements tend to have extremely low signal to noise ratio (SNR) [7] and probes often bind to sequences that are very similar, but not identical, to the one for which they were designed (a process referred to as cross- C1 A C1 A C1:A C1 C2 C2 3 Body probes A:C2 A C2 2 Inclusion junction probes C1:C2 C1 C2 1 Exclusion junction probe Figure 2: Each alternative splicing event is studied using six probes. Probes were chosen to measure the expression levels of each of the three exons involved in the event. Additionally, 3 probes are used that target the junctions that are formed by each of the two isoforms. The inclusion isoform would express the junctions formed by C1 and A, and A and C2 , while the exclusion isoform would express the junction formed by C1 and C2 hybridization). Additionally, probes exhibit somewhat varying hybridization efficiency, and sequences exhibit varying labelling efficiency. To design our data sets, we mined public sequence databases and identified exons that were strong candidates for exhibiting AS (the details of that analysis are provided elsewhere [4, 3]). Of the candidates, 3,126 potential AS events in 2,647 unique mouse genes were selected for the design of Agilent Custom Oligonucleotide microarray. The arrays were hybridized with unamplified mRNA samples extracted from 10 wild-type mouse tissues (brain, heart, intestine, kidney, liver, lung, salivary gland, skeletal muscle, spleen, and testis). Each AS event has six target probes on the arrays, chosen from regions of the C1 exon, C2 exon, A exon, C1 :A splice junction, A:C2 splice junction, and C1 :C2 splice junction, as shown in Figure 2. 2 Unsupervised discovery of alternative splicing With the exception of the probe measuring the alternative exon, A (Figure 2), all probes measure sequences that occur in both isoforms. For example, while the sequence of the probe measuring the junction A:C1 is designed to measure the inclusion isoform, half of it corresponds to a sequence that is found in the exclusion isoform. We can therefore safely assume that the measured intensity at each probe is a result of a certain amount of both isoforms binding to the probe. Due to the generally assumed linear relationship between the abundance of mRNA hybridized at a probe and the fluorescent intensity measured, we model the measured intensity as a weighted sum of the overall abundance of the two isoforms. A stronger assumption is that of a single, consistent hybridization profile for both isoforms across all probes and all slides. Ideally, one would prefer to estimate an individual hybridization profile for each AS event studied across all slides. However, in our current setup, the number of tissues is small (10), resulting in two difficulties. First, the number of parameters is very large when compared to the number of data point using this model, and second, a portion of the events do not exhibit tissue specific alternative splicing within our small set of tissues. While the first hurdle could be accounted for using Baysian parameter estimation, the second cannot. 2.1 GenASAP - a generative model for alternative splicing array platform Using the setup described above, the expression vector x, containing the six microarray measurements as real numbers, can be decomposed as a linear combination of the abundance of the two splice isoforms, represented by the real vector s, with some added noise: x = Λs + noise, where Λ is a 6 × 2 weight matrix containing the hybridization profiles for s1 ^ xC ^ xC 1 2 s2 ^ xC :A ^ xA ^ xA:C 2 ^ xC1:C2 2 xC1:C2 1 r xC xC 2 xA xC :A xA:C oC1 oC2 oA oC :A oA:C 1 1 1 2 oC :C 1 2 Σn 2 Figure 3: Graphical model for alternative splicing. Each measurement in the observed expression profile, x, is generated by either using a scale factor, r, on a linear combination of the isoforms, s, or drawing randomly from an outlier model. For a detailed description of the model, see text. the two isoforms across the six probes. Note that we may not have a negative amount of a given isoform, nor can the presence of an isoform deduct from the measured expression, and so both s and Λ are constrained to be positive. Expression levels measured by microarrays have previously been modelled as having expression-dependent noise [7]. To address this, we rewrite the above formulation as x = r(Λs + ε), (1) where r is a scale factor and ε is a zero-mean normally distributed random variable with a diagonal covariance matrix, Ψ, denoted as p(ε) = N (ε; 0, Ψ). The prior distribution for the abundance of the splice isoforms is given by a truncated normal distribution, denoted as p(s) ∝ N (s, 0, I)[s ≥ 0], where [·] is an indicator function such that [s ≥ 0] = 1 if ∀i, si ≥ 0, and [s ≥ 0] = 0 otherwise. Lastly, there is a need to account for aberrant observations (e.g. due to faulty probes, flakes of dust, etc.) with an outlier model. The complete GenASAP model (shown in Figure 3) accounts for the observations as the outcome of either applying equation (1) or an outlier model. To avoid degenerate cases and ensure meaningful and interpretable results, the number of faulty probes considered for each AS event may not exceed two, as indicated by the filled-in square constraint node in Figure 3. The distribution of x conditional on the latent variables, s, r, and o, is: N (xi ; rΛi s, r2 Ψi )[oi =0] N (xi ; Ei , Vi )[oi =1] , p(x|s, r, o) = (2) i where oi ∈ {0, 1} is a bernoulli random variable indicating if the measurement at probe xi is the result of the AS model or the outlier model parameterized by p(oi = 1) = γi . The parameters of the outlier model, E and V, are not optimized and are set to the mean and variance of the data. 2.2 Variational learning in the GenASAP model To infer the posterior distribution over the splice isoform abundances while at the same time learning the model parameters we use a variational expectation-maximization algorithm (EM). EM maximizes the log likelihood of the data by iteratively estimating the posterior distribution of the model given the data in the expectation (E) step, and maximizing the log likelihood with respect to the parameters, while keeping the posterior fixed, in the maximization (M) step. Variational EM is used when, as in the case of GenASAP, the exact posterior is intractable. Variational EM minimizes the free energy of the model, defined as the KL-divergence between the joint distribution of the latent and observed variables and the approximation to the posterior under the model parameters [5, 6]. We approximate the true posterior using the Q distribution given by T Q({s(t) }, {o(t) }, {r(t) }) = (t) Q(r(t) )Q(o(t) |r(t) ) t=1 (t) Q(si |oi , r(t) ) i −1 T =Z (t) (3) d d ρ(t) ω (t) N (s(t) ; µ(t) , Σ(t) )[s(t) ≥ 0], ro ro t=1 where Z is a normalization constant, the superscript d indicates that Σ is constrained to be diagonal, and there are T iid AS events. For computational efficiency, r is selected from a finite set, r ∈ {r1 , r2 , . . . , rC } with uniform probability. The variational free energy is given by Q({s(t) }, {o(t) }, {r(t) }) . P ({s(t) }, {o(t) }, {r(t) }, {x(t) }) s r o (4) Variational EM minimizes the free energy by iteratively updating the Q distribution’s vari(t)d (t)d ational parameters (ρ(t) , ω (t) , µro , and Σro ) in the E-step, and the model parameters (Λ, Ψ, {r1 , r2 , . . . , rC }, and γ) in the M-step. The resulting updates are too long to be shown in the context of this paper and are discussed in detail elsewhere [3]. A few particular points regarding the E-step are worth covering in detail here. Q({s(t) }, {o(t) }, {r(t) }) log F(Q, P ) = If the prior on s was a full normal distribution, there would be no need for a variational approach, and exact EM is possible. For a truncated normal distribution, however, the mixing proportions, Q(r)Q(o|r) cannot be calculated analytically except for the case where s is scalar, necessitating the diagonality constraint. Note that if Σ was allowed to be a full covariance matrix, equation (3) would be the true posterior, and we could find the sufficient statistics of Q(s(t) |o(t) , r(t) ): −1 µ(t) = (I + ΛT (I − O(t) )T Ψ−1 (I − O(t) )Λ)−1 ΛT (I − O(t) )T Ψ−1 x(t) r(t) ro −1 Σ(t) ro = (I + ΛT (I − O(t) )T Ψ−1 (I − O(t) )Λ) (5) (6) where O is a diagonal matrix with elements Oi,i = oi . Furthermore, it can be easily shown that the optimal settings for µd and Σd approximating a normal distribution with full covariance Σ and mean µ is µd optimal = µ −1 Σd optimal = diag(Σ (7) −1 ) (8) In the truncated case, equation (8) is still true. Equation (7) does not hold, though, and µd optimal cannot be found analytically. In our experiments, we found that using equation (7) still decreases the free energy every E-step, and it is significantly more efficient than using, for example, a gradient decent method to compute the optimal µd . Intuitive Weigh Matrix Optimal Weight Matrix 50 50 40 40 30 30 20 20 10 0 10 Inclusion Isoform 0 Exclusion Isoform Inclusion Isoform (a) Exclusion Isoform (b) Figure 4: (a) An intuitive set of weights. Based on the biological background, one would expect to see the inclusion isoform hybridize to the probes measuring C1 , C2 , A, C1 :A, and A:C2 , while the exclusion isoform hybridizes to C1 , C2 , and C1 :C2 . (b) The learned set of weights closely agrees with the intuition, and captures cross hybridization between the probes Contribution of exclusion isoform Contribution of inclusion isoform AS model Original Data (a) RT−PCR AS model measurement prediction (% exclusion) (% exclusion) 14% 72% (b) 27% 70% 8% 22% outliers (c) Figure 5: Three examples of data cases and their predictions. (a) The data does not follow our notion of single cassette exon AS, but the AS level is predicted accurately by the model.(b) The probe C1 :A is marked as outlier, allowing the model to predict the other probes accurately. (c) Two probes are marked as outliers, and the model is still successful in predicting the AS levels. 3 Making biological predictions about alternative splicing The results presented in this paper were obtained using two stages of learning. In the first step, the weight matrix, Λ, is learned on a subset of the data that is selected for quality. Two selection criteria were used: (a) sequencing data was used to select those cases for which, with high confidence, no other AS event is present (Figure 1) and (b) probe sets were selected for high expression, as determined by a set of negative controls. The second selection criterion is motivated by the common assumption that low intensity measurements are of lesser quality (see Section 1.1). In the second step, Λ is kept fixed, and we introduce the additional constraint that the noise is isotropic (Ψ = ψI) and learn on the entire data set. The constraint on the noise is introduced to prevent the model from using only a subset of the six probes for making the final set of predictions. We show a typical learned set of weights in Figure 4. The weights fit well with our intuition of what they should be to capture the presence of the two isoforms. Moreover, the learned weights account for the specific trends in the data. Examples of model prediction based on the microarray data are shown in Figure 5. Due to the nature of the microarray data, we do not expect all the inferred abundances to be equally good, and we devised a scoring criterion that ranks each AS event based on its fit to the model. Intuitively, given two input vectors that are equivalent up to a scale factor, with inferred MAP estimations that are equal up to the same scale factor, we would like their scores to be identical. The scoring criterion used, therefore is k (xk − rΛk s)2 /(xk + Rank 500 1000 2000 5000 10000 15000 20000 30000 Pearson’s correlation coefficient 0.94 0.95 0.95 0.79 0.79 0.78 0.75 0.65 False positive rate 0.11 0.08 0.05 0.2 0.25 0.29 0.32 0.42 Table 1: Model performance evaluated at various ranks. Using 180 RT-PCR measurements, we are able to predict the model’s performance at various ranks. Two evaluation criteria are used: Pearson’s correlation coefficient between the model’s predictions and the RT-PCR measurements and false positive rate, where a prediction is considered to be false positive if it is more than 15% away from the RT-PCR measurement. rΛk s)2 , where the MAP estimations for r and s are used. This scoring criterion can be viewed as proportional to the sum of noise to signal ratios, as estimated using the two values given by the observation and the model’s best prediction of that observation. Since it is the relative amount of the isoforms that is of most interest, we need to use the inferred distribution of the isoform abundances to obtain an estimate for the relative levels of AS. It is not immediately clear how this should be done. We do, however, have RTPCR measurements for 180 AS events to guide us (see figure 6 for details). Using the top 50 ranked RT-PCR measurement, we fit three parameters, {a1 , a2 , a3 }, such that the s2 proportion of excluded isoform present, p, is given by p = a1 s1 +a2 s2 + a3 , where s1 is the MAP estimation of the abundance of the inclusion isoform, s2 is the MAP estimation of the abundance of the exclusion isoform, and the RT-PCR measurement are used for target p. The parameters are fitted using gradient descent on a least squared error (LSE) evaluation criterion. We used two criteria to evaluate the quality of the AS model predictions. Pearson’s correlation coefficient (PCC) is used to evaluate the overall ability of the model to correctly estimate trends in the data. PCC is invariant to affine transformation and so is independent of the transformation parameters a1 and a3 discussed above, while the parameter a2 was found to effect PCC very little. The PCC stays above 0.75 for the top two thirds ranked predictions. The second evaluation criterion used is the false positive rate, where a prediction is considered to be false positive if it is more than 15% away from the RT-PCR measurement. This allows us to say, for example, that if a prediction is within the top 10000, we are 75% confident that it is within 15% of the actual levels of AS. 4 Summary We designed a novel AS model for the inference of the relative abundance of two alternatively spliced isoforms from six measurements. Unsupervised learning in the model is performed using a structured variational EM algorithm, which correctly captures the underlying structure of the data, as suggested by its biological nature. The AS model, though presented here for a cassette exon AS events, can be used to learn any type of AS, and with a simple adjustment, multiple types. The predictions obtained from the AS model are currently being used to verify various claims about the role of AS in evolution and functional genomics, and to help identify sequences that affect the regulation of AS. % Exclusion isoform RT−PCR measurement Vs. AS model predictions RT−PCR measurements: 90 80 AS model prediction Int es Te tine sti Kid s n Sa ey liva Br ry ain Sp le Liv en er Mu sc Lu le ng 100 70 60 50 40 30 14 22 27 32 47 46 66 78 63 20 AS model prediction: 10 27 24 26 26 51 75 60 85 100 (a) 0 0 20 40 60 80 RT−PCR measurement 100 (b) Figure 6: (a) Sample RT-PCR. RNA extracted from the cell is reverse-transcribed to DNA, amplified and labelled with radioactive or fluorescent molecules. The sample is pulled through a viscous gel in an electric field (DNA, being an acid, is positively charged). Shorter strands travel further through the gel than longer ones, resulting in two distinct bands, corresponding to the two isoforms, when exposed to a photosensitive or x-ray film. (b) A scatter plot showing the RT-PCR measurements as compared to the AS model predictions. The plot shows all available RT-PCR measurements with a rank of 8000 or better. The AS model presented assumes a single weight matrix for all data cases. This is an oversimplified view of the data, and current work is being carried out in identifying probe specific expression profiles. However, due to the low dimensionality of the problem (10 tissues, six probes per event), care must be taken to avoid overfitting and to ensure meaningful interpretations. Acknowledgments We would like to thank Wen Zhang, Naveed Mohammad, and Timothy Hughes for their contributions in generating the data set. This work was funded in part by an operating and infrastructure grants from the CIHR and CFI, and a operating grants from NSERC and a Premier’s Research Excellence Award. References [1] J. M. Johnson et al. Genome-wide survey of human alternative pre-mrna splicing with exon junction microarrays. Science, 302:2141–44, 2003. [2] L. Cartegni et al. Listening to silence and understanding nonsense: exonic mutations that affect splicing. Nature Gen. Rev., 3:285–98, 2002. [3] Q. Pan et al. Revealing global regulatory features of mammalian alternative splicing using a quantitative microarray platform. Molecular Cell, 16(6):929–41, 2004. [4] Q. Pan et al. Alternative splicing of conserved exons is frequently species specific in human and mouse. Trends Gen., In Press, 2005. [5] M. I. Jordan, Z. Ghahramani, T. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37(2):183– 233, 1999. [6] R. M. Neal and G. E. Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models. Cambridge, MIT Press, 1998. [7] D. M. Rocke and B. Durbin. A model for measurement error for gene expression arrays. Journal of Computational Biology, 8(6):557–69, 2001.

5 0.47622377 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge

Author: Isabelle Guyon, Steve Gunn, Asa Ben-Hur, Gideon Dror

Abstract: The NIPS 2003 workshops included a feature selection competition organized by the authors. We provided participants with five datasets from different application domains and called for classification results using a minimal number of features. The competition took place over a period of 13 weeks and attracted 78 research groups. Participants were asked to make on-line submissions on the validation and test sets, with performance on the validation set being presented immediately to the participant and performance on the test set presented to the participants at the workshop. In total 1863 entries were made on the validation sets during the development period and 135 entries on all test sets for the final competition. The winners used a combination of Bayesian neural networks with ARD priors and Dirichlet diffusion trees. Other top entries used a variety of methods for feature selection, which combined filters and/or wrapper or embedded methods using Random Forests, kernel methods, or neural networks as a classification engine. The results of the benchmark (including the predictions made by the participants and the features they selected) and the scoring software are publicly available. The benchmark is available at www.nipsfsc.ecs.soton.ac.uk for post-challenge submissions to stimulate further research. 1

6 0.43243659 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

7 0.43240952 101 nips-2004-Learning Syntactic Patterns for Automatic Hypernym Discovery

8 0.40643468 136 nips-2004-On Semi-Supervised Classification

9 0.39268729 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging

10 0.37717944 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations

11 0.36334062 137 nips-2004-On the Adaptive Properties of Decision Trees

12 0.34037048 49 nips-2004-Density Level Detection is Classification

13 0.33443335 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data

14 0.32946804 69 nips-2004-Fast Rates to Bayes for Kernel Machines

15 0.31424153 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data

16 0.31342584 106 nips-2004-Machine Learning Applied to Perception: Decision Images for Gender Classification

17 0.30943534 19 nips-2004-An Application of Boosting to Graph Classification

18 0.29895347 34 nips-2004-Breaking SVM Complexity with Cross-Training

19 0.29277244 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

20 0.29275134 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.063), (15, 0.095), (26, 0.057), (31, 0.025), (32, 0.021), (33, 0.253), (35, 0.011), (39, 0.022), (50, 0.042), (58, 0.304), (86, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92850494 88 nips-2004-Intrinsically Motivated Reinforcement Learning

Author: Nuttapong Chentanez, Andrew G. Barto, Satinder P. Singh

Abstract: Psychologists call behavior intrinsically motivated when it is engaged in for its own sake rather than as a step toward solving a specific problem of clear practical value. But what we learn during intrinsically motivated behavior is essential for our development as competent autonomous entities able to efficiently solve a wide range of practical problems as they arise. In this paper we present initial results from a computational study of intrinsically motivated reinforcement learning aimed at allowing artificial agents to construct and extend hierarchies of reusable skills that are needed for competent autonomy. 1

2 0.88206398 80 nips-2004-Identifying Protein-Protein Interaction Sites on a Genome-Wide Scale

Author: Haidong Wang, Eran Segal, Asa Ben-Hur, Daphne Koller, Douglas L. Brutlag

Abstract: Protein interactions typically arise from a physical interaction of one or more small sites on the surface of the two proteins. Identifying these sites is very important for drug and protein design. In this paper, we propose a computational method based on probabilistic relational model that attempts to address this task using high-throughput protein interaction data and a set of short sequence motifs. We learn the model using the EM algorithm, with a branch-and-bound algorithm as an approximate inference for the E-step. Our method searches for motifs whose presence in a pair of interacting proteins can explain their observed interaction. It also tries to determine which motif pairs have high affinity, and can therefore lead to an interaction. We show that our method is more accurate than others at predicting new protein-protein interactions. More importantly, by examining solved structures of protein complexes, we find that 2/3 of the predicted active motifs correspond to actual interaction sites. 1

same-paper 3 0.84258372 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data

Author: Mario Marchand, Mohak Shah

Abstract: We propose a “soft greedy” learning algorithm for building small conjunctions of simple threshold functions, called rays, defined on single real-valued attributes. We also propose a PAC-Bayes risk bound which is minimized for classifiers achieving a non-trivial tradeoff between sparsity (the number of rays used) and the magnitude of the separating margin of each ray. Finally, we test the soft greedy algorithm on four DNA micro-array data sets. 1

4 0.67316192 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

Author: Dori Peleg, Ron Meir

Abstract: A novel linear feature selection algorithm is presented based on the global minimization of a data-dependent generalization error bound. Feature selection and scaling algorithms often lead to non-convex optimization problems, which in many previous approaches were addressed through gradient descent procedures that can only guarantee convergence to a local minimum. We propose an alternative approach, whereby the global solution of the non-convex optimization problem is derived via an equivalent optimization problem. Moreover, the convex optimization task is reduced to a conic quadratic programming problem for which efficient solvers are available. Highly competitive numerical results on both artificial and real-world data sets are reported. 1

5 0.67057395 44 nips-2004-Conditional Random Fields for Object Recognition

Author: Ariadna Quattoni, Michael Collins, Trevor Darrell

Abstract: We present a discriminative part-based approach for the recognition of object classes from unsegmented cluttered scenes. Objects are modeled as flexible constellations of parts conditioned on local observations found by an interest operator. For each object class the probability of a given assignment of parts to local features is modeled by a Conditional Random Field (CRF). We propose an extension of the CRF framework that incorporates hidden variables and combines class conditional CRFs into a unified framework for part-based object recognition. The parameters of the CRF are estimated in a maximum likelihood framework and recognition proceeds by finding the most likely class under our model. The main advantage of the proposed CRF framework is that it allows us to relax the assumption of conditional independence of the observed data (i.e. local features) often used in generative approaches, an assumption that might be too restrictive for a considerable number of object classes.

6 0.66939265 72 nips-2004-Generalization Error and Algorithmic Convergence of Median Boosting

7 0.66768014 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data

8 0.66759324 207 nips-2004-ℓ₀-norm Minimization for Basis Selection

9 0.66707176 161 nips-2004-Self-Tuning Spectral Clustering

10 0.66689634 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve

11 0.66665202 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

12 0.66611814 115 nips-2004-Maximum Margin Clustering

13 0.66562748 166 nips-2004-Semi-supervised Learning via Gaussian Processes

14 0.66521966 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

15 0.66483665 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

16 0.66472453 23 nips-2004-Analysis of a greedy active learning strategy

17 0.6644817 77 nips-2004-Hierarchical Clustering of a Mixture Model

18 0.6633805 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering

19 0.66308302 186 nips-2004-The Correlated Correspondence Algorithm for Unsupervised Registration of Nonrigid Surfaces

20 0.66292375 32 nips-2004-Boosting on Manifolds: Adaptive Regularization of Base Classifiers