nips nips2004 nips2004-90 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Scott J. Gaffney, Padhraic Smyth
Abstract: Clustering and prediction of sets of curves is an important problem in many areas of science and engineering. It is often the case that curves tend to be misaligned from each other in a continuous manner, either in space (across the measurements) or in time. We develop a probabilistic framework that allows for joint clustering and continuous alignment of sets of curves in curve space (as opposed to a fixed-dimensional featurevector space). The proposed methodology integrates new probabilistic alignment models with model-based curve clustering algorithms. The probabilistic approach allows for the derivation of consistent EM learning algorithms for the joint clustering-alignment problem. Experimental results are shown for alignment of human growth data, and joint clustering and alignment of gene expression time-course data.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Clustering and prediction of sets of curves is an important problem in many areas of science and engineering. [sent-3, score-0.161]
2 It is often the case that curves tend to be misaligned from each other in a continuous manner, either in space (across the measurements) or in time. [sent-4, score-0.246]
3 We develop a probabilistic framework that allows for joint clustering and continuous alignment of sets of curves in curve space (as opposed to a fixed-dimensional featurevector space). [sent-5, score-0.989]
4 The proposed methodology integrates new probabilistic alignment models with model-based curve clustering algorithms. [sent-6, score-0.665]
5 The probabilistic approach allows for the derivation of consistent EM learning algorithms for the joint clustering-alignment problem. [sent-7, score-0.199]
6 Experimental results are shown for alignment of human growth data, and joint clustering and alignment of gene expression time-course data. [sent-8, score-1.17]
7 1 Introduction We introduce a novel methodology for the clustering and prediction of sets of smoothly varying curves while jointly allowing for the learning of sets of continuous curve transformations. [sent-9, score-0.646]
8 Our approach is to formulate models for both the clustering and alignment sub-problems and integrate them into a unified probabilistic framework that allows for the derivation of consistent learning algorithms. [sent-10, score-0.564]
9 The alignment sub-problem is handled with the introduction of a novel curve alignment procedure employing model priors over the set of possible alignments leading to the derivation of EM learning algorithms that formalize the so-called Procrustes approach for curve data [1]. [sent-11, score-1.036]
10 These alignment models are then integrated into a finite mixture model setting in which the clustering is carried out. [sent-12, score-0.57]
11 We make use of both polynomial and spline regression mixture models to complete the joint clustering-alignment framework. [sent-13, score-0.553]
12 The following simple illustrative example demonstrates the importance of jointly handling the clustering-alignment problem as opposed to treating alignment and clustering separately. [sent-14, score-0.549]
13 Figure 1(a) shows a simulated set of curves which have been subjected to random translations in time. [sent-15, score-0.168]
14 The underlying generative model contains three clusters each described by a cubic polynomial (not shown). [sent-16, score-0.152]
15 Figure 1(b) shows the output of the proposed joint EM algorithm introduced in this paper, where curves have been simultaneously aligned and clustered. [sent-17, score-0.301]
16 The sequential approach results in significant misclassification and incorrect alignment demonstrating that a two-stage approach can be quite suboptimal when compared to a joint clustering-alignment methodology. [sent-21, score-0.422]
17 (Similar results, not shown, are obtained when the curves are first aligned and then clustered—see [2] for full details. [sent-22, score-0.174]
18 ) There has been little prior work on the specific problem of joint curve clustering and alignment, but there is related work in other areas. [sent-23, score-0.461]
19 For example, clustering of gene-expression time profiles with mixtures of splines was addressed in [3]. [sent-24, score-0.351]
20 However, alignment was only considered as a post-processing step to compare cluster results among related datasets. [sent-25, score-0.37]
21 In image analysis, the transformed mixture of Gaussians (TMG) model uses a probabilistic framework and an EM algorithm to jointly learn clustering and alignment of image patches subject to various forms of linear transformations [4]. [sent-26, score-0.73]
22 However, this model only considers sets of transformations in discrete pixel space, whereas we are focused on curve modelling that allows for arbitrary continuous alignment in time and space. [sent-27, score-0.622]
23 In earlier related work we developed general techniques for curve clustering (e. [sent-30, score-0.334]
24 , [7]) and also proposed techniques for transformation-invariant curve clustering with discrete time alignment and Gaussian mixture models for curves [8, 9]. [sent-32, score-0.897]
25 In this paper we provide a much more general framework that allows for continuous alignment in both time and measurement space for a general class of “cluster shape” models, including polynomials and splines. [sent-33, score-0.465]
26 2 Joint clustering and alignment It is useful to represent curves as variable-length vectors. [sent-34, score-0.624]
27 In this case, y i is a curve that consists of a sequence of n i observations or measurements. [sent-35, score-0.137]
28 The j-th measurement of y i is denoted by y ij and is usually taken to be univariate (the generalization to multivariate observations is straightforward). [sent-36, score-0.129]
29 x i is often thought of as time so that x ij gives the time at which y ij was observed. [sent-38, score-0.222]
30 Regression mixture models can be effectively used to cluster this type of curve data [10]. [sent-39, score-0.29]
31 In the standard setup, y i is modelled using a normal (Gaussian) regression model in which yi = Xi β + i , where β is a (p+1)×1 coefficient vector, i is a zero-mean Gaussian noise variable, and X i is the regression matrix. [sent-40, score-0.448]
32 The form of X i depends on the type of regression model employed. [sent-41, score-0.121]
33 For polynomial regression, X i is often associated with the standard Vandermonde matrix; and for spline regression, X i takes the form of a spline-basis matrix (see, e. [sent-42, score-0.227]
34 The mixture model is completed by repeating this model over K clusters and indexing the parameters by k so that, for example, y i = Xi β k + i gives the regression model for y i under the k-th cluster. [sent-45, score-0.294]
35 Using B-splines, the curve point y ij can be represented as the linear combination y ij = Bij c, in which the vector B ij gives the vector of B-spline basis functions evaluated at x ij , and c gives the spline coefficient vector [2]. [sent-47, score-0.531]
36 The full curve yi can then be written compactly as y i = Bi c in which the spline basis matrix takes the form Bi = [Bi1 · · · Bini ] . [sent-48, score-0.553]
37 Spline regression models can be easily integrated into the regression mixture model framework by equating the regression matrix X i with the spline basis matrix Bi . [sent-49, score-0.623]
38 1 Joint model definition The joint clustering-alignment model definition is based on a regression mixture model that has been augmented with up to four individual random transformation parameters or variables (ai , bi , ci , di ). [sent-52, score-0.997]
39 The ai and bi allow for scaling and translation in time, while the c i and di allow for scaling and translation in measurement space. [sent-53, score-0.848]
40 The model definition takes the form yi = ci ai xi − bi β k + di + i , (1) in which ai xi − bi represents the regression matrix X i (either spline or polynomial) evaluated at the transformed time a i xi − bi . [sent-54, score-2.216]
41 Below we use the matrix X i to denote ai xi − bi when parsimony is required. [sent-55, score-0.546]
42 The conditional density 2 pk (yi |ai , bi , ci , di ) = N (yi |ci ai xi − bi β k + di , σk I) (2) gives the probability density of y i when all the transformation parameters (as well as cluster membership) are known. [sent-57, score-1.683]
43 ) In general, the values for the transformation parameters are unknown. [sent-59, score-0.146]
44 Treating this as a standard hidden-data problem, it is useful to think of each of the transformation parameters as random variables that are curve-specific but with “population-level” prior probability distributions. [sent-60, score-0.146]
45 In this way, the transformation parameters and the model parameters can be learned simultaneously in an efficient manner using EM. [sent-61, score-0.175]
46 2 Transformation priors Priors are attached to each of the transformation variables in such a way that the identity transformation is the most likely transformation. [sent-63, score-0.312]
47 The time transformation priors are specified as 2 ai ∼ N (1, rk ), bi ∼ N (0, s2 ), (3) k and the measurement space priors are given as 2 ci ∼ N (1, u2 ) , di ∼ N (0, vk ). [sent-65, score-1.302]
48 (4) k Note that the identity transformation is indeed the most likely. [sent-66, score-0.117]
49 All of the variance parameters are cluster-specific in general; however, any subset of these parameters can be “tied” across clusters if desired in a specific application. [sent-67, score-0.188]
50 Note that these priors technically allow for negative scaling in time and in measurement space. [sent-68, score-0.241]
51 We do not make use of hyperpriors for these prior parameters; however, it is straightforward to extend the method to allow hyperpriors if desired. [sent-73, score-0.133]
52 3 Full probability model The joint density of y i and the set of transformation variables Φ i = {ai , bi , ci , di } can be written succinctly as pk (yi , Φi ) = pk (yi |Φi )pk (Φi ), (5) 2 2 where pk (Φi ) = N (ai |1, rk )N (bi |0, s2 )N (ci |1, u2 )N (di |0, vk ). [sent-75, score-1.705]
53 The space transformak k tion parameters can be integrated-out of (5) resulting in the marginal of y i conditioned only on the time transformation parameters. [sent-76, score-0.239]
54 This conditional marginal takes the form pk (yi |ai , bi ) = pk (yi , ci , di |ai , bi ) dci , ddi 2 = N (yi |X i βk , Uik + Vk − σk I), (6) 2 2 2 2 with Uik = uk X i β k βk X i + σk I and V k = vk 11 + σk I. [sent-77, score-1.451]
55 The unconditional (though, still cluster-dependent) marginal for y i cannot be computed analytically since a i , bi cannot be analytically integrated-out. [sent-78, score-0.47]
56 (8) (9) The log-likelihood of all n curves Y = {y i } follows directly from this approximation and takes the form (m) (m) log p(Y ) ≈ log αk pk (yi |ai , bi ) − n log M. [sent-85, score-0.691]
57 4 EM algorithm We derive an EM algorithm that simultaneously allows the learning of both the model parameters and the transformation variables Φ with time-complexity that is linear in the total number of data points N = i ni . [sent-87, score-0.183]
58 First, let zi give the cluster membership for curve yi . [sent-88, score-0.609]
59 Now, regard the transformation variables {Φ i } as well as the cluster memberships {z i } as being hidden. [sent-89, score-0.221]
60 The complete-data log-likelihood function is defined as the joint loglikelihood of Y and the hidden data {Φ i , zi }. [sent-90, score-0.292]
61 This can be written as the sum over all n curves of the log of the product of α zi and the cluster-dependent joint density in (5). [sent-91, score-0.456]
62 This function takes the form Lc = log αzi pzi (yi |Φi ) pzi (Φi ). [sent-92, score-0.208]
63 (11) i In the E-step, the posterior p(Φ i , zi |yi ) is calculated and then used to take the posterior expectation of Equation (11). [sent-93, score-0.269]
64 This expectation is then used in the M-step to calculate the 2 2 2 re-estimation equations for updating the model parameters {β k , σk , rk , s2 , u2 , vk }. [sent-94, score-0.246]
65 5 E-step The posterior p(Φ i , zi |yi ) can be factorized as p zi (Φ|yi )p(zi |yi ). [sent-96, score-0.32]
66 The second factor is the membership probability w ik that yi was generated by cluster k. [sent-97, score-0.535]
67 It can be rewritten as p(zi = k|yi ) ∝ pk (yi ) and evaluated using Equation (7). [sent-98, score-0.229]
68 Further factoring reveals that p zi (Φ|yi ) = pzi (ci , di |ai , bi , yi )pzi (ai , bi |yi ). [sent-100, score-1.196]
69 The new first factor p zi (ci , di |ai , bi , yi ) can be solved for exactly by noting that it is proportional to a bivariate normal distribution for each z i [2]. [sent-101, score-0.762]
70 The new second factor p zi (ai , bi |yi ) cannot, in general, be solved for analytically, so instead we use an approximation. [sent-102, score-0.464]
71 To make the approximation here, the vector (ˆ ik , ˆik ) representing the multi-dimensional a b (k) mode of p k (ai , bi |yi ), the covariance matrix V ai bi for (ˆ ik , ˆik ), and the separate variances a b Vaik , Vbik must be found. [sent-105, score-1.238]
72 The above calculations of the posterior p(Φ i , zi |yi ) allow the posterior expectation of the complete-data log-likelihood in Equation (11) to be solved for. [sent-108, score-0.298]
73 Although the derivation is quite complex, the Q-function can be calculated exactly for polynomial regression [2]; for spline regression, the basis functions do not afford an exact formula for the solution of the Q-function. [sent-110, score-0.384]
74 However, in the spline case, removal of a few problematic variance terms gives an efficient approximation (the interested reader is referred to [2] for more details). [sent-111, score-0.212]
75 The Q2 2 2 function is maximized over the set of parameters {β k , σk , rk , s2 , u2 , vk } for 1 ≤ k ≤ K. [sent-114, score-0.245]
76 3 Experimental results and conclusions The results of a simple demonstration of EM-based alignment (using splines and the learning algorithm of the previous section, but with no clustering) are shown in Figure 2. [sent-117, score-0.391]
77 In the left plot are a set of smoothed curves representing the acceleration of height for each of 39 boys whose heights were measured at 29 observation times over the ages of 1 to 18 [1]. [sent-118, score-0.393]
78 Notice that the curves share a similar shape but seem to be misaligned in time due to individual growth dynamics. [sent-119, score-0.3]
79 The right plot shows the same acceleration curves after processing from our spline alignment model using quartic splines with 8 uniformly spaced knots allowing for a maximum time translation of 2 units. [sent-120, score-1.047]
80 The aligned curves in the right plot of Figure 2 represent the average behavior in a much clearer way. [sent-122, score-0.204]
81 The results demonstrate that it is common for important features of curves to be randomly translated in time and that it is possible to use the data to recover these underlying hidden transformations using our alignment models. [sent-126, score-0.583]
82 Next we briefly present an application of the joint clustering-alignment model to the problem of gene expression clustering. [sent-127, score-0.351]
83 We analyze the alpha arrest data described in [13] that captures gene expression levels at 7 minute intervals for two consecutive cell cycles (totaling 17 measurements per gene). [sent-128, score-0.287]
84 Clustering is often used in gene expression analysis to reveal groups of genes with similar profiles that may be physically related to the same underlying biological process (e. [sent-129, score-0.274]
85 tant role in gene regulation, and thus, curves measured over time which represent the same process may often be misaligned from each other. [sent-133, score-0.386]
86 Since these gene expression data are already normalized, we did not allow for transformations in measurement space. [sent-135, score-0.396]
87 We only allowed for translations in time since experts do not expect scaling in time to be a factor in these data. [sent-136, score-0.152]
88 For the curve model, cubic splines with 6 uniformly spaced knots across the interval from −4 to 21 were chosen, allowing for a maximum time translation of 4 units. [sent-137, score-0.524]
89 Due to limited space, we present a single case of comparison between a standard spline regression mixture model (SRM) and an SRM that jointly allows for time translations. [sent-138, score-0.496]
90 The left column of the figure shows the output from the joint clustering-alignment model, while the right column shows the output from the standard cluster model. [sent-142, score-0.202]
91 The results also demonstrate the appearance of cluster-dependent alignment effects. [sent-145, score-0.295]
92 Out-of-sample experiments (not shown here) show that the joint model produces better predictive models than the standard clustering method. [sent-146, score-0.324]
93 Experimental results on a variety of other data sets are provided in [2], including applications to clustering of cyclone trajectories. [sent-147, score-0.226]
94 4 Conclusions We proposed a general probabilistic framework for joint clustering and alignment of sets of curves. [sent-148, score-0.684]
95 The experimental results indicate that the approach provides a new and useful tool for curve analysis in the face of underlying hidden transformations. [sent-149, score-0.168]
96 The resulting EM-based learning algorithms have time-complexity that is linear in the number of measurements—in contrast, many existing curve alignment algorithms themselves are O(n2 ) (e. [sent-150, score-0.432]
97 The incorporation of splines gives the method an overall non-parametric freedom which leads to general applicability. [sent-153, score-0.141]
98 A new approach to analyzing gene expression time series data. [sent-171, score-0.282]
99 Probabilistic models for joint clustering and timewarping of multi-dimensional curves. [sent-214, score-0.324]
100 Aligning gene expression time series with time warping algorithms. [sent-273, score-0.375]
wordName wordTfidf (topN-words)
[('bi', 0.33), ('wik', 0.312), ('alignment', 0.295), ('pk', 0.229), ('yi', 0.206), ('ik', 0.197), ('clustering', 0.197), ('ai', 0.184), ('spline', 0.182), ('curve', 0.137), ('zi', 0.134), ('curves', 0.132), ('gaffney', 0.13), ('joint', 0.127), ('regression', 0.121), ('gene', 0.118), ('transformation', 0.117), ('expression', 0.106), ('pzi', 0.104), ('vk', 0.103), ('ci', 0.103), ('em', 0.101), ('acceleration', 0.096), ('splines', 0.096), ('di', 0.092), ('rk', 0.083), ('priors', 0.078), ('boys', 0.078), ('misaligned', 0.078), ('vxcd', 0.078), ('vxi', 0.078), ('mixture', 0.078), ('measurement', 0.076), ('cluster', 0.075), ('age', 0.073), ('transformations', 0.067), ('clusters', 0.066), ('time', 0.058), ('alignments', 0.058), ('height', 0.057), ('jointly', 0.057), ('membership', 0.057), ('translation', 0.054), ('ij', 0.053), ('chudova', 0.052), ('dik', 0.052), ('hyperpriors', 0.052), ('uik', 0.052), ('vaik', 0.052), ('vbik', 0.052), ('vdik', 0.052), ('vxxi', 0.052), ('posterior', 0.052), ('genes', 0.05), ('aligning', 0.048), ('canonical', 0.047), ('pami', 0.046), ('knots', 0.045), ('cik', 0.045), ('incorporation', 0.045), ('polynomial', 0.045), ('axis', 0.043), ('aligned', 0.042), ('unconditional', 0.041), ('cubic', 0.041), ('ninth', 0.038), ('srm', 0.038), ('ni', 0.037), ('august', 0.037), ('probabilistic', 0.036), ('translations', 0.036), ('frey', 0.036), ('derivation', 0.036), ('continuous', 0.036), ('density', 0.035), ('acm', 0.035), ('marginal', 0.035), ('warping', 0.035), ('scott', 0.035), ('across', 0.034), ('measurements', 0.033), ('irvine', 0.033), ('xi', 0.032), ('growth', 0.032), ('analytically', 0.032), ('hidden', 0.031), ('expectation', 0.031), ('january', 0.031), ('variance', 0.03), ('cell', 0.03), ('plot', 0.03), ('sigkdd', 0.03), ('spaced', 0.03), ('maximized', 0.03), ('parameters', 0.029), ('allowing', 0.029), ('allow', 0.029), ('regard', 0.029), ('sets', 0.029), ('written', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000008 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment
Author: Scott J. Gaffney, Padhraic Smyth
Abstract: Clustering and prediction of sets of curves is an important problem in many areas of science and engineering. It is often the case that curves tend to be misaligned from each other in a continuous manner, either in space (across the measurements) or in time. We develop a probabilistic framework that allows for joint clustering and continuous alignment of sets of curves in curve space (as opposed to a fixed-dimensional featurevector space). The proposed methodology integrates new probabilistic alignment models with model-based curve clustering algorithms. The probabilistic approach allows for the derivation of consistent EM learning algorithms for the joint clustering-alignment problem. Experimental results are shown for alignment of human growth data, and joint clustering and alignment of gene expression time-course data.
2 0.158783 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data
Author: Mario Marchand, Mohak Shah
Abstract: We propose a “soft greedy” learning algorithm for building small conjunctions of simple threshold functions, called rays, defined on single real-valued attributes. We also propose a PAC-Bayes risk bound which is minimized for classifiers achieving a non-trivial tradeoff between sparsity (the number of rays used) and the magnitude of the separating margin of each ray. Finally, we test the soft greedy algorithm on four DNA micro-array data sets. 1
3 0.15037368 161 nips-2004-Self-Tuning Spectral Clustering
Author: Lihi Zelnik-manor, Pietro Perona
Abstract: We study a number of open issues in spectral clustering: (i) Selecting the appropriate scale of analysis, (ii) Handling multi-scale data, (iii) Clustering with irregular background clutter, and, (iv) Finding automatically the number of groups. We first propose that a ‘local’ scale should be used to compute the affinity between each pair of points. This local scaling leads to better clustering especially when the data includes multiple scales and when the clusters are placed within a cluttered background. We further suggest exploiting the structure of the eigenvectors to infer automatically the number of groups. This leads to a new algorithm in which the final randomly initialized k-means stage is eliminated. 1
4 0.13205965 52 nips-2004-Discrete profile alignment via constrained information bottleneck
Author: Sean O'rourke, Gal Chechik, Robin Friedman, Eleazar Eskin
Abstract: Amino acid profiles, which capture position-specific mutation probabilities, are a richer encoding of biological sequences than the individual sequences themselves. However, profile comparisons are much more computationally expensive than discrete symbol comparisons, making profiles impractical for many large datasets. Furthermore, because they are such a rich representation, profiles can be difficult to visualize. To overcome these problems, we propose a discretization for profiles using an expanded alphabet representing not just individual amino acids, but common profiles. By using an extension of information bottleneck (IB) incorporating constraints and priors on the class distributions, we find an informationally optimal alphabet. This discretization yields a concise, informative textual representation for profile sequences. Also alignments between these sequences, while nearly as accurate as the full profileprofile alignments, can be computed almost as quickly as those between individual or consensus sequences. A full pairwise alignment of SwissProt would take years using profiles, but less than 3 days using a discrete IB encoding, illustrating how discrete encoding can expand the range of sequence problems to which profile information can be applied. 1
5 0.12078442 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering
Author: Zhengdong Lu, Todd K. Leen
Abstract: While clustering is usually an unsupervised operation, there are circumstances in which we believe (with varying degrees of certainty) that items A and B should be assigned to the same cluster, while items A and C should not. We would like such pairwise relations to influence cluster assignments of out-of-sample data in a manner consistent with the prior knowledge expressed in the training set. Our starting point is probabilistic clustering based on Gaussian mixture models (GMM) of the data distribution. We express clustering preferences in the prior distribution over assignments of data points to clusters. This prior penalizes cluster assignments according to the degree with which they violate the preferences. We fit the model parameters with EM. Experiments on a variety of data sets show that PPC can consistently improve clustering results.
6 0.11766431 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
7 0.11374725 115 nips-2004-Maximum Margin Clustering
8 0.10763065 178 nips-2004-Support Vector Classification with Input Data Uncertainty
9 0.10160291 77 nips-2004-Hierarchical Clustering of a Mixture Model
10 0.094685636 164 nips-2004-Semi-supervised Learning by Entropy Minimization
11 0.093870096 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data
12 0.09360303 124 nips-2004-Multiple Alignment of Continuous Time Series
13 0.093337633 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
14 0.091374293 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
15 0.09013395 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning
16 0.089699209 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition
17 0.088263549 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models
18 0.07933227 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning
19 0.077018902 50 nips-2004-Dependent Gaussian Processes
20 0.074085303 177 nips-2004-Supervised Graph Inference
topicId topicWeight
[(0, -0.241), (1, 0.069), (2, -0.088), (3, -0.054), (4, -0.046), (5, -0.016), (6, -0.135), (7, 0.133), (8, -0.114), (9, 0.014), (10, 0.097), (11, 0.147), (12, 0.004), (13, -0.039), (14, 0.128), (15, -0.044), (16, -0.168), (17, -0.048), (18, 0.033), (19, -0.058), (20, -0.089), (21, -0.065), (22, -0.124), (23, 0.022), (24, 0.161), (25, 0.099), (26, -0.083), (27, -0.119), (28, 0.038), (29, -0.039), (30, 0.019), (31, 0.119), (32, -0.067), (33, -0.077), (34, 0.039), (35, 0.029), (36, -0.082), (37, 0.019), (38, 0.042), (39, -0.156), (40, 0.094), (41, 0.029), (42, 0.065), (43, -0.016), (44, 0.07), (45, -0.056), (46, 0.105), (47, -0.048), (48, -0.066), (49, 0.005)]
simIndex simValue paperId paperTitle
same-paper 1 0.96926928 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment
Author: Scott J. Gaffney, Padhraic Smyth
Abstract: Clustering and prediction of sets of curves is an important problem in many areas of science and engineering. It is often the case that curves tend to be misaligned from each other in a continuous manner, either in space (across the measurements) or in time. We develop a probabilistic framework that allows for joint clustering and continuous alignment of sets of curves in curve space (as opposed to a fixed-dimensional featurevector space). The proposed methodology integrates new probabilistic alignment models with model-based curve clustering algorithms. The probabilistic approach allows for the derivation of consistent EM learning algorithms for the joint clustering-alignment problem. Experimental results are shown for alignment of human growth data, and joint clustering and alignment of gene expression time-course data.
2 0.60353553 52 nips-2004-Discrete profile alignment via constrained information bottleneck
Author: Sean O'rourke, Gal Chechik, Robin Friedman, Eleazar Eskin
Abstract: Amino acid profiles, which capture position-specific mutation probabilities, are a richer encoding of biological sequences than the individual sequences themselves. However, profile comparisons are much more computationally expensive than discrete symbol comparisons, making profiles impractical for many large datasets. Furthermore, because they are such a rich representation, profiles can be difficult to visualize. To overcome these problems, we propose a discretization for profiles using an expanded alphabet representing not just individual amino acids, but common profiles. By using an extension of information bottleneck (IB) incorporating constraints and priors on the class distributions, we find an informationally optimal alphabet. This discretization yields a concise, informative textual representation for profile sequences. Also alignments between these sequences, while nearly as accurate as the full profileprofile alignments, can be computed almost as quickly as those between individual or consensus sequences. A full pairwise alignment of SwissProt would take years using profiles, but less than 3 days using a discrete IB encoding, illustrating how discrete encoding can expand the range of sequence problems to which profile information can be applied. 1
3 0.59956247 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data
Author: Ofer Shai, Brendan J. Frey, Quaid D. Morris, Qun Pan, Christine Misquitta, Benjamin J. Blencowe
Abstract: Alternative splicing (AS) is an important and frequent step in mammalian gene expression that allows a single gene to specify multiple products, and is crucial for the regulation of fundamental biological processes. The extent of AS regulation, and the mechanisms involved, are not well understood. We have developed a custom DNA microarray platform for surveying AS levels on a large scale. We present here a generative model for the AS Array Platform (GenASAP) and demonstrate its utility for quantifying AS levels in different mouse tissues. Learning is performed using a variational expectation maximization algorithm, and the parameters are shown to correctly capture expected AS trends. A comparison of the results obtained with a well-established but low through-put experimental method demonstrate that AS levels obtained from GenASAP are highly predictive of AS levels in mammalian tissues. 1 Biological diversity through alternative splicing Current estimates place the number of genes in the human genome at approximately 30,000, which is a surprisingly small number when one considers that the genome of yeast, a singlecelled organism, has 6,000 genes. The number of genes alone cannot account for the complexity and cell specialization exhibited by higher eukaryotes (i.e. mammals, plants, etc.). Some of that added complexity can be achieved through the use of alternative splicing, whereby a single gene can be used to code for a multitude of products. Genes are segments of the double stranded DNA that contain the information required by the cell for protein synthesis. That information is coded using an alphabet of 4 (A, C, G, and T), corresponding to the four nucleotides that make up the DNA. In what is known as the central dogma of molecular biology, DNA is transcribed to RNA, which in turn is translated into proteins. Messenger RNA (mRNA) is synthesized in the nucleus of the cell and carries the genomic information to the ribosome. In eukaryotes, genes are generally comprised of both exons, which contain the information needed by the cell to synthesize proteins, and introns, sometimes referred to as spacer DNA, which are spliced out of the pre-mRNA to create mature mRNA. An estimated 35%-75% of human genes [1] can be C1 (a) C1 A C1 C2 C1 A 3’ C2 (b) C2 C1 A 3’ C1 A 5’ C2 A 5’ C2 C1 C1 C2 C1 (c) A2 A1 C1 A1 C1 A2 C1 C2 C2 C1 (d) C2 C2 A C2 C2 C2 C1 C2 Figure 1: Four types of AS. Boxes represent exons and lines represent introns, with the possible splicing alternatives indicated by the connectors. (a) Single cassette exon inclusion/exclusion. C1 and C2 are constitutive exons (exons that are included in all isoforms) and flank a single alternative exon (A). The alternative exon is included in one isoform and excluded in the other. (b) Alternative 3’ (or donor) and alternative 5’ (acceptor) splicing sites. Both exons are constitutive, but may contain alternative donor and/or acceptor splicing sites. (c) Mutually exclusive exons. One of the two alternative exons (A1 and A2 ) may be included in the isoform, but not both. (d) Intron inclusion. An intron may be included in the mature mRNA strand. spliced to yield different combinations of exons (called isoforms), a phenomenon referred to as alternative splicing (AS). There are four major types of AS as shown in Figure 1. Many multi-exon genes may undergo more than one alternative splicing event, resulting in many possible isoforms from a single gene. [2] In addition to adding to the genetic repertoire of an organism by enabling a single gene to code for more than one protein, AS has been shown to be critical for gene regulation, contributing to tissue specificity, and facilitating evolutionary processes. Despite the evident importance of AS, its regulation and impact on specific genes remains poorly understood. The work presented here is concerned with the inference of single cassette exon AS levels (Figure 1a) based on data obtained from RNA expression arrays, also known as microarrays. 1.1 An exon microarray data set that probes alternative splicing events Although it is possible to directly analyze the proteins synthesized by a cell, it is easier, and often more informative, to instead measure the abundance of mRNA present. Traditionally, gene expression (abundance of mRNA) has been studied using low throughput techniques (such as RT-PCR or Northern blots), limited to studying a few sequences at a time and making large scale analysis nearly impossible. In the early 1990s, microarray technology emerged as a method capable of measuring the expression of thousands of DNA sequences simultaneously. Sequences of interest are deposited on a substrate the size of a small microscope slide, to form probes. The mRNA is extracted from the cell and reverse-transcribed back into DNA, which is labelled with red and green fluorescent dye molecules (cy3 and cy5 respectively). When the sample of tagged DNA is washed over the slide, complementary strands of DNA from the sample hybridize to the probes on the array forming A-T and C-G pairings. The slide is then scanned and the fluorescent intensity is measured at each probe. It is generally assumed that the intensity measure at the probe is linearly related to the abundance of mRNA in the cell over a wide dynamic range. Despite significant improvements in microarray technologies in recent years, microarray data still presents some difficulties in analysis. Low measurements tend to have extremely low signal to noise ratio (SNR) [7] and probes often bind to sequences that are very similar, but not identical, to the one for which they were designed (a process referred to as cross- C1 A C1 A C1:A C1 C2 C2 3 Body probes A:C2 A C2 2 Inclusion junction probes C1:C2 C1 C2 1 Exclusion junction probe Figure 2: Each alternative splicing event is studied using six probes. Probes were chosen to measure the expression levels of each of the three exons involved in the event. Additionally, 3 probes are used that target the junctions that are formed by each of the two isoforms. The inclusion isoform would express the junctions formed by C1 and A, and A and C2 , while the exclusion isoform would express the junction formed by C1 and C2 hybridization). Additionally, probes exhibit somewhat varying hybridization efficiency, and sequences exhibit varying labelling efficiency. To design our data sets, we mined public sequence databases and identified exons that were strong candidates for exhibiting AS (the details of that analysis are provided elsewhere [4, 3]). Of the candidates, 3,126 potential AS events in 2,647 unique mouse genes were selected for the design of Agilent Custom Oligonucleotide microarray. The arrays were hybridized with unamplified mRNA samples extracted from 10 wild-type mouse tissues (brain, heart, intestine, kidney, liver, lung, salivary gland, skeletal muscle, spleen, and testis). Each AS event has six target probes on the arrays, chosen from regions of the C1 exon, C2 exon, A exon, C1 :A splice junction, A:C2 splice junction, and C1 :C2 splice junction, as shown in Figure 2. 2 Unsupervised discovery of alternative splicing With the exception of the probe measuring the alternative exon, A (Figure 2), all probes measure sequences that occur in both isoforms. For example, while the sequence of the probe measuring the junction A:C1 is designed to measure the inclusion isoform, half of it corresponds to a sequence that is found in the exclusion isoform. We can therefore safely assume that the measured intensity at each probe is a result of a certain amount of both isoforms binding to the probe. Due to the generally assumed linear relationship between the abundance of mRNA hybridized at a probe and the fluorescent intensity measured, we model the measured intensity as a weighted sum of the overall abundance of the two isoforms. A stronger assumption is that of a single, consistent hybridization profile for both isoforms across all probes and all slides. Ideally, one would prefer to estimate an individual hybridization profile for each AS event studied across all slides. However, in our current setup, the number of tissues is small (10), resulting in two difficulties. First, the number of parameters is very large when compared to the number of data point using this model, and second, a portion of the events do not exhibit tissue specific alternative splicing within our small set of tissues. While the first hurdle could be accounted for using Baysian parameter estimation, the second cannot. 2.1 GenASAP - a generative model for alternative splicing array platform Using the setup described above, the expression vector x, containing the six microarray measurements as real numbers, can be decomposed as a linear combination of the abundance of the two splice isoforms, represented by the real vector s, with some added noise: x = Λs + noise, where Λ is a 6 × 2 weight matrix containing the hybridization profiles for s1 ^ xC ^ xC 1 2 s2 ^ xC :A ^ xA ^ xA:C 2 ^ xC1:C2 2 xC1:C2 1 r xC xC 2 xA xC :A xA:C oC1 oC2 oA oC :A oA:C 1 1 1 2 oC :C 1 2 Σn 2 Figure 3: Graphical model for alternative splicing. Each measurement in the observed expression profile, x, is generated by either using a scale factor, r, on a linear combination of the isoforms, s, or drawing randomly from an outlier model. For a detailed description of the model, see text. the two isoforms across the six probes. Note that we may not have a negative amount of a given isoform, nor can the presence of an isoform deduct from the measured expression, and so both s and Λ are constrained to be positive. Expression levels measured by microarrays have previously been modelled as having expression-dependent noise [7]. To address this, we rewrite the above formulation as x = r(Λs + ε), (1) where r is a scale factor and ε is a zero-mean normally distributed random variable with a diagonal covariance matrix, Ψ, denoted as p(ε) = N (ε; 0, Ψ). The prior distribution for the abundance of the splice isoforms is given by a truncated normal distribution, denoted as p(s) ∝ N (s, 0, I)[s ≥ 0], where [·] is an indicator function such that [s ≥ 0] = 1 if ∀i, si ≥ 0, and [s ≥ 0] = 0 otherwise. Lastly, there is a need to account for aberrant observations (e.g. due to faulty probes, flakes of dust, etc.) with an outlier model. The complete GenASAP model (shown in Figure 3) accounts for the observations as the outcome of either applying equation (1) or an outlier model. To avoid degenerate cases and ensure meaningful and interpretable results, the number of faulty probes considered for each AS event may not exceed two, as indicated by the filled-in square constraint node in Figure 3. The distribution of x conditional on the latent variables, s, r, and o, is: N (xi ; rΛi s, r2 Ψi )[oi =0] N (xi ; Ei , Vi )[oi =1] , p(x|s, r, o) = (2) i where oi ∈ {0, 1} is a bernoulli random variable indicating if the measurement at probe xi is the result of the AS model or the outlier model parameterized by p(oi = 1) = γi . The parameters of the outlier model, E and V, are not optimized and are set to the mean and variance of the data. 2.2 Variational learning in the GenASAP model To infer the posterior distribution over the splice isoform abundances while at the same time learning the model parameters we use a variational expectation-maximization algorithm (EM). EM maximizes the log likelihood of the data by iteratively estimating the posterior distribution of the model given the data in the expectation (E) step, and maximizing the log likelihood with respect to the parameters, while keeping the posterior fixed, in the maximization (M) step. Variational EM is used when, as in the case of GenASAP, the exact posterior is intractable. Variational EM minimizes the free energy of the model, defined as the KL-divergence between the joint distribution of the latent and observed variables and the approximation to the posterior under the model parameters [5, 6]. We approximate the true posterior using the Q distribution given by T Q({s(t) }, {o(t) }, {r(t) }) = (t) Q(r(t) )Q(o(t) |r(t) ) t=1 (t) Q(si |oi , r(t) ) i −1 T =Z (t) (3) d d ρ(t) ω (t) N (s(t) ; µ(t) , Σ(t) )[s(t) ≥ 0], ro ro t=1 where Z is a normalization constant, the superscript d indicates that Σ is constrained to be diagonal, and there are T iid AS events. For computational efficiency, r is selected from a finite set, r ∈ {r1 , r2 , . . . , rC } with uniform probability. The variational free energy is given by Q({s(t) }, {o(t) }, {r(t) }) . P ({s(t) }, {o(t) }, {r(t) }, {x(t) }) s r o (4) Variational EM minimizes the free energy by iteratively updating the Q distribution’s vari(t)d (t)d ational parameters (ρ(t) , ω (t) , µro , and Σro ) in the E-step, and the model parameters (Λ, Ψ, {r1 , r2 , . . . , rC }, and γ) in the M-step. The resulting updates are too long to be shown in the context of this paper and are discussed in detail elsewhere [3]. A few particular points regarding the E-step are worth covering in detail here. Q({s(t) }, {o(t) }, {r(t) }) log F(Q, P ) = If the prior on s was a full normal distribution, there would be no need for a variational approach, and exact EM is possible. For a truncated normal distribution, however, the mixing proportions, Q(r)Q(o|r) cannot be calculated analytically except for the case where s is scalar, necessitating the diagonality constraint. Note that if Σ was allowed to be a full covariance matrix, equation (3) would be the true posterior, and we could find the sufficient statistics of Q(s(t) |o(t) , r(t) ): −1 µ(t) = (I + ΛT (I − O(t) )T Ψ−1 (I − O(t) )Λ)−1 ΛT (I − O(t) )T Ψ−1 x(t) r(t) ro −1 Σ(t) ro = (I + ΛT (I − O(t) )T Ψ−1 (I − O(t) )Λ) (5) (6) where O is a diagonal matrix with elements Oi,i = oi . Furthermore, it can be easily shown that the optimal settings for µd and Σd approximating a normal distribution with full covariance Σ and mean µ is µd optimal = µ −1 Σd optimal = diag(Σ (7) −1 ) (8) In the truncated case, equation (8) is still true. Equation (7) does not hold, though, and µd optimal cannot be found analytically. In our experiments, we found that using equation (7) still decreases the free energy every E-step, and it is significantly more efficient than using, for example, a gradient decent method to compute the optimal µd . Intuitive Weigh Matrix Optimal Weight Matrix 50 50 40 40 30 30 20 20 10 0 10 Inclusion Isoform 0 Exclusion Isoform Inclusion Isoform (a) Exclusion Isoform (b) Figure 4: (a) An intuitive set of weights. Based on the biological background, one would expect to see the inclusion isoform hybridize to the probes measuring C1 , C2 , A, C1 :A, and A:C2 , while the exclusion isoform hybridizes to C1 , C2 , and C1 :C2 . (b) The learned set of weights closely agrees with the intuition, and captures cross hybridization between the probes Contribution of exclusion isoform Contribution of inclusion isoform AS model Original Data (a) RT−PCR AS model measurement prediction (% exclusion) (% exclusion) 14% 72% (b) 27% 70% 8% 22% outliers (c) Figure 5: Three examples of data cases and their predictions. (a) The data does not follow our notion of single cassette exon AS, but the AS level is predicted accurately by the model.(b) The probe C1 :A is marked as outlier, allowing the model to predict the other probes accurately. (c) Two probes are marked as outliers, and the model is still successful in predicting the AS levels. 3 Making biological predictions about alternative splicing The results presented in this paper were obtained using two stages of learning. In the first step, the weight matrix, Λ, is learned on a subset of the data that is selected for quality. Two selection criteria were used: (a) sequencing data was used to select those cases for which, with high confidence, no other AS event is present (Figure 1) and (b) probe sets were selected for high expression, as determined by a set of negative controls. The second selection criterion is motivated by the common assumption that low intensity measurements are of lesser quality (see Section 1.1). In the second step, Λ is kept fixed, and we introduce the additional constraint that the noise is isotropic (Ψ = ψI) and learn on the entire data set. The constraint on the noise is introduced to prevent the model from using only a subset of the six probes for making the final set of predictions. We show a typical learned set of weights in Figure 4. The weights fit well with our intuition of what they should be to capture the presence of the two isoforms. Moreover, the learned weights account for the specific trends in the data. Examples of model prediction based on the microarray data are shown in Figure 5. Due to the nature of the microarray data, we do not expect all the inferred abundances to be equally good, and we devised a scoring criterion that ranks each AS event based on its fit to the model. Intuitively, given two input vectors that are equivalent up to a scale factor, with inferred MAP estimations that are equal up to the same scale factor, we would like their scores to be identical. The scoring criterion used, therefore is k (xk − rΛk s)2 /(xk + Rank 500 1000 2000 5000 10000 15000 20000 30000 Pearson’s correlation coefficient 0.94 0.95 0.95 0.79 0.79 0.78 0.75 0.65 False positive rate 0.11 0.08 0.05 0.2 0.25 0.29 0.32 0.42 Table 1: Model performance evaluated at various ranks. Using 180 RT-PCR measurements, we are able to predict the model’s performance at various ranks. Two evaluation criteria are used: Pearson’s correlation coefficient between the model’s predictions and the RT-PCR measurements and false positive rate, where a prediction is considered to be false positive if it is more than 15% away from the RT-PCR measurement. rΛk s)2 , where the MAP estimations for r and s are used. This scoring criterion can be viewed as proportional to the sum of noise to signal ratios, as estimated using the two values given by the observation and the model’s best prediction of that observation. Since it is the relative amount of the isoforms that is of most interest, we need to use the inferred distribution of the isoform abundances to obtain an estimate for the relative levels of AS. It is not immediately clear how this should be done. We do, however, have RTPCR measurements for 180 AS events to guide us (see figure 6 for details). Using the top 50 ranked RT-PCR measurement, we fit three parameters, {a1 , a2 , a3 }, such that the s2 proportion of excluded isoform present, p, is given by p = a1 s1 +a2 s2 + a3 , where s1 is the MAP estimation of the abundance of the inclusion isoform, s2 is the MAP estimation of the abundance of the exclusion isoform, and the RT-PCR measurement are used for target p. The parameters are fitted using gradient descent on a least squared error (LSE) evaluation criterion. We used two criteria to evaluate the quality of the AS model predictions. Pearson’s correlation coefficient (PCC) is used to evaluate the overall ability of the model to correctly estimate trends in the data. PCC is invariant to affine transformation and so is independent of the transformation parameters a1 and a3 discussed above, while the parameter a2 was found to effect PCC very little. The PCC stays above 0.75 for the top two thirds ranked predictions. The second evaluation criterion used is the false positive rate, where a prediction is considered to be false positive if it is more than 15% away from the RT-PCR measurement. This allows us to say, for example, that if a prediction is within the top 10000, we are 75% confident that it is within 15% of the actual levels of AS. 4 Summary We designed a novel AS model for the inference of the relative abundance of two alternatively spliced isoforms from six measurements. Unsupervised learning in the model is performed using a structured variational EM algorithm, which correctly captures the underlying structure of the data, as suggested by its biological nature. The AS model, though presented here for a cassette exon AS events, can be used to learn any type of AS, and with a simple adjustment, multiple types. The predictions obtained from the AS model are currently being used to verify various claims about the role of AS in evolution and functional genomics, and to help identify sequences that affect the regulation of AS. % Exclusion isoform RT−PCR measurement Vs. AS model predictions RT−PCR measurements: 90 80 AS model prediction Int es Te tine sti Kid s n Sa ey liva Br ry ain Sp le Liv en er Mu sc Lu le ng 100 70 60 50 40 30 14 22 27 32 47 46 66 78 63 20 AS model prediction: 10 27 24 26 26 51 75 60 85 100 (a) 0 0 20 40 60 80 RT−PCR measurement 100 (b) Figure 6: (a) Sample RT-PCR. RNA extracted from the cell is reverse-transcribed to DNA, amplified and labelled with radioactive or fluorescent molecules. The sample is pulled through a viscous gel in an electric field (DNA, being an acid, is positively charged). Shorter strands travel further through the gel than longer ones, resulting in two distinct bands, corresponding to the two isoforms, when exposed to a photosensitive or x-ray film. (b) A scatter plot showing the RT-PCR measurements as compared to the AS model predictions. The plot shows all available RT-PCR measurements with a rank of 8000 or better. The AS model presented assumes a single weight matrix for all data cases. This is an oversimplified view of the data, and current work is being carried out in identifying probe specific expression profiles. However, due to the low dimensionality of the problem (10 tissues, six probes per event), care must be taken to avoid overfitting and to ensure meaningful interpretations. Acknowledgments We would like to thank Wen Zhang, Naveed Mohammad, and Timothy Hughes for their contributions in generating the data set. This work was funded in part by an operating and infrastructure grants from the CIHR and CFI, and a operating grants from NSERC and a Premier’s Research Excellence Award. References [1] J. M. Johnson et al. Genome-wide survey of human alternative pre-mrna splicing with exon junction microarrays. Science, 302:2141–44, 2003. [2] L. Cartegni et al. Listening to silence and understanding nonsense: exonic mutations that affect splicing. Nature Gen. Rev., 3:285–98, 2002. [3] Q. Pan et al. Revealing global regulatory features of mammalian alternative splicing using a quantitative microarray platform. Molecular Cell, 16(6):929–41, 2004. [4] Q. Pan et al. Alternative splicing of conserved exons is frequently species specific in human and mouse. Trends Gen., In Press, 2005. [5] M. I. Jordan, Z. Ghahramani, T. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37(2):183– 233, 1999. [6] R. M. Neal and G. E. Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models. Cambridge, MIT Press, 1998. [7] D. M. Rocke and B. Durbin. A model for measurement error for gene expression arrays. Journal of Computational Biology, 8(6):557–69, 2001.
4 0.57778049 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering
Author: Zhengdong Lu, Todd K. Leen
Abstract: While clustering is usually an unsupervised operation, there are circumstances in which we believe (with varying degrees of certainty) that items A and B should be assigned to the same cluster, while items A and C should not. We would like such pairwise relations to influence cluster assignments of out-of-sample data in a manner consistent with the prior knowledge expressed in the training set. Our starting point is probabilistic clustering based on Gaussian mixture models (GMM) of the data distribution. We express clustering preferences in the prior distribution over assignments of data points to clusters. This prior penalizes cluster assignments according to the degree with which they violate the preferences. We fit the model parameters with EM. Experiments on a variety of data sets show that PPC can consistently improve clustering results.
5 0.55779576 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data
Author: Mario Marchand, Mohak Shah
Abstract: We propose a “soft greedy” learning algorithm for building small conjunctions of simple threshold functions, called rays, defined on single real-valued attributes. We also propose a PAC-Bayes risk bound which is minimized for classifiers achieving a non-trivial tradeoff between sparsity (the number of rays used) and the magnitude of the separating margin of each ray. Finally, we test the soft greedy algorithm on four DNA micro-array data sets. 1
6 0.49681458 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
7 0.46273932 161 nips-2004-Self-Tuning Spectral Clustering
8 0.45458156 77 nips-2004-Hierarchical Clustering of a Mixture Model
9 0.43876958 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models
10 0.42933574 115 nips-2004-Maximum Margin Clustering
11 0.42281073 128 nips-2004-Neural Network Computation by In Vitro Transcriptional Circuits
12 0.42044774 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection
13 0.40380779 50 nips-2004-Dependent Gaussian Processes
14 0.39241782 124 nips-2004-Multiple Alignment of Continuous Time Series
15 0.38972753 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition
16 0.38924918 158 nips-2004-Sampling Methods for Unsupervised Learning
17 0.36954924 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters
18 0.36634803 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data
19 0.36516485 51 nips-2004-Detecting Significant Multidimensional Spatial Clusters
20 0.36384243 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform
topicId topicWeight
[(13, 0.102), (15, 0.153), (25, 0.014), (26, 0.038), (31, 0.036), (33, 0.2), (35, 0.011), (39, 0.042), (50, 0.046), (77, 0.272)]
simIndex simValue paperId paperTitle
1 0.88752818 141 nips-2004-Optimal sub-graphical models
Author: Mukund Narasimhan, Jeff A. Bilmes
Abstract: We investigate the problem of reducing the complexity of a graphical model (G, PG ) by finding a subgraph H of G, chosen from a class of subgraphs H, such that H is optimal with respect to KL-divergence. We do this by first defining a decomposition tree representation for G, which is closely related to the junction-tree representation for G. We then give an algorithm which uses this representation to compute the optimal H ∈ H. Gavril [2] and Tarjan [3] have used graph separation properties to solve several combinatorial optimization problems when the size of the minimal separators in the graph is bounded. We present an extension of this technique which applies to some important choices of H even when the size of the minimal separators of G are arbitrarily large. In particular, this applies to problems such as finding an optimal subgraphical model over a (k − 1)-tree of a graphical model over a k-tree (for arbitrary k) and selecting an optimal subgraphical model with (a constant) d fewer edges with respect to KL-divergence can be solved in time polynomial in |V (G)| using this formulation. 1 Introduction and Preliminaries The complexity of inference in graphical models is typically exponential in some parameter of the graph, such as the size of the largest clique. Therefore, it is often required to find a subgraphical model that has lower complexity (smaller clique size) without introducing a large error in inference results. The KL-divergence between the original probability distribution and the probability distribution on the simplified graphical model is often used to measure the impact on inference. Existing techniques for reducing the complexity of graphical models including annihilation and edge-removal [4] are greedy in nature and cannot make any guarantees regarding the optimality of the solution. This problem is NP-complete [9] and so, in general, one cannot expect a polynomial time algorithm to find the optimal solution. However, we show that when we restrict the problem to some sets of subgraphs, the optimal solution can be found quite quickly using a dynamic programming algorithm in time polynomial in the tree-width of the graph. 1.1 Notation and Terminology A graph G = (V, E) is said to be triangulated if every cycle of length greater than 3 has a chord. A clique of G is a non-empty set S ⊆ V such that {a, b} ∈ E for all ∗ This work was supported by NSF grant IIS-0093430 and an Intel Corporation Grant. {b, c, d} d {c, f, g} {b, c} {b, e, c} b c {f, c} {c, e} {e, c, f } g {b, e} a e f {a, b, e} Figure 1: A triangulated graph G and a junction-tree for G a, b ∈ S. A clique S is maximal if S is not properly contained in another clique. If α and β are non-adjacent vertices of G then a set of vertices S ⊆ V \ {α, β} is called an (α, β)-separator if α and β are in distinct components of G[V \ S]. S is a minimal (α, β)-separator if no proper subset of S is an (α, β)-separator. S is said to be a minimal separator if S is a minimal (α, β)-separator for some non adjacent a, b ∈ V . If T = (K, S) is a junction-tree for G (see [7]), then the nodes K of T correspond to the maximalcliques of G, while the links S correspond to minimal separators of G (We reserve the terms vertices/edges for elements of G, and nodes/links for the elements of T ). If G is triangulated, then the number of maximal cliques is at most |V |. For example, in the graph G shown in Figure 1, K = {{b, c, d} , {a, b, e} , {b, e, c} , {e, c, f } , {c, f, g}}. The links S of T correspond to minimal-separators of G in the following way. If Vi Vj ∈ S (where Vi , Vj ∈ K and hence are cliques of G), then Vi ∩ Vj = φ. We label each edge Vi Vj ∈ S with the set Vij = Vi ∩ Vj , which is a non-empty complete separator in G. The removal of any link Vi Vj ∈ S disconnects T into two subtrees which we denote T (i) and T (j) (chosen so that T (i) contains Vi ). We will let K(i) be the nodes of T (i) , and V (i) = ∪V ∈K (i) V be the set of vertices corresponding to the subtree T (i) . The junction tree property ensures that V (i) ∩ V (j) = Vi ∩ Vj = Vij . We will let G(i) be the subgraph induced by V (i) . A graphical model is a pair (G, P ) where P is the joint probability distribution for random variables X1 , X2 , . . . , Xn , and G is a graph with vertex set V (G) = {X1 , X2 , . . . , Xn } such that the separators in G imply conditional independencies in P (so P factors according to G). If G is triangulated, then the junction-tree algorithm can be used for exact inference in the probability distribution P . The complexity of this algorithm grows with the treewidth of G (which is one less than the size of the largest clique in G when G is triangulated). The growth is exponential when P is a discrete probability distribution, thus rendering exact inference for graphs with large treewidth impractical. Therefore, we seek another graphical model (H, PH ) which allows tractable inference (so H should have lower treewidth than G has). The general problem of finding a graphical model of tree-width at most k so as to minimize the KL-divergence from a specified probability distribution is NP complete for general k ([9]) However, it is known that this problem is solvable in polynomial time (in |V (G)|) for some special cases cases (such as when G has bounded treewidth or when k = 1 [1]). If (G, PG ) and (H, PH ) are graphical models, then we say that (H, PH ) is a subgraphical model of (G, PG ) if H is a spanning subgraph of G. Note in particular that separators in G are separators in H, and hence (G, PH ) is also a graphical model. 2 Graph Decompositions and Divide-and-Conquer Algorithms For the remainder of the paper, we will be assuming that G = (V, E) is some triangulated graph, with junction tree T = (K, S). As observed above, if Vi Vj ∈ S, then the removal {b, c, d} d {b, c} {b, e, c} b c c {c, f, g} {f, c} {e, c, f } g {b, e} a e e f {a, b, e} Figure 2: The graphs G(i) , G(j) and junction-trees T (i) and T (j) resulting from the removal of the link Vij = {c, e} of Vij = Vi ∩ Vj disconnects G into two (vertex-induced) subgraphs G(i) and G(j) which are both triangulated, with junction-trees T (i) and T (j) respectively. We can recursively decompose each of G(i) and G(j) into smaller and smaller subgraphs till the resulting subgraphs are cliques. When the size of all the minimal separators are bounded, we may use these decompositions to easily solve problems that are hard in general. For example, in [5] it is shown that NP-complete problems like vertex coloring, and finding maximum independent sets can be solved in polynomial time on graphs with bounded tree-width (which are equivalent to spanning graphs with bounded size separators). We will be interested in finding (triangulated) subgraphs of G that satisfy some conditions, such as a bound on the number of edges, or a bound on the tree-width and which optimize separable objective functions (described in Section 2) One reason why problems such as this can often be solved easily when the tree-width of G is bounded by some constant is this : If Vij is a separator decomposing G into G(i) and G(j) , then a divide-and-conquer approach would suggest that we try and find optimal subgraphs of G(i) and G(j) and then splice the two together to get an optimal subgraph of G. There are two issues with this approach. First, the optimal subgraphs of G (i) and G(j) need not necessarily match up on Vij , the set of common vertices. Second, even if the two subgraphs agree on the set of common vertices, the graph resulting from splicing the two subgraphs together need not be triangulated (which could happen even if the two subgraphs individually are triangulated). To rectify the situation, we can do the following. We partition the set of subgraphs of G(i) and G(j) into classes, so that any subgraph of G(i) and any subgraph G(j) corresponding to the same class are compatible in the sense that they match up on their intersection namely Vij , and so that by splicing the two subgraphs together, we get a subgraph of G which is acceptable (and in particular is triangulated). Then given optimal subgraphs of both G(i) and G(j) corresponding to each class, we can enumerate over all the classes and pick the best one. Of course, to ensure that we do not repeatedly solve the same problem, we need to work bottom-up (a.k.a dynamic programming) or memoize our solutions. This procedure can be carried out in polynomial (in |V |) time as long as we have only a polynomial number of classes. Now, if we have a polynomial number of classes, these classes need not actually be a partition of all the acceptable subgraphs, though the union of the classes must cover all acceptable subgraphs (so the same subgraph can be contained in more than one class). For our application, every class can be thought of to be the set of subgraphs that satisfy some constraint, and we need to pick a polynomial number of constraints that cover all possibilities. The bound on the tree-width helps us here. If k |Vij | = k, then in any subgraph H of G, H[Vij ] must be one of the 2(2) possible subgraphs k of G[Vij ]. So, if k is sufficiently small (so 2(2) is bounded by some polynomial in |V |), then this procedure results in a polynomial time algorithm. In this paper, we show that in some cases we can characterize the space H so that we still have a polynomial number of constraints even when the tree-width of G is not bounded by a small constant. 2.1 Separable objective functions For cases where exact inference in the graphical model (G, PG ) is intractable, it is natural to try to find a subgraphical model (H, PH ) such that D(PG PH ) is minimized, and inference using H is tractable. We will denote by H the set of subgraphs of G that are tractable for inference. For example, this set could be the set of subgraphs of G with treewidth one less than the treewidth of G, or perhaps the set of subgraphs of G with at d fewer edges. For a specified subgraph H of G, there is a unique probability distribution PH factoring over H that minimizes D(PG PH ). Hence, finding a optimal subgraphical model is equivalent to finding a subgraph H for which D(PG PH ) is minimized. If Vij is a separator of G, we will attempt to find optimal subgraphs of G by finding optimal subgraphs of G (i) and G(j) and splicing them together. However, to do this, we need to ensure that the objective criteria also decomposes along the separator Vij . Suppose that H is any triangulated subgraph of G. Let PG(i) and PG(j) be the (marginalized) distributions of PG on V (i) and V (j) respectively, and PH (i) and PH (j) be the (marginalized) distributions of the distribution PH on V (i) and V (j) where H (i) = H[V (i) ] and H (j) = H[V (j) ], The following result assures us that the KL-divergence also factors according to the separator Vij . Lemma 1. Suppose that (G, PG ) is a graphical model, H is a triangulated subgraph of G, and PH factors over H. Then D(PG PH ) = D(PG(i) PH (i) ) + D(PG(j) PH (j) ) − D(PG[Vij ] PH[Vij ] ). Proof. Since H is a subgraph of G, and Vij is a separator of G, Vij must also be a separator of H. Therefore, PH {Xv }v∈V = PH (i) ({Xv }v∈V (i) )·PH (j) ({Xv }v∈V (j) ) . PH[Vij ] ({Xv }v∈V ) The result ij follows immediately. Therefore, there is hope that we can reduce our our original problem of finding an optimal subgraph H ∈ H as one of finding subgraphs of H (i) ⊆ G(i) and H (j) ⊆ G(j) that are compatible, in the sense that they match up on the overlap Vij , and for which D(PG PH ) is minimized. Throughout this paper, for the sake of concreteness, we will assume that the objective criterion is to minimize the KL-divergence. However, all the results can be extended to other objective functions, as long as they “separate” in the sense that for any separator, the objective function is the sum of the objective functions of the two parts, possibly modulo some correction factor which is purely a function of the separator. Another example might be the complexity r(H) of representing the graphical model H. A very natural representation satisfies r(G) = r(G(i) ) + r(G(j) ) if G has a separator G(i) ∩ G(j) . Therefore, the representation cost reduction would satisfy r(G) − r(H) = (r(G (i) ) − r(H (i) )) + (r(G(j) ) − r(H (j) )), and so also factors according to the separators. Finally note that any linear combinations of such separable functions is also separable, and so this technique could also be used to determine tradeoffs (representation cost vs. KL-divergence loss for example). In Section 4 we discuss some issues regarding computing this function. 2.2 Decompositions and decomposition trees For the algorithms considered in this paper, we will be mostly interested in the decompositions that are specified by the junction tree, and we will represent these decompositions by a rooted tree called a decomposition tree. This representation was introduced in [2, 3], and is similar in spirit to Darwiche’s dtrees [6] which specify decompositions of directed acyclic graphs. In this section and the next, we show how a decomposition tree for a graph may be constructed, and show how it is used to solve a number of optimization problems. abd; ce; gf a; be; cd d; bc; e abe dbc ebc e; cf ; g cef cf g Figure 3: The separator tree corresponding to Figure 1 A decomposition tree for G is a rooted tree whose vertices correspond to separators and cliques of G. We describe the construction of the decomposition tree in terms of a junctiontree T = (K, S) for G. The interior nodes of the decomposition tree R(T ) correspond to S (the links of T and hence the minimal separators of G). The leaf or terminal nodes represent the elements of K (the nodes of T and hence the maximal cliques of G). R(T ) can be recursively constructed from T as follows : If T consists of just one node K, (and hence no edges), then R consists of just one node, which is given the label K as well. If however, T has more than one node, then T must contain at least one link. To begin, let Vi Vj ∈ S be any link in T . Then removal of the link Vi Vj results in two disjoint junctiontrees T (i) and T (j) . We label the root of R by the decomposition (V (i) ; Vij ; V (j) ). The rest of R is recursively built by successively picking links of T (i) and T (j) (decompositions of G(i) and G(j) ) to form the interior nodes of R. The effect of this procedure on the junction tree of Figure 1 is shown in Figure 3, where the decomposition associated with the interior nodes is shown inside the nodes. Let M be the set of all nodes of R(T ). For any interior node M induced by the the link Vi Vj ∈ S of T , then we will let M (i) and M (j) represent the left and right children of M , and R(i) and R(j) be the left and right trees below M . 3 3.1 Finding optimal subgraphical models Optimal sub (k − 1)-trees of k-trees Suppose that G is a k-tree. A sub (k − 1)-tree of G is a subgraph H of G that is (k − 1)tree. Now, if Vij is any minimal separator of G, then both G(i) and G(j) are k-trees on vertex sets V (i) and V (j) respectively. It is clear that the induced subgraphs H[V (i) ] and H[V (j) ] are subgraphs of G(i) and G(j) and are partial (k − 1)-trees. We will be interested in finding sub (k − 1)-trees of k trees and this problem is trivial by the result of [1] when k = 2. Therefore, we assume that k ≥ 3. The following result characterizes the various possibilities for H[Vij ] in this case. Lemma 2. Suppose that G is a k-tree, and S = Vij is a minimal separator of G corresponding to the link ij of the junction-tree T . In any (k − 1)-tree H ⊆ G either 1. There is a u ∈ S such that u is not connected to vertices in both V (i) \ S and V (j) \ S. Then S \ {u} is a minimal separator in H and hence is complete. 2. Every vertex in S is connected to vertices in both V (i) \S and V (j) \S. Then there are vertices {x, y} ⊆ S such that the edge H[S] is missing only the edge {x, y}. Further either H[V (i) ] or H[V (j) ] does not contain a unchorded x-y path. Proof. We consider two possibilities. In the first, there is some vertex u ∈ S such that u is not connected to vertices in both V (i) \S and V (j) \. Since the removal of S disconnects G, the removal of S must also disconnect H. Therefore, S must contain a minimal separator of H. Since H is a (k − 1)-tree, all minimal separators of H must contain k − 1 vertices which must therefore be S \{u}. This corresponds to case (1) above. Clearly this possiblity can occur. If there is no such u ∈ S, then every vertex in S is connected to vertices in both V (i) \ S and V (j) \ S. If x ∈ S is connected to some yi ∈ V (i) \ S and yj ∈ V (j) \ S, then x is contained in every minimal yi /yj separator (see [5]). Therefore, every vertex in S is part of a minimal separator. Since each minimal separator contains k − 1 vertices, there must be at least two distinct minimum separators contained in S. Let Sx = S \ {x} and Sy = S \ {y} be two distinct minimal separators. We claim that H[S] contains all edges except the edge {x, y}. To see this, note that if z, w ∈ S, with z = w and {z, w} = {x, y} (as sets), then either {z, w} ⊂ Sy or {z, w} ⊂ Sx . Since both Sx and Sy are complete in H, this edge must be present in H. The edge {x, y} is not present in H[S] because all minimal separators in H must be of size k − 1. Further, if both V (i) and V (j) contain an unchorded path between x and y, then by joining the two paths at x and y, we get a unchorded cycle in H which contradicts the fact that H is triangulated. Therefore, we may associate k · 2 + 2 · k constraints with each separator Vij of G as 2 follows. There are k possible constraints corresponding to case (1) above (one for each choice of x), and k · 2 choices corresponding to case (2) above. This is because for each 2 pair {x, y} corresponding to the missing edge, we have either V (i) contains no unchorded xy paths or V (j) contains no unchorded xy paths. More explicitly, we can encode the set of constraints CM associated with each separator S corresponding to an interior node M of the decomposition tree as follows: CM = { (x, y, s) : x ∈ S, y ∈ S, s ∈ {i, j}}. If y = x, then this corresponds to case (1) of the above lemma. If s = i, then x is connected only to H (i) and if s = j, then x is connected only to H (j) . If y = x, then this corresponds to case (2) in the above lemma. If s = i, then H (i) does not contain any unchorded path between x and y, and there is no constraint on H (j) . Similarly if s = j, then H (j) does not contain any unchorded path between x and y, and there is no constraint on H (i) . Now suppose that H (i) and H (j) are triangulated subgraphs of G(i) and G(j) respectively, then it is clear that if H (i) and H (j) both satisfy the same constraint they must match up on the common vertices Vij . Therefore to splice together two solutions corresponding to the same constraint, we only need to check that the graph obtained by splicing the graphs is triangulated. Lemma 3. Suppose that H (i) and H (j) are triangulated subgraphs of G(i) and G(j) respectively such that both of them satisfy the same constraint as described above. Then the graph H obtained by splicing H (i) and H (j) together is triangulated. Proof. Suppose that both H (i) and H (j) are both triangulated and both satisfy the same constraint. If both H (i) and H (j) satisfy the same constraint corresponding to case (1) in Lemma 2 and H has an unchorded cycle, then this cycle must involve elements of both H (i) and H (j) . Therefore, there must be two vertices of S \{u} on the cycle, and hence this cycle has a chord as S \ {u} is complete. This contradiction shows that H is triangulated. So assume that both of them satisfy the constraint corresponding to case (2) of Lemma 2. Then if H is not triangulated, there must be a t-cycle (for t ≥ 4) with no chord. Now, since {x, y} is the only missing edge of S in H, and because H (i) and H (j) are individually triangulated, the cycle must contain x, y and vertices of both V (i) \ S and V (j) \ S. We may split this unchorded cycle into two unchorded paths, one contained in V (i) and one in V (j) thus violating our assumption that both H (i) and H (j) satisfy the same constraint. If |S| = k, then there are 2k + 2 · k ∈ O(k 2 ) ∈ O(n2 ). We can use a divide and conquer 2 strategy to find the optimal sub (k − 1) tree once we have taken care of the base case, where G is just a single clique (of k + 1) elements. However, for this case, it is easily checked that any subgraph of G obtained by deleting exactly one edge results in a (k − 1) tree, and every sub (k−1)-tree results from this operation. Therefore, the optimal (k−1)-tree can be found using Algorithm 1, and in this case, the complexity of Algorithm 1 is O(n(k + 1) 2 ). This procedure can be generalized to find the optimal sub (k − d)- tree for any fixed d. However, the number of constraints grows exponentially with d (though it is still polynomial in n). Therefore for small, fixed values of d, we can compute the optimal sub (k − d)-tree of G. While we can compute (k − d)-trees of G by first going from a k tree to a (k − 1) tree, then from a (k − 1)-tree to a (k − 2)-tree, and so on in a greedy fashion, this will not be optimal in general. However, this might be a good algorithm to try when d is large. 3.2 Optimal triangulated subgraphs with |E(G)| − d edges Suppose that we are interested in a (triangulated) subgraph of G that contains d fewer edges that G does. That is, we want to find an optimal subgraph H ⊂ G such that |E(H)| = |E(G)| − d. Note that by the result of [4] there is always a triangulated subgraph with d fewer edges (if d < |E(G)|). Two possibilities for finding such an optimal subgraph are 1. Use the procedure described in [4]. This is a greedy procedure which works in d steps by deleting an edge at each step. At each state, the edge is picked from the set of edges whose deletion leaves a triangulated graph. Then the edge which causes the least increase in KL-divergence is picked at each stage. 2. For each possible subset A of E(G) of size d, whose deletion leaves a triangulated graph, compute the KL divergence using the formula above, and then pick the optimal one. Since there are |E(G)| such sets, this can be done in polynomial d time (in |V (G)|) when d is a constant. The first greedy algorithm is not guaranteed to yield the optimal solution. The second takes time that is O(n2d ). Now, let us solve this problem using the framework we’ve described. Let H be the set of subgraphs of G which may be obtained by deletion of d edges. For each M = ij ∈ M corresponding to the separator Vij , let CM = (l, r, c, s, A) : l + r − c = d, s a d bit string, A ∈ E(G[Vij ]) . The constraint reprec sented by (l, r, c, A) is this : A is a set of d edges of G[Vij ] that are missing in H, l edges are missing from the left subgraph, and r edges are missing from the right subgraph. c represents the double count, and so is subtracted from the total. If k is the size of the largest k clique, then the total number of such constraints is bounded by 2d · 2d · (2) ∈ O(k 2d ) d which could be better than O(n2d ) and is polynomial in |V | when d is constant. See [10] for additional details. 4 Conclusions Algorithm 1 will compute the optimal H ∈ H for the two examples discussed above and is polynomial (for fixed constant d) even if k is O(n). In [10] a generalization is presented which will allow finding the optimal solution for other classes of subgraphical models. Now, we assume an oracle model for computing KL-divergences of probability distributions on vertex sets of cliques. It is clear that these KL-divergences can be computed R ← separator-tree for G; for each vertex M of R in order of increasing height (bottom up) do for each constraint cM of M do if M is an interior vertex of R corresponding to edge ij of the junction tree then Let Ml and Mr be the left and right children of M ; Pick constraint cl ∈ CMl compatible with cM to minimize table[Ml , cl ]; Pick constraint cr ∈ CMr compatible with cM to minimize table[Mr , cr ]; loss ← D(PG [M ] PH [M ]); table[M, cM ] ← table[Ml , cl ] + table[Mr , cr ] − loss; else table[M, cM ] ← D(PG [M ] PH [M ]); end end end Algorithm 1: Finding optimal set of constraints efficiently for distributions like Gaussians, but for discrete distributions this may not be possible when k is large. However even in this case this algorithm will result in only polynomial calls to the oracle. The standard algorithm [3] which is exponential in the treewidth will make O(2k ) calls to this oracle. Therefore, when the cost of computing the KL-divergence is large, this algorithm becomes even more attractive as it results in exponential speedup over the standard algorithm. Alternatively, if we can compute approximate KL-divergences, or approximately optimal solutions, then we can compute an approximate solution by using the same algorithm. References [1] C. Chow and C. Liu, “Approximating discrete probability distributions with dependence trees”, IEEE Transactions on Information Theory, v. 14, 1968, Pages 462–467. [2] F. Gavril, “Algorithms on clique separable graphs”, Discrete Mathematics v. 9 (1977), pp. 159–165. [3] R. E. Tarjan. “Decomposition by Clique Separators”, Discrete Mathematics, v. 55 (1985), pp. 221–232. [4] U. Kjaerulff. “Reduction of computational complexity in Bayesian networks through removal of weak dependencies”, Proceedings of the Tenth Annual Conference on Uncertainty in Artificial Intelligence, pp. 374–382, 1994. [5] T. Kloks, “Treewidth: Computations and Approximations”, Springer-Verlag, 1994. [6] A. Darwiche and M. Hopkins. “Using recursive decomposition to construct elimination orders, jointrees and dtrees”, Technical Report D-122, Computer Science Dept., UCLA. [7] S. Lauritzen. “Graphical Models”, Oxford University Press, Oxford, 1996. [8] T. A. McKee and F. R. McMorris. “Topics in Intersection Graph Theory”, SIAM Monographs on Discrete Mathematics and Applications, 1999. [9] D. Karger and N. Srebro. “Learning Markov networks: Maximum bounded tree-width graphs.” In Symposium on Discrete Algorithms, 2001, Pages 391-401. [10] M. Narasimhan and J. Bilmes. “Optimization on separator-clique trees.”, Technical report UWEETR 2004-10, June 2004.
same-paper 2 0.85266733 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment
Author: Scott J. Gaffney, Padhraic Smyth
Abstract: Clustering and prediction of sets of curves is an important problem in many areas of science and engineering. It is often the case that curves tend to be misaligned from each other in a continuous manner, either in space (across the measurements) or in time. We develop a probabilistic framework that allows for joint clustering and continuous alignment of sets of curves in curve space (as opposed to a fixed-dimensional featurevector space). The proposed methodology integrates new probabilistic alignment models with model-based curve clustering algorithms. The probabilistic approach allows for the derivation of consistent EM learning algorithms for the joint clustering-alignment problem. Experimental results are shown for alignment of human growth data, and joint clustering and alignment of gene expression time-course data.
3 0.8385666 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms
Author: Omid Madani, David M. Pennock, Gary W. Flake
Abstract: In the context of binary classification, we define disagreement as a measure of how often two independently-trained models differ in their classification of unlabeled data. We explore the use of disagreement for error estimation and model selection. We call the procedure co-validation, since the two models effectively (in)validate one another by comparing results on unlabeled data, which we assume is relatively cheap and plentiful compared to labeled data. We show that per-instance disagreement is an unbiased estimate of the variance of error for that instance. We also show that disagreement provides a lower bound on the prediction (generalization) error, and a tight upper bound on the “variance of prediction error”, or the variance of the average error across instances, where variance is measured across training sets. We present experimental results on several data sets exploring co-validation for error estimation and model selection. The procedure is especially effective in active learning settings, where training sets are not drawn at random and cross validation overestimates error. 1
4 0.78571731 48 nips-2004-Convergence and No-Regret in Multiagent Learning
Author: Michael Bowling
Abstract: Learning in a multiagent system is a challenging problem due to two key factors. First, if other agents are simultaneously learning then the environment is no longer stationary, thus undermining convergence guarantees. Second, learning is often susceptible to deception, where the other agents may be able to exploit a learner’s particular dynamics. In the worst case, this could result in poorer performance than if the agent was not learning at all. These challenges are identifiable in the two most common evaluation criteria for multiagent learning algorithms: convergence and regret. Algorithms focusing on convergence or regret in isolation are numerous. In this paper, we seek to address both criteria in a single algorithm by introducing GIGA-WoLF, a learning algorithm for normalform games. We prove the algorithm guarantees at most zero average regret, while demonstrating the algorithm converges in many situations of self-play. We prove convergence in a limited setting and give empirical results in a wider variety of situations. These results also suggest a third new learning criterion combining convergence and regret, which we call negative non-convergence regret (NNR). 1
5 0.71728295 160 nips-2004-Seeing through water
Author: Alexei Efros, Volkan Isler, Jianbo Shi, Mirkó Visontai
Abstract: We consider the problem of recovering an underwater image distorted by surface waves. A large amount of video data of the distorted image is acquired. The problem is posed in terms of finding an undistorted image patch at each spatial location. This challenging reconstruction task can be formulated as a manifold learning problem, such that the center of the manifold is the image of the undistorted patch. To compute the center, we present a new technique to estimate global distances on the manifold. Our technique achieves robustness through convex flow computations and solves the “leakage” problem inherent in recent manifold embedding techniques. 1
6 0.71273208 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
7 0.71210366 178 nips-2004-Support Vector Classification with Input Data Uncertainty
8 0.7106306 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
9 0.70963645 131 nips-2004-Non-Local Manifold Tangent Learning
10 0.70931464 150 nips-2004-Proximity Graphs for Clustering and Manifold Learning
11 0.70851964 163 nips-2004-Semi-parametric Exponential Family PCA
12 0.70746422 19 nips-2004-An Application of Boosting to Graph Classification
13 0.70590979 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach
14 0.70510286 70 nips-2004-Following Curved Regularized Optimization Solution Paths
15 0.70499599 102 nips-2004-Learning first-order Markov models for control
16 0.70470411 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
17 0.70432335 127 nips-2004-Neighbourhood Components Analysis
18 0.70421839 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
19 0.70385432 104 nips-2004-Linear Multilayer Independent Component Analysis for Large Natural Scenes
20 0.70358026 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection