nips nips2006 nips2006-3 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Andrea Vedaldi, Stefano Soatto
Abstract: Image Congealing (IC) is a non-parametric method for the joint alignment of a collection of images affected by systematic and unwanted deformations. The method attempts to undo the deformations by minimizing a measure of complexity of the image ensemble, such as the averaged per-pixel entropy. This enables alignment without an explicit model of the aligned dataset as required by other methods (e.g. transformed component analysis). While IC is simple and general, it may introduce degenerate solutions when the transformations allow minimizing the complexity of the data by collapsing them to a constant. Such solutions need to be explicitly removed by regularization. In this paper we propose an alternative formulation which solves this regularization issue on a more principled ground. We make the simple observation that alignment should simplify the data while preserving the useful information carried by them. Therefore we trade off fidelity and complexity of the aligned ensemble rather than minimizing the complexity alone. This eliminates the need for an explicit regularization of the transformations, and has a number of other useful properties such as noise suppression. We show the modeling and computational benefits of the approach to the some of the problems on which IC has been demonstrated. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Image Congealing (IC) is a non-parametric method for the joint alignment of a collection of images affected by systematic and unwanted deformations. [sent-3, score-0.486]
2 The method attempts to undo the deformations by minimizing a measure of complexity of the image ensemble, such as the averaged per-pixel entropy. [sent-4, score-0.258]
3 This enables alignment without an explicit model of the aligned dataset as required by other methods (e. [sent-5, score-0.559]
4 While IC is simple and general, it may introduce degenerate solutions when the transformations allow minimizing the complexity of the data by collapsing them to a constant. [sent-8, score-0.341]
5 We make the simple observation that alignment should simplify the data while preserving the useful information carried by them. [sent-11, score-0.443]
6 Therefore we trade off fidelity and complexity of the aligned ensemble rather than minimizing the complexity alone. [sent-12, score-0.56]
7 This eliminates the need for an explicit regularization of the transformations, and has a number of other useful properties such as noise suppression. [sent-13, score-0.102]
8 1 Introduction Joint pattern alignment attempts to remove from an ensemble of patterns the effect of nuisance transformations of a systematic nature. [sent-15, score-1.073]
9 The aligned patterns have then a simpler structure and can be processed more easily. [sent-16, score-0.385]
10 Joint pattern alignment is not the same problem as aligning a pattern to another; instead all the patterns are projected to a common “reference” (usually a subspace) which is unknown and needs to be discovered in the process. [sent-17, score-0.635]
11 Joint pattern alignment is useful in many applications and has been addressed by several authors. [sent-18, score-0.445]
12 Transform Component Analysis [7] (TCA) explicitly models the aligned ensemble as a Gaussian linear subspace of patterns. [sent-20, score-0.366]
13 Unfortunately the method requires the space of transformations to be quantized and it is not clear how well the approach could scale to complex scenarios. [sent-23, score-0.2]
14 The idea is that, as the nuisance deformations should increase the complexity of the data, one should be able to identify and undo them by contrasting this effect. [sent-25, score-0.296]
15 With respect to TCA, IC results in a lighter formulation which enables addressing more complex transformations and makes fewer assumptions on the aligned ensemble. [sent-27, score-0.448]
16 An issue with the standard formulation of IC is that it does not require the aligned data to be a faithful representation of the original data. [sent-28, score-0.324]
17 Thus simplifying the data might not only remove the nuisance factors, but also the useful information carried by the patterns. [sent-29, score-0.152]
18 For example, if entropy is used to measure complexity, a typical degenerate solution is obtained by mapping all the data to a constant, which results in minimum (null) entropy. [sent-30, score-0.176]
19 One should instead search for an optimal compromise between complexity of the simplified data and preservation of the useful information (Sect. [sent-32, score-0.199]
20 3 we specialize our model to the problem of image alignment as done in [9]. [sent-43, score-0.402]
21 2 Problem formulation We formulate joint pattern alignment as the problem of finding a deformed pattern ensemble which is simpler but faithful to the original data. [sent-53, score-0.862]
22 A pattern (or data) ensemble x ∈ X is a random variable with density p(x). [sent-57, score-0.247]
23 Similarly, an aligned ensemble or alignment y ∈ X of the ensemble x is another variable y that has conditional statistic p(y|x). [sent-58, score-0.879]
24 We seek for an alignment that is “simpler” than x but “faithful” to x. [sent-59, score-0.353]
25 The complexity R of the alignment y is measured by an operator R = H(y) such as, for example, the entropy of the random variable y (but we will see other options). [sent-60, score-0.609]
26 The cost of representing x by y is expressed by a distortion function d(x, y) ∈ R+ and the faithfulness of the alignment y is quantified as the expected distortion D = E[d(x, y)]. [sent-61, score-0.948]
27 Consider a class W of deformations w : X → X acting on the patterns X . [sent-62, score-0.22]
28 Figuring out the best alignment y boils down in optimizing p(y|x) for complexity and distortion. [sent-64, score-0.45]
29 However, this require trading off complexity and distortion and there is no unique way of doing so. [sent-65, score-0.373]
30 The distortion-complexity function D(R) gives the best distortion D that can be achieved by alignments of complexity R. [sent-66, score-0.423]
31 D(R) can be computed by optimizing the distortion D w. [sent-68, score-0.276]
32 1 Relation to rate-distortion and entropy constrained vector quantization If one chooses the mutual information I(x, y) as complexity measure H(y) in eq. [sent-75, score-0.27]
33 Therefore the alignment y of a pattern x is in general not unique. [sent-78, score-0.416]
34 In contrast, entropy constrained vector quantization [4, 3] assumes that y is finite (i. [sent-83, score-0.173]
35 Then it measures the complexity of y as the (discrete) entropy H(y). [sent-88, score-0.229]
36 Unlike rate-distortion, however, the aligned ensemble y is discrete even if the ensemble x is continuous. [sent-93, score-0.557]
37 2 Relation to information bottleneck Information Bottleneck (IB) [2] is a special rate-distortion problem in which one compresses a variable x while preserving the information carried by x about another variable z, representing the task of interest. [sent-95, score-0.112]
38 By designing an appropriate distribution p(x, z) it may also be possible to obtain an alignment effect similar to the one we seek here. [sent-97, score-0.382]
39 3 Alternative measures of complexity Instead of the entropy H(y) or the mutual information I(x, y) we can use alternative measures of complexity that yield to more convenient computations. [sent-100, score-0.326]
40 An example is the averaged-per-pixel entropy introduced by IC [9] and discussed in Sect. [sent-101, score-0.132]
41 Generalizing this idea, we assume that the aligned data y depend functionally on the patterns x (i. [sent-103, score-0.41]
42 y = y(x)) and we express the complexity of y as the total entropy of lower dimensional projections φ1 (y), . [sent-105, score-0.229]
43 , xK ∈ X of patterns, we recover transformations w1 , . [sent-113, score-0.2]
44 , yK ∈ X that minimize 1 K M K d(xi , wi (yi )) − λ j=1 i=1 1 K K log pj (φj (yi )), i=1 where the densities pj (φj (y)) are estimated from the samples φj (y1 ), . [sent-119, score-0.54]
45 4 Comparison to image congealing In IC [9], given data x1 , . [sent-126, score-0.121]
46 , xK ∈ X , one looks for transformations v : X → X , x → y such that the density p(y) estimated from samples y1 = v1 (x1 ), . [sent-129, score-0.248]
47 This prevents the differential entropy to have arbitrary small negative values. [sent-134, score-0.162]
48 Compared to IC, in our formulation: The distortion term E[d(x, y)] substitutes the arbitrary regularization R(v). [sent-136, score-0.322]
49 The aligned patterns y are not obtained by deforming the patterns x; instead y is obtained as a simplification of x within an acceptable level of distortion. [sent-137, score-0.551]
50 The transformations w can be rather general, even non-invertible. [sent-140, score-0.2]
51 IC can use complex transformations too, but most likely these would need to be heavily regularized as they would tend to annihilate the patterns. [sent-141, score-0.2]
52 3 Application to joint image alignment We apply our model to the problem of removing a family of geometric distortions from images. [sent-142, score-0.446]
53 The images may be affected by parametric transformations wi (·) = w(·; qi ) : R2 → R2 , so that Ii (x) = Ti (wx) + ni (x), x∈Λ for templates (aligned ensemble2 ) Ti (y), y ∈ Λ and residuals ni (x). [sent-148, score-0.957]
54 Here qi is the vector of parameters of the transformation wi (for example, wi might be a 2-D affine transformation y = Lx + l and qi the vector q = [L11 L21 L12 L22 l1 l2 ]). [sent-149, score-1.242]
55 The templates Ti (y), y ∈ Λ are digital images themselves. [sent-150, score-0.174]
56 Therefore the symbol Ti (wi x) really denotes the quantity T (wi x) = A(x; wi )Ti , x∈Λ where A(x; wi ) is a row vector of mixing coefficients determined by wi and and the interpolation method being used and Ti is the vector obtained by stacking the pixels of the template Ti (y), y ∈ Λ. [sent-152, score-1.397]
57 We will also use the notation wi ◦ Ti = A(wi )Ti where the left hand side is the stacking of the warped template T (wi x), x ∈ Λ and A(wi ) is the matrix whose rows are the vectors A(x; wi ) for x ∈ Λ. [sent-153, score-0.92]
58 The distortion is defined to be the squared l2 norm of the residual d(Ii , w ◦ Ti ) = x∈Λ (Ii (x) − Ti (wi x))2 . [sent-154, score-0.301]
59 The complexity of the aligned ensemble T (y), y ∈ Λ is computed as in Sect. [sent-155, score-0.463]
60 3 by projecting on the image pixels and averaging their entropies (this is equivalent to assuming that the pixels are statistically independent). [sent-157, score-0.162]
61 For each pixel y ∈ Λ a density p(T (y) = t), t ∈ [0, 1] is estimated non parametrically from the data {T1 (y), . [sent-158, score-0.135]
62 The complexity of a pixel is thus H(T (y)) = − 1 K K log p(Ti (y)). [sent-164, score-0.184]
63 2 the patterns xi are now the images Ii and the alignment y are the templates Ti . [sent-175, score-0.668]
64 1: Estimate the probabilities p(T (y)), y ∈ Λ from the templates {Ti (y) : i = 1, . [sent-176, score-0.1]
65 , K and for each component qji of the parameter vector qi , try a few values of qji . [sent-182, score-0.275]
66 Here we consider affine transformations for the sake of the illustration, so that q is six-dimensional. [sent-186, score-0.2]
67 As a first order approximation (as the final result will be refined by Gauss-Newton as explained in the next Section), we bypass this −1 problem and we simply set Ti = wi ◦ Ii , exploiting the fact that the affine transformations wi are 3 invertible . [sent-196, score-1.028]
68 Thus the new algorithm is simply avoiding those transformations wi that would introduce excessive loss of fidelity. [sent-198, score-0.637]
69 2 Gauss-Newton search With respect to IC, where only the transformations w1 , . [sent-200, score-0.241]
70 , wK are estimated, here we compute the templates T1 , . [sent-203, score-0.1]
71 We still process a single image per time, reiterating several times across the whole ensemble {I1 (x), . [sent-213, score-0.244]
72 For a given image Ii we update the warp parameters qi and the template Ti simultaneously. [sent-217, score-0.278]
73 We exploit the fact that, as the number K of images is usually big, the density p(T (y)) does not change significantly when only one of the templates Ti is being changed. [sent-218, score-0.162]
74 The gradient is given by ∂wi ∂L p(Ti (y)) ˙ ∂L = 2∆i (x) Ti (wi x) (x), = 2∆i (x)(A(x; wi )δy ) − ∂Ti (y) p(Ti (y)) ∂qi ∂qi x∈Λ x∈Λ y∈Λ where ∆i (x) = Ti (wi x)−Ii (x) is the reconstruction residual, A(x; wi ) is the linear map introduced in Sect. [sent-220, score-0.828]
75 We distort the patterns by applying translations drawn uniformly from the 8-shaped region (the center corresponds to the null translation). [sent-229, score-0.212]
76 We show the gradient based algorithm while it gradually aligns the patterns by reducing the complexity of the alignment y. [sent-231, score-0.606]
77 The basic algorithm completely eliminates the effect of the nuisance transformations doing a better job of avoiding local minima. [sent-235, score-0.393]
78 Although for this simple problem the basic search is more effective, on more difficult scenarios the extra complexity of the Gauss-Newton search pays off (see Sect. [sent-236, score-0.206]
79 where D1 is the discrete linear operator used to compute the derivative of Ti (y) along its first dimension and D2 the analogous operator for the second dimension. [sent-238, score-0.135]
80 samples from a 2-D Gaussian distribution and adding a random translation wi ∈ R2 to them. [sent-257, score-0.414]
81 The distribution of the translations wi is generic (in the example wi is drawn uniformly from an 8-shaped region of the plane): This is not a problem as we do not need to make any particular assumptions on w besides that it is a translation. [sent-258, score-0.86]
82 The distortion d(xi , yi ) is simply the m sum of the Euclidean distances j=1 yji + wi − xji 2 between the patterns xi and the transformed codes wi (yi ) = (y1i + wi , . [sent-259, score-1.835]
83 Despite this, we did not observe any of the aligned patterns to collapse. [sent-274, score-0.362]
84 As λ is increased, the alignment complexity is reduced and the fidelity of the alignment is degraded. [sent-277, score-0.803]
85 By an appropriate choice of λ, the alignment can be regarded as a “restoration” or “canonization” of the pattern which abstracts from details of the specific instance. [sent-278, score-0.416]
86 Expected value per pixel Expected value per pixel Entropy per pixel Entropy per pixel 0 0 Entropy per pixel Entropy per pixel 5 −1 −1. [sent-279, score-0.732]
87 5 25 20 5 10 15 20 25 Distortion−rate diagram x 10 20 Distortion 5 5 5 10 15 20 25 5 10 15 20 25 Figure 2: Basic vs GN image alignment algorithms. [sent-297, score-0.448]
88 We show the results of applying the basic image alignment algorithm of Sect. [sent-299, score-0.429]
89 The patterns are zeroes from the NIST Special Database 19. [sent-302, score-0.156]
90 Basic algorithm: Results are very similar to [9], except that no regularization on the transformations is used. [sent-315, score-0.246]
91 GN algorithm: Patterns achieve a better alignment due to the more efficient search strategy; they also appear to be much more “regular” due to the noise cancellation effect discussed in Fig. [sent-317, score-0.423]
92 More examples of patterns before and after GN alignment. [sent-320, score-0.156]
93 5 Conclusions IC is a useful algorithm for joint pattern alignment, both robust and flexible. [sent-321, score-0.136]
94 In this paper we showed that the original formulation can be improved by realizing that alignment should result in a simplified representation of the useful information carried by the patterns rather than a simplification of the patterns. [sent-322, score-0.616]
95 This results in a formulation that does not require inventing regularization terms in order to prevent degenerate solutions. [sent-323, score-0.132]
96 We also showed that Gauss-Newton can be successfully applied to this problem for the case of image alignment and that this is in some regards more effective than the original IC algorithm. [sent-324, score-0.402]
97 8 (b) Not aligned (c) Aligned Figure 4: Distortion-complexity balance. [sent-330, score-0.206]
98 (b) We show the alignment T (wi x) of eight patterns (rows) as λ is increased (columns). [sent-335, score-0.509]
99 In order to reduce the entropy of the alignment, the algorithm “forgets” about specific details of each glyph. [sent-336, score-0.132]
100 Joint nonparametric alignment for analizing spatial gene expression patterns in drosophila imaginal discs. [sent-348, score-0.509]
wordName wordTfidf (topN-words)
[('wi', 0.414), ('ti', 0.364), ('alignment', 0.353), ('distortion', 0.276), ('ic', 0.248), ('aligned', 0.206), ('transformations', 0.2), ('qi', 0.179), ('ensemble', 0.16), ('patterns', 0.156), ('entropy', 0.132), ('gn', 0.102), ('templates', 0.1), ('complexity', 0.097), ('nuisance', 0.087), ('pixel', 0.087), ('ii', 0.079), ('faithful', 0.076), ('yji', 0.076), ('congealing', 0.072), ('tca', 0.072), ('parzen', 0.067), ('deformations', 0.064), ('wx', 0.064), ('pattern', 0.063), ('nist', 0.057), ('wk', 0.056), ('tk', 0.054), ('bottleneck', 0.051), ('pj', 0.051), ('alignments', 0.05), ('template', 0.05), ('image', 0.049), ('functionally', 0.048), ('ppca', 0.048), ('qji', 0.048), ('undo', 0.048), ('diagram', 0.046), ('regularization', 0.046), ('degenerate', 0.044), ('joint', 0.044), ('cost', 0.043), ('formulation', 0.042), ('af', 0.042), ('delity', 0.042), ('lossy', 0.042), ('scanline', 0.042), ('stacking', 0.042), ('yi', 0.041), ('search', 0.041), ('quantization', 0.041), ('pixels', 0.04), ('yk', 0.039), ('images', 0.038), ('deformed', 0.038), ('spans', 0.038), ('digital', 0.036), ('carried', 0.036), ('angeles', 0.035), ('sweeps', 0.035), ('per', 0.035), ('acceptable', 0.033), ('entropies', 0.033), ('ib', 0.033), ('los', 0.033), ('compromise', 0.032), ('deformation', 0.032), ('translations', 0.032), ('xk', 0.031), ('discrete', 0.031), ('differential', 0.03), ('effect', 0.029), ('useful', 0.029), ('transformation', 0.028), ('regularizing', 0.028), ('eliminates', 0.027), ('operator', 0.027), ('basic', 0.027), ('affected', 0.026), ('simpli', 0.026), ('modes', 0.026), ('pami', 0.026), ('derivative', 0.025), ('analogous', 0.025), ('preserving', 0.025), ('residual', 0.025), ('systematic', 0.025), ('null', 0.024), ('density', 0.024), ('estimated', 0.024), ('avoiding', 0.023), ('codes', 0.023), ('interpolation', 0.023), ('middle', 0.023), ('simpler', 0.023), ('database', 0.023), ('toy', 0.023), ('xi', 0.021), ('green', 0.021), ('ik', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
Author: Andrea Vedaldi, Stefano Soatto
Abstract: Image Congealing (IC) is a non-parametric method for the joint alignment of a collection of images affected by systematic and unwanted deformations. The method attempts to undo the deformations by minimizing a measure of complexity of the image ensemble, such as the averaged per-pixel entropy. This enables alignment without an explicit model of the aligned dataset as required by other methods (e.g. transformed component analysis). While IC is simple and general, it may introduce degenerate solutions when the transformations allow minimizing the complexity of the data by collapsing them to a constant. Such solutions need to be explicitly removed by regularization. In this paper we propose an alternative formulation which solves this regularization issue on a more principled ground. We make the simple observation that alignment should simplify the data while preserving the useful information carried by them. Therefore we trade off fidelity and complexity of the aligned ensemble rather than minimizing the complexity alone. This eliminates the need for an explicit regularization of the transformations, and has a number of other useful properties such as noise suppression. We show the modeling and computational benefits of the approach to the some of the problems on which IC has been demonstrated. 1
2 0.21063465 175 nips-2006-Simplifying Mixture Models through Function Approximation
Author: Kai Zhang, James T. Kwok
Abstract: Finite mixture model is a powerful tool in many statistical learning problems. In this paper, we propose a general, structure-preserving approach to reduce its model complexity, which can bring significant computational benefits in many applications. The basic idea is to group the original mixture components into compact clusters, and then minimize an upper bound on the approximation error between the original and simplified models. By adopting the L2 norm as the distance measure between mixture models, we can derive closed-form solutions that are more robust and reliable than using the KL-based distance measure. Moreover, the complexity of our algorithm is only linear in the sample size and dimensionality. Experiments on density estimation and clustering-based image segmentation demonstrate its outstanding performance in terms of both speed and accuracy.
3 0.19780596 20 nips-2006-Active learning for misspecified generalized linear models
Author: Francis R. Bach
Abstract: Active learning refers to algorithmic frameworks aimed at selecting training data points in order to reduce the number of required training data points and/or improve the generalization performance of a learning method. In this paper, we present an asymptotic analysis of active learning for generalized linear models. Our analysis holds under the common practical situation of model misspecification, and is based on realistic assumptions regarding the nature of the sampling distributions, which are usually neither independent nor identical. We derive unbiased estimators of generalization performance, as well as estimators of expected reduction in generalization error after adding a new training data point, that allow us to optimize its sampling distribution through a convex optimization problem. Our analysis naturally leads to an algorithm for sequential active learning which is applicable for all tasks supported by generalized linear models (e.g., binary classification, multi-class classification, regression) and can be applied in non-linear settings through the use of Mercer kernels. 1
4 0.14442673 127 nips-2006-MLLE: Modified Locally Linear Embedding Using Multiple Weights
Author: Zhenyue Zhang, Jing Wang
Abstract: The locally linear embedding (LLE) is improved by introducing multiple linearly independent local weight vectors for each neighborhood. We characterize the reconstruction weights and show the existence of the linearly independent weight vectors at each neighborhood. The modified locally linear embedding (MLLE) proposed in this paper is much stable. It can retrieve the ideal embedding if MLLE is applied on data points sampled from an isometric manifold. MLLE is also compared with the local tangent space alignment (LTSA). Numerical examples are given that show the improvement and efficiency of MLLE. 1
5 0.13521469 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure
Author: Jennifer Listgarten, Radford M. Neal, Sam T. Roweis, Rachel Puckrin, Sean Cutler
Abstract: We present a hierarchical Bayesian model for sets of related, but different, classes of time series data. Our model performs alignment simultaneously across all classes, while detecting and characterizing class-specific differences. During inference the model produces, for each class, a distribution over a canonical representation of the class. These class-specific canonical representations are automatically aligned to one another — preserving common sub-structures, and highlighting differences. We apply our model to compare and contrast solenoid valve current data, and also, liquid-chromatography-ultraviolet-diode array data from a study of the plant Arabidopsis thaliana. 1 Aligning Time Series From Different Classes Many practical problems over a wide range of domains require synthesizing information from several noisy examples of one or more categories in order to build a model which captures common structure and also learns the patterns of variability between categories. In time series analysis, these modeling goals manifest themselves in the tasks of alignment and difference detection. These tasks have diverse applicability, spanning speech & music processing, equipment & industrial plant diagnosis/monitoring, and analysis of biological time series such as microarray & liquid/gas chromatography-based laboratory data (including mass spectrometry and ultraviolet diode arrays). Although alignment and difference detection have been extensively studied as separate problems in the signal processing and statistical pattern recognition communities, to our knowledge, no existing model performs both tasks in a unified way. Single class alignment algorithms attempt to align a set of time series all together, assuming that variability across different time series is attributable purely to noise. In many real-world situations, however, we have time series from multiple classes (categories) and our prior belief is that there is both substantial shared structure between the class distributions and, simultaneously, systematic (although often rare) differences between them. While in some circumstances (if differences are small and infrequent), single class alignment can be applied to multi-class data, it is much more desirable to have a model which performs true multi-class alignment in a principled way, allowing for more refined and accurate modeling of the data. In this paper, we introduce a novel hierarchical Bayesian model which simultaneously solves the multi-class alignment and difference detection tasks in a unified manner, as illustrated in Figure 1. The single-class alignment shown in this figure coerces the feature in region A for class 1 to be inappropriately collapsed in time, and the overall width of the main broad peak in class 2 to be inappropriately narrowed. In contrast, our multi-class model handles these features correctly. Furthermore, because our algorithm does inference for a fully probabilistic model, we are able to obtain quantitative measures of the posterior uncertainty in our results, which, unlike the point estimates produced by most current approaches, allow us to assess our relative confidence in differences learned by the model. Our basic setup for multi-class alignment assumes the class labels are known for each time series, as is the case for most difference detection problems. However, as we discuss at the end of the paper, our model can be extended to the completely unsupervised case. normal abnormal 3 common structure 3 1 1 20 0 3 −20 class−specific differences 20 1 0 −20 3 class−specific models 3 A 1 1 0 50 100 150 200 250 0 50 100 150 200 Figure 1: Nine time series from the NASA valve solenoid current data set [4]. Four belong to a ‘normal’ class, and five to an ‘abnormal’ class. On all figures, the horizontal axis is time, or latent time for figures of latent traces and observed time series aligned to latent traces. The vertical axis is current amplitude. Top left: The raw, unaligned data. Middle left: Average of the unaligned data within each class in thick line, with the thin lines showing one standard deviation on either side. Bottom left: Average of the aligned data (over MCMC samples) within each class, using the single-class alignment version of the model (no child traces), again with one standard deviation lines shown in the thinner style line. Right: Mean and one standard deviation over MCMC samples using the HB-CPM. Top right: Parent trace. Middle right: Class-specific energy impulses with the topmost showing the class impulses for the less smooth class. Bottom right: Child traces superimposed. Note that if one generates more HB-CPM MCMC samples, the parent cycles between the two classes since the model has no preference for which class is seen as a modification of the other; the child classes remain stable however. 2 A Hierarchical Bayesian Continuous Profile Model Building on our previous Continuous Profile Model (CPM) [7], we propose a Hierarchical Bayesian Continuous Profile Model (HB-CPM) to address the problems of multi-class alignment and difference detection, together, for sets of sibling time series data — that is, replicate time series from several distinct, but related classes. The HB-CPM is a generative model that allows simultaneous alignment of time series and also provides aligned canonical representations of each class along with measures of uncertainty on these representations. Inference in the model can be used, for example, to detect and quantify similarities and differences in class composition. The HB-CPM extends the basic CPM in two significant ways: i) it addresses the multi-class rather than the single-class alignment problem, and ii) it uses a fully Bayesian framework rather than a maximum likelihood approach, allowing us to estimate uncertainty in both the alignments and the canonical representations. Our model, depicted in Figure 2, assumes that each observed time series is generated as a noisy transformation of a single, class-specific latent trace. Each latent trace is an underlying, noiseless representation of the set of replicated, observable time series belonging to a single class. An observed time series is generated from this latent trace exactly as in the original CPM, by moving through a sequence of hidden states in a Markovian manner and emitting an observable value at each step, as with an HMM. Each hidden state corresponds to a ‘latent time’ in the latent trace. Thus different choices of hidden state sequences result in different nonlinear transformations of the underlying trace. The HB-CPM uses a separate latent trace for each class, which we call child traces. Crucially, each of these child traces is generated from a single parent trace (also unobserved), which 250 captures the common structure among all of the classes. The joint prior distribution for the child traces in the HB-CPM model can be realized by first sampling a parent trace, and then, for each class, sampling a sparse ‘difference vector’ which dictates how and where each child trace should differ from the common parent. z Figure 2: Core elements of the HB-CPM, illustrated with two-class data (hidden and observed) drawn from the model’s prior. parent r1 r2 z1 impulse z2 child trace child trace x1 x2 x3 class 1 observed time series 2.1 impulse x4 x5 x6 class 2 observed time series The Prior on Latent Traces Let the vector xk = (xk , xk , ..., xk ) represent the k th observed scalar time series, and w k ∈ 1..C 1 2 N be the class label of this time series. Also, let z = (z1 , z2 , ..., zM ) be the parent trace, and c c c z c = (z1 , z2 , ..., zM ) be the child trace for the cth class. During inference, posterior samples of c z form a canonical representation of the observed times series in class c, and z contains their common sub-structure. Ideally, the length of the latent traces, M , would be very large relative to N so that any experimental data could be mapped precisely to the correct underlying trace point. Aside from the computational impracticalities this would pose, great care to avoid overfitting would have to be taken. Thus in practice, we have used M = (2 + )N (double the resolution, plus some slack on each end) in our experiments, and found this to be sufficient with < 0.2. Because the resolution of the latent traces is higher than that of the observed time series, experimental time can be made to effectively speed up or slow down by advancing along the latent trace in larger or smaller jumps. As mentioned previously, the child traces in the HB-CPM inherit most of their structure from a common parent trace. The differences between child and parent are encoded in a difference vector for each class, dc = (dc , dc , ..., dc ); normally, most elements of dc are close to zero. Child traces 1 2 M are obtained by adding this difference vector to the parent trace: z c = z + dc . We model both the parent trace and class-specific difference vectors with what we call an energy impulse chain, which is an undirected Markov chain in which neighbouring nodes are encouraged to be similar (i.e., smooth), and where this smoothness is perturbed by a set of marginally independent energy impulse nodes, with one energy impulse node attached to each node in the chain. For the difc c c ference vector of the cth class, the corresponding energy impulses are denoted r c = (r1 , r2 , ..., rM ), and for the parent trace the energy impulses are denoted r = (r1 , r2 , ..., rM ). Conditioned on the energy impulses, the probability of a difference vector is p(dc |r c , αc , ρc ) = 1 1 exp − Zr c 2 M −1 i=1 M c (dc − dc )2 (dc − ri )2 i i+1 i + c c α ρ i=1 . (1) Here, Zrc is the normalizing constant for this probability density, αc controls the smoothness of the chain, and ρc controls the influence of the energy impulses. Together, αc and ρc also control the overall tightness of the distribution for dc . Presently, we set all αc = α , and similarly ρc = ρ — that is, these do not differ between classes. Similarly, the conditional probability of the parent trace is p(z|r, α, ρ) = 1 1 exp − Zr 2 M −1 i=1 M (zi − zi+1 )2 (zi − ri )2 + α ρ i=1 . (2) These probability densities are each multivariate Gaussian with tridiagonal precision matrixes (corresponding to the Markov nature of the interactions). Each component of each energy impulse for the parent, rj , is drawn independently from a single univariate Gaussian, N (ri |µpar , spar ), whose mean and variance are in turn drawn from a Gaussian and inverse-gamma, respectively. The class-specific difference vector impulses, however, are drawn from a mixture of two zero-mean Gaussians — one ‘no difference’ (inlier) Gaussian, and one ‘classdifference’ (outlier) Gaussian. The means are zero so as to encourage difference vectors to be near c zero (and thus child traces to be similar to the parent trace). Letting δi denote the binary latent c mixture component indicator variables for each rj , c c c c p(δj ) = Multinomial(δj |mc , mc ) = (mc )δj (mc )1−δj in out in out c c p(rj |δj ) = c N (rj |0, s2 ), in c N (rj |0, s2 ), out if if c δj c δj =1 . =0 (3) (4) Each Gaussian mixture variance has an Inverse-Gamma prior, which for the ‘no difference’ variance, s2 , is set to have very low mean (and not overly dispersed) so that ‘no difference’ regions truly have in little difference from the parent class, while for the ‘class-difference’ variance, s 2 , the prior is out set to have a larger mean, so as to model our belief that substantial class-specific differences do occasionally exist. The priors for αc , ρc , α, ρ are each log-normal (inverse-gamma priors would not be conjugate in this model, so we use log-normals which are easier to specify). Additionally, the mixing proportions, mc , mc , have a Dirichlet prior, which typically encodes our belief that the out in proportion that are ‘class differences’ is likely to be small. 2.2 The HMM Portion of the Model Each observed xk is modeled as being generated by an HMM conditioned on the appropriate child k trace, z w . The probability of an observed time series conditioned on a path of hidden time states, k N wk τ k , and the child trace, is given by p(xk |z w , τ k ) = i=1 N (xk |zτ k uk , ξ k ), where ξ k is the i i emission variance for time series k, and the scale factor, uk , allows for constant, global, multiplicak tive rescaling. The HMM transition probabilities T k (τi−1 → τik ) are multinomial within a limited k k range, with p (τi = a|τi−1 = b) = κ(a−b) for (a − b) ∈ [1, Jτ ] and pk (τi = a|τi−1 = b) = 0 for (a − b) < 1 or (a − b) > Jτ where Jτ is the maximum allowable number of consecutive time Jτ states that can be advanced in a single transition. (Of course, i=1 κk = 1.) This multinomial disi tribution, in turn, has a Dirichlet prior. The HMM emission variances, ξ k , have an inverse-gamma prior. Additionally, the prior over the first hidden time state is a uniform distribution over a constant number of states, 1..Q, where Q defines how large a shift can exist between any two observed time series. The prior over each global scaling parameter, uk , is a log-normal with fixed variance and mean of zero, which encourages the scaling factors to remain near unity. 3 Posterior Inference of Alignments and Parameters by MCMC Given a set of observed time series (and their associated class labels), the main computational operation to be performed in the HB-CPM is inference of the latent traces, alignment state paths and other model parameters. Exact inference is analytically intractable, but we are able to use Markov Chain Monte Carlo (MCMC) methods to create an iterative algorithm which, if run for sufficiently long, produces samples from the correct posterior distribution. This posterior provides simultaneous alignments of all observed time series in all classes, and also, crucially, aligned canonical representations of each class, along with error bars on these representations, allowing for a principled approach to difference detection in time series data from different classes. We may also wish to obtain a posterior estimate of some of our parameters conditioned on the data, and marginalized over the other parameters. In particular, we might be interested in obtaining the posterior over hidden time state vectors for each time series, τ k , which together provide a simultaneous, multi-class alignment of our data. We may, in addition, or, alternatively, be interested in the posterior of the child traces, z c , which together characterize how the classes agree and disagree. The former may be more of interest for visualizing aligned observed time series, or in expanding out aligned scalar time series to a related vector time series, while the latter would be more of interest when looking to characterize differences in multi-class, scalar time series data. We group our parameters into blocks, and sample these blocks conditioned on the values of the other parameters (as in Gibbs sampling) — however, when certain conditional distributions are not amenable to direct sampling, we use slice sampling [8]. The scalar conditional distributions for each c of µpar , spar , mc , mc , δj , κk are known distributions, amenable to direct sampling. The conditional out i in distributions for the scalars αc , ρc , α, ρ and uk are not tractable, and for each of these we use slice sampling (doubling out and shrinking). The conditional distribution for each of r and r c is multivariate Gaussian, and we sample directly from each using a Cholesky decomposition of the covariance matrix. 1 p(r|z, α, ρ) = p(z|r, α, ρ)p(r) = N (r|c, C) (5) Z 1 p(r c |dc , αc , ρc ) = p(dc |r, αc , ρc )p(r) = N (r c |b, B), (6) Z where, using I to denote the identity matrix, −1 µpar z S + Ispar −1 +I (7) , c=C C= ρ2 ρ spar B= S† 2 (ρc ) −1 +v c −1 , b=B dc . ρc (8) −1 The diagonal matrix v c consists of mixture component variances (s2 or s2 ). S −1 [or S † ] is the out in tridiagonal precision matrix of the multivariate normal distribution p(z|r, α, ρ) [or p(d c |r c , αc , ρc )], −1 −1 2 1 1 1 and has entries Sj,j = α + ρ for j = 2..(M − 1), Sj,j = α + ρ for j = 1, M , and −1 −1 −1 1 Sj,j+1 = Sj+1,j = − α [or analogously for S † ]. The computation of C and B can be made more efficient by using the Sherman-Morrison-Woodbury matrix inversion lemma. For example, −1 −1 −1 −1 −1 B = (ρ1)2 (S † − S † (v c + S † )−1 S † ), and we have S −1 [or S † ] almost for free, and c no longer need to invert S [or S † ] to obtain it. The conditional distributions of each of z, z c are also multivariate Gaussians. However, because of the underlying Markov dependencies, their precision matrixes are tridiagonal, and hence we can use belief propagation, in the style of Kalman filtering, followed by a stochastic traceback to sample from them efficiently. Thus each can be sampled in time proportional to M rather than M 3 , as required for a general multivariate Gaussian. Lastly, to sample from the conditional distribution of the hidden time vectors for each sample, τ k , we run belief propagation (analogous to the HMM forward-backward algorithm) followed by a stochastic traceback. In our experiments, the parent trace was initialized by averaging one smoothed example from each class. The child traces were initialized to the initial parent trace. The HMM states were initialized by a Viterbi decoding with respect to the initial values of the other parameters. The scaling factors were initialized to unity, and the child energy impulses to zero. MCMC was run for 5000 iterations, with convergence generally realized in less than 1000 iterations. 4 Experiments and Results We demonstrate use of the HB-CPM on two data sets. The first data set is the part of the NASA shuttle valve data [4], which measures valve solenoid current against time for some ‘normal’ runs and some ‘abnormal’ runs. Measurements were taken at a rate of 1ms per sample, with 1000 samples per time series. We subsampled the data by a factor of 7 in time since it was extremely dense. The results of performing posterior inference in our model on this two-class data set are shown in Figure 1. They nicely match our intuition of what makes a good solution. In our experiments, we also compared our model to a simple “single-class” version of the HB-CPM in which we simply remove the child trace level of the model, letting all observed data in both classes depend directly on one single parent trace. The single-class alignment, while doing a reasonable job, does so by coercing the two classes to look more similar than they should. This is evident in one particular region labeled on the graph and discussed in the legend. Essentially a single class alignment causes us to lose class-specific fine detail — the precise information we seek to retain for difference detection. The second data set is from a botany study which uses reverse-phase HPLC (high performance liquid chromatography) as a high-throughput screening method to identify genes involved in xenobiotic uptake and metabolism in the model plant Arabidopsis thaliana. Liquid-chromatography (LC) techniques are currently being developed and refined with the aim of providing a robust platform with which to detect differences in biological organisms — be they plants, animals or humans. Detected differences can reveal new fundamental biological insight, or can be applied in more clinical settings. LC-mass spectrometry technology has recently undergone explosive growth in tackling the problem of biomarker discovery — for example, detecting biological markers that can predict treatment outcome or severity of disease, thereby providing the potential for improved health care and better understanding of the mechanisms of drug and disease. In botany, LC-UV data is used to help understand the uptake and metabolism of compounds in plants by looking for differences across experimental conditions, and it is this type of data that we use here. LC separates mixtures of analytes on the basis of some chemical property — hydrophobicity, for reverse-phase LC, used to generate our data. Components of the analyte in our data set were detected as they came off the LC column with a Diode Array Detector (DAD), yielding UV-visible spectra collected at 540 time points (we used the 280 nm band, which is informative for these experiments). We performed background subtraction [2] and then subsampled this data by a factor of four. This is a three-class data set, where the first class is untreated plant extract, followed by two classes consisting of this same plant treated with compounds that were identified as possessing robust uptake in vivo, and, hence, when metabolized, provide a differential LC-UV signal of interest. Figure 3 gives an overview of the LC-UV results, while Figure 4 zooms in on a particular area of interest to highlight how subtle differences can be detected by the HB-CPM, but not by a singleclass alignment scheme. As with the NASA data set, a single-class alignment coerces features across classes that are in fact different to look the same, thereby preventing us from detecting them. Recall that this data set consists of a ‘no treatment’ plant extract, and two ‘treatments’ of this same plant. Though our model was not informed of these special relationships, it nevertheless elegantly captures this structure by giving almost no energy impulses to the ‘no treatment’ class, meaning that this class is essentially the parent trace, and allowing the ‘treatment’ classes to diverge from it, thereby nicely matching the reality of the situation. All averaging over MCMC runs shown is over 4000 samples, after a 1000 burn in period, which took around 3 hours for the NASA data, and 5 hours for the LC data set, on machines with dual 3 GHz Pentium 4 processors. 5 Related Work While much work has been done on time series alignment, and on comparison/clustering of time series, none of this work, to our knowledge, directly addresses the problem presented in this paper — simultaneously aligning and comparing sets of related time series in order to characterize how they differ from one another. The classical algorithm for aligning time series is Dynamic Time Warping (DTW) [10]. DTW works on pairs of time series, aligning one time series to a specified reference time, in a non-probabilistic way, without explicit allowance for differences in related time series. More recently, Gaffney et al [5] jointly clustered and aligned time series data from different classes. However, their model does not attempt to put time series from different classes into correspondence with one another — only time series within a class are aligned to one another. Ziv Bar-Joseph et al [1] use a similar approach to cluster and align microarray time series data. Ramsay et al [9] have introduced a curve clustering model, in which a time warping function, h(t), for each time series is learned by way of learning its relative curvature, parameterized with order one B-spline coefficients. This model accounts for 5 5 3 3 9 0 5 −9 9 0 −9 9 3 0 −9 5 5 3 3 0 50 100 150 200 250 300 0 50 100 150 200 250 Figure 3: Seven time series from each of three classes of LC-UV data. On all figures, the horizontal axis is time, or latent time for figures of latent traces and observed time series aligned to latent traces. The vertical axis is log of UV absorbance. Top left: The raw, unaligned data. Middle left: Average of the unaligned data within each class in thick line, with the thin lines showing one standard deviation on either side. Bottom left: Average of the aligned data within each class, using the single-class alignment version of the model (no child traces), again with one standard deviation lines shown in the thinner style line. Right: Mean and one standard deviation over MCMC samples using the HB-CPM model. Top right: Parent trace. Middle right: Class-specific energy impulses, with the top-most showing the class impulses for the ‘no treatment’ class. Bottom right: Child traces superimposed. See Figure 4 for a zoom-in in around the arrow. systematic changes in the range and domain of time series in a way that aligns curves with the same fundamental shape. However, their method does not allow for class-specific differences between shapes to be taken into account. The anomaly detection (AD) literature deals with related, yet distinct problems. For example, Chan et al [3] build a model of one class of time series data (they use the same NASA valve data as in this paper), and then match test data, possibly belonging to another class (e.g. ‘abnormal’ shuttle valve data) to this model to obtain an anomaly score. Emphasis in the AD community is on detecting abnormal events relative to a normal baseline, in an on-line manner, rather than comparing and contrasting two or more classes from a dataset containing examples of all classes. The problem of ‘elastic curve matching‘ is addressed in [6], where a target time series that best matches a query series is found, by mapping the problem of finding the best matching subsequence to the problem of finding the cheapest path in a DAG (directed acyclic graph). 6 Discussion and Conclusion We have introduced a hierarchical, Bayesian model to perform detection of rare differences between sets of related time series, a problem which arises across a wide range of domains. By training our model, we obtain the posterior distribution over a set of class-specific canonical representations of each class, which are aligned in a way that preserves their common sub-structures, yet retains and highlights important differences. This model can be extended in several interesting and useful ways. One small modification could be useful for the LC-UV data set presented in this paper, in which one of the classes was ‘no treatment’, while the other two were each a different ‘treatment’. We might model the ‘no treatment’ as the parent trace, and each of the treatments as a child trace, so that the direct comparison of interest would be made more explicit. Another direction would be to apply the HB-CPM in a completely 300 4 5 0 3 −5 4 5 0 3 −5 4 5 0 3 −5 100 105 110 115 120 125 130 135 140 145 150 155 100 105 110 115 120 125 130 135 140 145 150 155 Figure 4: Left: A zoom in of data displayed in Figure 3, from the region of time 100-150 (labeled in that figure in latent time, not observed time). Top left: mean and standard deviation of the unaligned data. Middle left: mean and standard deviation of the single-class alignment. Bottom left: mean and standard deviation of the child traces from the HB-CPM. A case in point of a difference that could be detected with the HB-CPM and not in the raw or single-class aligned data, is the difference occurring at time point 127. Right: The mean and standard deviation of the child energy impulses, with dashed lines showing correspondences with the child traces in the bottom left panel. unsupervised setting where we learn not only the canonical class representations, but also obtain the posterior over the class labels by introducing a latent class indicator variable. Lastly, one could use a model with cyclical latent traces to model cyclic data such as electrocardiogram (ECG) and climate data. In such a model, an observed trace being generated by the model would be allowed to cycle back to the start of the latent trace, and the smoothness constraints on the trace would be extended to apply to beginning and end of the traces, coercing these to be similar. Such a model would allow one to do anomaly detection in cyclic data, as well as segmentation. Acknowledgments: Thanks to David Ross and Roland Memisevic for useful discussions, and Ben Marlin for his Matlab slice sampling code. References [1] Z. Bar-Joseph, G. Gerber, D. K. Gifford, T. Jaakkola, and I. Simon. A new approach to analyzing gene expression time series data. In RECOMB, pages 39–48, 2002. [2] H. Boelens, R. Dijkstra, P. Eilers, F. Fitzpatrick, and J. Westerhuis. New background correction method for liquid chromatography with diode array detection, infrared spectroscopic detection and raman spectroscopic detection. Journal of Chromatography A, 1057:21–30, 2004. [3] P. K. Chan and M. V. Mahoney. Modeling multiple time series for anomaly detection. In ICDM, 2005. [4] B. Ferrell and S. Santuro. NASA shuttle valve data. http://www.cs.fit.edu/∼pkc/nasa/data/, 2005. [5] S. J. Gaffney and P. Smyth. Joint probabilistic curve clustering and alignment. In Advances in Neural Information Processing Systems 17, 2005. [6] L. Latecki, V. Megalooikonomou, Q. Wang, R. Lakaemper, C. Ratanamahatana, and E. Keogh. Elastic partial matching of time series, 2005. [7] J. Listgarten, R. M. Neal, S. T. Roweis, and A. Emili. Multiple alignment of continuous time series. In Advances in Neural Information Processing Systems 17, 2005. [8] R. M. Neal. Slice sampling. Annals of Statistics, 31:705–767, 2003. [9] J. Ramsay and X. Li. Curve registration. Journal of the Royal Statistical Society(B), 60, 1998. [10] H. Sakoe and S. Chiba. Dynamic programming algorithm for spoken word recognition. Readings in Speech Recognition, pages 159–165, 1990.
6 0.12394959 187 nips-2006-Temporal Coding using the Response Properties of Spiking Neurons
7 0.10525024 135 nips-2006-Modelling transcriptional regulation using Gaussian Processes
8 0.10522375 69 nips-2006-Distributed Inference in Dynamical Systems
9 0.091317438 139 nips-2006-Multi-dynamic Bayesian Networks
10 0.089044571 31 nips-2006-Analysis of Contour Motions
11 0.088173561 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
12 0.084711209 43 nips-2006-Bayesian Model Scoring in Markov Random Fields
13 0.080771662 50 nips-2006-Chained Boosting
14 0.080082379 68 nips-2006-Dirichlet-Enhanced Spam Filtering based on Biased Samples
15 0.077796146 103 nips-2006-Kernels on Structured Objects Through Nested Histograms
16 0.068156525 99 nips-2006-Information Bottleneck Optimization and Independent Component Extraction with Spiking Neurons
17 0.067977585 186 nips-2006-Support Vector Machines on a Budget
18 0.067735672 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions
19 0.063080586 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data
20 0.055941433 41 nips-2006-Bayesian Ensemble Learning
topicId topicWeight
[(0, -0.209), (1, -0.01), (2, 0.044), (3, 0.006), (4, 0.076), (5, 0.039), (6, -0.039), (7, -0.061), (8, 0.009), (9, 0.188), (10, -0.088), (11, -0.062), (12, 0.365), (13, 0.255), (14, 0.018), (15, -0.061), (16, -0.015), (17, 0.091), (18, -0.044), (19, -0.009), (20, -0.0), (21, 0.036), (22, -0.227), (23, 0.003), (24, 0.039), (25, 0.052), (26, -0.13), (27, -0.124), (28, 0.154), (29, -0.007), (30, 0.042), (31, 0.019), (32, -0.046), (33, -0.004), (34, 0.017), (35, 0.079), (36, 0.028), (37, -0.077), (38, -0.013), (39, 0.111), (40, -0.093), (41, -0.042), (42, -0.023), (43, -0.081), (44, -0.041), (45, 0.048), (46, 0.06), (47, -0.107), (48, -0.061), (49, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.98003858 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
Author: Andrea Vedaldi, Stefano Soatto
Abstract: Image Congealing (IC) is a non-parametric method for the joint alignment of a collection of images affected by systematic and unwanted deformations. The method attempts to undo the deformations by minimizing a measure of complexity of the image ensemble, such as the averaged per-pixel entropy. This enables alignment without an explicit model of the aligned dataset as required by other methods (e.g. transformed component analysis). While IC is simple and general, it may introduce degenerate solutions when the transformations allow minimizing the complexity of the data by collapsing them to a constant. Such solutions need to be explicitly removed by regularization. In this paper we propose an alternative formulation which solves this regularization issue on a more principled ground. We make the simple observation that alignment should simplify the data while preserving the useful information carried by them. Therefore we trade off fidelity and complexity of the aligned ensemble rather than minimizing the complexity alone. This eliminates the need for an explicit regularization of the transformations, and has a number of other useful properties such as noise suppression. We show the modeling and computational benefits of the approach to the some of the problems on which IC has been demonstrated. 1
2 0.67978477 175 nips-2006-Simplifying Mixture Models through Function Approximation
Author: Kai Zhang, James T. Kwok
Abstract: Finite mixture model is a powerful tool in many statistical learning problems. In this paper, we propose a general, structure-preserving approach to reduce its model complexity, which can bring significant computational benefits in many applications. The basic idea is to group the original mixture components into compact clusters, and then minimize an upper bound on the approximation error between the original and simplified models. By adopting the L2 norm as the distance measure between mixture models, we can derive closed-form solutions that are more robust and reliable than using the KL-based distance measure. Moreover, the complexity of our algorithm is only linear in the sample size and dimensionality. Experiments on density estimation and clustering-based image segmentation demonstrate its outstanding performance in terms of both speed and accuracy.
3 0.64317882 127 nips-2006-MLLE: Modified Locally Linear Embedding Using Multiple Weights
Author: Zhenyue Zhang, Jing Wang
Abstract: The locally linear embedding (LLE) is improved by introducing multiple linearly independent local weight vectors for each neighborhood. We characterize the reconstruction weights and show the existence of the linearly independent weight vectors at each neighborhood. The modified locally linear embedding (MLLE) proposed in this paper is much stable. It can retrieve the ideal embedding if MLLE is applied on data points sampled from an isometric manifold. MLLE is also compared with the local tangent space alignment (LTSA). Numerical examples are given that show the improvement and efficiency of MLLE. 1
4 0.61695606 20 nips-2006-Active learning for misspecified generalized linear models
Author: Francis R. Bach
Abstract: Active learning refers to algorithmic frameworks aimed at selecting training data points in order to reduce the number of required training data points and/or improve the generalization performance of a learning method. In this paper, we present an asymptotic analysis of active learning for generalized linear models. Our analysis holds under the common practical situation of model misspecification, and is based on realistic assumptions regarding the nature of the sampling distributions, which are usually neither independent nor identical. We derive unbiased estimators of generalization performance, as well as estimators of expected reduction in generalization error after adding a new training data point, that allow us to optimize its sampling distribution through a convex optimization problem. Our analysis naturally leads to an algorithm for sequential active learning which is applicable for all tasks supported by generalized linear models (e.g., binary classification, multi-class classification, regression) and can be applied in non-linear settings through the use of Mercer kernels. 1
5 0.45087069 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure
Author: Jennifer Listgarten, Radford M. Neal, Sam T. Roweis, Rachel Puckrin, Sean Cutler
Abstract: We present a hierarchical Bayesian model for sets of related, but different, classes of time series data. Our model performs alignment simultaneously across all classes, while detecting and characterizing class-specific differences. During inference the model produces, for each class, a distribution over a canonical representation of the class. These class-specific canonical representations are automatically aligned to one another — preserving common sub-structures, and highlighting differences. We apply our model to compare and contrast solenoid valve current data, and also, liquid-chromatography-ultraviolet-diode array data from a study of the plant Arabidopsis thaliana. 1 Aligning Time Series From Different Classes Many practical problems over a wide range of domains require synthesizing information from several noisy examples of one or more categories in order to build a model which captures common structure and also learns the patterns of variability between categories. In time series analysis, these modeling goals manifest themselves in the tasks of alignment and difference detection. These tasks have diverse applicability, spanning speech & music processing, equipment & industrial plant diagnosis/monitoring, and analysis of biological time series such as microarray & liquid/gas chromatography-based laboratory data (including mass spectrometry and ultraviolet diode arrays). Although alignment and difference detection have been extensively studied as separate problems in the signal processing and statistical pattern recognition communities, to our knowledge, no existing model performs both tasks in a unified way. Single class alignment algorithms attempt to align a set of time series all together, assuming that variability across different time series is attributable purely to noise. In many real-world situations, however, we have time series from multiple classes (categories) and our prior belief is that there is both substantial shared structure between the class distributions and, simultaneously, systematic (although often rare) differences between them. While in some circumstances (if differences are small and infrequent), single class alignment can be applied to multi-class data, it is much more desirable to have a model which performs true multi-class alignment in a principled way, allowing for more refined and accurate modeling of the data. In this paper, we introduce a novel hierarchical Bayesian model which simultaneously solves the multi-class alignment and difference detection tasks in a unified manner, as illustrated in Figure 1. The single-class alignment shown in this figure coerces the feature in region A for class 1 to be inappropriately collapsed in time, and the overall width of the main broad peak in class 2 to be inappropriately narrowed. In contrast, our multi-class model handles these features correctly. Furthermore, because our algorithm does inference for a fully probabilistic model, we are able to obtain quantitative measures of the posterior uncertainty in our results, which, unlike the point estimates produced by most current approaches, allow us to assess our relative confidence in differences learned by the model. Our basic setup for multi-class alignment assumes the class labels are known for each time series, as is the case for most difference detection problems. However, as we discuss at the end of the paper, our model can be extended to the completely unsupervised case. normal abnormal 3 common structure 3 1 1 20 0 3 −20 class−specific differences 20 1 0 −20 3 class−specific models 3 A 1 1 0 50 100 150 200 250 0 50 100 150 200 Figure 1: Nine time series from the NASA valve solenoid current data set [4]. Four belong to a ‘normal’ class, and five to an ‘abnormal’ class. On all figures, the horizontal axis is time, or latent time for figures of latent traces and observed time series aligned to latent traces. The vertical axis is current amplitude. Top left: The raw, unaligned data. Middle left: Average of the unaligned data within each class in thick line, with the thin lines showing one standard deviation on either side. Bottom left: Average of the aligned data (over MCMC samples) within each class, using the single-class alignment version of the model (no child traces), again with one standard deviation lines shown in the thinner style line. Right: Mean and one standard deviation over MCMC samples using the HB-CPM. Top right: Parent trace. Middle right: Class-specific energy impulses with the topmost showing the class impulses for the less smooth class. Bottom right: Child traces superimposed. Note that if one generates more HB-CPM MCMC samples, the parent cycles between the two classes since the model has no preference for which class is seen as a modification of the other; the child classes remain stable however. 2 A Hierarchical Bayesian Continuous Profile Model Building on our previous Continuous Profile Model (CPM) [7], we propose a Hierarchical Bayesian Continuous Profile Model (HB-CPM) to address the problems of multi-class alignment and difference detection, together, for sets of sibling time series data — that is, replicate time series from several distinct, but related classes. The HB-CPM is a generative model that allows simultaneous alignment of time series and also provides aligned canonical representations of each class along with measures of uncertainty on these representations. Inference in the model can be used, for example, to detect and quantify similarities and differences in class composition. The HB-CPM extends the basic CPM in two significant ways: i) it addresses the multi-class rather than the single-class alignment problem, and ii) it uses a fully Bayesian framework rather than a maximum likelihood approach, allowing us to estimate uncertainty in both the alignments and the canonical representations. Our model, depicted in Figure 2, assumes that each observed time series is generated as a noisy transformation of a single, class-specific latent trace. Each latent trace is an underlying, noiseless representation of the set of replicated, observable time series belonging to a single class. An observed time series is generated from this latent trace exactly as in the original CPM, by moving through a sequence of hidden states in a Markovian manner and emitting an observable value at each step, as with an HMM. Each hidden state corresponds to a ‘latent time’ in the latent trace. Thus different choices of hidden state sequences result in different nonlinear transformations of the underlying trace. The HB-CPM uses a separate latent trace for each class, which we call child traces. Crucially, each of these child traces is generated from a single parent trace (also unobserved), which 250 captures the common structure among all of the classes. The joint prior distribution for the child traces in the HB-CPM model can be realized by first sampling a parent trace, and then, for each class, sampling a sparse ‘difference vector’ which dictates how and where each child trace should differ from the common parent. z Figure 2: Core elements of the HB-CPM, illustrated with two-class data (hidden and observed) drawn from the model’s prior. parent r1 r2 z1 impulse z2 child trace child trace x1 x2 x3 class 1 observed time series 2.1 impulse x4 x5 x6 class 2 observed time series The Prior on Latent Traces Let the vector xk = (xk , xk , ..., xk ) represent the k th observed scalar time series, and w k ∈ 1..C 1 2 N be the class label of this time series. Also, let z = (z1 , z2 , ..., zM ) be the parent trace, and c c c z c = (z1 , z2 , ..., zM ) be the child trace for the cth class. During inference, posterior samples of c z form a canonical representation of the observed times series in class c, and z contains their common sub-structure. Ideally, the length of the latent traces, M , would be very large relative to N so that any experimental data could be mapped precisely to the correct underlying trace point. Aside from the computational impracticalities this would pose, great care to avoid overfitting would have to be taken. Thus in practice, we have used M = (2 + )N (double the resolution, plus some slack on each end) in our experiments, and found this to be sufficient with < 0.2. Because the resolution of the latent traces is higher than that of the observed time series, experimental time can be made to effectively speed up or slow down by advancing along the latent trace in larger or smaller jumps. As mentioned previously, the child traces in the HB-CPM inherit most of their structure from a common parent trace. The differences between child and parent are encoded in a difference vector for each class, dc = (dc , dc , ..., dc ); normally, most elements of dc are close to zero. Child traces 1 2 M are obtained by adding this difference vector to the parent trace: z c = z + dc . We model both the parent trace and class-specific difference vectors with what we call an energy impulse chain, which is an undirected Markov chain in which neighbouring nodes are encouraged to be similar (i.e., smooth), and where this smoothness is perturbed by a set of marginally independent energy impulse nodes, with one energy impulse node attached to each node in the chain. For the difc c c ference vector of the cth class, the corresponding energy impulses are denoted r c = (r1 , r2 , ..., rM ), and for the parent trace the energy impulses are denoted r = (r1 , r2 , ..., rM ). Conditioned on the energy impulses, the probability of a difference vector is p(dc |r c , αc , ρc ) = 1 1 exp − Zr c 2 M −1 i=1 M c (dc − dc )2 (dc − ri )2 i i+1 i + c c α ρ i=1 . (1) Here, Zrc is the normalizing constant for this probability density, αc controls the smoothness of the chain, and ρc controls the influence of the energy impulses. Together, αc and ρc also control the overall tightness of the distribution for dc . Presently, we set all αc = α , and similarly ρc = ρ — that is, these do not differ between classes. Similarly, the conditional probability of the parent trace is p(z|r, α, ρ) = 1 1 exp − Zr 2 M −1 i=1 M (zi − zi+1 )2 (zi − ri )2 + α ρ i=1 . (2) These probability densities are each multivariate Gaussian with tridiagonal precision matrixes (corresponding to the Markov nature of the interactions). Each component of each energy impulse for the parent, rj , is drawn independently from a single univariate Gaussian, N (ri |µpar , spar ), whose mean and variance are in turn drawn from a Gaussian and inverse-gamma, respectively. The class-specific difference vector impulses, however, are drawn from a mixture of two zero-mean Gaussians — one ‘no difference’ (inlier) Gaussian, and one ‘classdifference’ (outlier) Gaussian. The means are zero so as to encourage difference vectors to be near c zero (and thus child traces to be similar to the parent trace). Letting δi denote the binary latent c mixture component indicator variables for each rj , c c c c p(δj ) = Multinomial(δj |mc , mc ) = (mc )δj (mc )1−δj in out in out c c p(rj |δj ) = c N (rj |0, s2 ), in c N (rj |0, s2 ), out if if c δj c δj =1 . =0 (3) (4) Each Gaussian mixture variance has an Inverse-Gamma prior, which for the ‘no difference’ variance, s2 , is set to have very low mean (and not overly dispersed) so that ‘no difference’ regions truly have in little difference from the parent class, while for the ‘class-difference’ variance, s 2 , the prior is out set to have a larger mean, so as to model our belief that substantial class-specific differences do occasionally exist. The priors for αc , ρc , α, ρ are each log-normal (inverse-gamma priors would not be conjugate in this model, so we use log-normals which are easier to specify). Additionally, the mixing proportions, mc , mc , have a Dirichlet prior, which typically encodes our belief that the out in proportion that are ‘class differences’ is likely to be small. 2.2 The HMM Portion of the Model Each observed xk is modeled as being generated by an HMM conditioned on the appropriate child k trace, z w . The probability of an observed time series conditioned on a path of hidden time states, k N wk τ k , and the child trace, is given by p(xk |z w , τ k ) = i=1 N (xk |zτ k uk , ξ k ), where ξ k is the i i emission variance for time series k, and the scale factor, uk , allows for constant, global, multiplicak tive rescaling. The HMM transition probabilities T k (τi−1 → τik ) are multinomial within a limited k k range, with p (τi = a|τi−1 = b) = κ(a−b) for (a − b) ∈ [1, Jτ ] and pk (τi = a|τi−1 = b) = 0 for (a − b) < 1 or (a − b) > Jτ where Jτ is the maximum allowable number of consecutive time Jτ states that can be advanced in a single transition. (Of course, i=1 κk = 1.) This multinomial disi tribution, in turn, has a Dirichlet prior. The HMM emission variances, ξ k , have an inverse-gamma prior. Additionally, the prior over the first hidden time state is a uniform distribution over a constant number of states, 1..Q, where Q defines how large a shift can exist between any two observed time series. The prior over each global scaling parameter, uk , is a log-normal with fixed variance and mean of zero, which encourages the scaling factors to remain near unity. 3 Posterior Inference of Alignments and Parameters by MCMC Given a set of observed time series (and their associated class labels), the main computational operation to be performed in the HB-CPM is inference of the latent traces, alignment state paths and other model parameters. Exact inference is analytically intractable, but we are able to use Markov Chain Monte Carlo (MCMC) methods to create an iterative algorithm which, if run for sufficiently long, produces samples from the correct posterior distribution. This posterior provides simultaneous alignments of all observed time series in all classes, and also, crucially, aligned canonical representations of each class, along with error bars on these representations, allowing for a principled approach to difference detection in time series data from different classes. We may also wish to obtain a posterior estimate of some of our parameters conditioned on the data, and marginalized over the other parameters. In particular, we might be interested in obtaining the posterior over hidden time state vectors for each time series, τ k , which together provide a simultaneous, multi-class alignment of our data. We may, in addition, or, alternatively, be interested in the posterior of the child traces, z c , which together characterize how the classes agree and disagree. The former may be more of interest for visualizing aligned observed time series, or in expanding out aligned scalar time series to a related vector time series, while the latter would be more of interest when looking to characterize differences in multi-class, scalar time series data. We group our parameters into blocks, and sample these blocks conditioned on the values of the other parameters (as in Gibbs sampling) — however, when certain conditional distributions are not amenable to direct sampling, we use slice sampling [8]. The scalar conditional distributions for each c of µpar , spar , mc , mc , δj , κk are known distributions, amenable to direct sampling. The conditional out i in distributions for the scalars αc , ρc , α, ρ and uk are not tractable, and for each of these we use slice sampling (doubling out and shrinking). The conditional distribution for each of r and r c is multivariate Gaussian, and we sample directly from each using a Cholesky decomposition of the covariance matrix. 1 p(r|z, α, ρ) = p(z|r, α, ρ)p(r) = N (r|c, C) (5) Z 1 p(r c |dc , αc , ρc ) = p(dc |r, αc , ρc )p(r) = N (r c |b, B), (6) Z where, using I to denote the identity matrix, −1 µpar z S + Ispar −1 +I (7) , c=C C= ρ2 ρ spar B= S† 2 (ρc ) −1 +v c −1 , b=B dc . ρc (8) −1 The diagonal matrix v c consists of mixture component variances (s2 or s2 ). S −1 [or S † ] is the out in tridiagonal precision matrix of the multivariate normal distribution p(z|r, α, ρ) [or p(d c |r c , αc , ρc )], −1 −1 2 1 1 1 and has entries Sj,j = α + ρ for j = 2..(M − 1), Sj,j = α + ρ for j = 1, M , and −1 −1 −1 1 Sj,j+1 = Sj+1,j = − α [or analogously for S † ]. The computation of C and B can be made more efficient by using the Sherman-Morrison-Woodbury matrix inversion lemma. For example, −1 −1 −1 −1 −1 B = (ρ1)2 (S † − S † (v c + S † )−1 S † ), and we have S −1 [or S † ] almost for free, and c no longer need to invert S [or S † ] to obtain it. The conditional distributions of each of z, z c are also multivariate Gaussians. However, because of the underlying Markov dependencies, their precision matrixes are tridiagonal, and hence we can use belief propagation, in the style of Kalman filtering, followed by a stochastic traceback to sample from them efficiently. Thus each can be sampled in time proportional to M rather than M 3 , as required for a general multivariate Gaussian. Lastly, to sample from the conditional distribution of the hidden time vectors for each sample, τ k , we run belief propagation (analogous to the HMM forward-backward algorithm) followed by a stochastic traceback. In our experiments, the parent trace was initialized by averaging one smoothed example from each class. The child traces were initialized to the initial parent trace. The HMM states were initialized by a Viterbi decoding with respect to the initial values of the other parameters. The scaling factors were initialized to unity, and the child energy impulses to zero. MCMC was run for 5000 iterations, with convergence generally realized in less than 1000 iterations. 4 Experiments and Results We demonstrate use of the HB-CPM on two data sets. The first data set is the part of the NASA shuttle valve data [4], which measures valve solenoid current against time for some ‘normal’ runs and some ‘abnormal’ runs. Measurements were taken at a rate of 1ms per sample, with 1000 samples per time series. We subsampled the data by a factor of 7 in time since it was extremely dense. The results of performing posterior inference in our model on this two-class data set are shown in Figure 1. They nicely match our intuition of what makes a good solution. In our experiments, we also compared our model to a simple “single-class” version of the HB-CPM in which we simply remove the child trace level of the model, letting all observed data in both classes depend directly on one single parent trace. The single-class alignment, while doing a reasonable job, does so by coercing the two classes to look more similar than they should. This is evident in one particular region labeled on the graph and discussed in the legend. Essentially a single class alignment causes us to lose class-specific fine detail — the precise information we seek to retain for difference detection. The second data set is from a botany study which uses reverse-phase HPLC (high performance liquid chromatography) as a high-throughput screening method to identify genes involved in xenobiotic uptake and metabolism in the model plant Arabidopsis thaliana. Liquid-chromatography (LC) techniques are currently being developed and refined with the aim of providing a robust platform with which to detect differences in biological organisms — be they plants, animals or humans. Detected differences can reveal new fundamental biological insight, or can be applied in more clinical settings. LC-mass spectrometry technology has recently undergone explosive growth in tackling the problem of biomarker discovery — for example, detecting biological markers that can predict treatment outcome or severity of disease, thereby providing the potential for improved health care and better understanding of the mechanisms of drug and disease. In botany, LC-UV data is used to help understand the uptake and metabolism of compounds in plants by looking for differences across experimental conditions, and it is this type of data that we use here. LC separates mixtures of analytes on the basis of some chemical property — hydrophobicity, for reverse-phase LC, used to generate our data. Components of the analyte in our data set were detected as they came off the LC column with a Diode Array Detector (DAD), yielding UV-visible spectra collected at 540 time points (we used the 280 nm band, which is informative for these experiments). We performed background subtraction [2] and then subsampled this data by a factor of four. This is a three-class data set, where the first class is untreated plant extract, followed by two classes consisting of this same plant treated with compounds that were identified as possessing robust uptake in vivo, and, hence, when metabolized, provide a differential LC-UV signal of interest. Figure 3 gives an overview of the LC-UV results, while Figure 4 zooms in on a particular area of interest to highlight how subtle differences can be detected by the HB-CPM, but not by a singleclass alignment scheme. As with the NASA data set, a single-class alignment coerces features across classes that are in fact different to look the same, thereby preventing us from detecting them. Recall that this data set consists of a ‘no treatment’ plant extract, and two ‘treatments’ of this same plant. Though our model was not informed of these special relationships, it nevertheless elegantly captures this structure by giving almost no energy impulses to the ‘no treatment’ class, meaning that this class is essentially the parent trace, and allowing the ‘treatment’ classes to diverge from it, thereby nicely matching the reality of the situation. All averaging over MCMC runs shown is over 4000 samples, after a 1000 burn in period, which took around 3 hours for the NASA data, and 5 hours for the LC data set, on machines with dual 3 GHz Pentium 4 processors. 5 Related Work While much work has been done on time series alignment, and on comparison/clustering of time series, none of this work, to our knowledge, directly addresses the problem presented in this paper — simultaneously aligning and comparing sets of related time series in order to characterize how they differ from one another. The classical algorithm for aligning time series is Dynamic Time Warping (DTW) [10]. DTW works on pairs of time series, aligning one time series to a specified reference time, in a non-probabilistic way, without explicit allowance for differences in related time series. More recently, Gaffney et al [5] jointly clustered and aligned time series data from different classes. However, their model does not attempt to put time series from different classes into correspondence with one another — only time series within a class are aligned to one another. Ziv Bar-Joseph et al [1] use a similar approach to cluster and align microarray time series data. Ramsay et al [9] have introduced a curve clustering model, in which a time warping function, h(t), for each time series is learned by way of learning its relative curvature, parameterized with order one B-spline coefficients. This model accounts for 5 5 3 3 9 0 5 −9 9 0 −9 9 3 0 −9 5 5 3 3 0 50 100 150 200 250 300 0 50 100 150 200 250 Figure 3: Seven time series from each of three classes of LC-UV data. On all figures, the horizontal axis is time, or latent time for figures of latent traces and observed time series aligned to latent traces. The vertical axis is log of UV absorbance. Top left: The raw, unaligned data. Middle left: Average of the unaligned data within each class in thick line, with the thin lines showing one standard deviation on either side. Bottom left: Average of the aligned data within each class, using the single-class alignment version of the model (no child traces), again with one standard deviation lines shown in the thinner style line. Right: Mean and one standard deviation over MCMC samples using the HB-CPM model. Top right: Parent trace. Middle right: Class-specific energy impulses, with the top-most showing the class impulses for the ‘no treatment’ class. Bottom right: Child traces superimposed. See Figure 4 for a zoom-in in around the arrow. systematic changes in the range and domain of time series in a way that aligns curves with the same fundamental shape. However, their method does not allow for class-specific differences between shapes to be taken into account. The anomaly detection (AD) literature deals with related, yet distinct problems. For example, Chan et al [3] build a model of one class of time series data (they use the same NASA valve data as in this paper), and then match test data, possibly belonging to another class (e.g. ‘abnormal’ shuttle valve data) to this model to obtain an anomaly score. Emphasis in the AD community is on detecting abnormal events relative to a normal baseline, in an on-line manner, rather than comparing and contrasting two or more classes from a dataset containing examples of all classes. The problem of ‘elastic curve matching‘ is addressed in [6], where a target time series that best matches a query series is found, by mapping the problem of finding the best matching subsequence to the problem of finding the cheapest path in a DAG (directed acyclic graph). 6 Discussion and Conclusion We have introduced a hierarchical, Bayesian model to perform detection of rare differences between sets of related time series, a problem which arises across a wide range of domains. By training our model, we obtain the posterior distribution over a set of class-specific canonical representations of each class, which are aligned in a way that preserves their common sub-structures, yet retains and highlights important differences. This model can be extended in several interesting and useful ways. One small modification could be useful for the LC-UV data set presented in this paper, in which one of the classes was ‘no treatment’, while the other two were each a different ‘treatment’. We might model the ‘no treatment’ as the parent trace, and each of the treatments as a child trace, so that the direct comparison of interest would be made more explicit. Another direction would be to apply the HB-CPM in a completely 300 4 5 0 3 −5 4 5 0 3 −5 4 5 0 3 −5 100 105 110 115 120 125 130 135 140 145 150 155 100 105 110 115 120 125 130 135 140 145 150 155 Figure 4: Left: A zoom in of data displayed in Figure 3, from the region of time 100-150 (labeled in that figure in latent time, not observed time). Top left: mean and standard deviation of the unaligned data. Middle left: mean and standard deviation of the single-class alignment. Bottom left: mean and standard deviation of the child traces from the HB-CPM. A case in point of a difference that could be detected with the HB-CPM and not in the raw or single-class aligned data, is the difference occurring at time point 127. Right: The mean and standard deviation of the child energy impulses, with dashed lines showing correspondences with the child traces in the bottom left panel. unsupervised setting where we learn not only the canonical class representations, but also obtain the posterior over the class labels by introducing a latent class indicator variable. Lastly, one could use a model with cyclical latent traces to model cyclic data such as electrocardiogram (ECG) and climate data. In such a model, an observed trace being generated by the model would be allowed to cycle back to the start of the latent trace, and the smoothness constraints on the trace would be extended to apply to beginning and end of the traces, coercing these to be similar. Such a model would allow one to do anomaly detection in cyclic data, as well as segmentation. Acknowledgments: Thanks to David Ross and Roland Memisevic for useful discussions, and Ben Marlin for his Matlab slice sampling code. References [1] Z. Bar-Joseph, G. Gerber, D. K. Gifford, T. Jaakkola, and I. Simon. A new approach to analyzing gene expression time series data. In RECOMB, pages 39–48, 2002. [2] H. Boelens, R. Dijkstra, P. Eilers, F. Fitzpatrick, and J. Westerhuis. New background correction method for liquid chromatography with diode array detection, infrared spectroscopic detection and raman spectroscopic detection. Journal of Chromatography A, 1057:21–30, 2004. [3] P. K. Chan and M. V. Mahoney. Modeling multiple time series for anomaly detection. In ICDM, 2005. [4] B. Ferrell and S. Santuro. NASA shuttle valve data. http://www.cs.fit.edu/∼pkc/nasa/data/, 2005. [5] S. J. Gaffney and P. Smyth. Joint probabilistic curve clustering and alignment. In Advances in Neural Information Processing Systems 17, 2005. [6] L. Latecki, V. Megalooikonomou, Q. Wang, R. Lakaemper, C. Ratanamahatana, and E. Keogh. Elastic partial matching of time series, 2005. [7] J. Listgarten, R. M. Neal, S. T. Roweis, and A. Emili. Multiple alignment of continuous time series. In Advances in Neural Information Processing Systems 17, 2005. [8] R. M. Neal. Slice sampling. Annals of Statistics, 31:705–767, 2003. [9] J. Ramsay and X. Li. Curve registration. Journal of the Royal Statistical Society(B), 60, 1998. [10] H. Sakoe and S. Chiba. Dynamic programming algorithm for spoken word recognition. Readings in Speech Recognition, pages 159–165, 1990.
6 0.43045464 135 nips-2006-Modelling transcriptional regulation using Gaussian Processes
7 0.38835984 139 nips-2006-Multi-dynamic Bayesian Networks
8 0.36995375 68 nips-2006-Dirichlet-Enhanced Spam Filtering based on Biased Samples
9 0.33148986 69 nips-2006-Distributed Inference in Dynamical Systems
10 0.32137293 31 nips-2006-Analysis of Contour Motions
11 0.31319278 50 nips-2006-Chained Boosting
12 0.30978552 174 nips-2006-Similarity by Composition
13 0.29308677 187 nips-2006-Temporal Coding using the Response Properties of Spiking Neurons
14 0.29070267 129 nips-2006-Map-Reduce for Machine Learning on Multicore
15 0.28857616 103 nips-2006-Kernels on Structured Objects Through Nested Histograms
16 0.28259602 180 nips-2006-Speakers optimize information density through syntactic reduction
17 0.2696932 160 nips-2006-Part-based Probabilistic Point Matching using Equivalence Constraints
18 0.26542559 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
19 0.26146081 155 nips-2006-Optimal Single-Class Classification Strategies
20 0.24236043 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions
topicId topicWeight
[(1, 0.127), (3, 0.04), (7, 0.065), (9, 0.045), (20, 0.017), (22, 0.095), (32, 0.185), (44, 0.114), (57, 0.107), (65, 0.075), (69, 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.88130397 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
Author: Andrea Vedaldi, Stefano Soatto
Abstract: Image Congealing (IC) is a non-parametric method for the joint alignment of a collection of images affected by systematic and unwanted deformations. The method attempts to undo the deformations by minimizing a measure of complexity of the image ensemble, such as the averaged per-pixel entropy. This enables alignment without an explicit model of the aligned dataset as required by other methods (e.g. transformed component analysis). While IC is simple and general, it may introduce degenerate solutions when the transformations allow minimizing the complexity of the data by collapsing them to a constant. Such solutions need to be explicitly removed by regularization. In this paper we propose an alternative formulation which solves this regularization issue on a more principled ground. We make the simple observation that alignment should simplify the data while preserving the useful information carried by them. Therefore we trade off fidelity and complexity of the aligned ensemble rather than minimizing the complexity alone. This eliminates the need for an explicit regularization of the transformations, and has a number of other useful properties such as noise suppression. We show the modeling and computational benefits of the approach to the some of the problems on which IC has been demonstrated. 1
2 0.78357989 175 nips-2006-Simplifying Mixture Models through Function Approximation
Author: Kai Zhang, James T. Kwok
Abstract: Finite mixture model is a powerful tool in many statistical learning problems. In this paper, we propose a general, structure-preserving approach to reduce its model complexity, which can bring significant computational benefits in many applications. The basic idea is to group the original mixture components into compact clusters, and then minimize an upper bound on the approximation error between the original and simplified models. By adopting the L2 norm as the distance measure between mixture models, we can derive closed-form solutions that are more robust and reliable than using the KL-based distance measure. Moreover, the complexity of our algorithm is only linear in the sample size and dimensionality. Experiments on density estimation and clustering-based image segmentation demonstrate its outstanding performance in terms of both speed and accuracy.
3 0.78040028 20 nips-2006-Active learning for misspecified generalized linear models
Author: Francis R. Bach
Abstract: Active learning refers to algorithmic frameworks aimed at selecting training data points in order to reduce the number of required training data points and/or improve the generalization performance of a learning method. In this paper, we present an asymptotic analysis of active learning for generalized linear models. Our analysis holds under the common practical situation of model misspecification, and is based on realistic assumptions regarding the nature of the sampling distributions, which are usually neither independent nor identical. We derive unbiased estimators of generalization performance, as well as estimators of expected reduction in generalization error after adding a new training data point, that allow us to optimize its sampling distribution through a convex optimization problem. Our analysis naturally leads to an algorithm for sequential active learning which is applicable for all tasks supported by generalized linear models (e.g., binary classification, multi-class classification, regression) and can be applied in non-linear settings through the use of Mercer kernels. 1
4 0.77982426 65 nips-2006-Denoising and Dimension Reduction in Feature Space
Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann
Abstract: We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. 1
5 0.77383363 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
Author: Rey Ramírez, Jason Palmer, Scott Makeig, Bhaskar D. Rao, David P. Wipf
Abstract: The ill-posed nature of the MEG/EEG source localization problem requires the incorporation of prior assumptions when choosing an appropriate solution out of an infinite set of candidates. Bayesian methods are useful in this capacity because they allow these assumptions to be explicitly quantified. Recently, a number of empirical Bayesian approaches have been proposed that attempt a form of model selection by using the data to guide the search for an appropriate prior. While seemingly quite different in many respects, we apply a unifying framework based on automatic relevance determination (ARD) that elucidates various attributes of these methods and suggests directions for improvement. We also derive theoretical properties of this methodology related to convergence, local minima, and localization bias and explore connections with established algorithms. 1
6 0.76756358 61 nips-2006-Convex Repeated Games and Fenchel Duality
7 0.76487422 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation
8 0.76430345 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields
9 0.76380432 79 nips-2006-Fast Iterative Kernel PCA
10 0.76212037 117 nips-2006-Learning on Graph with Laplacian Regularization
11 0.76165742 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
12 0.75924855 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
13 0.75893849 194 nips-2006-Towards a general independent subspace analysis
14 0.75892907 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
15 0.75810164 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures
16 0.75656044 21 nips-2006-AdaBoost is Consistent
17 0.75555021 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
18 0.75522321 101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow
19 0.75520366 34 nips-2006-Approximate Correspondences in High Dimensions
20 0.75369394 193 nips-2006-Tighter PAC-Bayes Bounds