nips nips2009 nips2009-51 knowledge-graph by maker-knowledge-mining

51 nips-2009-Clustering sequence sets for motif discovery

Source: pdf

Author: Jong K. Kim, Seungjin Choi

Abstract: Most of existing methods for DNA motif discovery consider only a single set of sequences to ﬁnd an over-represented motif. In contrast, we consider multiple sets of sequences where we group sets associated with the same motif into a cluster, assuming that each set involves a single motif. Clustering sets of sequences yields clusters of coherent motifs, improving signal-to-noise ratio or enabling us to identify multiple motifs. We present a probabilistic model for DNA motif discovery where we identify multiple motifs through searching for patterns which are shared across multiple sets of sequences. Our model infers cluster-indicating latent variables and learns motifs simultaneously, where these two tasks interact with each other. We show that our model can handle various motif discovery problems, depending on how to construct multiple sets of sequences. Experiments on three different problems for discovering DNA motifs emphasize the useful behavior and conﬁrm the substantial gains over existing methods where only a single set of sequences is considered.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 kr Abstract Most of existing methods for DNA motif discovery consider only a single set of sequences to ﬁnd an over-represented motif. [sent-3, score-0.758]

2 In contrast, we consider multiple sets of sequences where we group sets associated with the same motif into a cluster, assuming that each set involves a single motif. [sent-4, score-0.8]

3 Clustering sets of sequences yields clusters of coherent motifs, improving signal-to-noise ratio or enabling us to identify multiple motifs. [sent-5, score-0.376]

4 We present a probabilistic model for DNA motif discovery where we identify multiple motifs through searching for patterns which are shared across multiple sets of sequences. [sent-6, score-1.147]

5 Our model infers cluster-indicating latent variables and learns motifs simultaneously, where these two tasks interact with each other. [sent-7, score-0.544]

6 We show that our model can handle various motif discovery problems, depending on how to construct multiple sets of sequences. [sent-8, score-0.611]

7 Experiments on three different problems for discovering DNA motifs emphasize the useful behavior and conﬁrm the substantial gains over existing methods where only a single set of sequences is considered. [sent-9, score-0.756]

8 1 Introduction Discovering how DNA-binding proteins called transcription factors (TFs) regulate gene expression programs in living cells is fundamental to understanding transcriptional regulatory networks controlling development, cancer, and many human diseases. [sent-10, score-0.167]

9 TFs that bind to speciﬁc cis-regulatory elements in DNA sequences are essential for mediating this transcriptional control. [sent-11, score-0.25]

10 The ﬁrst step toward deciphering this complex network is to identify functional binding sites of TFs referred to as motifs. [sent-12, score-0.402]

11 We address the problem of discovering sequence motifs that are enriched in a given target set of sequences, compared to a background model (or a set of background sequences). [sent-13, score-0.793]

12 There have been extensive research works on statistical modeling of this problem (see [1] for review), and recent works have focused on improving the motif-ﬁnding performance by integrating additional information into comparative [2] and discriminative motif discovery [3]. [sent-14, score-0.569]

13 Despite the relative long history and the critical roles of motif discovery in bioinformatics, many issues are still unsolved and controversial. [sent-15, score-0.543]

14 First, the target set of sequences is assumed to have only one motif, but this assumption is often incorrect. [sent-16, score-0.264]

15 For example, a recent study examining the binding speciﬁcities of 104 mouse TFs observed that nearly half of the TFs recognize multiple sequence motifs [4]. [sent-17, score-0.835]

16 Second, it is unclear how to select the target set on which over-represented motifs are returned. [sent-18, score-0.55]

17 The target set of sequences is often constructed from genome-wide binding location data (ChIP-chip or ChIP-seq) or gene expression microarray data. [sent-19, score-0.532]

18 Third, a uniﬁed algorithm which is applicable to diverse motif discovery problems is solely needed to provide a principled framework for developing more complex models. [sent-21, score-0.543]

19 1 S1 sm ,1 Sm sm ,i sm , Lm zm,ij = [0,1]T sm,i W sm,ij = ( sm,ij = , sm,i ( j +1) = , , sm,i ( j +W −1) = ) SM Figure 1: Notation illustration. [sent-22, score-0.291]

20 These considerations motivate us to develop a generative probabilistic framework for learning multiple motifs on multiple sets of sequences. [sent-23, score-0.622]

21 One can view our framework as an extension of the classic sequence models such as the two-component mixture (TCM) [5] and the zero or one occurrence per sequence (ZOOPS) [6] models in which sequences are partitioned into two clusters, depending on whether or not they contain a motif. [sent-24, score-0.44]

22 In this paper, we make use of a ﬁnite mixture model to partition the multiple sequence sets into clusters having distinct sequence motifs, which improves the motifﬁnding performance over the classic models by enhancing signal-to-noise ratio of input sequences. [sent-25, score-0.408]

23 We also show how our algorithm can be applied into three different problems by simply changing the way of constructing multiple sets from input sequences without any algorithmic modiﬁcations. [sent-26, score-0.283]

24 2 Problem formulation We are given M sets of DNA sequences S = {S1 , . [sent-27, score-0.248]

25 , SM } to be grouped according to the type of motif involved with, in which each set is associated with only a single motif but multiple binding sites are present in each sequence. [sent-30, score-1.405]

26 To allow for a variable number of binding sites per sequence, we represent each sequence s m,i as a set of overlapping subsequences sW = (sm,ij , sm,i(j+1) , . [sent-35, score-0.52]

27 We introduce a latent variable matrix zm,i ∈ R2×|Im,i | in which the j th column vector zm,ij is a 2-dimensional binary random vector [zm,ij1 , zm,ij2 ] such that zm,ij = [0, 1] if a binding site starts at position j ∈ Im,i , otherwise, zm,ij = [1, 0] . [sent-43, score-0.346]

28 , M , which involve partitioning the sequence sets S into K disjoint clusters, where sets in the same cluster are associated with the same common motif. [sent-47, score-0.17]

29 For a motif model, we use a position-frequency matrix whose entries correspond to probability distributions (over the alphabet Σ) of each position within a binding site. [sent-48, score-0.771]

30 We denote by Θ k ∈ RW ×4 the k th motif model of length W over Σ, where Θk,w represents row w, each entry is non-negative, 4 Θk,wl ≥ 0 for ∀w, l, and l=1 Θk,wl = 1 for ∀w. [sent-49, score-0.508]

31 Our goal is to construct a probabilistic model for DNA motif discovery where we identify multiple motifs through searching for patterns which are shared across multiple sets of sequences. [sent-51, score-1.147]

32 Our model infers cluster-indicating latent variables (to ﬁnd a good partition of S) and learns motifs (inferring binding site-indicating latent variables zm,i ) simultaneously, where these two tasks interact with each other. [sent-52, score-0.84]

33 2 π zm ,ij θ0 W sm,ij β Θk | I m ,i | Lm K tm α v M Figure 2: Graphical representation of our mixture model for M sequence sets. [sent-53, score-0.278]

34 3 Mixture model for motif discovery We assume that the distribution of S is modeled as a mixture of K components, where it is not known in advance which mixture component underlies a particular set of sequences. [sent-54, score-0.663]

35 We also assume that the conditional distribution of the subsequence sW given tm is modeled as a mixture of m,ij two components, each of which corresponds to the motif and the background models, respectively. [sent-55, score-0.698]

36 Then, the joint distribution of observed sequence sets S and (unobserved) latent variables Z and T conditioned on parameters Φ is written as: M p(S, Z, T |Φ) = Lm m (1) p(sW |zm,ij , tm , Φ)p(zm,ij |Φ), m,ij p(tm |Φ) i=1 j∈Im,i where Z = {zm,ij } and T = {tm }. [sent-56, score-0.249]

37 The chosen k=1 k th motif model Θk is drawn from the product of Dirichlet distributions: W p(Θk |β) = 4 W Θβl −1 , k,wl p(Θk,w |β) ∝ w=1 (3) w=1 l=1 where β = [β1 , . [sent-68, score-0.508]

38 The latent variables zm,ij indicating the starting positions of binding sites are governed by the prior distribution speciﬁed by: 2 (4) z πr m,ijr , p(zm,ij |π) = r=1 where the mixture weights π = [π1 , π2 ] satisfy π1 , π2 ≥ 0 and π1 + π2 = 1. [sent-72, score-0.523]

39 First, the width W of the motif model and the number K of set clusters are assumed to be known and ﬁxed. [sent-76, score-0.579]

40 Extension to double stranded DNA sequences is obvious and omitted here due to the lack of space. [sent-80, score-0.215]

41 Our model builds upon the existing TCM model proposed by [5] where the EM algorithm is applied to learn a motif on a single target set. [sent-81, score-0.533]

42 This model actually generates subsequences instead of sequences themselves. [sent-82, score-0.265]

43 An alternative model which explicitly generates sequences has been proposed based on Gibbs sampling [7, 8]. [sent-83, score-0.235]

44 The main difference is that they focus on clustering motifs already discovered, and in our formulation, we try to cluster sequence sets and discover motifs simultaneously. [sent-86, score-1.191]

45 We will derive a Gibbs sampler for our generative model in which the set mixture weights v and motif models {Θk }K are integrated out to improve the k=1 convergence rate and the cost per iteration [8]. [sent-89, score-0.607]

46 The i=1 ﬁrst term represents the predictive distribution of tm given the other set cluster assignments T\m , and is given by marginalizing the set mixture weights v: p(tm,k = 1|T\m , α) = p(tm,k = 1|v)p(v|T\m , α)dv = v −m Nk + α k K M − 1 + αk (8) −m −m where Nk = n=m δ(tn,k , 1). [sent-94, score-0.219]

47 i=1 j∈Im,i ,zm,ij2 =1 −m Note that Nwl counts the number of letter l at position w within currently assigned binding sites m excluding the ones of the mth set. [sent-97, score-0.496]

48 Similarly, Nwl denotes the number of letter l at position w within bindings sites of the mth set. [sent-98, score-0.265]

49 Note that Nwl notes the number of letter l at position w within currently assigned binding sites other than z m,ij . [sent-102, score-0.47]

50 5 Results We evaluated our motif-ﬁnding algorithm on the three different tasks: (1) ﬁltering out undesirable noisy sequences, (2) incorporating evolutionary conservation information, and (3) clustering DNA sequences based on the learned motifs (Fig. [sent-105, score-0.938]

51 1 Data sets and evaluation criteria We ﬁrst examined the yeast ChIP-chip data published by [10] to investigate the effect of ﬁltering out noisy sequences from input sequences on identifying true binding sites. [sent-110, score-0.842]

52 We compiled 156 sequencesets by choosing TFs having consensus motifs in the literature [11]. [sent-111, score-0.549]

53 For each sequence-set, we deﬁned its sequences to be probe sequences that are bound with P -value ≤ 0. [sent-112, score-0.448]

54 5 (a) Filtering out noisy sequences (b) Evolutionary conservation (c) Motif-based clustering Figure 3: Three different ways of constructing multiple sequence sets. [sent-114, score-0.49]

55 To apply our algorithm into the comparative motif discovery problem, we compiled orthologous sequences for each probe sequence of the yeast ChIP-chip data based on the multiple alignments of seven species of Saccharomyces (S. [sent-116, score-1.167]

56 In the experiments using the ChIP-chip data, the motif width was set to 8 and a ﬁfth-order Markov chain estimated from the whole yeast intergenic sequences was used to describe the background model. [sent-124, score-0.892]

57 We next constructed the ChIP-seq data for human neuron-restrictive silence factor (NRSF) to determine whether our algorithm can be applied to partition DNA sequences into biologically meaningful clusters [13]. [sent-127, score-0.347]

58 The data consist of 200 sequence segments of length 100 from all peak sites with the top 10% binding intensity (≥ 500 ChIP-seq reads), where most sequences have canonical NRSFbinding sites. [sent-128, score-0.724]

59 We also added 13 sequence segments extracted from peak sites (≥ 300 reads) known to have noncanonical NRSF-binding sites, resulting in 213 sequences. [sent-129, score-0.305]

60 In the experiment using the ChIP-seq data, the motif width was set to 30 and a zero-order Markov chain estimated from the 213 sequence segments was used to describe the background model. [sent-130, score-0.66]

61 In the experiments using the yeast ChIP-chip data, we used the inter-motif distance to measure the quality of discovered motifs [10]. [sent-133, score-0.636]

62 Speciﬁcally, an algorithm will be called successful on a sequence set only if at least one of the position-frequency matrices constructed from the identiﬁed binding sites is at a distance less than 0. [sent-134, score-0.488]

63 2 Filtering out noisy sequences Selecting target sequences from the ChIP-chip measurements is largely left to users and this choice is often unclear. [sent-137, score-0.522]

64 Our strategy of constructing sequence-sets based on the binding P -value cutoff would be exposed to danger of including many irrelevant sequences. [sent-138, score-0.231]

65 In practice, the inclusion of noisy sequences in the target set is a serious obstacle in the success of motif discovery. [sent-139, score-0.791]

66 One possible solution is to cluster input sequences into two smaller sets of target and noisy sequences based on sequence similarity, and predict motifs from the clustered target sequences with the improved signal-to-noise ratio. [sent-140, score-1.424]

67 This two-step approach has been applied to only protein sequences because DNA sequences do not share much similarity for effective clustering [15]. [sent-141, score-0.482]

68 To this end, we constructed multiple sets by treating each sequence of a particular yeast ChIP-chip sequence-set as one set (Fig. [sent-143, score-0.259]

69 We examined the ability of our algorithm to ﬁnd a correct motif with two different numbers of clusters: K = 1 (without ﬁltering) and K = 2 (clustering into two subsets of true and noisy sequences). [sent-145, score-0.527]

70 Note that the ZOOPS or TCM models can also handle noisy sequences by modeling them with only a background model [5, 6]. [sent-148, score-0.307]

71 But we allow noisy sequences to have a decoy motif (randomly occurring sequence 6 Figure 4: Effect of ﬁltering out noisy sequences on the number of successfully identiﬁed motifs on the yeast ChIP-chip data. [sent-149, score-1.674]

72 patterns or repeating elements) which is modeled with a motif model. [sent-151, score-0.484]

73 Because our model can be reduced to these classic models by setting K = 1, we concluded that noisy sequences were better represented by our clustering approach than the previous ones using the background model (Fig. [sent-152, score-0.388]

74 We expected that our model would perform better than these four methods because they try to remove noisy sequences based on the classic models. [sent-157, score-0.287]

75 Second, we also compared our model with DRIM speciﬁcally designed to dynamically select the target set from the list of sorted sequences according to the binding P -values of ChIP-chip measurements. [sent-160, score-0.495]

76 Because DRIM does not produce any motifs when they are not statistically enriched at the top of the ranked list, we counted the number of successfully identiﬁed motifs on the sequence-sets where DRIM generated signiﬁcant motifs. [sent-162, score-1.039]

77 3 Detecting evolutionary conserved motifs Comparative approach using evolutionary conservation information has been widely used to improve the performance of motif-ﬁnding algorithms because functional TF binding sites are likely to be conserved in orthologous sequences. [sent-165, score-1.327]

78 To incorporate conservation information into our clustering framework, orthologous sequences of each sequence of a particular yeast ChIP-chip sequence-set were considered as one set and the number of clusters was set to 2 (Fig. [sent-166, score-0.684]

79 The constructed sets contain at most 7 sequences because we only used seven species of Saccharomyces. [sent-168, score-0.286]

80 We used the single result with the highest objective function value of (11) among ﬁve runs and compared it with the results of ﬁve conservation-based motif ﬁnding algorithms on the same data set: MEME c [10], PhyloCon [19], PhyMe [20], PhyloGibbs [21], PRIORITY-C [11]. [sent-169, score-0.484]

81 Table 1 presents the motif-ﬁnding performance in terms of the number of correctly identiﬁed motifs for each algorithm. [sent-172, score-0.501]

82 We see that our algorithm greatly outperforms the four alignment-based methods which rely on multiple or pair-wise alignments of orthologous sequences to search for motifs that are conserved across the aligned blocks of orthologous sequences. [sent-173, score-1.037]

83 In our opinion, it is because diverged regions other than the short conserved binding sites may prevent a correct alignment. [sent-174, score-0.479]

84 The two motifs correspond directly to the previously known motifs (canonical and non-canonical NRSF motifs). [sent-187, score-1.002]

85 However, other motif-ﬁnding algorithms such as MEME could not return the noncanonical motif enriched in a very small set of sequences. [sent-188, score-0.567]

86 These observations suggest that our motif-driven clustering approach is effective at inferring latent clusters of DNA sequences and can be used to ﬁnd unexpected novel motifs. [sent-189, score-0.384]

87 6 Conclusions In this paper, we have presented a generative probabilistic framework for DNA motif discovery using multiple sets of sequences where we cluster DNA sequences and learn motifs interactively. [sent-190, score-1.596]

88 We have presented a ﬁnite mixture model with two different types of latent variables, in which one is associated with cluster-indicators and the other corresponds to motifs (transcription factor binding sites). [sent-191, score-0.835]

89 Our empirical results show that the proposed method can be applied to various motif discovery problems, depending on how to construct the multiple sets. [sent-193, score-0.578]

90 For example, it would be interesting to examine the possibility of learning the number of clusters from data based on Dirichlet process mixture models, or to extend our probabilistic framework for discriminative motif discovery. [sent-195, score-0.618]

91 Acknowledgments: We thank Raluca Gordˆ n for providing the literature consensus motifs and the script to a compute the inter-motif distance. [sent-196, score-0.528]

92 Fitting a mixture model by expectation maximization to discover motifs in biopolymers. [sent-253, score-0.561]

93 The value of prior knowledge in discovering motifs with MEME. [sent-259, score-0.541]

94 A fast, alignment-free, conservation-based method for transcription factor binding site discovery. [sent-336, score-0.299]

95 Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. [sent-362, score-0.182]

96 igibbs: improving gibbs motif sampler for proteins by sequence clustering and iterative pattern sampling. [sent-384, score-0.714]

97 Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. [sent-395, score-0.771]

98 An algorithm for ﬁnding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. [sent-404, score-0.421]

99 PhyME: a probabilistic algorithm for ﬁnding motifs in sets of orthologous sequences. [sent-423, score-0.627]

100 PhyloGibbs: a gibbs sampling motif ﬁnder that incorporates phylogeny. [sent-430, score-0.558]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('motifs', 0.501), ('motif', 0.484), ('nwl', 0.309), ('binding', 0.231), ('sequences', 0.215), ('dna', 0.187), ('sites', 0.171), ('sw', 0.141), ('nrsf', 0.108), ('tm', 0.105), ('yeast', 0.105), ('sm', 0.097), ('meme', 0.093), ('orthologous', 0.093), ('tfs', 0.093), ('lm', 0.084), ('conservation', 0.077), ('conserved', 0.077), ('drim', 0.077), ('clusters', 0.074), ('ec', 0.07), ('sequence', 0.068), ('mixture', 0.06), ('discovery', 0.059), ('regulatory', 0.055), ('tcm', 0.054), ('gibbs', 0.054), ('clustering', 0.052), ('evolutionary', 0.05), ('subsequences', 0.05), ('background', 0.049), ('target', 0.049), ('transcription', 0.048), ('ltering', 0.047), ('alignace', 0.046), ('mdscan', 0.046), ('noncanonical', 0.046), ('phylogibbs', 0.046), ('phyme', 0.046), ('zm', 0.045), ('biology', 0.045), ('noisy', 0.043), ('latent', 0.043), ('letter', 0.04), ('molecular', 0.04), ('discovering', 0.04), ('enriched', 0.037), ('nc', 0.036), ('cluster', 0.036), ('multiple', 0.035), ('nk', 0.035), ('transcriptional', 0.035), ('bioinformatics', 0.034), ('sets', 0.033), ('rectangles', 0.031), ('gordan', 0.031), ('narlikar', 0.031), ('neuwald', 0.031), ('phylocon', 0.031), ('pohang', 0.031), ('zoops', 0.031), ('discovered', 0.03), ('successes', 0.03), ('classic', 0.029), ('proteins', 0.029), ('alphabet', 0.028), ('position', 0.028), ('consensus', 0.027), ('biotechnology', 0.027), ('hughes', 0.027), ('sampler', 0.027), ('mth', 0.026), ('comparative', 0.026), ('reads', 0.025), ('bailey', 0.025), ('korea', 0.025), ('th', 0.024), ('ln', 0.023), ('alignments', 0.023), ('partition', 0.022), ('width', 0.021), ('compiled', 0.021), ('nding', 0.021), ('sampling', 0.02), ('segments', 0.02), ('site', 0.02), ('species', 0.02), ('initializations', 0.02), ('reported', 0.02), ('identi', 0.02), ('microarray', 0.019), ('ratio', 0.019), ('canonical', 0.019), ('filtering', 0.019), ('generative', 0.018), ('constructed', 0.018), ('weights', 0.018), ('chain', 0.018), ('probe', 0.018), ('biologically', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 51 nips-2009-Clustering sequence sets for motif discovery

Author: Jong K. Kim, Seungjin Choi

2 0.068379164 168 nips-2009-Non-stationary continuous dynamic Bayesian networks

Author: Marco Grzegorczyk, Dirk Husmeier

Abstract: Dynamic Bayesian networks have been applied widely to reconstruct the structure of regulatory processes from time series data. The standard approach is based on the assumption of a homogeneous Markov chain, which is not valid in many realworld scenarios. Recent research efforts addressing this shortcoming have considered undirected graphs, directed graphs for discretized data, or over-ﬂexible models that lack any information sharing among time series segments. In the present article, we propose a non-stationary dynamic Bayesian network for continuous data, in which parameters are allowed to vary among segments, and in which a common network structure provides essential information sharing across segments. Our model is based on a Bayesian multiple change-point process, where the number and location of the change-points is sampled from the posterior distribution. 1

3 0.061483085 246 nips-2009-Time-Varying Dynamic Bayesian Networks

Author: Le Song, Mladen Kolar, Eric P. Xing

Abstract: Directed graphical models such as Bayesian networks are a favored formalism for modeling the dependency structures in complex multivariate systems such as those encountered in biology and neural science. When a system is undergoing dynamic transformation, temporally rewiring networks are needed for capturing the dynamic causal inﬂuences between covariates. In this paper, we propose time-varying dynamic Bayesian networks (TV-DBN) for modeling the structurally varying directed dependency structures underlying non-stationary biological/neural time series. This is a challenging problem due the non-stationarity and sample scarcity of time series data. We present a kernel reweighted 1 -regularized auto-regressive procedure for this problem which enjoys nice properties such as computational efﬁciency and provable asymptotic consistency. To our knowledge, this is the ﬁrst practical and statistically sound method for structure learning of TVDBNs. We applied TV-DBNs to time series measurements during yeast cell cycle and brain response to visual stimuli. In both cases, TV-DBNs reveal interesting dynamics underlying the respective biological systems. 1

4 0.048636641 97 nips-2009-Free energy score space

Author: Alessandro Perina, Marco Cristani, Umberto Castellani, Vittorio Murino, Nebojsa Jojic

Abstract: A score function induced by a generative model of the data can provide a feature vector of a ﬁxed dimension for each data sample. Data samples themselves may be of differing lengths (e.g., speech segments, or other sequence data), but as a score function is based on the properties of the data generation process, it produces a ﬁxed-length vector in a highly informative space, typically referred to as a “score space”. Discriminative classiﬁers have been shown to achieve higher performance in appropriately chosen score spaces than is achievable by either the corresponding generative likelihood-based classiﬁers, or the discriminative classiﬁers using standard feature extractors. In this paper, we present a novel score space that exploits the free energy associated with a generative model. The resulting free energy score space (FESS) takes into account latent structure of the data at various levels, and can be trivially shown to lead to classiﬁcation performance that at least matches the performance of the free energy classiﬁer based on the same generative model, and the same factorization of the posterior. We also show that in several typical vision and computational biology applications the classiﬁers optimized in FESS outperform the corresponding pure generative approaches, as well as a number of previous approaches to combining discriminating and generative models.

5 0.046336353 226 nips-2009-Spatial Normalized Gamma Processes

Author: Vinayak Rao, Yee W. Teh

Abstract: Dependent Dirichlet processes (DPs) are dependent sets of random measures, each being marginally DP distributed. They are used in Bayesian nonparametric models when the usual exchangeability assumption does not hold. We propose a simple and general framework to construct dependent DPs by marginalizing and normalizing a single gamma process over an extended space. The result is a set of DPs, each associated with a point in a space such that neighbouring DPs are more dependent. We describe Markov chain Monte Carlo inference involving Gibbs sampling and three different Metropolis-Hastings proposals to speed up convergence. We report an empirical study of convergence on a synthetic dataset and demonstrate an application of the model to topic modeling through time. 1

6 0.045281231 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

7 0.04389086 58 nips-2009-Constructing Topological Maps using Markov Random Fields and Loop-Closure Detection

8 0.043306645 9 nips-2009-A Game-Theoretic Approach to Hypergraph Clustering

9 0.037695687 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

10 0.036187757 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization

11 0.035515498 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

12 0.033916697 243 nips-2009-The Ordered Residual Kernel for Robust Motion Subspace Clustering

13 0.033837318 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

14 0.033750299 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

15 0.031875659 80 nips-2009-Efficient and Accurate Lp-Norm Multiple Kernel Learning

16 0.031846944 5 nips-2009-A Bayesian Model for Simultaneous Image Clustering, Annotation and Object Segmentation

17 0.031334363 133 nips-2009-Learning models of object structure

18 0.031312283 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall

19 0.031302389 234 nips-2009-Streaming k-means approximation

20 0.030885706 123 nips-2009-Large Scale Nonparametric Bayesian Inference: Data Parallelisation in the Indian Buffet Process

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.098), (1, -0.033), (2, -0.012), (3, -0.047), (4, 0.029), (5, -0.049), (6, 0.018), (7, -0.005), (8, -0.008), (9, -0.009), (10, -0.015), (11, -0.03), (12, -0.0), (13, -0.034), (14, -0.049), (15, -0.005), (16, 0.002), (17, -0.049), (18, -0.042), (19, 0.001), (20, 0.003), (21, 0.008), (22, 0.023), (23, -0.026), (24, -0.003), (25, -0.007), (26, 0.038), (27, 0.069), (28, 0.045), (29, 0.009), (30, 0.046), (31, 0.025), (32, 0.033), (33, -0.034), (34, 0.035), (35, 0.008), (36, 0.026), (37, 0.031), (38, 0.035), (39, -0.04), (40, -0.071), (41, -0.042), (42, -0.027), (43, 0.113), (44, 0.026), (45, -0.014), (46, 0.069), (47, -0.129), (48, 0.169), (49, -0.059)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91368246 51 nips-2009-Clustering sequence sets for motif discovery

Author: Jong K. Kim, Seungjin Choi

2 0.59688461 168 nips-2009-Non-stationary continuous dynamic Bayesian networks

Author: Marco Grzegorczyk, Dirk Husmeier

3 0.49386156 226 nips-2009-Spatial Normalized Gamma Processes

Author: Vinayak Rao, Yee W. Teh

4 0.47812462 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

Author: Francois Caron, Arnaud Doucet

Abstract: Over recent years Dirichlet processes and the associated Chinese restaurant process (CRP) have found many applications in clustering while the Indian buffet process (IBP) is increasingly used to describe latent feature models. These models are attractive because they ensure exchangeability (over samples). We propose here extensions of these models where the dependency between samples is given by a known decomposable graph. These models have appealing properties and can be easily learned using Monte Carlo techniques. 1 Motivation The CRP and IBP have found numerous applications in machine learning over recent years [5, 10]. We consider here the case where the data we are interested in are ‘locally’ dependent; these dependencies being represented by a known graph G where each data point/object is associated to a vertex. These local dependencies can correspond to any conceptual or real (e.g. space, time) metric. For example, in the context of clustering, we might want to propose a prior distribution on partitions enforcing that data which are ‘close’ in the graph are more likely to be in the same cluster. Similarly, in the context of latent feature models, we might be interested in a prior distribution on features enforcing that data which are ‘close’ in the graph are more likely to possess similar features. The ‘standard’ CRP and IBP correspond to the case where the graph G is complete; that is it is fully connected. In this paper, we generalize the CRP and IBP to decomposable graphs. The resulting generalized versions of the CRP and IBP enjoy attractive properties. Each clique of the graph follows marginally a CRP or an IBP process and explicit expressions for the joint prior distribution on the graph is available. It makes it easy to learn those models using straightforward generalizations of Markov chain Monte Carlo (MCMC) or Sequential Monte Carlo (SMC) algorithms proposed to perform inference for the CRP and IBP [5, 10, 14]. The rest of the paper is organized as follows. In Section 2, we review the popular Dirichlet multinomial allocation model and the Dirichlet Process (DP) partition distribution. We propose an extension of these two models to decomposable graphical models. In Section 3 we discuss nonparametric latent feature models, reviewing brieﬂy the construction in [5] and extending it to decomposable graphs. We demonstrate these models in Section 4 on two applications: an alternative to the hierarchical DP model [12] and a time-varying matrix factorization problem. 2 Prior distributions for partitions on decomposable graphs Assume we have n observations. When performing clustering, we associate to each of this observation an allocation variable zi ∈ [K] = {1, . . . , K}. Let Πn be the partition of [n] = {1, . . . , n} deﬁned by the equivalence relation i ↔ j ⇔ zi = zj . The resulting partition Πn = {A1 , . . . , An(Πn ) } 1 is an unordered collection of disjoint non-empty subsets Aj of [n], j = 1, . . . , n(Πn ), where ∪j Aj = [n] and n(Πn ) is the number of subsets for partition Πn . We also denote by Pn be the set of all partitions of [n] and let nj , j = 1, . . . , n(Πn ), be the size of the subset Aj . Each allocation variable zi is associated to a vertex/site of an undirected graph G, which is assumed to be known. In the standard case where the graph G is complete, we ﬁrst review brieﬂy here two popular prior distributions on z1:n , equivalently on Πn . We then extend these models to undirected decomposable graphs; see [2, 8] for an introduction to decomposable graphs. Finally we brieﬂy discuss the directed case. Note that the models proposed here are completely different from the hyper multinomial-Dirichlet in [2] and its recent DP extension [6]. 2.1 Dirichlet multinomial allocation model and DP partition distribution Assume for the time being that K is ﬁnite. When the graph is complete, a popular choice for the allocation variables is to consider a Dirichlet multinomial allocation model [11] θ θ , . . . , ), zi |π ∼ π (1) K K where D is the standard Dirichlet distribution and θ > 0. Integrating out π, we obtain the following Dirichlet multinomial prior distribution π ∼ D( Pr(z1:n ) = K j=1 Γ(θ) Γ(nj + θ K) (2) θ Γ(θ + n)Γ( K )K and then, using the straightforward equality Pr(Πn ) = PK where PK = {Πn ∈ Pn |n(Πn ) ≤ K}, we obtain K! (K−n(Πn ))! Pr(z1:n ) valid for for all Πn ∈ n(Π ) Pr(Πn ) = θ Γ(θ) j=1n Γ(nj + K ) K! . θ (K − n(Πn ))! Γ(θ + n)Γ( K )n(Πn ) (3) DP may be seen as a generalization of the Dirichlet multinomial model when the number of components K → ∞; see for example [10]. In this case the distribution over the partition Πn of [n] is given by [11] n(Π ) θn(Πn ) j=1n Γ(nj ) . (4) Pr(Πn ) = n i=1 (θ + i − 1) Let Π−k = {A1,−k , . . . , An(Π−k ),−k } be the partition induced by removing item k to Πn and nj,−k be the size of cluster j for j = 1, . . . , n(Π−k ). It follows from (4) that an item k is assigned to an existing cluster j, j = 1, . . . , n(Π−k ), with probability proportional to nj,−k / (n − 1 + θ) and forms a new cluster with probability θ/ (n − 1 + θ). This property is the basis of the CRP. We now extend the Dirichlet multinomial allocation and the DP partition distribution models to decomposable graphs. 2.2 Markov combination of Dirichlet multinomial and DP partition distributions Let G be a decomposable undirected graph, C = {C1 , . . . , Cp } a perfect ordering of the cliques and S = {S2 , . . . , Cp } the associated separators. It can be easily checked that if the marginal distribution of zC for each clique C ∈ C is deﬁned by (2) then these distributions are consistent as they yield the same distribution (2) over the separators. Therefore, the unique Markov distribution over G with Dirichlet multinomial distribution over the cliques is deﬁned by [8] Pr(zC ) S∈S Pr(zS ) C∈C Pr(z1:n ) = (5) where for each complete set B ⊆ G, we have Pr(zB ) given by (2). It follows that we have for any Πn ∈ PK Γ(θ) K! Pr(Πn ) = (K − n(Πn ))! C∈C Γ(θ) S∈S 2 K j=1 θ Γ(nj,C + K ) θ Γ(θ+nC )Γ( K )K K j=1 θ Γ(nj,S + K ) θ Γ(θ+nS )Γ( K )K (6) where for each complete set B ⊆ G, nj,B is the number of items associated to cluster j, j = 1, . . . , K in B and nB is the total number of items in B. Within each complete set B, the allocation variables deﬁne a partition distributed according to the Dirichlet-multinomial distribution. We now extend this approach to DP partition distributions; that is we derive a joint distribution over Πn such that the distribution of ΠB over each complete set B of the graph is given by (4) with θ > 0. Such a distribution satisﬁes the consistency condition over the separators as the restriction of any partition distributed according to (4) still follows (4) [7]. G Proposition. Let Pn be the set of partitions Πn ∈ Pn such that for each decomposition A, B, and any (i, j) ∈ A × B, i ↔ j ⇒ ∃k ∈ A ∩ B such that k ↔ i ↔ j. As K → ∞, the prior distribution G over partitions (6) is given for each Πn ∈ Pn by Pr(Πn ) = θn(Πn ) n(ΠC ) Γ(nj,C ) j=1 nC i=1 (θ+i−1) n(ΠS ) Γ(nj,S ) j=1 nS (θ+i−1) i=1 C∈C S∈S (7) where n(ΠB ) is the number of clusters in the complete set B. Proof. From (6), we have θ n(ΠC ) K(K − 1) . . . (K − n(Πn ) + 1) Pr(Πn ) = K C∈C n(ΠC )− S∈S n(ΠS ) C∈C θ n(ΠS ) S∈S n(ΠC ) θ Γ(nj,C + K ) j=1 nC (θ+i−1) i=1 n(ΠS ) θ Γ(nj,S + K ) j=1 nS (θ+i−1) i=1 Thus when K → ∞, we obtain (7) if n(Πn ) = C∈C n(ΠC ) − S∈S n(ΠS ) and 0 otherwise. We have n(Πn ) ≤ C∈C n(ΠC ) − S∈S n(ΠS ) for any Πn ∈ Pn and the subset of Pn verifying G n(Πn ) = C∈C n(ΠC ) − S∈S n(ΠS ) corresponds to the set Pn . Example. Let the notation i ∼ j (resp. i j) indicates an edge (resp. no edge) between two sites. Let n = 3 and G be the decomposable graph deﬁned by the relations 1 ∼ 2, 2 ∼ 3 and 1 3. G The set P3 is then equal to {{{1, 2, 3}}; {{1, 2}, {3}}; {{1}, {2, 3}}; {{1}, {2}, {3}}}. Note that G the partition {{1, 3}, {2}} does not belong to P3 . Indeed, as there is no edge between 1 and 3, they cannot be in the same cluster if 2 is in another cluster. The cliques are C1 = {1, 2} and C2 = {2, 3} Pr(ΠC1 ) Pr(ΠC2 ) hence we can and the separator is S2 = {2}. The distribution is given by Pr(Π3 ) = Pr(ΠS ) 2 check that we obtain Pr({1, 2, 3}) = (θ + 1)−2 , Pr({1, 2}, {3}) = Pr({1, 2}, {3}) = θ(θ + 1)−2 and Pr({1}, {2}, {3}) = θ2 (θ + 1)−2 . Let now deﬁne the full conditional distributions. Based on (7) the conditional assignment of an item k is proportional to the conditional over the cliques divided by the conditional over the separators. G Let denote G−k the undirected graph obtained by removing vertex k from G. Suppose that Πn ∈ Pn . G−k If Π−k ∈ Pn−1 , then do not change the value of item k. Otherwise, item k is assigned to cluster j / where j = 1, . . . , n(Π−k ) with probability proportional to {C∈C|n−k,j,C >0} n−k,j,C {S∈S|n−k,j,S >0} n−k,j,S (8) and to a new cluster with probability proportional to θ, where n−k,j,C is the number of items in the set C \ {k} belonging to cluster j. The updating process is illustrated by the Chinese wedding party process1 in Fig. 1. The results of this section can be extended to the Pitman-Yor process, and more generally to species sampling models. Example (continuing). Given Π−2 = {A1 = {1}, A2 = {3}}, we have −1 Pr( item 2 assigned to A1 = {1}| Π−2 ) = Pr( item 2 assigned to A2 = {3}| Π−2 ) = (θ + 2) −1 and Pr( item 2 assigned to new cluster A3 | Π−2 ) = θ (θ + 2) . Given Π−2 = {A1 = {1, 3}}, item 2 is assigned to A1 with probability 1. 1 Note that this representation describes the full conditionals while the CRP represents the sequential updat- ing. 3 (a) (b) (d) (c) (e) Figure 1: Chinese wedding party. Consider a group of n guests attending a wedding party. Each of the n guests may belong to one or several cliques, i.e. maximal groups of people such that everybody knows everybody. The belonging of each guest to the different cliques is represented by color patches on the ﬁgures, and the graphical representation of the relationship between the guests is represented by the graphical model (e). (a) Suppose that the guests are already seated such that two guests cannot be together at the same table is they are not part of the same clique, or if there does not exist a group of other guests such that they are related (“Any friend of yours is a friend of mine”). (b) The guest number k leaves his table and either (c) joins a table where there are guests from the same clique as him, with probability proportional to the product of the number of guests from each clique over the product of the number of guests belonging to several cliques on that table or (d) he joins a new table with probability proportional to θ. 2.3 Monte Carlo inference 2.3.1 MCMC algorithm Using the full conditionals, a single site Gibbs sampler can easily be designed to approximate the posterior distribution Pr(Πn |z1:n ). Given a partition Πn , an item k is taken out of the partition. If G−k Π−k ∈ Pn−1 , item k keeps the same value. Otherwise, the item will be assigned to a cluster j, / j = 1, . . . , n(Π−k ), with probability proportional to p(z{k}∪Aj,−k ) × p(zAj,−k ) {C∈C|n−k,j,C >0} n−k,j,C {S∈S|n−k,j,S >0} n−k,j,S (9) and the item will be assigned to a new cluster with probability proportional to p(z{k} ) × θ. Similarly to [3], we can also deﬁne a procedure to sample from p(θ|n(Πn ) = k)). We assume that θ ∼ G(a, b) and use p auxiliary variables x1 , . . . , xp . The procedure is as follows. • For j = 1, . . . , p, sample xj |k, θ ∼ Beta(θ + nSj , nCj − nSj ) • Sample θ|k, x1:p ∼ G(a + k, b − j log xj ) 2.3.2 Sequential Monte Carlo We have so far only treated the case of an undirected decomposable graph G. We can formulate a sequential updating rule for the corresponding perfect directed version D of G. Indeed, let (a1 , . . . a|V | ) be a perfect ordering and pa(ak ) be the set of parents of ak which is by deﬁnition complete. Let Πk−1 = {A1,k−1 , . . . , An(Πk−1 ),k−1 } denote the partition of the ﬁrst k−1 vertices a1:k−1 and let nj,pa(ak ) be the number of elements with value j in the set pa(ak ), j = 1, . . . , n(Πk−1 ). Then the vertex ak joins the set j with probability nj,pa(ak ) / θ + cluster with probability θ/ θ + q q nq,pa(ak ) and creates a new nq,pa(ak ) . One can then design a particle ﬁlter/SMC method in a similar fashion as [4]. Consider a set of (i) (i) (i) (i) N N particles Πk−1 with weights wk−1 ∝ Pr(Πk−1 , z1:k−1 ) ( i=1 wk−1 = 1) that approximate (i) the posterior distribution Pr(Πk−1 |z1:k−1 ). For each particle i, there are n(Πk−1 ) + 1 possible 4 (i,j) allocations for component ak . We denote Πk the partition obtained by associating component ak (i,j) to cluster j. The weight associated to Πk is given by  nj,pa(ak ) (i)  if j = 1, . . . , n(Πk−1 ) θ+ q nq,pa(ak ) (i,j) (i) p(z{ak }∪Aj,k−1 ) wk−1 = wk−1 × (10) (i) θ  θ+ n p(zAj,k−1 ) if j = n(Πk−1 ) + 1 q q,pa(ak ) (i,j) Then we can perform a deterministic resampling step by keeping the N particles Πk with highest (i,j) (i) (i) weights wk−1 . Let Πk be the resampled particles and wk the associated normalized weights. 3 Prior distributions for inﬁnite binary matrices on decomposable graphs Assume we have n objects; each of these objects being associated to the vertex of a graph G. To K each object is associated a K-dimensional binary vector zn = (zn,1 , . . . , zn,K ) ∈ {0, 1} where zn,i = 1 if object n possesses feature i and zn,i = 0 otherwise. These vectors zt form a binary n × K matrix denoted Z1:n . We denote by ξ1:n the associated equivalence class of left-ordered matrices and let EK be the set of left-ordered matrices with at most K features. In the standard case where the graph G is complete, we review brieﬂy here two popular prior distributions on Z1:n , equivalently on ξ1:n : the Beta-Bernoulli model and the IBP [5]. We then extend these models to undirected decomposable graphs. This can be used for example to deﬁne a time-varying IBP as illustrated in Section 4. 3.1 Beta-Bernoulli and IBP distributions The Beta-Bernoulli distribution over the allocation Z1:n is K Pr(Z1:n ) = α + K )Γ(n − nj + 1) α Γ(n + 1 + K ) α K Γ(nj j=1 (11) where nj is the number of objects having feature j. It follows that Pr(ξ1:n ) = K K! 2n −1 h=0 α K Γ(nj α + K )Γ(n − nj + 1) α Γ(n + 1 + K ) Kh ! j=1 (12) where Kh is the number of features possessing the history h (see [5] for details). The nonparametric model is obtained by taking the limit when K → ∞ Pr(ξ1:n ) = αK K+ + 2n −1 h=1 Kh ! exp(−αHn ) where K + is the total number of features and Hn = 3.2 (n − nj )!(nj − 1)! n! j=1 n 1 k=1 k . (13) The IBP follows from (13). Markov combination of Beta-Bernoulli and IBP distributions Let G be a decomposable undirected graph, C = {C1 , . . . , Cp } a perfect ordering of the cliques and S = {S2 , . . . , Cp } the associated separators. As in the Dirichlet-multinomial case, it is easily seen that if for each clique C ∈ C, the marginal distribution is deﬁned by (11), then these distributions are consistent as they yield the same distribution (11) over the separators. Therefore, the unique Markov distribution over G with Beta-Bernoulli distribution over the cliques is deﬁned by [8] Pr(ZC ) S∈S Pr(ZS ) C∈C Pr(Z1:n ) = (14) where Pr(ZB ) given by (11) for each complete set B ⊆ G. The prior over ξ1:n is thus given, for ξ1:n ∈ EK , by Pr(ξ1:n ) = K! 2n −1 h=0 Kh ! α K α Γ(nj,C + K )Γ(nC −nj,C +1) α Γ(nC +1+ K ) α α Γ(nj,S + K )Γ(nS −nj,S +1) K K α j=1 Γ(nS +1+ K ) K j=1 C∈C S∈S 5 (15) where for each complete set B ⊆ G, nj,B is the number of items having feature j, j = 1, . . . , K in the set B and nB is the whole set of objects in set B. Taking the limit when K → ∞, we obtain after a few calculations Pr(ξ1:n ) = α + K[n] exp [−α ( C HnC − 2n −1 h=1 Kh ! HnS )] × C∈C + KC (nC −nj,C )!(nj,C −1)! j=1 nC ! S∈S S + KS (nS −nj,S )!(nj,S −1)! j=1 nS ! + + + + if K[n] = C KC − S KS and 0 otherwise, where KB is the number of different features possessed by objects in B. G Let En be the subset of En such that for each decomposition A, B and any (u, v) ∈ A × B: {u and v possess feature j} ⇒ ∃k ∈ A ∩ B such that {k possesses feature j}. Let ξ−k be the left-ordered + matrix obtained by removing object k from ξn and K−k be the total number of different features in G−k + ξ−k . For each feature j = 1, . . . , K−k , if ξ−k ∈ En−1 then we have   b C∈C nj,C if i = 1 S∈C nj,S Pr(ξk,j = i) = (16)  b C∈C (nC −nj,C ) if i = 0 (nS −nj,S ) S∈C nS where b is the appropriate normalizing constant then the customer k tries Poisson α {S∈S|k∈S} nC {C∈C|k∈C} new dishes. We can easily generalize this construction to a directed version D of G using arguments similar to those presented in Section 2; see Section 4 for an application to time-varying matrix factorization. 4 4.1 Applications Sharing clusters among relative groups: An alternative to HDP Consider that we are given d groups with nj data yi,j in each group, i = 1, . . . , nj , j = 1, . . . , d. We consider latent cluster variables zi,j that deﬁne the partition of the data. We will use alternatively the notation θi,j = Uzi,j in the following. Hierarchical Dirichlet Process [12] (HDP) is a very popular model for sharing clusters among related groups. It is based on a hierarchy of DPs G0 ∼ DP (γ, H), Gj |G0 ∼ DP (α, G0 ) j = 1, . . . d θi,j |Gj ∼ Gj , yi,j |θi,j ∼ f (θi,j ) i = 1, . . . , nj . Under conjugacy assumptions, G0 , Gj and U can be integrated out and we can approximate the marginal posterior of (zi,j ) given y = (yi,j ) with Gibbs sampling using the Chinese restaurant franchise to sample from the full conditional p(zi,j |z−{i,j} , y). Using the graph formulation deﬁned in Section 2, we propose an alternative to HDP. Let θ0,1 , . . . , θ0,N be N auxiliary variables belonging to what we call group 0. We deﬁne each clique Cj (j = 1, . . . , d) to be composed of elements from group j and elements from group 0. This deﬁnes a decomposable graphical model whose separator is given by the elements of group 0. We can rewrite the model in a way quite similar to HDP G0 ∼ DP (α, H), θ0,i |G0 ∼ G0 i = 1, ..., N α α Gj |θ0,1 , . . . , θ0,N ∼ DP (α + N, α+N H + α+N θi,j |Gj ∼ Gj , yi,j |θi,j ∼ f (θi,j ) i = 1, . . . , nj N i=1 δθ0,i ) j = 1, . . . d, N For any subset A and j = k ∈ {1, . . . , p} we have corr(Gj (A), Gk (A)) = α+N . Again, under conjugacy conditions, we can integrate out G0 , Gj and U and approximate the marginal posterior distribution over the partition using the Chinese wedding party process deﬁned in Section 2. Note that for latent variables zi,j , j = 1, . . . , d, associated to data, this is the usual CRP update. As in HDP, multiple layers can be added to the model. Figures 2 (a) and (b) resp. give the graphical DP alternative to HDP and 2-layer HDP. 6 z0 root z0 root corpora docs z1 z2 z1 z2 z3 z1,1 z1,2 z2,1 z2,2 z2,3 docs (a) Graphical DP alternative to HDP (b) Graphical DP alternative to 2-layer HDP Figure 2: Hierarchical Graphs of dependency with (a) one layer and (b) two layers of hierarchy. If N = 0, then Gj ∼ DP (α, H) for all j and this is equivalent to setting γ → ∞ in HDP. If N → ∞ then Gj = G0 for all j, G0 ∼ DP (α, H). This is equivalent to setting α → ∞ in the HDP. One interesting feature of the model is that, contrary to HDP, the marginal distribution of Gj at any layer of the tree is DP (α, H). As a consequence, the total number of clusters scales logarithmically (as in the usual DP) with the size of each group, whereas it scales doubly logarithmically in HDP. Contrary to HDP, there are at most N clusters shared between different groups. Our model is in that sense reminiscent of [9] where only a limited number of clusters can be shared. Note however that contrary to [9] we have a simple CRP-like process. The proposed methodology can be straightforwardly extended to the inﬁnite HMM [12]. The main issue of the proposed model is the setting of the number N of auxiliary parameters. Another issue is that to achieve high correlation, we need a large number of auxiliary variables. Nonetheless, the computational time used to sample from auxiliary variables is negligible compared to the time used for latent variables associated to data. Moreover, it can be easily parallelized. The model proposed offers a far richer framework and ensures that at each level of the tree, the marginal distribution of the partition is given by a DP partition model. 4.2 Time-varying matrix factorization Let X1:n be an observed matrix of dimension n × D. We want to ﬁnd a representation of this matrix in terms of two latent matrices Z1:n of dimension n × K and Y of dimension K × D. Here Z1:n 2 is a binary matrix whereas Y is a matrix of latent features. By assuming that Y ∼ N 0, σY IK×D and 2 X1:n = Z1:n Y + σX εn where εn ∼ N 0, σX In×D , we obtain p(X1:n |Z1:n ) ∝ −D/2 2 2 + Z+T Z+ + σX /σY IKn 1:n 1:n + (n−Kn )D σX exp − + Kn D σY 2 2 + where Σ−1 = I − Z+ Z+T Z+ + σX /σY IKn n 1:n 1:n 1:n −1 1 T −1 2 tr X1:n Σn X1:n 2σX (17) + Z+T , Kn the number of non-zero columns of 1:n + Z1:n and Z+ is the ﬁrst Kn columns of Z1:n . To avoid having to set K, [5, 14] assume that Z1:n 1:n follows an IBP. The resulting posterior distribution p(Z1:n |X1:n ) can be estimated through MCMC [5] or SMC [14]. We consider here a different model where the object Xt is assumed to arrive at time index t and we want a prior distribution on Z1:n ensuring that objects close in time are more likely to possess similar features. To achieve this, we consider the simple directed graphical model D of Fig. 3 where the site numbering corresponds to a time index in that case and a perfect numbering of D is (1, 2, . . .). The set of parents pa(t) is composed of the r preceding sites {{t − r}, . . . , {t − 1}}. The time-varying IBP to sample from p(Z1:n ) associated to this directed graph follows from (16) and proceeds as follows. At time t = 1 + new new • Sample K1 ∼Poisson(α), set z1,i = 1 for i = 1, ..., K1 and set K1 = Knew . At times t = 2, . . . , r n + new ∼Poisson( α ). • For k = 1, . . . Kt , sample zt,k ∼ Ber( 1:t−1,k ) and Kt t t 7 ? ? - t−r - t−r+1 - . . . - t−1 - t - t+1 6 6 Figure 3: Directed graph. At times t = r + 1, . . . , n n + α new ∼Poisson( r+1 ). • For k = 1, . . . Kt , sample zt,k ∼ Ber( t−r:t−1,k ) and Kt r+1 + Here Kt is the total number of features appearing from time max(1, t − r) to t − 1 and nt−r:t−1,k the restriction of n1:t−1 to the r last customers. Using (17) and the prior distribution of Z1:n which can be sampled using the time-varying IBP described above, we can easily design an SMC method to sample from p(Z1:n |X1:n ). We do not detail it here. Note that contrary to [14], our algorithm does not require inverting a matrix whose dimension grows linearly with the size of the data but only a matrix of dimension r × r. In order to illustrate the model and SMC algorithm, we create 200 6 × 6 images using a ground truth Y consisting of 4 different 6 × 6 latent images. The 200 × 4 binary matrix was generated from Pr(zt,k = 1) = πt,k , where πt = ( .6 .5 0 0 ) if t = 1, . . . , 30, πt = ( .4 .8 .4 0 ) if t = 31, . . . , 50 and πt = ( 0 .3 .6 .6 ) if t = 51, . . . , 200. The order of the model is set to r = 50. The feature occurences Z1:n and true features Y and their estimates are represented in Figure 4. Two spurious features are detected by the model (features 2 and 5 on Fig. 3(c)) but quickly discarded (Fig. 4(d)). The algorithm is able to correctly estimate the varying prior occurences of the features over time. Feature1 Feature2 Feature1 Feature2 Feature3 20 20 40 40 60 60 Feature4 80 100 Feature4 Feature5 Feature6 Time Feature3 Time 80 100 120 120 140 140 160 160 180 200 180 1 2 3 200 4 Feature (a) 1 2 3 4 5 6 Feature (b) (c) (d) Figure 4: (a) True features, (b) True features occurences, (c) MAP estimate ZM AP and (d) associated E[Y|ZM AP ] t=20 t=50 t=20 t=50 t=100 t=200 t=100 t=200 (a) (b) Figure 5: (a) E[Xt |πt , Y] and (b) E[Xt |X1:t−1 ] at t = 20, 50, 100, 200. 5 Related work and Discussion The ﬁxed-lag version of the time-varying DP of Caron et al. [1] is a special case of the proposed model when G is given by Fig. 3. The bivariate DP of Walker and Muliere [13] is also a special case when G has only two cliques. In this paper, we have assumed that the structure of the graph was known beforehand and we have shown that many ﬂexible models arise from this framework. It would be interesting in the future to investigate the case where the graphical structure is unknown and must be estimated from the data. Acknowledgment The authors thank the reviewers for their comments that helped to improve the writing of the paper. 8 References [1] F. Caron, M. Davy, and A. Doucet. Generalized Polya urn for time-varying Dirichlet process mixtures. In Uncertainty in Artiﬁcial Intelligence, 2007. [2] A.P. Dawid and S.L. Lauritzen. Hyper Markov laws in the statistical analysis of decomposable graphical models. The Annals of Statistics, 21:1272–1317, 1993. [3] M.D. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90:577–588, 1995. [4] P. Fearnhead. Particle ﬁlters for mixture models with an unknown number of components. Statistics and Computing, 14:11–21, 2004. [5] T.L. Grifﬁths and Z. Ghahramani. Inﬁnite latent feature models and the Indian buffet process. In Advances in Neural Information Processing Systems, 2006. [6] D. Heinz. Building hyper dirichlet processes for graphical models. Electonic Journal of Statistics, 3:290–315, 2009. [7] J.F.C. Kingman. Random partitions in population genetics. Proceedings of the Royal Society of London, 361:1–20, 1978. [8] S.L. Lauritzen. Graphical Models. Oxford University Press, 1996. [9] P. M¨ ller, F. Quintana, and G. Rosner. A method for combining inference across related nonu parametric Bayesian models. Journal of the Royal Statistical Society B, 66:735–749, 2004. [10] R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9:249–265, 2000. [11] J. Pitman. Exchangeable and partially exchangeable random partitions. Probability theory and related ﬁelds, 102:145–158, 1995. [12] Y.W. Teh, M.I. Jordan, M.J. Beal, and D.M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101:1566–1581, 2006. [13] S. Walker and P. Muliere. A bivariate Dirichlet process. Statistics and Probability Letters, 64:1–7, 2003. [14] F. Wood and T.L. Grifﬁths. Particle ﬁltering for nonparametric Bayesian matrix factorization. In Advances in Neural Information Processing Systems, 2007. 9

5 0.4611685 246 nips-2009-Time-Varying Dynamic Bayesian Networks

Author: Le Song, Mladen Kolar, Eric P. Xing

6 0.39795172 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models

7 0.39364851 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs

8 0.39252049 9 nips-2009-A Game-Theoretic Approach to Hypergraph Clustering

9 0.38078466 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization

10 0.37151048 243 nips-2009-The Ordered Residual Kernel for Robust Motion Subspace Clustering

11 0.35886684 203 nips-2009-Replacing supervised classification learning by Slow Feature Analysis in spiking neural networks

12 0.35010159 182 nips-2009-Optimal Scoring for Unsupervised Learning

13 0.34886727 129 nips-2009-Learning a Small Mixture of Trees

14 0.34751803 150 nips-2009-Maximum likelihood trajectories for continuous-time Markov chains

15 0.34192866 62 nips-2009-Correlation Coefficients are Insufficient for Analyzing Spike Count Dependencies

16 0.34011036 69 nips-2009-Discrete MDL Predicts in Total Variation

17 0.33291054 153 nips-2009-Modeling Social Annotation Data with Content Relevance using a Topic Model

18 0.33120406 234 nips-2009-Streaming k-means approximation

19 0.32973379 42 nips-2009-Bayesian Sparse Factor Models and DAGs Inference and Comparison

20 0.32358378 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.014), (21, 0.039), (24, 0.04), (25, 0.06), (35, 0.049), (36, 0.102), (39, 0.073), (58, 0.048), (71, 0.065), (81, 0.014), (86, 0.043), (89, 0.352), (91, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76889491 51 nips-2009-Clustering sequence sets for motif discovery

Author: Jong K. Kim, Seungjin Choi

2 0.63612282 55 nips-2009-Compressed Least-Squares Regression

Author: Odalric Maillard, Rémi Munos

Abstract: We consider the problem of learning, from K data, a regression function in a linear space of high dimension N using projections onto a random subspace of lower dimension M . From any algorithm minimizing the (possibly penalized) empirical risk, we provide bounds on the excess risk of the estimate computed in the projected subspace (compressed domain) in terms of the excess risk of the estimate built in the high-dimensional space (initial domain). We show that solving the problem in the compressed domain instead of the initial domain reduces the estimation error at the price of an increased (but controlled) approximation error. We apply the analysis to Least-Squares (LS) regression and discuss the excess risk and numerical complexity of the resulting “Compressed Least Squares Re√ gression” (CLSR) in terms of N , K, and M . When we choose M = O( K), we √ show that CLSR has an estimation error of order O(log K/ K). 1 Problem setting We consider a regression problem where we observe data DK = ({xk , yk }k≤K ) (where xk ∈ X and yk ∈ R) are assumed to be independently and identically distributed (i.i.d.) from some distribution P , where xk ∼ PX and yk = f ∗ (xk ) + ηk (xk ), where f ∗ is the (unknown) target function, and ηk a centered independent noise of variance σ 2 (xk ). For a given class of functions F, and f ∈ F, we deﬁne the empirical (quadratic) error def LK (f ) = 1 K K [yk − f (xk )]2 , k=1 and the generalization (quadratic) error def L(f ) = E(X,Y )∼P [(Y − f (X))2 ]. Our goal is to return a regression function f ∈ F with lowest possible generalization error L(f ). Notations: In the sequel we will make use of the following notations about norms: for h : X → R, we write ||h||P for the L2 norm of h with respect to (w.r.t.) the measure P , ||h||PK for the L2 norm n 2 1/2 of h w.r.t. the empirical measure PK , and for u ∈ Rn , ||u|| denotes by default . i=1 ui The measurable function minimizing the generalization error is f ∗ , but it may be the case that f ∗ ∈ F. For any regression function f , we deﬁne the excess risk / L(f ) − L(f ∗ ) = ||f − f ∗ ||2 , P which decomposes as the sum of the estimation error L(f ) − inf f ∈F L(f ) and the approximation error inf f ∈F L(f ) − L(f ∗ ) = inf f ∈F ||f − f ∗ ||2 which measures the distance between f ∗ and the P function space F. 1 In this paper we consider a class of linear functions FN deﬁned as the span of a set of N functions def def N {ϕn }1≤n≤N called features. Thus: FN = {fα = n=1 αn ϕn , α ∈ RN }. When the number of data K is larger than the number of features N , the ordinary Least-Squares Regression (LSR) provides the LS solution fα which is the minimizer of the empirical risk LK (f ) b 1 in FN . Note that here LK (fα ) rewrites K ||Φα − Y ||K where Φ is the K × N matrix with elements (ϕn (xk ))1≤n≤N,1≤k≤K and Y the K-vector with components (yk )1≤k≤K . Usual results provide bound on the estimation error as a function of the capacity of the function space and the number of data. In the case of linear approximation, the capacity measures (such as covering numbers [23] or the pseudo-dimension [16]) depend on the number of features (for example the pseudo-dimension is at most N + 1). For example, let fα be a LS estimate (minimizer of LK b in FN ), then (a more precise statement will be stated later in Subsection 3) the expected estimation error is bounded as: N log K E L(fα ) − inf L(f ) ≤ cσ2 , (1) b f ∈FN K def where c is a universal constant, σ = supx∈X σ(x), and the expectation is taken with respect to P . Now, the excess risk is the sum of this estimation error and the approximation error inf f ∈FN ||f − f ∗ ||P of the class FN . Since the later usually decreases when the number of features N increases [13] (e.g. when N FN is dense in L2 (P )), we see the usual tradeoff between small estimation error (low N ) and small approximation error (large N ). In this paper we are interested in the setting when N is large so that the approximation error is small. Whenever N is larger than K we face the overﬁtting problem since there are more parameters than actual data (more variables than constraints), which is illustrated in the bound (1) which provides no information about the generalization ability of any LS estimate. In addition, there are many minimizers (in fact a vector space of same dimension as the null space of ΦT Φ) of the empirical risk. To overcome the problem, several approaches have been proposed in the literature: • LS solution with minimal norm: The solution is the minimizer of the empirical error with minimal (l1 or l2 )-norm: α = arg minΦα=Y ||α||1 or 2 , (or a robust solution arg min||Φα−Y ||2 ≤ε ||α||1 ). The choice of 2 -norm yields the ordinary LS solution. The choice of 1 -norm has been used for generating sparse solutions (e.g. the Basis Pursuit [10]), and assuming that the target function admits a sparse decomposition, the ﬁeld of Compressed Sensing [9, 21] provides sufﬁcient conditions for recovering the exact solution. However, such conditions (e.g. that Φ possesses a Restricted Isometric Property (RIP)) does not hold in general in this regression setting. On another aspect, solving these problems (both for l1 or l2 -norm) when N is large is numerically expensive. • Regularization. The solution is the minimizer of the empirical error plus a penalty term, for example f = arg min LK (f ) + λ||f ||p , for p = 1 or 2. p f ∈FN where λ is a parameter and usual choices for the norm are 2 (ridge-regression [20]) and 1 (LASSO [19]). A close alternative is the Dantzig selector [8, 5] which solves: α = arg min||α||1 ≤λ ||ΦT (Y − Φα)||∞ . The numerical complexity and generalization bounds of those methods depend on the sparsity of the target function decomposition in FN . Now if we possess a sequence of function classes (FN )N ≥1 with increasing capacity, we may perform structural risk minimization [22] by solving in each model the empirical risk penalized by a term that depends on the size of the model: fN = arg minf ∈FN ,N ≥1 LK (f ) + pen(N, K), where the penalty term measures the capacity of the function space. In this paper we follow another approach where instead of searching in the large space FN (where N > K) for a solution that minimizes the empirical error plus a penalty term, we simply search for the empirical error minimizer in a (randomly generated) lower dimensional subspace GM ⊂ FN (where M < K). Our contribution: We consider a set of M random linear combinations of the initial N features and perform our favorite LS regression algorithm (possibly regularized) using those “compressed 2 features”. This is equivalent to projecting the K points {ϕ(xk ) ∈ RN , k = 1..K} from the initial domain (of size N ) onto a random subspace of dimension M , and then performing the regression in the “compressed domain” (i.e. span of the compressed features). This is made possible because random projections approximately preserve inner products between vectors (by a variant of the Johnson-Lindenstrauss Lemma stated in Proposition 1. Our main result is a bound on the excess risk of a linear estimator built in the compressed domain in terms of the excess risk of the linear estimator built in the initial domain (Section 2). We further detail the case of ordinary Least-Squares Regression (Section 3) and discuss, in terms of M , N , K, the different tradeoffs concerning the excess risk (reduced estimation error in the compressed domain versus increased approximation error introduced by the random projection) and the numerical complexity (reduced complexity of solving the LSR in the compressed domain versus the additional load of performing the projection). √ As a consequence, we show that by choosing M = O( K) projections we deﬁne a Compressed Least-Squares Regression which uses O(N K 3/2 ) elementary operations to compute a regression √ function with estimation error (relatively to the initial function space FN ) of order log K/ K up to a multiplicative factor which depends on the best approximation of f ∗ in FN . This is competitive with the best methods, up to our knowledge. Related works: Using dimension reduction and random projections in various learning areas has received considerable interest over the past few years. In [7], the authors use a SVM algorithm in a compressed space for the purpose of classiﬁcation and show that their resulting algorithm has good generalization properties. In [25], the authors consider a notion of compressed linear regression. For data Y = Xβ + ε, where β is the target and ε a standard noise, they use compression of the set of data, thus considering AY = AXβ + Aε, where A has a Restricted Isometric Property. They provide an analysis of the LASSO estimator built from these compressed data, and discuss a property called sparsistency, i.e. the number of random projections needed to recover β (with high probability) when it is sparse. These works differ from our approach in the fact that we do not consider a compressed (input and/or output) data space but a compressed feature space instead. In [11], the authors discuss how compressed measurements may be useful to solve many detection, classiﬁcation and estimation problems without having to reconstruct the signal ever. Interestingly, they make no assumption about the signal being sparse, like in our work. In [6, 17], the authors show how to map a kernel k(x, y) = ϕ(x) · ϕ(y) into a low-dimensional space, while still approximately preserving the inner products. Thus they build a low-dimensional feature space speciﬁc for (translation invariant) kernels. 2 Linear regression in the compressed domain We remind that the initial set of features is {ϕn : X → def N FN = {fα = n=1 αn ϕn , α ∈ components (ϕn (x))n≤N . Let us R, 1 ≤ n ≤ N } and the initial domain R } is the span of those features. We write ϕ(x) the N -vector of N now deﬁne the random projection. Let A be a M × N matrix of i.i.d. elements drawn for some distribution ρ. Examples of distributions are: • Gaussian random variables N (0, 1/M ), √ • ± Bernoulli distributions, i.e. which takes values ±1/ M with equal probability 1/2, • Distribution taking values ± 3/M with probability 1/6 and 0 with probability 2/3. The following result (proof in the supplementary material) states the property that inner-product are approximately preserved through random projections (this is a simple consequence of the JohnsonLindenstrauss Lemma): Proposition 1 Let (uk )1≤k≤K and v be vectors of RN . Let A be a M × N matrix of i.i.d. elements drawn from one of the previously deﬁned distributions. For any ε > 0, δ > 0, for M ≥ ε2 1 ε3 log 4K , we have, with probability at least 1 − δ, for all k ≤ K, δ 4 − 6 |Auk · Av − uk · v| ≤ ε||uk || ||v||. 3 def We now introduce the set of M compressed features (ψm )1≤m≤M such that ψm (x) = N We also write ψ(x) the M -vector of components (ψm (x))m≤M . Thus n=1 Am,n ϕn (x). ψ(x) = Aϕ(x). We deﬁne the compressed domain GM = {gβ = m=1 βm ψm , β ∈ RM } the span of the compressed features (vector space of dimension at most M ). Note that each ψm ∈ FN , thus GM is a subspace of FN . def 2.1 M Approximation error We now compare the approximation error assessed in the compressed domain GM versus in the initial space FN . This applies to the linear algorithms mentioned in the introduction such as ordinary LS regression (analyzed in details in Section 3), but also its penalized versions, e.g. LASSO and ridge regression. Deﬁne α+ = arg minα∈RN L(fα ) − L(f ∗ ) the parameter of the best regression function in FN . Theorem 1 For any δ > 0, any M ≥ 15 log(8K/δ), let A be a random M × N matrix deﬁned like in Proposition 1, and GM be the compressed domain resulting from this choice of A. Then with probability at least 1 − δ, inf ||g−f ∗ ||2 ≤ P g∈GM 8 log(8K/δ) + 2 ||α || M E ||ϕ(X)||2 +2 sup ||ϕ(x)||2 x∈X log 4/δ + inf ||f −f ∗ ||2 . P f ∈FN 2K (2) This theorem shows the tradeoff in terms of estimation and approximation errors for an estimator g obtained in the compressed domain compared to an estimator f obtained in the initial domain: • Bounds on the estimation error of g in GM are usually smaller than that of f in FN when M < N (since the capacity of FN is larger than that of GM ). • Theorem 1 says that the approximation error assessed in GM increases by at most O( log(K/δ) )||α+ ||2 E||ϕ(X)||2 compared to that in FN . M def def Proof: Let us write f + = fα+ = arg minf ∈FN ||f − f ∗ ||P and g + = gAα+ . The approximation error assessed in the compressed domain GM is bounded as inf ||g − f ∗ ||2 P g∈GM ≤ ||g + − f ∗ ||2 = ||g + − f + ||2 + ||f + − f ∗ ||2 , P P P (3) since f + is the orthogonal projection of f ∗ on FN and g + belongs to FN . We now bound ||g + − def def f + ||2 using concentration inequalities. Deﬁne Z(x) = Aα+ · Aϕ(x) − α+ · ϕ(x). Deﬁne ε2 = P log(8K/δ) 8 M log(8K/δ). For M ≥ 15 log(8K/δ) we have ε < 3/4 thus M ≥ ε2 /4−ε3 /6 . Proposition 1 applies and says that on an event E of probability at least 1 − δ/2, we have for all k ≤ K, def |Z(xk )| ≤ ε||α+ || ||ϕ(xk )|| ≤ ε||α+ || sup ||ϕ(x)|| = C (4) x∈X On the event E, we have with probability at least 1 − δ , ||g + − f + ||2 P = ≤ ≤ EX∼PX |Z(X)|2 ≤ ε2 ||α+ ||2 ε2 ||α+ ||2 1 K 1 K K |Z(xk )|2 + C 2 k=1 K ||ϕ(xk )||2 + sup ||ϕ(x)||2 x∈X k=1 E ||ϕ(X)||2 + 2 sup ||ϕ(x)||2 x∈X log(2/δ ) 2K log(2/δ ) 2K log(2/δ ) . 2K where we applied two times Chernoff-Hoeffding’s inequality. Combining with (3), unconditioning, and setting δ = δ/2 then with probability at least (1 − δ/2)(1 − δ ) ≥ 1 − δ we have (2). 4 2.2 Computational issues We now discuss the relative computational costs of a given algorithm applied either in the initial or in the compressed domain. Let us write Cx(DK , FN , P ) the complexity (e.g. number of elementary operations) of an algorithm A to compute the regression function f when provided with the data DK and function space FN . We plot in the table below, both for the initial and the compressed versions of the algorithm A, the order of complexity for (i) the cost for building the feature matrix, (ii) the cost for computing the estimator, (iii) the cost for making one prediction (i.e. computing f (x) for any x): Construction of the feature matrix Computing the regression function Making one prediction Initial domain NK Cx(DK , FN , P ) N Compressed domain N KM Cx(DK , GM , P ) NM Note that the values mentioned for the compressed domain are upper-bounds on the real complexity and do not take into account the possible sparsity of the projection matrix A (which would speed up matrix computations, see e.g. [2, 1]). 3 Compressed Least-Squares Regression We now analyze the speciﬁc case of Least-Squares Regression. 3.1 Excess risk of ordinary Least Squares regression In order to bound the estimation error, we follow the approach of [13] which truncates (up to the level ±L where L is a bound, assumed to be known, on ||f ∗ ||∞ ) the prediction of the LS regression function. The ordinary LS regression provides the regression function fα where b α= argmin α∈argminα ∈ RN ||α||. ||Y −Φα || Note that ΦΦT α = ΦT Y , hence α = Φ† Y ∈ RN where Φ† is the Penrose pseudo-inverse of Φ1 . def Then the truncated predictor is: fL (x) = TL [fα (x)], where b def TL (u) = u if |u| ≤ L, L sign(u) otherwise. Truncation after the computation of the parameter α ∈ RN , which is the solution of an unconstrained optimization problem, is easier than solving an optimization problem under the constraint that ||α|| is small (which is the approach followed in [23]) and allows for consistency results and prediction bounds. Indeed, the excess risk of fL is bounded as 1 + log K E(||f − f ∗ ||2 ) ≤ c max{σ2 , L2 } N + 8 inf ||f − f ∗ ||2 (5) P P f ∈FN K where a bound on c is 9216 (see [13]). We have a simpler bound when we consider the expectation EY conditionally on the input data: N EY (||f − f ∗ ||2 K ) ≤ σ2 + inf ||f − f ∗ ||2 K (6) P P K f ∈F Remark: Note that because we use the quadratic loss function, by following the analysis in [3], or by deriving tight bounds on the Rademacher complexity [14] and following Theorem 5.2 of Koltchinskii’s Saint Flour course, it is actually possible to state assumptions under which we can remove the log K term in (5). We will not further detail such bounds since our motivation here is not to provide the tightest possible bounds, but rather to show how the excess risk bound for LS regression in the initial domain extends to the compressed domain. 1 In the full rank case, Φ† = (ΦT Φ)−1 ΦT when K ≥ N and Φ† = ΦT (ΦΦT )−1 when K ≤ N 5 3.2 Compressed Least-Squares Regression (CLSR) CLSR is deﬁned as the ordinary LSR in the compressed domain. Let β = Ψ† Y ∈ RM , where Ψ is the K × M matrix with elements (ψm (xk ))1≤m≤M,1≤k≤K . The CLSR estimate is deﬁned as def gL (x) = TL [gβ (x)]. From Theorem 1, (5) and (6), we deduce the following excess risk bounds for b the CLSR estimate: √ ||α+ || E||ϕ(X)||2 K log(8K/δ) Corollary 1 For any δ > 0, set M = 8 max(σ,L) c (1+log K) . Then whenever M ≥ 15 log(8K/δ), with probability at least 1 − δ, the expected excess risk of the CLSR estimate is bounded as √ E(||gL − f ∗ ||2 ) ≤ 16 c max{σ, L}||α+ || E||ϕ(X)||2 P × 1+ supx ||ϕ(x)||2 E||ϕ(X)||2 (1 + log K) log(8K/δ) K log 4/δ + 8 inf ||f − f ∗ ||2 . P f ∈FN 2K (7) √ ||α+ || E||ϕ(X)||2 Now set M = 8K log(8K/δ). Assume N > K and that the features (ϕk )1≤k≤K σ are linearly independent. Then whenever M ≥ 15 log(8K/δ), with probability at least 1 − δ, the expected excess risk of the CLSR estimate conditionally on the input samples is upper bounded as 2 log(8K/δ) supx ||ϕ(x)||2 1+ K E||ϕ(X)||2 EY (||gL − f ∗ ||2 K ) ≤ 4σ||α+ || E||ϕ(X)||2 P log 4/δ . 2K Proof: Whenever M ≥ 15 log(8K/δ) we deduce from Theorem 1 and (5) that the excess risk of gL is bounded as E(||gL − f ∗ ||2 ) ≤ c max{σ2 , L2 } P +8 8 log(8K/δ) + 2 ||α || M 1 + log K M K E||ϕ(X)||2 + 2 sup ||ϕ(x)||2 x log 4/δ + inf ||f − f ∗ ||2 . P f ∈FN 2K By optimizing on M , we deduce (7). Similarly, using (6) we deduce the following bound on EY (||gL − f ∗ ||2 K ): P σ2 8 M + log(8K/δ)||α+ ||2 K M E||ϕ(X)||2 + 2 sup ||ϕ(x)||2 x log 4/δ + inf ||f − f ∗ ||2 K . P f ∈FN 2K By optimizing on M and noticing that inf f ∈FN ||f − f ∗ ||2 K = 0 whenever N > K and the features P (ϕk )1≤k≤K are linearly independent, we deduce the second result. Remark 1 Note that the second term in the parenthesis of (7) is negligible whenever K Thus we have the expected excess risk log K/δ + inf ||f − f ∗ ||2 . P f ∈FN K E(||gL − f ∗ ||2 ) = O ||α+ || E||ϕ(X)||2 √ P log 1/δ. (8) The choice of M in the previous corollary depends on ||α+ || and E||ϕ(X)|| which are a priori unknown (since f ∗ and PX are unknown). If we set M independently of ||α+ ||, then an additional multiplicative factor of ||α+ || appears in the bound, and if we replace E||ϕ(X)|| by its bound supx ||ϕ(x)|| (which is known) then this latter factor will appear instead of the former in the bound. Complexity of CLSR: The complexity of LSR for computing the regression function in the compressed domain only depends on M and K, and is (see e.g. [4]) Cx(DK , GM , P ) = O(M K 2 ) which √ is of order O(K 5/2 ) when we choose the optimized number of projections M = O( K). However the leading term when using CLSR is the cost for building the Ψ matrix: O(N K 3/2 ). 6 4 4.1 Discussion The factor ||α+ || E||ϕ(X)||2 In light of Corollary 1, the important factor which will determine whether the CLSR provides low generalization error or not is ||α+ || E||ϕ(X)||2 . This factor indicates that a good set of features (for CLSR) should be such that the norm of those features as well as the norm of the parameter α+ of the projection of f ∗ onto the span of those features should be small. A natural question is whether this product can be made small for appropriate choices of features. We now provide two speciﬁc cases for which this is actually the case: (1) when the features are rescaled orthonormal basis functions, and (2) when the features are speciﬁc wavelet functions. In both cases, we relate the bound to an assumption of regularity on the function f ∗ , and show that the dependency w.r.t. N decreases when the regularity increases, and may even vanish. Rescaled Orthonormal Features: Consider a set of orthonormal functions (ηi )i≥1 w.r.t a measure µ, i.e. ηi , ηj µ = δi,j . In addition we assume that the law of the input data is dominated by µ, i.e. PX ≤ Cµ where C is a constant. For instance, this is the case when the set X is compact, µ is the uniform measure and PX has bounded density. def We deﬁne the set of N features as: ϕi = ci ηi , where ci > 0, for i ∈ {1, . . . , N }. Then any f ∈ FN decomposes as f = 2 we have: ||α|| = ||α+ ||2 E||ϕ||2 ≤ C N bi 2 i=1 ( ci ) N bi 2 i=1 ( ci ) and N i=1 N bi i=1 ci ϕi , where N 2 2 i=1 ci X ηi (x)dPX (x) f, ηi ηi = E||ϕ|| = 2 def bi = f, ηi . Thus ≤ C N 2 i=1 ci . Thus N 2 i=1 ci . Now, linear approximation theory (Jackson-type theorems) tells us that assuming a function f ∗ ∈ L2 (µ) is smooth, it may be decomposed onto the span of the N ﬁrst (ηi )i∈{1,...,N } functions with decreasing coefﬁcients |bi | ≤ i−λ for some λ ≥ 0 that depends on the smoothness of f ∗ . For example the class of functions with bounded total variation may be decomposed with Fourier basis (in dimension 1) with coefﬁcients |bi | ≤ ||f ||V /(2πi). Thus here λ = 1. Other classes (such as Sobolev spaces) lead to larger values of λ related to the order of differentiability. √ N By choosing ci = i−λ/2 , we have ||α+ || E||ϕ||2 ≤ C i=1 i−λ . Thus if λ > 1, then this term is bounded by a constant that does not depend on N . If λ = 1 then it is bounded by O(log N ), and if 0 < λ < 1, then it is bounded by O(N 1−λ ). However any orthonormal basis, even rescaled, would not necessarily yield a small ||α+ || E||ϕ||2 term (this is all the more true when the dimension of X is large). The desired property that the coefﬁcients (α+ )i of the decomposition of f ∗ rapidly decrease to 0 indicates that hierarchical bases, such as wavelets, that would decompose the function at different scales, may be interesting. Wavelets: Consider an inﬁnite family of wavelets in [0, 1]: (ϕ0 ) = (ϕ0 ) (indexed by n ≥ 1 or n h,l equivalently by the scale h ≥ 0 and translation 0 ≤ l ≤ 2h − 1) where ϕ0 (x) = 2h/2 ϕ0 (2h x − l) h,l and ϕ0 is the mother wavelet. Then consider N = 2H features (ϕh,l )1≤h≤H deﬁned as the rescaled def wavelets ϕh,l = ch 2−h/2 ϕ0 , where ch > 0 are some coefﬁcients. Assume the mother wavelet h,l is C p (for p ≥ 1), has at least p vanishing moments, and that for all h ≥ 0, supx l ϕ0 (2h x − l)2 ≤ 1. Then the following result (proof in the supplementary material) provides a bound on supx∈X ||ϕ(x)||2 (thus on E||ϕ(X)||2 ) by a constant independent of N : Proposition 2 Assume that f ∗ is (L, γ)-Lipschitz (i.e. for all v ∈ X there exists a polynomial pv of degree γ such that for all u ∈ X , |f (u) − pv (u)| ≤ L|u − v|γ ) with 1/2 < γ ≤ p. Then setting γ 1 ch = 2h(1−2γ)/4 , we have ||α+ || supx ||ϕ(x)|| ≤ L 1−22 |ϕ0 |, which is independent of N . 1/2−γ 0 Notice that the Haar walevets has p = 1 vanishing moment but is not C 1 , thus the Proposition does not apply directly. However direct computations show that if f ∗ is L-Lipschitz (i.e. γ = 1) then L 0 αh,l ≤ L2−3h/2−2 , and thus ||α+ || supx ||ϕ(x)|| ≤ 4(1−2−1/2 ) with ch = 2−h/4 . 7 4.2 Comparison with other methods In the case when the factor ||α+ || E||ϕ(X)||2 does not depend on N (such as in the previous example), the bound (8) on the excess risk of CLSR states that the estimation error (assessed in √ √ terms of FN ) of CLSR is O(log K/ K). It is clear that whenever N > K (which is the case of interest here), this is better than the ordinary LSR in the initial domain, whose estimation error is O(N log K/K). It is difﬁcult to compare this result with LASSO (or the Dantzig selector that has similar properties [5]) for which an important aspect is to design sparse regression functions or to recover a solution assumed to be sparse. From [12, 15, 24] one deduces that under some assumptions, the estimation error of LASSO is of order S log N where S is the sparsity (number of non-zero coefﬁcients) of the K√ best regressor f + in FN . If S < K then LASSO is more interesting than CLSR in terms of excess risk. Otherwise CLSR may be an interesting alternative although this method does not make any assumption about the sparsity of f + and its goal is not to recover a possible sparse f + but only to make good predictions. However, in some sense our method ﬁnds a sparse solution in the fact that the regression function gL lies in a space GM of small dimension M N and can thus be expressed using only M coefﬁcients. Now in terms of numerical complexity, CLSR requires O(N K 3/2 ) operations to build the matrix and compute the regression function, whereas according to [18], the (heuristical) complexity of the LASSO algorithm is O(N K 2 ) in the best cases (assuming that the number of steps required for convergence is O(K), which is not proved theoretically). Thus CLSR seems to be a good and simple competitor to LASSO. 5 Conclusion We considered the case when the number of features N is larger than the number of data K. The result stated in Theorem 1 enables to analyze the excess risk of any linear regression algorithm (LS or its penalized versions) performed in the compressed domain GM versus in the initial space FN . In the compressed domain the estimation error is reduced but an additional (controlled) approximation error (when compared to the best regressor in FN ) comes into the picture. In the case of LS regression, when the term ||α+ || E||ϕ(X)||2 has a mild dependency on N , then by choosing a √ random subspace of dimension M = O( K), CLSR has an estimation error (assessed in terms of √ FN ) bounded by O(log K/ K) and has numerical complexity O(N K 3/2 ). In short, CLSR provides an alternative to usual penalization techniques where one ﬁrst selects a random subspace of lower dimension and then performs an empirical risk minimizer in this subspace. Further work needs to be done to provide additional settings (when the space X is of dimension > 1) for which the term ||α+ || E||ϕ(X)||2 is small. Acknowledgements: The authors wish to thank Laurent Jacques for numerous comments and Alessandro Lazaric and Mohammad Ghavamzadeh for exciting discussions. This work has been supported by French National Research Agency (ANR) through COSINUS program (project EXPLO-RA, ANR-08-COSI-004). References [1] Dimitris Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671–687, June 2003. [2] Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast JohnsonLindenstrauss transform. In STOC ’06: Proceedings of the thirty-eighth annual ACM symposium on Theory of computing, pages 557–563, New York, NY, USA, 2006. ACM. [3] Jean-Yves Audibert and Olivier Catoni. Risk bounds in linear regression through pac-bayesian truncation. Technical Report HAL : hal-00360268, 2009. [4] David Bau III and Lloyd N. Trefethen. Numerical linear algebra. Philadelphia: Society for Industrial and Applied Mathematics, 1997. 8 [5] Peter J. Bickel, Ya’acov Ritov, and Alexandre B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. To appear in Annals of Statistics, 2008. [6] Avrim Blum. Random projection, margins, kernels, and feature-selection. Subspace, Latent Structure and Feature Selection, pages 52–68, 2006. [7] Robert Calderbank, Sina Jafarpour, and Robert Schapire. Compressed learning: Universal sparse dimensionality reduction and learning in the measurement domain. Technical Report, 2009. [8] Emmanuel Candes and Terence Tao. The Dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics, 35:2313, 2007. [9] Emmanuel J. Candes and Justin K. Romberg. Signal recovery from random projections. volume 5674, pages 76–86. SPIE, 2005. [10] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientiﬁc Computing, 20:33–61, 1998. [11] Mark A. Davenport, Michael B. Wakin, and Richard G. Baraniuk. Detection and estimation with compressive measurements. Technical Report TREE 0610, Department of Electrical and Computer Engineering, Rice University, 2006. [12] E. Greenshtein and Y. Ritov. Persistency in high dimensional linear predictor-selection and the virtue of over-parametrization. Bernoulli, 10:971–988, 2004. [13] L. Gy¨ rﬁ, M. Kohler, A. Krzy˙ ak, and H. Walk. A distribution-free theory of nonparametric o z regression. Springer-Verlag, 2002. [14] Sham M. Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Daphne Koller, Dale Schuurmans, Yoshua Bengio, and Leon Bottou, editors, Neural Information Processing Systems, pages 793– 800. MIT Press, 2008. [15] Yuval Nardi and Alessandro Rinaldo. On the asymptotic properties of the group Lasso estimator for linear models. Electron. J. Statist., 2:605–633, 2008. [16] D. Pollard. Convergence of Stochastic Processes. Springer Verlag, New York, 1984. [17] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. Neural Information Processing Systems, 2007. [18] Saharon Rosset and Ji Zhu. Piecewise linear regularized solution paths. Annals of Statistics, 35:1012, 2007. [19] Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1994. [20] A. N. Tikhonov. Solution of incorrectly formulated problems and the regularization method. Soviet Math Dokl 4, pages 1035–1038, 1963. [21] Yaakov Tsaig and David L. Donoho. Compressed sensing. IEEE Trans. Inform. Theory, 52:1289–1306, 2006. [22] Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA, 1995. [23] Tong Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527–550, 2002. [24] Tong Zhang. Some sharp performance bounds for least squares regression with L1 regularization. To appear in Annals of Statistics, 2009. [25] Shuheng Zhou, John D. Lafferty, and Larry A. Wasserman. Compressed regression. In John C. Platt, Daphne Koller, Yoram Singer, and Sam T. Roweis, editors, Neural Information Processing Systems. MIT Press, 2007. 9

3 0.57848424 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels

Author: Maxim Raginsky, Svetlana Lazebnik

Abstract: This paper addresses the problem of designing binary codes for high-dimensional data such that vectors that are similar in the original space map to similar binary strings. We introduce a simple distribution-free encoding scheme based on random projections, such that the expected Hamming distance between the binary codes of two vectors is related to the value of a shift-invariant kernel (e.g., a Gaussian kernel) between the vectors. We present a full theoretical analysis of the convergence properties of the proposed scheme, and report favorable experimental performance as compared to a recent state-of-the-art method, spectral hashing.

4 0.44659436 135 nips-2009-Learning to Hash with Binary Reconstructive Embeddings

Author: Brian Kulis, Trevor Darrell

Abstract: Fast retrieval methods are increasingly critical for many large-scale analysis tasks, and there have been several recent methods that attempt to learn hash functions for fast and accurate nearest neighbor searches. In this paper, we develop an algorithm for learning hash functions based on explicitly minimizing the reconstruction error between the original distances and the Hamming distances of the corresponding binary embeddings. We develop a scalable coordinate-descent algorithm for our proposed hashing objective that is able to efﬁciently learn hash functions in a variety of settings. Unlike existing methods such as semantic hashing and spectral hashing, our method is easily kernelized and does not require restrictive assumptions about the underlying distribution of the data. We present results over several domains to demonstrate that our method outperforms existing state-of-the-art techniques. 1

5 0.43342507 111 nips-2009-Hierarchical Modeling of Local Image Features through $L p$-Nested Symmetric Distributions

Author: Matthias Bethge, Eero P. Simoncelli, Fabian H. Sinz

Abstract: We introduce a new family of distributions, called Lp -nested symmetric distributions, whose densities are expressed in terms of a hierarchical cascade of Lp norms. This class generalizes the family of spherically and Lp -spherically symmetric distributions which have recently been successfully used for natural image modeling. Similar to those distributions it allows for a nonlinear mechanism to reduce the dependencies between its variables. With suitable choices of the parameters and norms, this family includes the Independent Subspace Analysis (ISA) model as a special case, which has been proposed as a means of deriving ﬁlters that mimic complex cells found in mammalian primary visual cortex. Lp -nested distributions are relatively easy to estimate and allow us to explore the variety of models between ISA and the Lp -spherically symmetric models. By ﬁtting the generalized Lp -nested model to 8 × 8 image patches, we show that the subspaces obtained from ISA are in fact more dependent than the individual ﬁlter coefﬁcients within a subspace. When ﬁrst applying contrast gain control as preprocessing, however, there are no dependencies left that could be exploited by ISA. This suggests that complex cell modeling can only be useful for redundancy reduction in larger image patches. 1

6 0.43017238 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

7 0.42963499 226 nips-2009-Spatial Normalized Gamma Processes

8 0.42896104 129 nips-2009-Learning a Small Mixture of Trees

9 0.42861959 31 nips-2009-An LP View of the M-best MAP problem

10 0.42848402 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

11 0.42767712 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

12 0.42751572 188 nips-2009-Perceptual Multistability as Markov Chain Monte Carlo Inference

13 0.42664897 260 nips-2009-Zero-shot Learning with Semantic Output Codes

14 0.4247613 168 nips-2009-Non-stationary continuous dynamic Bayesian networks

15 0.42423815 97 nips-2009-Free energy score space

16 0.4235141 154 nips-2009-Modeling the spacing effect in sequential category learning

17 0.4232007 246 nips-2009-Time-Varying Dynamic Bayesian Networks

18 0.42239502 131 nips-2009-Learning from Neighboring Strokes: Combining Appearance and Context for Multi-Domain Sketch Recognition

19 0.42172557 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior

20 0.42127755 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition