nips nips2011 nips2011-14 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Martin O. Larsson, Johan Ugander
Abstract: Latent variable mixture models are a powerful tool for exploring the structure in large datasets. A common challenge for interpreting such models is a desire to impose sparsity, the natural assumption that each data point only contains few latent features. Since mixture distributions are constrained in their L1 norm, typical sparsity techniques based on L1 regularization become toothless, and concave regularization becomes necessary. Unfortunately concave regularization typically results in EM algorithms that must perform problematic non-concave M-step maximizations. In this work, we introduce a technique for circumventing this difficulty, using the so-called Mountain Pass Theorem to provide easily verifiable conditions under which the M-step is well-behaved despite the lacking concavity. We also develop a correspondence between logarithmic regularization and what we term the pseudo-Dirichlet distribution, a generalization of the ordinary Dirichlet distribution well-suited for inducing sparsity. We demonstrate our approach on a text corpus, inferring a sparse topic mixture model for 2,406 weblogs. 1
Reference: text
sentIndex sentText sentNum sentScore
1 A concave regularization technique for sparse mixture models Martin Larsson School of Operations Research and Information Engineering Cornell University mol23@cornell. [sent-1, score-0.19]
2 A common challenge for interpreting such models is a desire to impose sparsity, the natural assumption that each data point only contains few latent features. [sent-4, score-0.082]
3 Since mixture distributions are constrained in their L1 norm, typical sparsity techniques based on L1 regularization become toothless, and concave regularization becomes necessary. [sent-5, score-0.259]
4 Unfortunately concave regularization typically results in EM algorithms that must perform problematic non-concave M-step maximizations. [sent-6, score-0.093]
5 We also develop a correspondence between logarithmic regularization and what we term the pseudo-Dirichlet distribution, a generalization of the ordinary Dirichlet distribution well-suited for inducing sparsity. [sent-8, score-0.116]
6 We demonstrate our approach on a text corpus, inferring a sparse topic mixture model for 2,406 weblogs. [sent-9, score-0.152]
7 Specific contexts for this demand include latent semantic models for organizing text corpora, image feature extraction models for navigating large photo datasets, and community detection in social networks for optimizing content delivery. [sent-11, score-0.127]
8 Mixture models identify such latent structure, helping to categorize unstructured data. [sent-12, score-0.068]
9 In this work we explore an additional sparsity assumption, namely that individual elements only incorporate a small subset of the |Z| classes, so that each element arises as a mixture of only |Z| classes. [sent-17, score-0.099]
10 Our primary context for mixture modeling in this work will be latent semantic models of text data, where elements d are documents, words w are literal words, and classes z are vocabulary topics. [sent-19, score-0.198]
11 Sparse inference as a rule targets point estimation, which makes PLSA-style models appropriate since they are inherently frequentist, deriving point-estimated models via likelihood maximization. [sent-23, score-0.059]
12 Sparse inference is commonly achieved through two largely equivalent techniques: regularization or MAP inference. [sent-25, score-0.063]
13 Regularization modifies ordinary likelihood maximization with a penalty on the magnitudes of the parameters. [sent-26, score-0.069]
14 Maximum a posteriori (MAP) inference employs priors concentrated towards small parameter values. [sent-27, score-0.071]
15 MAP PLSA is an established technique [7], but earlier work has been limited to log-concave prior distributions (corresponding to convex regularization functions) that make a concave contribution to the posterior log-likelihood. [sent-28, score-0.202]
16 In this work we resolve this difficulty by showing how, even though concavity fails in general, we are able to derive simple checkable conditions that guarantee a unique stationary point to the M-step objective function that serves as the unique global maximum. [sent-31, score-0.204]
17 Section 3 discusses priors appropriate for inducing sparsity, and introduces a generalization of the Dirichlet distribution which we term the pseudo-Dirichlet distribution. [sent-34, score-0.094]
18 Section 4 contains our main result, a tractable EM algorithm for PLSA under sparse pseudo-Dirichlet priors using the Mountain Pass Theorem. [sent-35, score-0.074]
19 Following [1], the corresponding data log-likelihood can be written 0 (θ) = P (w | z)P (z | d) + n(w, d) log w,d z n(d) log P (d), (2) d where n(w, d) is the number of occurrences of word w in document d, and n(d) = w n(w, d) is the total number of words in d. [sent-42, score-0.139]
20 This is accomplished using the EM algorithm, iterating between the following two steps: E-step: Find P (z | w, d, θ ), the posterior distribution of the latent variable z, given (w, d) and a current parameter estimate θ . [sent-44, score-0.083]
21 This formulation is less interesting in our context, since our sparsity assumption is intended as a statement about the vectors (P (z | d) : z ∈ Z), d ∈ D. [sent-49, score-0.09]
22 2 MAP PLSA The standard MAP extension of PLSA is to introduce a prior density P (θ) on the parameter vector, and then maximize the posterior data log-likelihood (θ) = 0 (θ) + log P (θ) via the EM algorithm. [sent-51, score-0.11]
23 That is, fz (P (w | z) : w ∈ W) × P (θ) = z∈Z gd (P (z | d) : z ∈ Z) × h(P (d) : d ∈ D), d∈D for densities fz , gd and h on the simplexes in R|W| , R|Z| and R|D| , respectively. [sent-53, score-1.667]
24 d As a comment, notice that if the densities fz , gd , or h are log-concave then Fz , Gd , and H are concave in θ. [sent-55, score-0.892]
25 The most well-known family of distributions on the simplex is the Dirichlet family, which has many properties that make it useful in Bayesian statistics [8]. [sent-59, score-0.055]
26 Unfortunately the Dirichlet distribution is not a suitable prior for modeling sparsity for PLSA, as we shall see, and to address this we introduce a generalization of the Dirichlet distribution which we call the pseudo-Dirichlet distribution. [sent-60, score-0.115]
27 To illustrate why the Dirichlet distribution is unsuitable in the present context, consider placing a symmetric Dirichlet prior on (P (z | d) : z ∈ Z) for each document d. [sent-61, score-0.1]
28 That is, for each d ∈ D, P (z | d)α−1 , gd (P (z | d) : z ∈ Z) ∝ z∈Z where α > 0 is the concentration parameter. [sent-62, score-0.618]
29 The relevant case for sparsity is when α < 1, which concentrates the density toward the (relative) boundary of the simplex. [sent-64, score-0.096]
30 A bigger problem, however, is that for α < 1 the density of the symmetric Dirichlet distribution is unbounded and the MAP likelihood problem does not have a well-defined solution, as the following result shows. [sent-67, score-0.104]
31 3 Proposition 1 Under the above assumptions on fz , gd and h there are infinitely many sequences (θm )m≥1 , converging to distinct limits, such that limm→∞ Q(θm | θm ) = ∞. [sent-68, score-0.826]
32 This proposition is a formal statement of the observation that when the Dirichlet prior is unbounded, any single zero element in P (z|d) leads to an infinite posterior likelihood, and so the optimization problem is not well-posed. [sent-77, score-0.062]
33 To overcome these unbounded Dirichlet priors while retaining their sparsity-inducing properties, we introduce the following class of distributions on the simplex. [sent-78, score-0.098]
34 Definition 1 A random vector confined to the simplex in Rp is said to follow a pseudo-Dirichlet distribution with concentration parameter α = (α1 , . [sent-79, score-0.075]
35 , p ) ∈ Rp if it has a density on the simplex given by + p ( i + xi )αi −1 P (x1 , . [sent-85, score-0.053]
36 The psuedo-Dirichlet distribution can be viewed as a bounded perturbation of the Dirichlet distribution, and for small values of the perturbation parameter , many of the properties of the original Dirichlet distribution hold approximately. [sent-94, score-0.09]
37 In our discussion section we offer a justification for allowing α ≤ 0, framed within a regularization approach. [sent-95, score-0.058]
38 4 EM under pseudo-Dirichlet priors We now derive an EM algorithm for PLSA under sparse pseudo-Dirichlet priors. [sent-96, score-0.074]
39 While our M-step will not offer a closed-form maximization, we are able to derive simple checkable conditions under which the M-step has a stationary point that is also the global maximum. [sent-99, score-0.185]
40 Because our sparsity assumption focuses on the parameters P (z|d), we perform our main analysis on Gd , but for completeness we state the corresponding result for Fz . [sent-102, score-0.069]
41 We use symmetric pseudo-Dirichlet priors with parameters αd = (αd , . [sent-105, score-0.07]
42 Since each Gd is treated separately, let us fix d and write xz = P (z | d), P (z | w, d, θ )n(w, d), cz = w where the dependence on d is suppressed in the notation. [sent-112, score-0.624]
43 For x = (xz : z ∈ Z) and a fixed θ , we write Gd (x) = Gd (θ | θ ), which yields, up to an additive constant, (αd − 1) log( Gd (x) = z 4 d + xz ) + cz log xz . [sent-113, score-1.067]
44 The task is to maximize Gd , subject to z xz = 1 and xz ≥ 0 for all z. [sent-114, score-0.846]
45 Assuming that every word w is observed in at least one document d and that all components of θ are strictly positive, Lemma 1 below implies that any M-step optimizer must have strictly positive components. [sent-115, score-0.136]
46 The non-negativity constraint is therefore never binding, so the appropriate Lagrangian for this problem is Ld (x; λ) = Gd (x) + λ 1 − xz , z and it suffices to consider its stationary points. [sent-116, score-0.536]
47 Lemma 1 Assume that every word w has been observed in at least one document d, and that P (z | w, d; θ ) > 0 for all (w, d, z). [sent-117, score-0.064]
48 If xz → 0 for some z, and the nonnegativity and normalization constraints are maintained, then Gd (x) → −∞. [sent-118, score-0.431]
49 Therefore, since log( d + xz ) and log xz are bounded from above, ∀z, when θ stays in the feasible region, xz → 0 leads to Gd (x) → −∞. [sent-121, score-1.291]
50 The next lemma establishes a property of the stationary points of the Lagrangian Ld which will be the key to proving our main result. [sent-122, score-0.154]
51 Lemma 2 Let (x, λ) be any stationary point of Ld such that xz > 0 for all z. [sent-123, score-0.556]
52 Since ∂Lz (x, λ) = 0 at the stationary point, we get ∂x ∂x z d +xz xz λxz = cz −(1−αd ) d +xz ≥ cz −(1−αd ), which, after summing over z and using that z xz = 1, yields λ≥ cz − (1 − αd )|Z|. [sent-128, score-1.593]
53 z Furthermore, z cz = w n(w, d) z P (z | w, d, θ ) = n(d), so λ ≥ n(d) − (1 − αd )|Z|. [sent-129, score-0.209]
54 d For the second assertion, using once again that ∂Lz (x, λ) = 0 at the stationary point, a calculation ∂x shows that ∂ 2 Gd 1 (x, λ) = − 2 x2 λ + cz d . [sent-130, score-0.33]
55 2 ∂xz xz ( d + xz ) z The assumptions imply that cz > 0, so it suffices to prove that λ ≥ 0. [sent-131, score-1.039]
56 Theorem 1 Assume that (i) every word w has been observed in at least one document d, (ii) P (z | w, d, θ ) > 0 for all (w, d, z), and (iii) n(d) > (1 − αd )|Z| for each d. [sent-134, score-0.064]
57 Then each Lagrangian Ld has a unique stationary point, which is the global maximum of the corresponding optimization problem, and whose components are strictly positive. [sent-135, score-0.157]
58 If φ has two distinct strict local maxima, it must have a third stationary point that is not a strict local maximum. [sent-140, score-0.215]
59 We first prove that the corresponding Lagrangian Ld can have at most one stationary point. [sent-147, score-0.121]
60 The following facts are readily verified: ++ (i) If (x, λ) is a stationary point of Ld , then (x1 , . [sent-162, score-0.141]
61 , xK−1 ) is a stationary point of Gd , then (x, λ) is a stationary point of Ld , where K−1 c xK = 1 − k=1 xk and λ = xK − 1−αd . [sent-169, score-0.374]
62 k Now, suppose that (x, λ) is a stationary point of Ld . [sent-174, score-0.141]
63 , xK−1 ) is a stationary point of Gd and that 2 Gd is negative definite there. [sent-178, score-0.141]
64 By Lemma 1, Gd tends to −∞ near the boundary of O, so we may apply the mountain pass theorem to get the existence of a third point (˜1 , . [sent-181, score-0.213]
65 , xK−1 ), stationary for Gd , that is not a strict local maximum. [sent-184, score-0.158]
66 But by (ii), this x ˜ yields a corresponding stationary point (˜ , λ) for Ld . [sent-185, score-0.141]
67 We deduce that Ld has x ˜ at most one stationary point. [sent-190, score-0.121]
68 Finally, the continuity of Gd together with its boundary behavior (Lemma 1) implies that a maximizer exists and has strictly positive components. [sent-191, score-0.08]
69 But the maximizer must be a stationary point of Ld , so together with the previously established uniqueness, the result follows. [sent-192, score-0.161]
70 Condition (i) in Theorem 1 is not a real restriction, since a word that does not appear in any document typically will be removed from the vocabulary. [sent-193, score-0.064]
71 While there are various methods available for finding the stationary point of Ld , we have found that the following fixed-point type iterative scheme produces satisfactory results. [sent-198, score-0.141]
72 (6) xz 1 z xz n(d) + (1 − α ) − d z d +xz d +xz To motivate this particular update rule, recall that ∂Ld cz 1 − αd = − − λ, ∂xz xz d + xz ∂Ld =1− ∂λ xz . [sent-200, score-2.284]
73 z At the stationary point, λxz = cz − 1−αd xz , so by summing over z and using that z xz = 1 d +xz d and z cz = n(d), we get λ = n(d) − (1 − αd ) z dxz z . [sent-201, score-1.384]
74 Notice that Lemma 2 ensures that the denominator stays strictly positive. [sent-203, score-0.054]
75 We impose a symmetric pseudo-Dirichlet prior on the vector (P (w | z) : w ∈ W) for each z ∈ Z. [sent-207, score-0.062]
76 The objective function Fz (y) = Fz (θ | θ ) is then given by (αz − 1) log( Fz (y) = z + yw ) + bw log yz , w P (z | w, d, θ )n(w, d). [sent-210, score-0.104]
77 5 Empirical evaluation To evaluate our framework for sparse mixture model inference, we develop a MAP PLSA topic model for a corpus of 2,406 blogger. [sent-216, score-0.172]
78 Unigram frequencies for the blogs were built using the python NLTK toolkit [12]. [sent-219, score-0.101]
79 Inference was run on the document-word distribution of 2,406 blogs and 2,000 most common words, as determined by the aggregate frequencies across the entire corpus. [sent-220, score-0.121]
80 The implications of Section 4 is that in order to adapt PLSA for sparse MAP inference, we simply need to replace equation (4) from PLSA’s ordinary M-step with an iteration of (6). [sent-221, score-0.054]
81 The user-provided topic labels are quite noisy, and so in order to have cleaner ground truth data for evaluating our model we chose to also construct a synthetic, sparse dataset. [sent-224, score-0.072]
82 To generate our synthetic data, we ran PLSA on our text corpus and extracted the inferred P (w|z) and P (d) distributions, while creating 2,406 synthetic P (z|d) distributions where each synthetic blog was a uniform mixture of between 1 and 4 topics. [sent-226, score-0.325]
83 These distributions were then used to construct a ground-truth word-document distribution Q(w, d), which we then sampled N times, where N is the total number of words in our true corpus. [sent-227, score-0.064]
84 In this way we were able to generate a realistic synthetic dataset with a sparse and known document-topic distribution. [sent-228, score-0.063]
85 We evaluate the quality of each model by calculating the model perplexity of the reconstructed worddocument distribution as compared to the underlying ground truth distribution used to generate the synthetic data. [sent-229, score-0.179]
86 Here model perplexity is given by P(P (w, d)) = 2− w,d Q(w,d) log2 P (w,d) , where Q(w, d) is the true document-word distribution used to generate the synthetic dataset and P (w, d) is the reconstructed matrix inferred by the model. [sent-230, score-0.176]
87 In Figure 2 we see that sparse inference indeed results in P (z|d) distributions with significantly sparser support. [sent-234, score-0.07]
88 Furthermore, we can more easily see how certain categories of blogs cluster in their usage of certain topics. [sent-235, score-0.085]
89 For example, a majority of the blogs self-categorized as pertaining to ‘religion’ employ almost exclusively the second topic vocabulary of the model. [sent-236, score-0.158]
90 6 Discussion We have shown how certain latent variable mixture models can be tractably extended with sparsityinducing priors using what we call the pseudo-Dirichlet distribution. [sent-238, score-0.146]
91 The use of log-convex priors (equivalently, concave regularization functions) to encourage sparsity is particularly relevant when the parameters of the model correspond to probability distributions. [sent-241, score-0.192]
92 The pseudo-Dirichlet prior we introduce corresponds to a concave regularization of the form i log(xi + ). [sent-243, score-0.119]
93 85 −50 −40 −30 α α −20 −10 0 Figure 1: Model perplexity for inferred models with k = 8 topics as a function of the concentration parameter α of the pseudo-Dirichlet prior, shown from the algorithm’s lower bound α = 1 − n(d)/k to the uniform prior case of α = 1. [sent-248, score-0.168]
94 Three different choices of are shown, as well as the baseline PLSA perplexity corresponding to a uniform prior. [sent-249, score-0.085]
95 The dashed line indicates the perplexity P(Q(w, d)), which should be interpreted as a lower bound. [sent-250, score-0.085]
96 passing that the same sum-log regularization has also been used for sparse signal recovery in [13]. [sent-253, score-0.066]
97 Indeed, Theorem 1 shows that, no different from ordinary PLSA, the estimated parameters for MAP PLSA will all be strictly positive. [sent-255, score-0.066]
98 Next, let us comment on the possibility to allow the concentration parameter αd to be negative, assuming for simplicity that fz and h are constant. [sent-257, score-0.278]
99 This indicates that the quantity (1 − αd )/N , which plays the role of a regularization ‘gain’ in the normalized problem, must be non-negligible in order for the regularization to have an effect. [sent-259, score-0.084]
100 Nonnegative matrix factorization and probabilistic latent semantic indexing: Equivalence chi-square statistic, and a hybrid method. [sent-292, score-0.091]
wordName wordTfidf (topN-words)
[('gd', 0.593), ('plsa', 0.428), ('xz', 0.415), ('fz', 0.233), ('cz', 0.209), ('ld', 0.161), ('stationary', 0.121), ('dirichlet', 0.117), ('xk', 0.092), ('blogs', 0.085), ('perplexity', 0.085), ('mountain', 0.069), ('map', 0.068), ('pass', 0.052), ('bw', 0.052), ('concave', 0.051), ('mixture', 0.05), ('priors', 0.05), ('corpus', 0.05), ('sparsity', 0.049), ('lz', 0.049), ('nmf', 0.049), ('religion', 0.048), ('topic', 0.048), ('latent', 0.046), ('em', 0.046), ('regularization', 0.042), ('synthetic', 0.039), ('lagrangian', 0.038), ('strict', 0.037), ('blog', 0.036), ('strictly', 0.036), ('concavity', 0.035), ('document', 0.034), ('lemma', 0.033), ('schler', 0.032), ('word', 0.03), ('text', 0.03), ('ordinary', 0.03), ('simplex', 0.03), ('media', 0.029), ('log', 0.028), ('checkable', 0.028), ('tourism', 0.028), ('weblogs', 0.028), ('semantic', 0.028), ('theorem', 0.027), ('prior', 0.026), ('perturbation', 0.025), ('vocabulary', 0.025), ('concentration', 0.025), ('distributions', 0.025), ('yw', 0.024), ('boundary', 0.024), ('yk', 0.024), ('inducing', 0.024), ('sparse', 0.024), ('demand', 0.023), ('unbounded', 0.023), ('technique', 0.023), ('density', 0.023), ('intended', 0.022), ('unstructured', 0.022), ('inference', 0.021), ('tends', 0.021), ('maximization', 0.021), ('maximizer', 0.02), ('completeness', 0.02), ('point', 0.02), ('distribution', 0.02), ('comment', 0.02), ('gender', 0.02), ('symmetric', 0.02), ('iii', 0.019), ('words', 0.019), ('statement', 0.019), ('nonnegative', 0.019), ('rp', 0.018), ('likelihood', 0.018), ('contribution', 0.018), ('veri', 0.018), ('cornell', 0.018), ('internet', 0.018), ('lda', 0.018), ('stays', 0.018), ('posterior', 0.017), ('factorization', 0.017), ('inferred', 0.017), ('impose', 0.016), ('normalization', 0.016), ('maximize', 0.016), ('frequencies', 0.016), ('law', 0.016), ('allocation', 0.016), ('offer', 0.016), ('applicable', 0.016), ('reconstructed', 0.015), ('topics', 0.015), ('summing', 0.015), ('densities', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 14 nips-2011-A concave regularization technique for sparse mixture models
Author: Martin O. Larsson, Johan Ugander
Abstract: Latent variable mixture models are a powerful tool for exploring the structure in large datasets. A common challenge for interpreting such models is a desire to impose sparsity, the natural assumption that each data point only contains few latent features. Since mixture distributions are constrained in their L1 norm, typical sparsity techniques based on L1 regularization become toothless, and concave regularization becomes necessary. Unfortunately concave regularization typically results in EM algorithms that must perform problematic non-concave M-step maximizations. In this work, we introduce a technique for circumventing this difficulty, using the so-called Mountain Pass Theorem to provide easily verifiable conditions under which the M-step is well-behaved despite the lacking concavity. We also develop a correspondence between logarithmic regularization and what we term the pseudo-Dirichlet distribution, a generalization of the ordinary Dirichlet distribution well-suited for inducing sparsity. We demonstrate our approach on a text corpus, inferring a sparse topic mixture model for 2,406 weblogs. 1
2 0.10939275 58 nips-2011-Complexity of Inference in Latent Dirichlet Allocation
Author: David Sontag, Dan Roy
Abstract: We consider the computational complexity of probabilistic inference in Latent Dirichlet Allocation (LDA). First, we study the problem of finding the maximum a posteriori (MAP) assignment of topics to words, where the document’s topic distribution is integrated out. We show that, when the e↵ective number of topics per document is small, exact inference takes polynomial time. In contrast, we show that, when a document has a large number of topics, finding the MAP assignment of topics to words in LDA is NP-hard. Next, we consider the problem of finding the MAP topic distribution for a document, where the topic-word assignments are integrated out. We show that this problem is also NP-hard. Finally, we briefly discuss the problem of sampling from the posterior, showing that this is NP-hard in one restricted setting, but leaving open the general question. 1
3 0.064925179 281 nips-2011-The Doubly Correlated Nonparametric Topic Model
Author: Dae I. Kim, Erik B. Sudderth
Abstract: Topic models are learned via a statistical model of variation within document collections, but designed to extract meaningful semantic structure. Desirable traits include the ability to incorporate annotations or metadata associated with documents; the discovery of correlated patterns of topic usage; and the avoidance of parametric assumptions, such as manual specification of the number of topics. We propose a doubly correlated nonparametric topic (DCNT) model, the first model to simultaneously capture all three of these properties. The DCNT models metadata via a flexible, Gaussian regression on arbitrary input features; correlations via a scalable square-root covariance representation; and nonparametric selection from an unbounded series of potential topics via a stick-breaking construction. We validate the semantic structure and predictive performance of the DCNT using a corpus of NIPS documents annotated by various metadata. 1
4 0.056544147 258 nips-2011-Sparse Bayesian Multi-Task Learning
Author: Shengbo Guo, Onno Zoeter, Cédric Archambeau
Abstract: We propose a new sparse Bayesian model for multi-task regression and classification. The model is able to capture correlations between tasks, or more specifically a low-rank approximation of the covariance matrix, while being sparse in the features. We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classification tasks it achieves competitive predictive performance compared to previously proposed methods. 1
5 0.053556085 129 nips-2011-Improving Topic Coherence with Regularized Topic Models
Author: David Newman, Edwin V. Bonilla, Wray Buntine
Abstract: Topic models have the potential to improve search and browsing by extracting useful semantic themes from web pages and other text documents. When learned topics are coherent and interpretable, they can be valuable for faceted browsing, results set diversity analysis, and document retrieval. However, when dealing with small collections or noisy text (e.g. web search result snippets or blog posts), learned topics can be less coherent, less interpretable, and less useful. To overcome this, we propose two methods to regularize the learning of topic models. Our regularizers work by creating a structured prior over words that reflect broad patterns in the external data. Using thirteen datasets we show that both regularizers improve topic coherence and interpretability while learning a faithful representation of the collection of interest. Overall, this work makes topic models more useful across a broader range of text data. 1
6 0.051974673 104 nips-2011-Generalized Beta Mixtures of Gaussians
7 0.047695622 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning
8 0.04729623 259 nips-2011-Sparse Estimation with Structured Dictionaries
9 0.046646126 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
10 0.046362966 116 nips-2011-Hierarchically Supervised Latent Dirichlet Allocation
11 0.043448765 266 nips-2011-Spatial distance dependent Chinese restaurant processes for image segmentation
12 0.042877838 115 nips-2011-Hierarchical Topic Modeling for Analysis of Time-Evolving Personal Choices
13 0.040641293 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
14 0.038960986 302 nips-2011-Variational Learning for Recurrent Spiking Networks
15 0.038437262 221 nips-2011-Priors over Recurrent Continuous Time Processes
16 0.038360383 191 nips-2011-Nonnegative dictionary learning in the exponential noise model for adaptive music signal representation
17 0.03502322 110 nips-2011-Group Anomaly Detection using Flexible Genre Models
18 0.034844343 161 nips-2011-Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation
19 0.033991892 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
20 0.033141818 260 nips-2011-Sparse Features for PCA-Like Linear Regression
topicId topicWeight
[(0, 0.11), (1, 0.023), (2, -0.005), (3, -0.027), (4, -0.037), (5, -0.122), (6, 0.052), (7, 0.034), (8, -0.045), (9, 0.053), (10, 0.048), (11, 0.021), (12, 0.008), (13, -0.002), (14, -0.007), (15, -0.008), (16, 0.014), (17, -0.038), (18, -0.008), (19, 0.01), (20, 0.012), (21, 0.01), (22, -0.006), (23, -0.002), (24, -0.006), (25, -0.02), (26, 0.023), (27, 0.005), (28, 0.004), (29, -0.0), (30, 0.014), (31, -0.009), (32, -0.045), (33, -0.01), (34, -0.032), (35, -0.023), (36, 0.006), (37, -0.028), (38, -0.038), (39, -0.088), (40, -0.033), (41, 0.012), (42, -0.086), (43, -0.013), (44, 0.021), (45, 0.054), (46, 0.1), (47, -0.013), (48, 0.014), (49, 0.06)]
simIndex simValue paperId paperTitle
same-paper 1 0.88720316 14 nips-2011-A concave regularization technique for sparse mixture models
Author: Martin O. Larsson, Johan Ugander
Abstract: Latent variable mixture models are a powerful tool for exploring the structure in large datasets. A common challenge for interpreting such models is a desire to impose sparsity, the natural assumption that each data point only contains few latent features. Since mixture distributions are constrained in their L1 norm, typical sparsity techniques based on L1 regularization become toothless, and concave regularization becomes necessary. Unfortunately concave regularization typically results in EM algorithms that must perform problematic non-concave M-step maximizations. In this work, we introduce a technique for circumventing this difficulty, using the so-called Mountain Pass Theorem to provide easily verifiable conditions under which the M-step is well-behaved despite the lacking concavity. We also develop a correspondence between logarithmic regularization and what we term the pseudo-Dirichlet distribution, a generalization of the ordinary Dirichlet distribution well-suited for inducing sparsity. We demonstrate our approach on a text corpus, inferring a sparse topic mixture model for 2,406 weblogs. 1
2 0.65457773 58 nips-2011-Complexity of Inference in Latent Dirichlet Allocation
Author: David Sontag, Dan Roy
Abstract: We consider the computational complexity of probabilistic inference in Latent Dirichlet Allocation (LDA). First, we study the problem of finding the maximum a posteriori (MAP) assignment of topics to words, where the document’s topic distribution is integrated out. We show that, when the e↵ective number of topics per document is small, exact inference takes polynomial time. In contrast, we show that, when a document has a large number of topics, finding the MAP assignment of topics to words in LDA is NP-hard. Next, we consider the problem of finding the MAP topic distribution for a document, where the topic-word assignments are integrated out. We show that this problem is also NP-hard. Finally, we briefly discuss the problem of sampling from the posterior, showing that this is NP-hard in one restricted setting, but leaving open the general question. 1
3 0.63443559 281 nips-2011-The Doubly Correlated Nonparametric Topic Model
Author: Dae I. Kim, Erik B. Sudderth
Abstract: Topic models are learned via a statistical model of variation within document collections, but designed to extract meaningful semantic structure. Desirable traits include the ability to incorporate annotations or metadata associated with documents; the discovery of correlated patterns of topic usage; and the avoidance of parametric assumptions, such as manual specification of the number of topics. We propose a doubly correlated nonparametric topic (DCNT) model, the first model to simultaneously capture all three of these properties. The DCNT models metadata via a flexible, Gaussian regression on arbitrary input features; correlations via a scalable square-root covariance representation; and nonparametric selection from an unbounded series of potential topics via a stick-breaking construction. We validate the semantic structure and predictive performance of the DCNT using a corpus of NIPS documents annotated by various metadata. 1
4 0.63125116 116 nips-2011-Hierarchically Supervised Latent Dirichlet Allocation
Author: Adler J. Perotte, Frank Wood, Noemie Elhadad, Nicholas Bartlett
Abstract: We introduce hierarchically supervised latent Dirichlet allocation (HSLDA), a model for hierarchically and multiply labeled bag-of-word data. Examples of such data include web pages and their placement in directories, product descriptions and associated categories from product hierarchies, and free-text clinical records and their assigned diagnosis codes. Out-of-sample label prediction is the primary goal of this work, but improved lower-dimensional representations of the bagof-word data are also of interest. We demonstrate HSLDA on large-scale data from clinical document labeling and retail product categorization tasks. We show that leveraging the structure from hierarchical labels improves out-of-sample label prediction substantially when compared to models that do not. 1
Author: Shuai Huang, Jing Li, Jieping Ye, Teresa Wu, Kewei Chen, Adam Fleisher, Eric Reiman
Abstract: Diagnosis of Alzheimer’s disease (AD) at the early stage of the disease development is of great clinical importance. Current clinical assessment that relies primarily on cognitive measures proves low sensitivity and specificity. The fast growing neuroimaging techniques hold great promise. Research so far has focused on single neuroimaging modality. However, as different modalities provide complementary measures for the same disease pathology, fusion of multi-modality data may increase the statistical power in identification of disease-related brain regions. This is especially true for early AD, at which stage the disease-related regions are most likely to be weakeffect regions that are difficult to be detected from a single modality alone. We propose a sparse composite linear discriminant analysis model (SCLDA) for identification of disease-related brain regions of early AD from multi-modality data. SCLDA uses a novel formulation that decomposes each LDA parameter into a product of a common parameter shared by all the modalities and a parameter specific to each modality, which enables joint analysis of all the modalities and borrowing strength from one another. We prove that this formulation is equivalent to a penalized likelihood with non-convex regularization, which can be solved by the DC (difference of convex functions) programming. We show that in using the DC programming, the property of the nonconvex regularization in terms of preserving weak-effect features can be nicely revealed. We perform extensive simulations to show that SCLDA outperforms existing competing algorithms on feature selection, especially on the ability for identifying weak-effect features. We apply SCLDA to the Magnetic Resonance Imaging (MRI) and Positron Emission Tomography (PET) images of 49 AD patients and 67 normal controls (NC). Our study identifies disease-related brain regions consistent with findings in the AD literature. 1 In tro du cti on Alzheimer’s disease (AD) is a fatal, neurodegenerative disorder that currently affects over five million people in the U.S. It leads to substantial, progressive neuron damage that is irreversible, which eventually causes death. Early diagnosis of AD is of great clinical importance, because disease-modifying therapies given to patients at the early stage of their disease development will have a much better effect in slowing down the disease progression and helping preserve some cognitive functions of the brain. However, current clinical assessment that majorly relies on cognitive measures proves low sensitivity and specificity in early diagnosis of AD. This is because these cognitive measures are vulnerable to the confounding effect from some non-AD related factors such as patients’ mood, and presence of other illnesses or major life events [1]. The confounding effect is especially severe in the diagnosis of early AD, at which time cognitive 1 impairment is not yet apparent. On the other hand, fast growing neuroimaging techniques, such as Magnetic Resonance Imaging (MRI) and Positron Emission Tomography (PET), provide great opportunities for improving early diagnosis of AD, due to their ability for overcoming the limitations of conventional cognitive measures. There are two major categories of neuroimaging techniques, i.e., functional and structure neuroimaging. MRI is a typical structural neuroimaging technique, which allows for visualization of brain anatomy. PET is a typical functional neuroimaging technique, which measures the cerebral metabolic rate for glucose. Both techniques have been extensively applied to AD studies. For example, studies based on MRI have consistently revealed brain atrophy that involves the hippocampus and entorhinal cortex [2-6]; studies based on PET have revealed functional abnormality that involves the posterior temporal and parietal association cortices [8-10], posterior cingulate, precuneus, and medial temporal cortices [11-14]. There is overlap between the disease-related brain regions detected by MRI and those by PET, such as regions in the hippocampus area and the mesia temporal lobe [15-17]. This is not surprising since MRI and PET are two complementary measures for the same disease pathology, i.e., it starts mainly in the hippocampus and entorhinal cortex, and subsequently spreads throughout temporal and orbiogrontal cortext, poseterior cingulated, and association cortex [7]. However, most existing studies only exploited structural and functional alterations in separation, which ignore the potential interaction between them. The fusion of MRI and PET imaging modalities will increase the statistical power in identification of disease-related brain regions, especially for early AD, at which stage the disease-related regions are most likely to be weakeffect regions that are difficult to be detected from MRI or PET alone. Once a good set of diseaserelated brain regions is identified, they can be further used to build an effective classifier (i.e., a biomarker from the clinical perspective) to enable AD diagnose with high sensitivity and specificity. The idea of multi-modality data fusion in the research of neurodegenerative disorders has been exploited before. For example, a number of models have been proposed to combine electroencephalography (EEG) and functional MRI (fMRI), including parallel EEG-fMRI independent component analysis [18]-[19], EEG-informed fMRI analysis [18] [20], and variational Bayesian methods [18] [21]. The purpose of these studies is different from ours, i.e., they aim to combine EEG, which has high temporal resolution but low spatial resolution, and fMRI, which has low temporal resolution but high spatial resolution, so as to obtain an accurate picture for the whole brain with both high spatial and high temporal resolutions [18]-[21]. Also, there have been some studies that include both MRI and PET data for classification [15], [22][25]. However, these studies do not make use of the fact that MRI and PET measure the same underlying disease pathology from two complementary perspectives (i.e., structural and functional perspectives), so that the analysis of one imaging modality can borrow strength from the other. In this paper, we focus on the problem of identifying disease-related brain regions from multimodality data. This is actually a variable selection problem. Because MRI and PET data are highdimensional, regularization techniques are needed for effective variable selection, such as the L1regularization technique [25]-[30] and the L2/L1-regularization technique [31]. In particular, L2/L1-regularization has been used for variable selection jointly on multiple related datasets, also known as multitask feature selection [31], which has a similar nature to our problem. Note that both L1- and L2/L1-regularizations are convex regularizations, which have gained them popularity in the literature. On the other hand, there is increasing evidence that these convex regularizations tend to produce too severely shrunken parameter estimates. Therefore, these convex regularizations could lead to miss-identification of the weak-effect disease-related brain regions, which unfortunately make up a large portion of the disease-related brain regions especially in early AD. Also, convex regularizations tend to select many irrelevant variables to compensate for the overly severe shrinkage in the parameters of the relevant variables. Considering these limitations of convex regularizations, we study non-convex regularizations [33]-[35] [39], which have the advantage of producing mildly or slightly shrunken parameter estimates so as to be able to preserve weak-effect disease-related brain regions and the advantage of avoiding selecting many disease-irrelevant regions. Specifically in this paper, we propose a sparse composite linear discriminant analysis model, called SCLDA, for identification of disease-related brain regions from multi-modality data. The contributions of our paper include: 2 • • • 2 Formulation: We propose a novel formulation that decomposes each LDA parameter into a product of a common parameter shared by all the data sources and a parameter specific to each data source, which enables joint analysis of all the data sources and borrowing strength from one another. We further prove that this formulation is equivalent to a penalized likelihood with non-convex regularization. Algorithm: We show that the proposed non-convex optimization can be solved by the DC (difference of convex functions) programming [39]. More importantly, we show that in using the DC programming, the property of the non-convex regularization in terms of preserving weak-effect features can be nicely revealed. Application: We apply the proposed SCLDA to the PET and MRI data of early AD patients and normal controls (NC). Our study identifies disease-related brain regions that are consistent with the findings in the AD literature. AD vs. NC classification based on these identified regions achieves high accuracy, which makes the proposed method a useful tool for clinical diagnosis of early AD. In contrast, the convex-regularization based multitask feature selection method [31] identifies more irrelevant brain regions and yields a lower classification accuracy. Rev iew o f L D A a nd it s v ari an ts ! Denote đ?’ = đ?‘?!, đ?‘?! , ‌ , đ?‘?! as the variables and assume there are đ??˝ classes. Denote đ?‘ ! as the ! sample size of class đ?‘— and đ?‘ = !!! đ?‘ ! is the total sample size. Let đ??ł = đ?’›! , đ?’›! , ‌ , đ?’›! ! be the đ?‘ Ă—đ?‘? sample matrix, where đ?’›! is the đ?‘– !! sample and đ?‘” đ?‘– is its associated class index. Let ! ! ! ! đ?›?! = !!! đ?’›! be the overall sample mean, !!!,! ! !! đ?’›! be the sample mean of class đ?‘—, đ?›? = !! ! ! đ?’› − đ?›? ! !!! ! ! ! đ??–! = !! !!!,! ! !! ! ! đ?‘ đ??– be the ! !!! ! ! đ??“= ! đ?’›! − đ?›? đ?’›! − đ?›?! ! be the total normalized sum of squares and products (SSQP), đ?’›! − đ?›?! ! be the normalized class SSQP of class đ?‘—, and đ??–= overall normalized class SSQP. The objective of LDA is to seek for a đ?‘?Ă—đ?‘ž linear transformation matrix, đ?›‰! , with which đ?›‰! đ?‘? ! retains the maximum amount of class discrimination information in đ?‘?. To achieve this objective, one approach is to seek for the đ?›‰! that maximizes the between-class variance of đ?›‰! đ?‘?, which can ! be measured by tr(đ?›‰! đ??“đ?›‰! ), while minimizing the within-class variance of đ?›‰! đ?‘?, which can be ! ! measured by tr(đ?›‰! đ??–đ?›‰! ). Here tr() is the matrix trace operator. This is equivalent to solving the ! following optimization problem:  đ?›‰! = argmax đ?›‰! đ??đ??Ť(đ?›‰! đ??“đ?›‰! ) ! đ??đ??Ť(đ?›‰! đ??–đ?›‰! ) ! . (1) Note that đ?›‰! corresponds to the right eigenvector of đ??– !! đ??“ and đ?‘ž = đ??˝ − 1. Another approach used for finding the đ?›‰! is to use the maximum likelihood estimation for Gaussian populations that have different means and a common covariance matrix. Specifically, as in [36], this approach is developed by assuming the class distributions are Gaussian with a common covariance matrix, and their mean differences lie in a đ?‘ž-dimensional subspace of the đ?‘?dimensional original variable space. Hastie [37] further generalized this approach by assuming that class distributions are a mixture of Gaussians, which has more flexibility than LDA. However, both approaches assume a common covariance matrix for all the classes, which is too strict in many practical applications, especially in high-dimensional problems where the covariance matrices of different classes tend to be different. Consequently, the linear transformation explored by LDA may not be effective. In [38], a heterogeneous LDA (HLDA) is developed to relax this assumption. The HLDA seeks for a đ?‘?Ă—đ?‘? linear transformation matrix, đ?›‰, in which only the first đ?‘ž columns (đ?›‰! ) contain discrimination information and the remaining đ?‘? − đ?‘ž columns (đ?›‰!!! ) contain no discrimination information. For Gaussian models, assuming lack of discrimination information is equivalent to assuming that the means and the covariance matrices of the class distributions are the same for all 3 classes, in the đ?‘? − đ?‘ž dimensional subspace. Following this, the log-likelihood function of đ?›‰ can be written as below [38]: ! đ?‘™ đ?›‰|đ??™ = − log đ?›‰! đ??“đ?›‰!!! − !!! ! !! ! !!! ! log đ?›‰! đ??–! đ?›‰! + đ?‘ log đ?›‰ , ! (2) Here đ??€ denotes the determinant of matrix đ??€. There is no closed-form solution for đ?›‰. As a result, numeric methods are needed to derive the maximum likelihood estimate for đ?›‰. It is worth mentioning that the LDA in the form of (1) is a special case of the HLDA [38]. 3 T he p ro po sed SC L DA Suppose that there are multiple data sources, đ??™ ! , đ??™ ! , ‌ , đ??™ ! , with each data source capturing one aspect of the same set of physical variables, e.g., the MRI and PET capture the structural and functional aspects of the same brain regions. For each data source, đ??™ ! , there is a linear transformation matrix đ?›‰ ! , which retains the maximum amount of class discrimination information in đ??™ ! . A naive way for estimating đ?šŻ = đ?›‰ ! , đ?›‰ ! , ‌ , đ?›‰ ! is to separately estimate each đ?›‰ ! based on đ??™ ! . Apparently, this approach does not take advantage of the fact that all the data sources measure the same physical process. Also, when the sample size of each data source is small, this approach may lead to unreliable estimates for the đ?›‰ ! ’s. To tackle these problems, we propose a composite parameterization following the line as [40]. ! Specifically, let đ?œƒ!,! be the element at the k-th row and l-th column of đ?›‰ ! . We treat ! ! ! ! ! ! đ?œƒ!,! , đ?œƒ!,! , ‌ , đ?œƒ!,! as an interrelated group and parameterize each đ?œƒ!,! as đ?œƒ!,! = đ?›ż! đ?›ž!,! , for 1 ≤ đ?‘˜ ≤ đ?‘?, 1 ≤ đ?‘™ ≤ đ?‘? and 1 ≤ đ?‘š ≤ đ?‘€. In order to assure identifiability, we restrict each đ?›ż! ≼ 0. Here, đ?›ż! represents the common information shared by all the data sources about variable đ?‘˜, while ! đ?›ž!,! represents the specific information only captured by the đ?‘š !! data source. For example, for disease-related brain region identification, if đ?›ż! = 0, it means that all the data sources indicate variable đ?‘˜ is not a disease-related brain region; otherwise, variable đ?‘˜ is a disease-related brain ! region. đ?›ž!,! ≠0 means that the đ?‘š !! data source supports this assertion. The log-likelihood function of đ?šŻ is: đ?‘™! đ?šŻ| đ??™ ! , đ??™ ! , ‌ , đ??™ ! = ! !!! − ! ! ! đ?‘ ! ! log đ?›‰!!! đ??“ ! log đ?›‰ ! ! ! ! đ?›‰!!! − !! ! !!! ! ! log đ?›‰! đ??–! ! ! đ?›‰! +  , which follows the same line as (2). However, our formulation includes the following constraints on đ?šŻ: ! ! đ?œƒ!,! = đ?›ż! đ?›ž!,! , đ?›ż! ≼ 0, 1 ≤ đ?‘˜, đ?‘™ ≤ đ?‘?, 1 ≤ đ?‘š ≤ đ?‘€. Let ! đ?šŞ = đ?›ž!,! , 1 ≤ đ?‘˜ ≤ đ?‘?, 1 ≤ đ?‘™ ≤ đ?‘?, 1 ≤ đ?‘š ≤ đ?‘€ and (3) đ?šż = đ?›ż! , 1 ≤ đ?‘˜ ≤ đ?‘? . An intuitive choice for estimation of đ?šŞ and đ?šż is to maximize the đ?‘™! đ?šŻ| đ??™ ! , đ??™ ! , ‌ , đ??™ !   subject to the constraints in (3). However, it can be anticipated that no element in the estimated đ?šŞ and đ?šż will be exactly zero, resulting in a model which is not interpretable, i.e., poor identification of diseaserelated regions. Thus, we encourage the estimation of đ?šż and the first đ?‘ž columns of đ?šŞ (i.e., the columns containing discrimination information) to be sparse, by imposing the L1-penalty on đ?šŞ and đ?šż. By doing so, we obtain the following optimization problem for the proposed SCLDA: đ?šŻ = argmin đ?šŻ đ?‘™! đ?šŻ| đ??™ ! , đ??™ ! , ‌ , đ??™ ! ! đ?œƒ!,! = = argmin đ?šŻ −đ?‘™! đ?šŻ| đ??™ ! , đ??™ ! , ‌ , đ??™ ! ! đ?œ†! !,!,! đ?›ž!,!  , subject to ! đ?›ż! đ?›ž!,! , đ?›ż! ≼ 0, 1 ≤ đ?‘˜, đ?‘™ ≤ +  đ?œ†! ! đ?›ż! + đ?‘?, 1 ≤ đ?‘š ≤ đ?‘€.                               (4) Here, đ?œ†! and đ?œ†! control the degrees of sparsity of đ?šż and đ?šŞ, respectively. Tuning of two regularization parameters is difficult. Fortunately, we prove the following Theorem which indicates that formulation (4) is equivalent to a simpler optimization problem involving only one regularization parameter. 4 Theorem 1: The optimization problem (4) is equivalent to the following optimization problem: đ?šŻ = argmin đ?šŻ đ?‘™! đ?šŻ| đ??™ = argmin đ?šŻ −đ?‘™! đ?šŻ| đ??™ with đ?œ† = 2 ! ! , đ??™ ! ! , đ??™ ,‌, đ??™ ! ! ,‌, đ??™ +  đ?œ† ! ! ! !!! ! !!! ! đ?œƒ!,!  , (5) ! đ?œ†! đ?œ†! , i.e., đ?œƒ!,! = đ?œƒ!,! . The proof can be found in the supplementary document. It can also be found in the supplementary material how this formulation will serve the purpose of the composite parameterization, i.e., common information and specific information can be estimated separately and simultaneously. The optimization problem (5) is a non-convex optimization problem that is difficult to solve. We address this problem by using an iterative two-stage procedure known as Difference of Convex functions (DC) programming [39]. A full description of the algorithm can be found in the supplemental material. 4 S im ula tion s tu d ies In this section, we conduct experiments to compare the performance of the proposed SCLDA with sparse LDA (SLDA) [42] and multitask feature selection [31]. Specifically, as we focus on LDA, we use the multitask feature selection method developed in [31] on LDA, denoted as MSLDA. Both SLDA and MSLDA adopt convex regularizations. Specifically, SLDA selects features from one single data source with L1-regularization; MSLDA selects features from multiple data sources with L2/L1 regularization. We evaluate the performances of these three methods across various parameters settings, including the number of variables, đ?‘?, the number of features, đ?‘™, the number of data sources, M, sample size, đ?‘›, and the degree of overlapping of the features across different data sources, s% (the larger the đ?‘ %, the more shared features among the datasets). Definition of đ?‘ % can be found in the simulation procedure that is included in the supplemental material. For each specification of the parameters settings, đ?‘€ datasets can be generated following the simulation procedure. We apply the proposed SCLDA to the đ?‘€ datasets, and identify one feature vector đ?›‰(!) for each dataset, with đ?œ† and đ?‘ž chosen by the method described in section 3.3. The result can be described by the number of true positives (TPs) as well as the number of false positives (FPs). Here, true positives are the non-zero elements in the learned feature vector đ?›‰(!) which are also non-zero in the đ?›ƒ! ; false positives are the non-zero elements in đ?›‰(!) , which are actually zero in đ?›ƒ! . As there are đ?‘š pairs of the TPs and FPs for the đ?‘€ datasets, the average TP over the M datasets and the average FP over the M datasets are used as the performance measures. This procedure (i.e., from data simulation, to SCLDA, to TPs and FPs generation) can be repeated for đ??ľ times, and đ??ľ pairs of average TP and average FP are collected for SCLDA. In a similar way, we can obtain đ??ľ pairs of average TP and average FP for both SLDA and MSLDA. Figures 1 (a) and (b) show comparison between SCLDA, SLDA and MSLDA by scattering the average TP against the average FP for each method. Each point corresponds to one of the N repetitions. The comparison is across various parameters settings, including the number of variables (đ?‘? = 100,200,500), the number of data sources (đ?‘š = 2,5,10), and the degree of overlapping of the features across different data sources (đ?‘ % = 90%, 70%). Additionally, đ?‘› đ?‘? is kept constant, đ?‘› đ?‘? = 1. A general observation is that SCLDA is better than SLDA and MSLDA across all the parameter settings. Some specific trends can be summarized as follows: (i) Both SCLDA and MSLDA outperform SLDA in terms of TPs; SCLDA further outperforms MSLDA in terms of FPs. (ii) In Figure 2 (a), rows correspond to different numbers of data sources, i.e., đ?‘š = 2,5,10 respectively. It is clear that the advantage of SCLDA over both SLDA and MSLDA is more significant when there are more data sources. Also, MSLDA performs consistently better than SLDA. Similar phenomena are shown in Figure 2 (b). This demonstrates that in analyzing each data source, both SCLDA and MSLDA are able to make use of the information contained in other data sources. SCLDA can use this information more efficiently, as SCLDA can produce less shrunken parameter estimates than MSLDA and thus it is able to preserve weak-effect features. (iii) Comparing Figures 2 (a) and (b), it can be seen that the advantage of SCLDA or MSLDA over SLDA is more significant as the data sources have more degree of overlapping in their 5 features. Finally, although not presented here, our simulation shows that the three methods perform similarly when đ?‘ % = 40 or less. (a) (b) Figure 1: Average numbers of TPs vs FPs for SCLDA (green symbols “+â€?), SLDA (blue symbols “*â€?) and MSLDA (red symbols “oâ€?) (a) đ?‘ % = 90%, đ?‘› đ?‘? = 1; (b) đ?‘ % = 70%, đ?‘› đ?‘? = 1 5 C ase st ud y 5.1 Data preprocessing Our study includes 49 AD patient and 67 age-matched normal controls (NC), with each subject of AD or NC being scanned both by PET and MRI. The PET and MRI images can be downloaded from the database by the Alzheimer’s Disease Neuroimaging Initiative. In what follows, we outline the data preprocessing steps. Each image is spatially normalized to the Montreal Neurological Institute (MNI) template, using the affine transformation and subsequent non-linear wraping algorithm [43] implemented in the SPM MATLAB toolbox. This is to ensure that each voxel is located in the same anatomical region for all subjects, so that spatial locations can be reported and interpreted in a consistent manner. Once all the images in the MNI template, we further apply the Automated Anatomical Labeling (AAL) technique [43] to segment the whole brain of each subject into 116 brain regions. The 90 regions that belong to the cerebral cortex are selected for the later analysis, as the other regions are not included in the cerebral cortex are rarely considered related with AD in the literature. The measurement of each region in the PET data is regional cerebral blood flow (rCBF); the measurement of each region in the MRI data is the structural volume of the region. 5.2 Disease-related brain regions SCLDA is applied to the preprocessed PET and MRI data of AD and NC with the penalty parameter selected by the AIC method mentioned in section 3. 26 disease-related brain regions are identified from PET and 21 from MRI (see Table 1 for their names). The maps of the diseaserelated brain regions identified from PET and MRI are highlighted in Figure 2 (a) and (b), respectively, with different colors given to neighboring regions in order to distinguish them. Each figure is a set of horizontal cut away slices of the brain as seen from the top, which aims to provide a full view of locations of the regions. One major observation is that the identified disease-related brain regions from MRI are in the hippocampus, parahippocampus, temporal lobe, frontal lobe, and precuneus, which is consistent with the existing literature that reports structural atrophy in these brain areas. [3-6,12-14]. The identified disease-related brain regions from PET are in the temporal, frontal and parietal lobes, which is consistent with many functional neuroimaging studies that report reduced rCBF or 6 reduced cortical glucose metabolism in these areas [8-10, 12-14]. Many of these identified disease-related regions can be explained in terms of the AD pathology. For example, hippocampus is a region affected by AD the earliest and severely [6] Also, as regions in the temporal lobe are essential for memory, damage on these regions by AD can explain the memory loss which is a major clinic symptom of AD. The consistency of our findings with the AD literature supports effectiveness of the proposed SCLDA. Another finding is that there is a large overlap between the identified disease-related regions from PET and those from MRI, which implies strong interaction between functional and structural alterations in these regions. Although well-accepted biological mechanisms underlying this interaction are still not very clear, there are several explanations existing in the literature. The first explanation is that both functional and structural alterations could be the consequence of dendritic arborizations, which results from intracellular accumulation of PHFtau and further leads to neuron death and grey matter loss [14]. The second explanation is that the AD pathology may include a vascular component, which may result in reduced rCBF due to limited blood supply and may ultimately result in structural alteration such as brain atrophy [45]. (a) (b) Figure 2: locations of disease-related brain regions identified from (a) MRI; (b) PET 5.3 Classification accuracy As one of our primary goals is to distinguish AD from NC, the identified disease-related brain regions through SCLDA are further utilized for establishing a classification model. Specifically, for each subject, the rCBF values of the 26 disease-related brain regions identified from PET and the structural volumes of the 21 disease-related brain regions identified from MRI are used, as a joint spatial pattern of both brain physiology and structure. As a result, each subject is associated with a vector with 47 features/variables. Linear SVM (Support Vector Machine) is employed as the classifier. The classification accuracy based on 10-fold cross-validation is 94.3%. For comparison purposes, MSLDA is also applied, which identifies 45 and 38 disease-related brain regions for PET and MRI, respectively. Linear SVM applied to the 45+38 features gives a classification accuracy of only 85.8%. Note that MSLDA identifies a much larger number of disease-related brain regions than SCLDA, but some of the identified regions by MSLDA may indeed be disease-irrelevant, so including them deteriorates the classification. 5.4 Relationship between structural atrophy and abnormal rCBF, and severity of cognitive impairment in AD In addition to classification, it is also of interest to further verify relevance of the identified disease-related regions with AD in an alternative way. One approach is to investigate the degree to which those disease-related regions are relevant to cognitive impairment that can be measured by the Alzheimer’s disease assessment scale – cognitive subscale (ADAS-cog). ADAS measures severity of the most important symptoms of AD, while its subscale, ADAS-cog, is the most 7 popular cognitive testing instrument used in clinic trails. The ADAS-cog consists of 11 items measuring disturbances of memory, language, praxis, attention and other cognitive abilities that are often affected by AD. As the total score of these 11 items provides an overall assessment of cognitive impairment, we regress this ADAS-cog total score (the response) against the rCBF or structure volume measurement (the predictor) of each identified brain region, using a simple regression. The regression results are listed in Table 1. It is not surprising to find that some regions in the hippocampus area and temporal lobes are among the best predictors, as these regions are extensively reported in the literature as the most severely affected by AD [3-6]. Also, it is found that most of these brain regions are weak-effect predictors, as most of them can only explain a small portion of the variability in the ADAS-cog total score, i.e., many R-square values in Table 1 are less than 10%. However, although the effects are weak, most of them are significant, i.e., most of the p-values in Table 1 are smaller than 0.05. Furthermore, it is worth noting that 70.22% variability in ADAS-cog can be explained by taking all the 26 brain regions identified from PET as predictors in a multiple regression model; 49.72% variability can be explained by taking all the 21 brain regions from MRI as predictors in a multiple regression model. All this findings imply that the disease-related brain regions are indeed weakeffect features if considered individually, but jointly they can play a strong role for characterizing AD. This verifies the suitability of the proposed SCLDA for AD studies, as SCLDA can preserve weak-effect features. Table 1: Explanatory power of regional rCBF and structural volume for variability in ADAS-cog (“~â€? means this region is not identified from PET (or MRI) as a disease-related region by SCLDA) PET Brain regions Precentral_L Precentral_R Frontal_Sup_L Frontal_Sup_R Frontal_Mid_R Frontal_M_O_L Frontal_M_O_R Insula_L Insula_R Cingulum_A_R Cingulum_Mid_L Cingulum_Post_L Hippocampus_L Hippocampus_R ParaHippocamp_L 6 R 2 0.003 0.044 0.051 0.044 0.056 0.036 0.019 0.016 ~ 0.004 0.001 0.184 0.158 ~ 0.206 pvalue 0.503 0.022 0.013 0.023 0.010 0.040 0.138 0.171 ~ 0.497 0.733 <10-4 <10-4 ~ <10-4 MRI R 2 0.027 ~ 0.047 ~ 0.072 0.086 0.126 0.163 0.125 0.082 0.040 ~ ~ 0.242 ~ PET Brain regions pvalue 0.077 ~ 0.018 ~ 0.003 0.001 0.000 <10-4 0.000 0.001 0.030 ~ ~ <10-4 ~ Amygdala_L Calcarine_L Lingual_L Postcentral_L Parietal_Sup_R Angular_R Precuneus_R Paracentr_Lobu_L Pallidum_L Pallidum_R Heschl_L Heschl_R Temporal_P_S_R Temporal_Inf_R All regions R 2 0.090 0.038 0.066 0.038 0.001 0.173 0.063 0.035 0.082 ~ 0.001 0.000 0.008 0.187 0.702 pvalue 0.001 0.034 0.005 0.035 0.677 <10-4 0.006 0.043 0.001 ~ 0.640 0.744 0.336 <10-4 <10-4 MRI R 2 0.313 0.028 0.044 0.026 ~ 0.063 0.025 0.000 ~ 0.020 ~ 0.111 0.071 0.147 0.497 pvalue <10-4 0.070 0.023 0.081 ~ 0.006 0.084 0.769 ~ 0.122 ~ 0.000 0.003 <10-4 <10-4 C on clu sio n In the paper, we proposed a SCLDA model for identification of disease-related brain regions of AD from multi-modality data, which is capable to preserve weak-effect disease-related brain regions due to its less shrinkage imposed on its parameters. We applied SCLDA to the PET and MRI data of early AD patients and normal controls. As MRI and PET measure two complementary aspects (structural and functional aspects, respectively) of the same AD pathology, fusion of these two image modalities can make effective use of their interaction and thus improve the statistical power in identification of disease-related brain regions. Our findings were consistent with the literature and also showed some new aspects that may suggest further investigation in neuroimaging research in the future. 8 References [1] deToledo-Morrell, L., Stoub, T.R., Bulgakova, M. 2004. MRI-derived entorhinal volume is a good predictor of conversion from MCI to AD. Neurobiol. Aging 25, 1197–1203. [2] Morra, J.H., Tu, Z. Validation of automated hippocampal segmentation method. NeuroImage 43, 59–68, 2008. [3] Morra, J.H., Tu, Z. 2009a. Automated 3D mapping of hippocampal atrophy. Hum. Brain Map. 30, 2766–2788. [4] Morra, J.H., Tu, Z. 2009b. Automated mapping of hippocampal atrophy in 1-year repeat MRI data. NeuroImage 45, 213-221. [5] Schroeter, M.L., Stein, T. 2009. Neural correlates of AD and MCI. NeuroImage 47, 1196–1206. [6] Braak, H., Braak, E. 1991. Neuropathological stageing of Alzheimer-related changes. Acta Neuro. 82, 239–259. [7] Bradley, K.M., O'Sullivan. 2002. Cerebral perfusion SPET correlated with Braak pathological stage in AD. Brain 125, 1772–1781. [8] Keilp, J.G., Alexander, G.E. 1996. Inferior parietal perfusion, lateralization, and neuropsychological dysfunction in AD. Brain Cogn. 32, 365–383. [9] Schroeter, M.L., Stein, T. 2009. Neural correlates of AD and MCI. NeuroImage 47, 1196–1206. [10] Asllani, I., Habeck, C. 2008. Multivariate and univariate analysis of continuous arterial spin labeling perfusion MRI [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] in AD. J. Cereb. Blood Flow Metab. 28, 725–736. Du,A.T., Jahng, G.H. 2006. Hypoperfusion in frontotemporal dementia and AD. Neurology 67, 1215–1220. Ishii, K., Kitagaki, H. 1996. Decreased medial temporal oxygen metabolism in AD. J. Nucl. Med. 37, 1159–1165. Johnson, N.A., Jahng, G.H. 2005. Pattern of cerebral hypoperfusion in AD. Radiology 234, 851–859. Wolf, H., Jelic, V. 2003. A critical discussion of the role of neuroimaging in MCI. Acta Neuroal: 107 (4), 52-76. Tosun, D., Mojabi, P. 2010. Joint analysis of structural and perfusion MRI for cognitive assessment and classification of AD and normal aging. NeuroImage 52, 186-197. Alsop, D., Casement, M. 2008. Hippocampal hyperperfusion in Alzheimer's disease. NeuroImage 42, 1267–1274. Mosconi, L., Tsui, W.-H. 2005. Reduced hippocampal metabolism in MCI and AD. Neurology 64, 1860–1867. Mulert, C., Lemieux, L. 2010. EEG-fMRI: physiological basis, technique and applications. Springer. Xu, L., Qiu, C., Xu, P. and Yao, D. 2010. A parallel framework for simultaneous EEG/fMRI analysis: methodology and simulation. NeuroImage, 52(3), 1123-1134. Philiastides, M. and Sajda, P. 2007. EEG-informed fMRI reveals spatiotemporal characteristics of perceptual decision making. Journal of Neuroscience, 27(48), 13082-13091. Daunizeau, J., Grova, C. 2007. Symmetrical event-related EEG/fMRI information fusion. NeuroImage 36, 69-87. Jagust, W. 2006. PET and MRI in the diagnosis and prediction of dementia. Alzheimer’s Dement 2, 36-42. Kawachi, T., Ishii, K. and Sakamoto, S. 2006. Comparison of the diagnostic performance of FDG-PET and VBM. Eur.J.Nucl.Med.Mol.Imaging 33, 801-809. Matsunari, I., Samuraki, M. 2007. Comparison of 18F-FDG PET and optimized voxel-based morphometry for detection of AD. J.Nucl.Med 48, 1961-1970. Schmidt, M., Fung, G. and Rosales, R. 2007. Fast optimization methods for L1-regularization: a comparative study and 2 new approaches. ECML 2007. Liu, J., Ji, S. and Ye, J. 2009. SLEP: sparse learning with efficient projections, Arizona state university. Tibshirani, R. 1996. Regression Shrinkage and Selection via the Lasso, JRSS, Series B, 58(1):267–288. Friedman, J., Hastie, T. and Tibshirani, R. 2007. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 8(1):1–10. Zou, H., Hastie, T. and Tibshirani, R. 2006. Sparse PCA, J. of Comp. and Graphical Statistics, 15(2), 262-286. Qiao, Z., Zhou, L and Huang, J. 2006. Sparse LDA with applications to high dimensional low sample size data. IAENG applied mathematics, 39(1). Argyriou, A., Evgeniou, T. and Pontil, M. 2008. Convex multi-task feature learning. Machine Learning 73(3): 243– 272. Huang, S., Li, J., et al. 2010. Learning Brain Connectivity of AD by Sparse Inverse Covariance Estimation, NeuroImage, 50, 935-949. Candes, E., Wakin, M. and Boyd, S. 2008. Enhancing sparsity by reweighted L1 minimization. Journal of Fourier analysis and applications, 14(5), 877-905. Mazumder, R.; Friedman, J. 2009. SparseNet: Coordinate Descent with Non-Convex Penalties. Manuscript. Zhang, T. 2008. Multi-stage Convex Relaxation for Learning with Sparse Regularization. NIPS 2008. Campbell, N. 1984. Canonical variate analysis ageneral formulation. Australian Jour of Stat 26, 86–96. Hastie, T. and Tibshirani, R. 1994. Discriminant analysis by gaussian mixtures. Technical report. AT&T; Bell Lab. Kumar, N. and Andreou, G. 1998. Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Communication, 26 (4), 283-297. Gasso, G., Rakotomamonjy, A. and Canu, S. 2009. Recovering sparse signals with non-convex penalties and DC programming. IEEE Trans. Signal Processing 57( 12), 4686-4698. Guo, J., Levina, E., Michailidis, G. and Zhu, J. 2011. Joint estimation of multiple graphical models. Biometrika 98(1) 1-15. Bertsekas, D. 1982. Projected newton methods for optimization problems with simple constraints. SIAM J. Control Optim 20, 221-246. Clemmensen, L., Hastie, T., Witten, D. and Ersboll:, B. 2011. Sparse Discriminant Analysis. Technometrics (in press) Friston, K.J., Ashburner, J. 1995. Spatial registration and normalization of images. HBM 2, 89–165. Tzourio-Mazoyer, N., et al., 2002. Automated anatomical labelling of activations in SPM. NeuroImage 15, 273–289. Bidzan, L. 2005. Vascular factors in dementia. Psychiatr. Pol. 39, 977-986. 9
6 0.54888725 110 nips-2011-Group Anomaly Detection using Flexible Genre Models
7 0.53053057 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints
8 0.52719796 104 nips-2011-Generalized Beta Mixtures of Gaussians
9 0.52310807 129 nips-2011-Improving Topic Coherence with Regularized Topic Models
10 0.49421969 192 nips-2011-Nonstandard Interpretations of Probabilistic Programs for Efficient Inference
11 0.49235108 42 nips-2011-Bayesian Bias Mitigation for Crowdsourcing
12 0.47110483 221 nips-2011-Priors over Recurrent Continuous Time Processes
13 0.46983188 33 nips-2011-An Exact Algorithm for F-Measure Maximization
14 0.46263176 161 nips-2011-Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation
15 0.46161944 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
16 0.45661718 115 nips-2011-Hierarchical Topic Modeling for Analysis of Time-Evolving Personal Choices
17 0.45557266 191 nips-2011-Nonnegative dictionary learning in the exponential noise model for adaptive music signal representation
18 0.45192337 55 nips-2011-Collective Graphical Models
19 0.44375709 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning
20 0.42895353 269 nips-2011-Spike and Slab Variational Inference for Multi-Task and Multiple Kernel Learning
topicId topicWeight
[(0, 0.034), (4, 0.027), (20, 0.026), (26, 0.021), (31, 0.077), (33, 0.018), (43, 0.071), (45, 0.116), (57, 0.041), (74, 0.068), (83, 0.042), (84, 0.022), (88, 0.287), (99, 0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.76575613 14 nips-2011-A concave regularization technique for sparse mixture models
Author: Martin O. Larsson, Johan Ugander
Abstract: Latent variable mixture models are a powerful tool for exploring the structure in large datasets. A common challenge for interpreting such models is a desire to impose sparsity, the natural assumption that each data point only contains few latent features. Since mixture distributions are constrained in their L1 norm, typical sparsity techniques based on L1 regularization become toothless, and concave regularization becomes necessary. Unfortunately concave regularization typically results in EM algorithms that must perform problematic non-concave M-step maximizations. In this work, we introduce a technique for circumventing this difficulty, using the so-called Mountain Pass Theorem to provide easily verifiable conditions under which the M-step is well-behaved despite the lacking concavity. We also develop a correspondence between logarithmic regularization and what we term the pseudo-Dirichlet distribution, a generalization of the ordinary Dirichlet distribution well-suited for inducing sparsity. We demonstrate our approach on a text corpus, inferring a sparse topic mixture model for 2,406 weblogs. 1
2 0.54578751 258 nips-2011-Sparse Bayesian Multi-Task Learning
Author: Shengbo Guo, Onno Zoeter, Cédric Archambeau
Abstract: We propose a new sparse Bayesian model for multi-task regression and classification. The model is able to capture correlations between tasks, or more specifically a low-rank approximation of the covariance matrix, while being sparse in the features. We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classification tasks it achieves competitive predictive performance compared to previously proposed methods. 1
3 0.54254258 186 nips-2011-Noise Thresholds for Spectral Clustering
Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh
Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1
4 0.54203683 276 nips-2011-Structured sparse coding via lateral inhibition
Author: Arthur D. Szlam, Karol Gregor, Yann L. Cun
Abstract: This work describes a conceptually simple method for structured sparse coding and dictionary design. Supposing a dictionary with K atoms, we introduce a structure as a set of penalties or interactions between every pair of atoms. We describe modifications of standard sparse coding algorithms for inference in this setting, and describe experiments showing that these algorithms are efficient. We show that interesting dictionaries can be learned for interactions that encode tree structures or locally connected structures. Finally, we show that our framework allows us to learn the values of the interactions from the data, rather than having them pre-specified. 1
5 0.54051304 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
Author: Tzu-kuo Huang, Jeff G. Schneider
Abstract: Vector Auto-regressive models (VAR) are useful tools for analyzing time series data. In quite a few modern time series modelling tasks, the collection of reliable time series turns out to be a major challenge, either due to the slow progression of the dynamic process of interest, or inaccessibility of repetitive measurements of the same dynamic process over time. In those situations, however, we observe that it is often easier to collect a large amount of non-sequence samples, or snapshots of the dynamic process of interest. In this work, we assume a small amount of time series data are available, and propose methods to incorporate non-sequence data into penalized least-square estimation of VAR models. We consider non-sequence data as samples drawn from the stationary distribution of the underlying VAR model, and devise a novel penalization scheme based on the Lyapunov equation concerning the covariance of the stationary distribution. Experiments on synthetic and video data demonstrate the effectiveness of the proposed methods. 1
6 0.54000616 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
7 0.53816372 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
8 0.53744882 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning
9 0.5373919 265 nips-2011-Sparse recovery by thresholded non-negative least squares
10 0.53609115 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
11 0.53583568 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
12 0.53533876 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
13 0.5352152 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
14 0.53486514 271 nips-2011-Statistical Tests for Optimization Efficiency
15 0.53469092 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection
16 0.53359705 116 nips-2011-Hierarchically Supervised Latent Dirichlet Allocation
17 0.53129828 257 nips-2011-SpaRCS: Recovering low-rank and sparse matrices from compressive measurements
18 0.53107142 183 nips-2011-Neural Reconstruction with Approximate Message Passing (NeuRAMP)
19 0.53045142 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
20 0.53034198 156 nips-2011-Learning to Learn with Compound HD Models