nips nips2013 nips2013-70 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: James Y. Zou, Daniel Hsu, David C. Parkes, Ryan P. Adams
Abstract: In many natural settings, the analysis goal is not to characterize a single data set in isolation, but rather to understand the difference between one set of observations and another. For example, given a background corpus of news articles together with writings of a particular author, one may want a topic model that explains word patterns and themes specific to the author. Another example comes from genomics, in which biological signals may be collected from different regions of a genome, and one wants a model that captures the differential statistics observed in these regions. This paper formalizes this notion of contrastive learning for mixture models, and develops spectral algorithms for inferring mixture components specific to a foreground data set when contrasted with a background data set. The method builds on recent moment-based estimators and tensor decompositions for latent variable models, and has the intuitive feature of using background data statistics to appropriately modify moments estimated from foreground data. A key advantage of the method is that the background data need only be coarsely modeled, which is important when the background is too complex, noisy, or not of interest. The method is demonstrated on applications in contrastive topic modeling and genomic sequence analysis. 1
Reference: text
sentIndex sentText sentNum sentScore
1 For example, given a background corpus of news articles together with writings of a particular author, one may want a topic model that explains word patterns and themes specific to the author. [sent-2, score-0.555]
2 This paper formalizes this notion of contrastive learning for mixture models, and develops spectral algorithms for inferring mixture components specific to a foreground data set when contrasted with a background data set. [sent-4, score-1.43]
3 The method builds on recent moment-based estimators and tensor decompositions for latent variable models, and has the intuitive feature of using background data statistics to appropriately modify moments estimated from foreground data. [sent-5, score-1.131]
4 A key advantage of the method is that the background data need only be coarsely modeled, which is important when the background is too complex, noisy, or not of interest. [sent-6, score-0.452]
5 The method is demonstrated on applications in contrastive topic modeling and genomic sequence analysis. [sent-7, score-0.599]
6 Instead, such a model will simply learn about English syntactic structure and invent topics that reflect uninteresting statistical correlations between stop words [3]. [sent-13, score-0.256]
7 To answer this question, we develop a new set of techniques that we refer to as contrastive learning methods. [sent-15, score-0.405]
8 These methods differentiate between foreground and background data and seek to learn a latent variable model that captures statistical relationships that appear in the foreground but do not appear in the background. [sent-16, score-1.577]
9 Revisiting the previous scientific topics example, contrastive learning could treat computer science papers as a foreground corpus and (say) English-language news articles as a background corpus. [sent-17, score-1.558]
10 As both corpora share the same broad syntactic structure, a contrastive foreground topic model would be more likely to discover semantic relationships between words that are specific to computer science. [sent-18, score-1.24]
11 This intuition has broad applicability in other models and domains 1 Background Foreground (a) PCA (b) Linear contrastive analysis Figure 1: These figures show foreground and background data from Gaussian distributions. [sent-19, score-1.258]
12 The foreground data has greater variance in its minor direction, but the same variance in its major direction. [sent-20, score-0.627]
13 For example, in genomics one might use a contrastive hidden Markov model to amplify the signal of a particular class of sequences, relative to the broader genome. [sent-24, score-0.459]
14 Note that the objective of contrastive learning is not to discriminate between foreground and background data, but to learn an interpretable generative model that captures the differential statistics between the two data sets. [sent-25, score-1.36]
15 To clarify this difference, consider the difference between principal component analysis and contrastive analysis. [sent-26, score-0.423]
16 Principal component analysis finds the linear projection that maximally preserves variance without regard to foreground versus background. [sent-27, score-0.682]
17 A contrastive approach, however, would try to find a linear projection that maximally preserves the foreground variance that is not explained by the background. [sent-28, score-1.069]
18 We formalize the concept of contrastive learning for mixture models and present new spectral contrast algorithms. [sent-32, score-0.518]
19 We prove that by appropriately “subtracting” background moments from the foreground moments, our algorithms recover the model for the foregroundspecific data. [sent-33, score-0.975]
20 We demonstrate the effectiveness, robustness, and scalability of our method in contrastive topic modeling and contrastive genomics. [sent-35, score-0.982]
21 The general mixture model has the form J N N J p({xn }n=1 ; {(µj , wj )}j=1 ) = wj f (xn |µj ) (1) n=1 j=1 where {µj } are the parameters of the mixture components, {wj } are the mixture weights, and f (·|µj ) is the density of the j-th mixture component. [sent-37, score-0.414]
22 In many applications, we have two sets of observations {xf } and {xb }, which we call the foreground n n data and the background data, respectively. [sent-39, score-0.873]
23 The foreground and background are generated by two possibly overlapping sets of mixture components. [sent-40, score-0.942]
24 The f foreground {xf } is generated from the mixture model {(µj , wj )}j∈A∪B , and the background {xb } n n b is generated from {(µj , wj )}j∈B∪C . [sent-42, score-1.145]
25 f The goal of contrastive learning is to infer the parameters {(µj , wj )}j∈A , which we call the f foreground-specific model. [sent-43, score-0.513]
26 However, this involves explicitly learning a model for the background data, which 2 is undesirable if the background is too complex, if {xb } is too noisy, or if we do not want to den vote computational power to learn the background. [sent-45, score-0.536]
27 In many applications, we are only interested in learning a generative model for the difference between the foreground and background, because that contrast is the interesting signal. [sent-46, score-0.686]
28 Our approach is based on a method-ofmoments that uses higher-order tensor decompositions for estimation [5]; we generalize the tensor decomposition technique to deal with our task of contrastive learning. [sent-48, score-0.678]
29 , [6– 13]), but their parameter estimation can not account for the asymmetry between foreground and background. [sent-51, score-0.627]
30 We demonstrate spectral contrastive learning through two concrete applications: contrastive topic modeling and contrastive genomics. [sent-52, score-1.422]
31 In contrastive topic modeling we are given a foreground corpus of documents and a background corpus. [sent-53, score-1.59]
32 We want to learn a fully generative topic model that explains the foreground-specific documents (the contrast). [sent-54, score-0.35]
33 We show that even when the background is extremely sparse—too noisy to learn a good background topic model—our spectral contrast algorithm still recovers foreground-specific topics. [sent-55, score-0.734]
34 In contrastive genomics, sequence data is modeled by HMMs. [sent-56, score-0.405]
35 The foreground data is generated by a mixture of two HMMs; one is foreground-specific, and the other captures some background process. [sent-57, score-0.964]
36 The background data is generated by this second HMM. [sent-58, score-0.255]
37 3 Contrastive topic modeling To illustrate contrastive analysis and introduce tensor methods, we consider a simple topic model where each document is generated by exactly one topic. [sent-60, score-0.957]
38 The generative topic model for a document is as follows. [sent-63, score-0.269]
39 1 Moment decompositions We use the symbol ⊗ to denote the tensor product of vectors, so a⊗b is the matrix whose (i, j)-th entry is ai bj , and a⊗b⊗c is the third-order tensor whose (i, j, k)-th entry is ai bj ck . [sent-81, score-0.55]
40 Given a third-order tensor T ∈ Rd1 ×d2 ×d3 and vectors a ∈ Rd1 , b ∈ Rd2 , and c ∈ Rd3 , we let T (I, b, c) ∈ Rd1 denote the vector whose i-th entry is j,k Ti,j,k bj ck , and T (a, b, c) denote the scalar i,j,k Ti,j,k ai bj ck . [sent-82, score-0.324]
41 Let x1 , x2 , x3 ∈ [D] be three random words sampled from a random document generated by the foreground model (the discussion here also applies to the background model). [sent-86, score-0.955]
42 Similarly, the third-order (cross) moment tensor M3 := E[ex1 ⊗ ex2 ⊗ ex3 ] is the 3 Algorithm 1 Contrastive Topic Model estimator input Foreground and background documents {cf }, {cb }; parameter γ > 0; number of topics K. [sent-88, score-0.701]
43 ˆf ˆf ˆb ˆb 1: Let M2 and M3 (M2 and M3 ) be the foreground (background) second- and third-order moment f ˆ ˆf ˆb ˆ ˆf ˆb estimates based on {cn } ({cb }), and let M2 := M2 − γ M2 and M3 := M3 − γ M3 . [sent-90, score-0.677]
44 Observe that for any t ∈ A ∪ B, the i-th entry of E[ex1 |topic = t] is precisely the probability that x1 = i given topic = t, which is i-th entry of µt . [sent-94, score-0.278]
45 ) We can similarly b b decompose the background moments M2 and M3 in terms of tensors products of {µt }t∈B∪C . [sent-103, score-0.345]
46 Let M2 , M3 and M2 , M3 be the second- and third-order moments from the foreground and background data, respectively. [sent-106, score-0.945]
47 Proposition 1 implies that the modified moments M2 and M3 have low-rank decompositions in which the components t with positive multipliers ωt correspond to the f foreground-specific topics {(µt , wt )}t∈A . [sent-110, score-0.387]
48 We argue that under some natural conditions, the generalized power method is robust to large perturb b bations in M2 and M3 , which suggests that foreground-specific topics can be learned even when it is not possible to accurately model the background. [sent-112, score-0.25]
49 We use the generalized tensor power method to estimate the foreground-specific topics in our Contrastive Topic Model estimator (Algorithm 1). [sent-113, score-0.372]
50 Where possible in practice, we recommend using prior belief about foreground and background compositions to estimate f b maxj∈B wj /wj , and then vary γ as part of the exploratory analysis. [sent-116, score-0.985]
51 2 Experiments with contrastive topic modeling We test our contrastive topic models on the RCV1 dataset, which consists of ≈ 800000 news articles. [sent-118, score-1.182]
52 The corpus spans a large set of complex and overlapping categories, making this a good dataset to validate our contrastive learning algorithm. [sent-124, score-0.467]
53 In one set of experiments, we take documents associated with one region as the foreground corpus, and documents associated with a general theme, such as economics, as the background. [sent-125, score-0.823]
54 The goal of the contrast is to find the region-specific topics which are not relevant to the background theme. [sent-126, score-0.425]
55 We first set the contrast parameter γ = 0 in Algorithm 1; this learns the topics from the foreground dataset alone. [sent-417, score-0.848]
56 Due to the composition of the corpus, the foreground topics for USA is dominated by topics relevant to stock markets and trade; representative topics and keywords are shown on the left of Table 1. [sent-418, score-1.216]
57 The topics involving market and trade are also present in the background corpus, so their weights are reduced through contrast. [sent-421, score-0.479]
58 A similar experiment with China-related articles as foreground, and the same economics themed background is shown in the bottom of Table 1. [sent-423, score-0.31]
59 Using the RCV1 labels, we partition the foreground USA documents into two disjoint groups: documents with any economics-related labels (group 0) and the rest (group 1). [sent-426, score-0.852]
60 Because Algorithm 1 learns the full probabilistic model, we use the inferred topic parameters to compute the marginal likelihood for each foreground document given the model. [sent-427, score-0.877]
61 We then use the likelihood value to classify each foreground document as belonging to group 0 or 1. [sent-428, score-0.683]
62 We first set γ = 0 and compute the AUC score, which corresponds to how well a topic model learned from only the foreground can distinguish between the two groups. [sent-430, score-0.799]
63 The hope is that by using the background data, the contrastive model can better identify the documents that are generated by foreground-specific topics. [sent-432, score-0.758]
64 For γ > 2 we find that the foreground specific topics do not change qualitatively. [sent-434, score-0.808]
65 A major advantage of our approach is that we do not need to learn a very accurate background model to learn the contrast. [sent-435, score-0.304]
66 To validate this, we down sample the background corpus to 1000, 100, 5 and 50 documents. [sent-436, score-0.288]
67 This simulates settings where the background is very sparsely sampled, so it is not possible to learn a background model very accurately. [sent-437, score-0.491]
68 Qualitatively, we observe that even with only 50 randomly sampled background documents, Algorithm 1 still recovers topics specific to USA and not related to Economics. [sent-438, score-0.425]
69 This is supported by the specificity test, where contrastive topic models with sparse background better identify foreground-specific documents relative to the γ = 0 (foreground data-only) model. [sent-440, score-0.901]
70 Consider a simple generative process where foreground data are generated by a mixture of two HMMs: (1 − γ) HMMA +γ HMMB , and background data are generated by HMMB . [sent-445, score-1.012]
71 As we did for topic models, we can estimate a contrastive HMM by taking appropriate combinations of observable moments. [sent-447, score-0.577]
72 be a random emission sequence in RD generated by the 1 2 3 foreground model (1 − γ) HMMA +γ HMMB , and xb , xb , xb , . [sent-451, score-0.906]
73 be the sequence generated by the 1 2 3 background model HMMB . [sent-454, score-0.255]
74 Following [5], we estimate the following cross moment matrices and f f f f tensors: M1,2 := E[xf ⊗ xf ], M1,3 := E[xf ⊗ xf ], M2,3 := E[xf ⊗ xf ], M1,2,3 := E[xf ⊗ xf ⊗ xf ], 1 2 1 3 2 3 1 2 3 as well as the corresponding moments for the background model. [sent-455, score-1.323]
75 Then, f b analogous to Proposition 1, we define the contrastive moments as Mu,v := Mu,v − γMu,v (for f b {u, v} ⊂ {1, 2, 3}) and M1,2,3 := M1,2,3 − γM1,2,3 . [sent-458, score-0.497]
76 The key technical difference from contrastive LDA lies in the asymmetric generalization of the Tensor Power Method of Algorithm 2. [sent-461, score-0.405]
77 For many biological problems, it is important to understand how signals in certain data are enriched relative to some related background data. [sent-463, score-0.28]
78 For instance, we may want to contrast foreground data composed of gene expressions (or mutation rates, protein levels, etc) from one population against background data taken from (say) a control experiment, a different cell type, or a different time point. [sent-464, score-0.871]
79 The contrastive analysis methods developed here can be a powerful exploratory tool for biology. [sent-465, score-0.45]
80 The human genome consists of ≈ 3 billion DNA bases, and has recently been shown that these bases can be naturally segmented into a handful of chromatin states [15, 16]. [sent-467, score-0.312]
81 Each chromatin state corresponds to a latent state, characterized by a vector of 10 emission probabilities. [sent-475, score-0.283]
82 We take as foreground data the observations from exons, introns and promoters, which account for about 30% of the genome; as background data, we take observations from intergenic regions. [sent-476, score-0.926]
83 Because exons and introns are transcribed, we expect the foreground to be a mixture of functional chromatin states and spurious states due to noise, and expect more of the background observations to be due to non-functional process. [sent-477, score-1.238]
84 The contrastive HMM should capture biologically meaningful signals in the foreground data. [sent-478, score-1.049]
85 In Figure 2(b), we show the emission matrix for the foreground HMM and for the contrastive HMM. [sent-479, score-1.093]
86 The foreground states recover the known biological chromatin states from literature [16]. [sent-488, score-0.933]
87 In the contrastive HMM, most of the states are the same as before. [sent-490, score-0.441]
88 Interestingly, state 7, which is associated with feature K20me1, drops from the largest component of the foreground to a very small component of the contrast. [sent-491, score-0.682]
89 5 Generalized tensor power method We now describe our general approach for tensor decomposition used in Algorithm 1. [sent-493, score-0.291]
90 1 Robustness to sparse background sampling Algorithm 1 can recover the foreground-specific {µt }t∈A even with relatively small numbers of background data. [sent-517, score-0.482]
91 We can illustrate this robustness under the assumption that the support of the foreground-specific topics S0 := ∪t∈A supp(µt ) is disjoint from that of the other topics f S1 := ∪t∈B∪C supp(µt ) (similar to Brown clusters [19]). [sent-518, score-0.417]
92 Suppose that M2 is estimated accurately using a large sample of foreground documents. [sent-519, score-0.651]
93 Then because S0 and S1 are disjoint, Algorithm 1 7 f (using sufficiently large γ) will accurately recover the topics {(µt , wt ) : t ∈ A} in Topicsf . [sent-520, score-0.305]
94 The remaining concern is that sampling errors will cause Algorithm 1 to mistakenly return additional ˆ topics in Topicsf , namely the topics t ∈ B ∪ C. [sent-521, score-0.362]
95 We first discuss how to represent empirical estif f mates of the second- and third-order moments M2 and M3 for the foreground documents (the same will hold for the background documents). [sent-531, score-1.043]
96 Let document n ∈ [N ] have length n , and let cn ∈ ND be its word count vector (its i-th entry cn (i) is the number of times word i appears in document n). [sent-532, score-0.739]
97 For any distinct i, j ∈ [D], E[(cn (i)2 − f f cn (i))/(n (n − 1))] = [M2 ]i,i and E[cn (i)cn (j)/(n (n − 1))] = [M2 ]i,j . [sent-535, score-0.249]
98 N f ˆf By Proposition 3, an unbiased estimator of M2 is M2 := N −1 n=1 (n (n − 1))−1 (cn ⊗ cn − ˆf diag(cn )). [sent-536, score-0.272]
99 For any distinct i, j, k ∈ [D], E[(cn (i)3 − 2 f 3cn (i) + 2cn (i))/(n (n − 1)(n − 2))] = [M3 ]i,i,i , E[(cn (i)2 cn (j) − cn (i)cn (j))/(n (n − f f 1)(n − 2))] = [M3 ]i,i,j , and E[(cn (i)cn (j)cn (k))/(n (n − 1)(n − 2))] = [M3 ]i,j,k . [sent-541, score-0.498]
100 f ˆf By Proposition 4, an unbiased estimator of M3 (I, v, v) for any vector v ∈ RD is M3 (I, v, v) := N −1 −1 2 N (cn , v cn −2cn , v(cn ◦v)−cn , v ◦vcn +2cn ◦v ◦v) (where n=1 (n (n −1)(n −2)) ◦ denotes component-wise product of vectors). [sent-542, score-0.272]
wordName wordTfidf (topN-words)
[('foreground', 0.627), ('contrastive', 0.405), ('cn', 0.249), ('background', 0.226), ('xf', 0.191), ('topics', 0.181), ('topic', 0.172), ('chromatin', 0.167), ('tensor', 0.123), ('documents', 0.098), ('moments', 0.092), ('wj', 0.087), ('china', 0.078), ('hmm', 0.073), ('wt', 0.07), ('hmma', 0.067), ('hmmb', 0.067), ('topicsf', 0.067), ('xb', 0.063), ('corpus', 0.062), ('proposition', 0.061), ('emission', 0.061), ('mixture', 0.06), ('document', 0.056), ('economics', 0.055), ('genome', 0.054), ('entry', 0.053), ('moment', 0.05), ('ation', 0.049), ('trade', 0.045), ('exploratory', 0.045), ('power', 0.045), ('hmms', 0.045), ('rd', 0.045), ('ai', 0.044), ('generative', 0.041), ('learn', 0.039), ('word', 0.038), ('biological', 0.037), ('states', 0.036), ('latent', 0.036), ('genomics', 0.035), ('spectral', 0.035), ('bases', 0.035), ('auc', 0.035), ('basketball', 0.033), ('exons', 0.033), ('introns', 0.033), ('nation', 0.033), ('harvard', 0.031), ('bj', 0.031), ('recover', 0.03), ('transcribed', 0.029), ('week', 0.029), ('generated', 0.029), ('disjoint', 0.029), ('articles', 0.029), ('year', 0.028), ('maxj', 0.028), ('news', 0.028), ('percent', 0.028), ('million', 0.027), ('bond', 0.027), ('keywords', 0.027), ('usa', 0.027), ('market', 0.027), ('tensors', 0.027), ('decompositions', 0.027), ('robustness', 0.026), ('nnz', 0.026), ('pseudoinverse', 0.026), ('lda', 0.024), ('accurately', 0.024), ('estimator', 0.023), ('captures', 0.022), ('genomic', 0.022), ('learns', 0.022), ('ck', 0.021), ('infer', 0.021), ('billion', 0.02), ('observations', 0.02), ('score', 0.02), ('cb', 0.02), ('bank', 0.02), ('projection', 0.02), ('syntactic', 0.019), ('chemical', 0.019), ('supp', 0.019), ('state', 0.019), ('stock', 0.019), ('hidden', 0.019), ('contrast', 0.018), ('component', 0.018), ('recovers', 0.018), ('ect', 0.018), ('city', 0.017), ('maximally', 0.017), ('words', 0.017), ('components', 0.017), ('signals', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 70 nips-2013-Contrastive Learning Using Spectral Methods
Author: James Y. Zou, Daniel Hsu, David C. Parkes, Ryan P. Adams
Abstract: In many natural settings, the analysis goal is not to characterize a single data set in isolation, but rather to understand the difference between one set of observations and another. For example, given a background corpus of news articles together with writings of a particular author, one may want a topic model that explains word patterns and themes specific to the author. Another example comes from genomics, in which biological signals may be collected from different regions of a genome, and one wants a model that captures the differential statistics observed in these regions. This paper formalizes this notion of contrastive learning for mixture models, and develops spectral algorithms for inferring mixture components specific to a foreground data set when contrasted with a background data set. The method builds on recent moment-based estimators and tensor decompositions for latent variable models, and has the intuitive feature of using background data statistics to appropriately modify moments estimated from foreground data. A key advantage of the method is that the background data need only be coarsely modeled, which is important when the background is too complex, noisy, or not of interest. The method is demonstrated on applications in contrastive topic modeling and genomic sequence analysis. 1
2 0.17237426 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition
Author: Tzu-Kuo Huang, Jeff Schneider
Abstract: Learning dynamic models from observed data has been a central issue in many scientific studies or engineering tasks. The usual setting is that data are collected sequentially from trajectories of some dynamical system operation. In quite a few modern scientific modeling tasks, however, it turns out that reliable sequential data are rather difficult to gather, whereas out-of-order snapshots are much easier to obtain. Examples include the modeling of galaxies, chronic diseases such Alzheimer’s, or certain biological processes. Existing methods for learning dynamic model from non-sequence data are mostly based on Expectation-Maximization, which involves non-convex optimization and is thus hard to analyze. Inspired by recent advances in spectral learning methods, we propose to study this problem from a different perspective: moment matching and spectral decomposition. Under that framework, we identify reasonable assumptions on the generative process of non-sequence data, and propose learning algorithms based on the tensor decomposition method [2] to provably recover firstorder Markov models and hidden Markov models. To the best of our knowledge, this is the first formal guarantee on learning from non-sequence data. Preliminary simulation results confirm our theoretical findings. 1
Author: Anima Anandkumar, Daniel Hsu, Majid Janzamin, Sham M. Kakade
Abstract: Overcomplete latent representations have been very popular for unsupervised feature learning in recent years. In this paper, we specify which overcomplete models can be identified given observable moments of a certain order. We consider probabilistic admixture or topic models in the overcomplete regime, where the number of latent topics can greatly exceed the size of the observed word vocabulary. While general overcomplete topic models are not identifiable, we establish generic identifiability under a constraint, referred to as topic persistence. Our sufficient conditions for identifiability involve a novel set of “higher order” expansion conditions on the topic-word matrix or the population structure of the model. This set of higher-order expansion conditions allow for overcomplete models, and require the existence of a perfect matching from latent topics to higher order observed words. We establish that random structured topic models are identifiable w.h.p. in the overcomplete regime. Our identifiability results allow for general (non-degenerate) distributions for modeling the topic proportions, and thus, we can handle arbitrarily correlated topics in our framework. Our identifiability results imply uniqueness of a class of tensor decompositions with structured sparsity which is contained in the class of Tucker decompositions, but is more general than the Candecomp/Parafac (CP) decomposition. Keywords: Overcomplete representation, admixture models, generic identifiability, tensor decomposition.
4 0.11742663 301 nips-2013-Sparse Additive Text Models with Low Rank Background
Author: Lei Shi
Abstract: The sparse additive model for text modeling involves the sum-of-exp computing, whose cost is consuming for large scales. Moreover, the assumption of equal background across all classes/topics may be too strong. This paper extends to propose sparse additive model with low rank background (SAM-LRB) and obtains simple yet efficient estimation. Particularly, employing a double majorization bound, we approximate log-likelihood into a quadratic lower-bound without the log-sumexp terms. The constraints of low rank and sparsity are then simply embodied by nuclear norm and ℓ1 -norm regularizers. Interestingly, we find that the optimization task of SAM-LRB can be transformed into the same form as in Robust PCA. Consequently, parameters of supervised SAM-LRB can be efficiently learned using an existing algorithm for Robust PCA based on accelerated proximal gradient. Besides the supervised case, we extend SAM-LRB to favor unsupervised and multifaceted scenarios. Experiments on three real data demonstrate the effectiveness and efficiency of SAM-LRB, compared with a few state-of-the-art models. 1
5 0.1101885 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition
Author: Fang Zhao, Yongzhen Huang, Liang Wang, Tieniu Tan
Abstract: Unstructured social group activity recognition in web videos is a challenging task due to 1) the semantic gap between class labels and low-level visual features and 2) the lack of labeled training data. To tackle this problem, we propose a “relevance topic model” for jointly learning meaningful mid-level representations upon bagof-words (BoW) video representations and a classifier with sparse weights. In our approach, sparse Bayesian learning is incorporated into an undirected topic model (i.e., Replicated Softmax) to discover topics which are relevant to video classes and suitable for prediction. Rectified linear units are utilized to increase the expressive power of topics so as to explain better video data containing complex contents and make variational inference tractable for the proposed model. An efficient variational EM algorithm is presented for model parameter estimation and inference. Experimental results on the Unstructured Social Activity Attribute dataset show that our model achieves state of the art performance and outperforms other supervised topic model in terms of classification accuracy, particularly in the case of a very small number of labeled training videos. 1
6 0.1042442 11 nips-2013-A New Convex Relaxation for Tensor Completion
7 0.10207536 174 nips-2013-Lexical and Hierarchical Topic Regression
8 0.085709333 160 nips-2013-Learning Stochastic Feedforward Neural Networks
9 0.079451762 113 nips-2013-Exact and Stable Recovery of Pairwise Interaction Tensors
10 0.079047799 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
11 0.076930955 224 nips-2013-On the Sample Complexity of Subspace Learning
12 0.075251564 295 nips-2013-Simultaneous Rectification and Alignment via Robust Recovery of Low-rank Tensors
13 0.0734277 74 nips-2013-Convex Tensor Decomposition via Structured Schatten Norm Regularization
14 0.072926626 98 nips-2013-Documents as multiple overlapping windows into grids of counts
15 0.069321543 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification
16 0.067766808 179 nips-2013-Low-Rank Matrix and Tensor Completion via Adaptive Sampling
17 0.06487871 263 nips-2013-Reasoning With Neural Tensor Networks for Knowledge Base Completion
18 0.063804545 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models
19 0.063719787 247 nips-2013-Phase Retrieval using Alternating Minimization
20 0.061861083 65 nips-2013-Compressive Feature Learning
topicId topicWeight
[(0, 0.144), (1, 0.078), (2, 0.016), (3, 0.079), (4, 0.057), (5, -0.098), (6, 0.066), (7, 0.035), (8, 0.138), (9, 0.013), (10, -0.019), (11, 0.062), (12, -0.014), (13, -0.004), (14, 0.003), (15, -0.014), (16, 0.163), (17, -0.006), (18, -0.002), (19, -0.022), (20, -0.036), (21, 0.001), (22, -0.002), (23, 0.007), (24, 0.016), (25, -0.123), (26, 0.001), (27, 0.074), (28, -0.195), (29, -0.027), (30, 0.092), (31, 0.029), (32, -0.045), (33, -0.014), (34, 0.043), (35, -0.025), (36, 0.011), (37, -0.025), (38, 0.072), (39, -0.069), (40, 0.029), (41, 0.035), (42, -0.08), (43, -0.043), (44, 0.03), (45, 0.045), (46, 0.03), (47, 0.002), (48, 0.056), (49, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.95441031 70 nips-2013-Contrastive Learning Using Spectral Methods
Author: James Y. Zou, Daniel Hsu, David C. Parkes, Ryan P. Adams
Abstract: In many natural settings, the analysis goal is not to characterize a single data set in isolation, but rather to understand the difference between one set of observations and another. For example, given a background corpus of news articles together with writings of a particular author, one may want a topic model that explains word patterns and themes specific to the author. Another example comes from genomics, in which biological signals may be collected from different regions of a genome, and one wants a model that captures the differential statistics observed in these regions. This paper formalizes this notion of contrastive learning for mixture models, and develops spectral algorithms for inferring mixture components specific to a foreground data set when contrasted with a background data set. The method builds on recent moment-based estimators and tensor decompositions for latent variable models, and has the intuitive feature of using background data statistics to appropriately modify moments estimated from foreground data. A key advantage of the method is that the background data need only be coarsely modeled, which is important when the background is too complex, noisy, or not of interest. The method is demonstrated on applications in contrastive topic modeling and genomic sequence analysis. 1
Author: Anima Anandkumar, Daniel Hsu, Majid Janzamin, Sham M. Kakade
Abstract: Overcomplete latent representations have been very popular for unsupervised feature learning in recent years. In this paper, we specify which overcomplete models can be identified given observable moments of a certain order. We consider probabilistic admixture or topic models in the overcomplete regime, where the number of latent topics can greatly exceed the size of the observed word vocabulary. While general overcomplete topic models are not identifiable, we establish generic identifiability under a constraint, referred to as topic persistence. Our sufficient conditions for identifiability involve a novel set of “higher order” expansion conditions on the topic-word matrix or the population structure of the model. This set of higher-order expansion conditions allow for overcomplete models, and require the existence of a perfect matching from latent topics to higher order observed words. We establish that random structured topic models are identifiable w.h.p. in the overcomplete regime. Our identifiability results allow for general (non-degenerate) distributions for modeling the topic proportions, and thus, we can handle arbitrarily correlated topics in our framework. Our identifiability results imply uniqueness of a class of tensor decompositions with structured sparsity which is contained in the class of Tucker decompositions, but is more general than the Candecomp/Parafac (CP) decomposition. Keywords: Overcomplete representation, admixture models, generic identifiability, tensor decomposition.
3 0.74456549 174 nips-2013-Lexical and Hierarchical Topic Regression
Author: Viet-An Nguyen, Jordan Boyd-Graber, Philip Resnik
Abstract: Inspired by a two-level theory from political science that unifies agenda setting and ideological framing, we propose supervised hierarchical latent Dirichlet allocation (S H L DA), which jointly captures documents’ multi-level topic structure and their polar response variables. Our model extends the nested Chinese restaurant processes to discover tree-structured topic hierarchies and uses both per-topic hierarchical and per-word lexical regression parameters to model response variables. S H L DA improves prediction on political affiliation and sentiment tasks in addition to providing insight into how topics under discussion are framed. 1 Introduction: Agenda Setting and Framing in Hierarchical Models How do liberal-leaning bloggers talk about immigration in the US? What do conservative politicians have to say about education? How do Fox News and MSNBC differ in their language about the gun debate? Such questions concern not only what, but how things are talked about. In political communication, the question of “what” falls under the heading of agenda setting theory, which concerns the issues introduced into political discourse (e.g., by the mass media) and their influence over public priorities [1]. The question of “how” concerns framing: the way the presentation of an issue reflects or encourages a particular perspective or interpretation [2]. For example, the rise of the “innocence frame” in the death penalty debate, emphasizing the irreversible consequence of mistaken convictions, has led to a sharp decline in the use of capital punishment in the US [3]. In its concern with the subjects or issues under discussion in political discourse, agenda setting maps neatly to topic modeling [4] as a means of discovering and characterizing those issues [5]. Interestingly, one line of communication theory seeks to unify agenda setting and framing by viewing frames as a second-level kind of agenda [1]: just as agenda setting is about which objects of discussion are salient, framing is about the salience of attributes of those objects. The key is that what communications theorists consider an attribute in a discussion can itself be an object, as well. For example, “mistaken convictions” is one attribute of the death penalty discussion, but it can also be viewed as an object of discussion in its own right. This two-level view leads naturally to the idea of using a hierarchical topic model to formalize both agendas and frames within a uniform setting. In this paper, we introduce a new model to do exactly that. The model is predictive: it represents the idea of alternative or competing perspectives via a continuous-valued response variable. Although inspired by the study of political discourse, associating texts with “perspectives” is more general and has been studied in sentiment analysis, discovery of regional variation, and value-sensitive design. We show experimentally that the model’s hierarchical structure improves prediction of perspective in both a political domain and on sentiment analysis tasks, and we argue that the topic hierarchies exposed by the model are indeed capturing structure in line with the theory that motivated the work. 1 ߨ ݉ ߠௗ ߙ ߰ௗ ߛ ݐௗ௦ ݖௗ௦ ݓௗ௦ ܿௗ௧ ܰௗ௦ ∞ ߩ ܵௗ ݕௗ ܦ ߱ ߟ ߬௩ ܸ 1. For each node k ∈ [1, ∞) in the tree (a) Draw topic φk ∼ Dir(βk ) (b) Draw regression parameter ηk ∼ N (µ, σ) 2. For each word v ∈ [1, V ], draw τv ∼ Laplace(0, ω) 3. For each document d ∈ [1, D] (a) Draw level distribution θd ∼ GEM(m, π) (b) Draw table distribution ψd ∼ GEM(α) (c) For each table t ∈ [1, ∞), draw a path cd,t ∼ nCRP(γ) (d) For each sentence s ∈ [1, Sd ], draw a table indicator td,s ∼ Mult(ψd ) i. For each token n ∈ [1, Nd,s ] A. Draw level zd,s,n ∼ Mult(θd ) B. Draw word wd,s,n ∼ Mult(φcd,td,s ,zd,s,n ) ¯ ¯ (e) Draw response yd ∼ N (η T zd + τ T wd , ρ): ߶ ∞ ߤ i. zd,k = ¯ ߪ ߚ ii. wd,v = ¯ 1 Nd,· 1 Nd,· Sd s=1 Sd s=1 Nd,s n=1 I [kd,s,n = k] Nd,s n=1 I [wd,s,n = v] Figure 1: S H L DA’s generative process and plate diagram. Words w are explained by topic hierarchy φ, and response variables y are explained by per-topic regression coefficients η and global lexical coefficients τ . 2 S H L DA: Combining Supervision and Hierarchical Topic Structure Jointly capturing supervision and hierarchical topic structure falls under a class of models called supervised hierarchical latent Dirichlet allocation. These models take as input a set of D documents, each of which is associated with a response variable yd , and output a hierarchy of topics which is informed by yd . Zhang et al. [6] introduce the S H L DA family, focusing on a categorical response. In contrast, our novel model (which we call S H L DA for brevity), uses continuous responses. At its core, S H L DA’s document generative process resembles a combination of hierarchical latent Dirichlet allocation [7, HLDA] and the hierarchical Dirichlet process [8, HDP]. HLDA uses the nested Chinese restaurant process (nCRP(γ)), combined with an appropriate base distribution, to induce an unbounded tree-structured hierarchy of topics: general topics at the top, specific at the bottom. A document is generated by traversing this tree, at each level creating a new child (hence a new path) with probability proportional to γ or otherwise respecting the “rich-get-richer” property of a CRP. A drawback of HLDA, however, is that each document is restricted to only a single path in the tree. Recent work relaxes this restriction through different priors: nested HDP [9], nested Chinese franchises [10] or recursive CRPs [11]. In this paper, we address this problem by allowing documents to have multiple paths through the tree by leveraging information at the sentence level using the twolevel structure used in HDP. More specifically, in the HDP’s Chinese restaurant franchise metaphor, customers (i.e., tokens) are grouped by sitting at tables and each table takes a dish (i.e., topic) from a flat global menu. In our S H L DA, dishes are organized in a tree-structured global menu by using the nCRP as prior. Each path in the tree is a collection of L dishes (one for each level) and is called a combo. S H L DA groups sentences of a document by assigning them to tables and associates each table with a combo, and thus, models each document as a distribution over combos.1 In S H L DA’s metaphor, customers come in a restaurant and sit at a table in groups, where each group is a sentence. A sentence wd,s enters restaurant d and selects a table t (and its associated combo) with probability proportional to the number of sentences Sd,t at that table; or, it sits at a new table with probability proportional to α. After choosing the table (indexed by td,s ), if the table is new, the group will select a combo of dishes (i.e., a path, indexed by cd,t ) from the tree menu. Once a combo is in place, each token in the sentence chooses a “level” (indexed by zd,s,n ) in the combo, which specifies the topic (φkd,s,n ≡ φcd,td,s ,zd,s,n ) producing the associated observation (Figure 2). S H L DA also draws on supervised LDA [12, SLDA] associating each document d with an observable continuous response variable yd that represents the author’s perspective toward a topic, e.g., positive vs. negative sentiment, conservative vs. liberal ideology, etc. This lets us infer a multi-level topic structure informed by how topics are “framed” with respect to positions along the yd continuum. 1 We emphasize that, unlike in HDP where each table is assigned to a single dish, each table in our metaphor is associated with a combo–a collection of L dishes. We also use combo and path interchangeably. 2 Sd Sd,t ߶ଵ ߟଵ dish ߶ଵଵ ߟଵଵ ߶ଵଶ ߟଵଶ ߶ଵଵଵ ߟଵଵଵ ߶ଵଵଶ ߟଵଵଶ ߶ଵଶଵ ߟଵଶଵ ߶ଵଶଶ ߟଵଶଶ table ܿௗ௧ 1=ݐ 2=ݐ 1=ݐ 2=ݐ 3=ݐ 1=ݐ 2=ݐ ݐௗ௦ 2=ݏ 1=ݏ ܵ = ݏଵ 3=ݏ 2=ݏ 1=ݏ ݀=1 ݇ௗ௦ ܵ = ݏଶ ܵ = ݏ ݀=2 ߶ଵ ߟଵ ݀=ܦ customer group (token) (sentence) restaurant (document) ߶ଵଵ ߟଵଵ ݀=1 1=ݏ ߶ଵଵଵ ߟଵଵଵ combo (path) Nd,s Nd,·,l Nd,·,>l Nd,·,≥l Mc,l Cc,l,v Cd,x,l,v φk ηk τv cd,t td,s zd,s,n kd,s,n L C+ Figure 2: S H L DA’s restaurant franchise metaphor. # sentences in document d # groups (i.e. sentences) sitting at table t in restaurant d # tokens wd,s # tokens in wd assigned to level l # tokens in wd assigned to level > l ≡ Nd,·,l + Nd,·,>l # tables at level l on path c # word type v assigned to level l on path c # word type v in vd,x assigned to level l Topic at node k Regression parameter at node k Regression parameter of word type v Path assignment for table t in restaurant d Table assignment for group wd,s Level assignment for wd,s,n Node assignment for wd,s,n (i.e., node at level zd,s,n on path cd,td,s ) Height of the tree Set of all possible paths (including new ones) of the tree Table 1: Notation used in this paper Unlike SLDA, we model the response variables using a normal linear regression that contains both pertopic hierarchical and per-word lexical regression parameters. The hierarchical regression parameters are just like topics’ regression parameters in SLDA: each topic k (here, a tree node) has a parameter ηk , and the model uses the empirical distribution over the nodes that generated a document as the regressors. However, the hierarchy in S H L DA makes it possible to discover relationships between topics and the response variable that SLDA’s simple latent space obscures. Consider, for example, a topic model trained on Congressional debates. Vanilla LDA would likely discover a healthcare category. SLDA [12] could discover a pro-Obamacare topic and an anti-Obamacare topic. S H L DA could do that and capture the fact that there are alternative perspectives, i.e., that the healthcare issue is being discussed from two ideological perspectives, along with characterizing how the higher level topic is discussed by those on both sides of that ideological debate. Sometimes, of course, words are strongly associated with extremes on the response variable continuum regardless of underlying topic structure. Therefore, in addition to hierarchical regression parameters, we include global lexical regression parameters to model the interaction between specific words and response variables. We denote the regression parameter associated with a word type v in the vocabulary as τv , and use the normalized frequency of v in the documents to be its regressor. Including both hierarchical and lexical parameters is important. For detecting ideology in the US, “liberty” is an effective indicator of conservative speakers regardless of context; however, “cost” is a conservative-leaning indicator in discussions about environmental policy but liberal-leaning in debates about foreign policy. For sentiment, “wonderful” is globally a positive word; however, “unexpected” is a positive descriptor of books but a negative one of a car’s steering. S H L DA captures these properties in a single model. 3 Posterior Inference and Optimization Given documents with observed words w = {wd,s,n } and response variables y = {yd }, the inference task is to find the posterior distribution over: the tree structure including topic φk and regression parameter ηk for each node k, combo assignment cd,t for each table t in document d, table assignment td,s for each sentence s in a document d, and level assignment zd,s,n for each token wd,s,n . We approximate S H L DA’s posterior using stochastic EM, which alternates between a Gibbs sampling E-step and an optimization M-step. More specifically, in the E-step, we integrate out ψ, θ and φ to construct a Markov chain over (t, c, z) and alternate sampling each of them from their conditional distributions. In the M-step, we optimize the regression parameters η and τ using L-BFGS [13]. Before describing each step in detail, let us define the following probabilities. For more thorough derivations, please see the supplement. 3 • First, define vd,x as a set of tokens (e.g., a token, a sentence or a set of sentences) in document d. The conditional density of vd,x being assigned to path c given all other assignments is −d,x Γ(Cc,l,· + V βl ) L −d,x fc (vd,x ) = l=1 −d,x Γ(Cc,l,v + Cd,x,l,v + βl ) V −d,x Γ(Cc,l,· + Cd,x,l,· + V βl ) (1) −d,x Γ(Cc,l,v + βl ) v=1 where superscript −d,x denotes the same count excluding assignments of vd,x ; marginal counts −d,x are represented by ·’s. For a new path cnew , if the node does not exist, Ccnew ,l,v = 0 for all word types v. • Second, define the conditional density of the response variable yd of document d given vd,x being −d,x assigned to path c and all other assignments as gc (yd ) = 1 N Nd,· ηc,l · Cd,x,l,· + ηcd,td,s ,zd,s,n + wd,s,n ∈{wd \vd,x } Sd Nd,s L τwd,s,n , ρ (2) s=1 n=1 l=1 where Nd,· is the total number of tokens in document d. For a new node at level l on a new path cnew , we integrate over all possible values of ηcnew ,l . Sampling t: For each group wd,s we need to sample a table td,s . The conditional distribution of a table t given wd,s and other assignments is proportional to the number of sentences sitting at t times the probability of wd,s and yd being observed under this assignment. This is P (td,s = t | rest) ∝ P (td,s = t | t−s ) · P (wd,s , yd | td,s = t, w−d,s , t−d,s , z, c, η) d ∝ −d,s −d,s −d,s Sd,t · fcd,t (wd,s ) · gcd,t (yd ), for existing table t; (3) −d,s −d,s α · c∈C + P (cd,tnew = c | c−d,s ) · fc (wd,s ) · gc (yd ), for new table tnew . For a new table tnew , we need to sum over all possible paths C + of the tree, including new ones. For example, the set C + for the tree shown in Figure 2 consists of four existing paths (ending at one of the four leaf nodes) and three possible new paths (a new leaf off of one of the three internal nodes). The prior probability of path c is: P (cd,tnew = c | c−d,s ) ∝ L l=2 −d,s Mc,l −d,s Mc,l−1 + γl−1 γl∗ −d,s M ∗ cnew ,l∗ + γl , l∗ l=2 for an existing path c; (4) −d,s Mcnew ,l , for a new path cnew which consists of an existing path −d,s Mcnew ,l−1 + γl−1 from the root to a node at level l∗ and a new node. Sampling z: After assigning a sentence wd,s to a table, we assign each token wd,s,n to a level to choose a dish from the combo. The probability of assigning wd,s,n to level l is −s,n P (zd,s,n = l | rest) ∝ P (zd,s,n = l | zd )P (wd,s,n , yd | zd,s,n = l, w−d,s,n , z −d,s,n , t, c, η) (5) The first factor captures the probability that a customer in restaurant d is assigned to level l, conditioned on the level assignments of all other customers in restaurant d, and is equal to P (zd,s,n = −s,n l | zd ) = −d,s,n mπ + Nd,·,l −d,s,n π + Nd,·,≥l l−1 −d,s,n (1 − m)π + Nd,·,>j −d,s,n π + Nd,·,≥j j=1 , The second factor is the probability of observing wd,s,n and yd , given that wd,s,n is assigned to level −d,s,n −d,s,n l: P (wd,s,n , yd | zd,s,n = l, w−d,s,n , z −d,s,n , t, c, η) = fcd,t (wd,s,n ) · gcd,t (yd ). d,s d,s Sampling c: After assigning customers to tables and levels, we also sample path assignments for all tables. This is important since it can change the assignments of all customers sitting at a table, which leads to a well-mixed Markov chain and faster convergence. The probability of assigning table t in restaurant d to a path c is P (cd,t = c | rest) ∝ P (cd,t = c | c−d,t ) · P (wd,t , yd | cd,t = c, w−d,t , c−d,t , t, z, η) (6) where we slightly abuse the notation by using wd,t ≡ ∪{s|td,s =t} wd,s to denote the set of customers in all the groups sitting at table t in restaurant d. The first factor is the prior probability of a path given all tables’ path assignments c−d,t , excluding table t in restaurant d and is given in Equation 4. The second factor in Equation 6 is the probability of observing wd,t and yd given the new path −d,t −d,t assignments, P (wd,t , yd | cd,t = c, w−d,t , c−d,t , t, z, η) = fc (wd,t ) · gc (yd ). 4 Optimizing η and τ : We optimize the regression parameters η and τ via the likelihood, 1 L(η, τ ) = − 2ρ D 1 ¯ ¯ (yd − η zd − τ wd ) − 2σ T d=1 T K+ 2 (ηk − µ)2 − k=1 1 ω V |τv |, (7) v=1 where K + is the number of nodes in the tree.2 This maximization is performed using L-BFGS [13]. 4 Data: Congress, Products, Films We conduct our experiments using three datasets: Congressional floor debates, Amazon product reviews, and movie reviews. For all datasets, we remove stopwords, add bigrams to the vocabulary, and filter the vocabulary using tf-idf.3 • U.S Congressional floor debates: We downloaded debates of the 109th US Congress from GovTrack4 and preprocessed them as in Thomas et al. [14]. To remove uninterestingly non-polarized debates, we ignore bills with less than 20% “Yea” votes or less than 20% “Nay” votes. Each document d is a turn (a continuous utterance by a single speaker, i.e. speech segment [14]), and its response variable yd is the first dimension of the speaker’s DW- NOMINATE score [15], which captures the traditional left-right political distinction.5 After processing, our corpus contains 5,201 turns in the House, 3,060 turns in the Senate, and 5,000 words in the vocabulary.6 • Amazon product reviews: From a set of Amazon reviews of manufactured products such as computers, MP 3 players, GPS devices, etc. [16], we focused on the 50 most frequently reviewed products. After filtering, this corpus contains 37,191 reviews with a vocabulary of 5,000 words. We use the rating associated with each review as the response variable yd .7 • Movie reviews: Our third corpus is a set of 5,006 reviews of movies [17], again using review ratings as the response variable yd , although in this corpus the ratings are normalized to the range from 0 to 1. After preprocessing, the vocabulary contains 5,000 words. 5 Evaluating Prediction S H L DA’s response variable predictions provide a formally rigorous way to assess whether it is an improvement over prior methods. We evaluate effectiveness in predicting values of the response variables for unseen documents in the three datasets. For comparison we consider these baselines: • Multiple linear regression (MLR) models the response variable as a linear function of multiple features (or regressors). Here, we consider two types of features: topic-based features and lexicallybased features. Topic-based MLR, denoted by MLR - LDA, uses the topic distributions learned by vanilla LDA as features [12], while lexically-based MLR, denoted by MLR - VOC, uses the frequencies of words in the vocabulary as features. MLR - LDA - VOC uses both features. • Support vector regression (SVM) is a discriminative method [18] that uses LDA topic distributions (SVM - LDA), word frequencies (SVM - VOC), and both (SVM - LDA - VOC) as features.8 • Supervised topic model (SLDA): we implemented SLDA using Gibbs sampling. The version of SLDA we use is slightly different from the original SLDA described in [12], in that we place a Gaussian prior N (0, 1) over the regression parameters to perform L2-norm regularization.9 For parametric models (LDA and SLDA), which require the number of topics K to be specified beforehand, we use K ∈ {10, 30, 50}. We use symmetric Dirichlet priors in both LDA and SLDA, initialize The superscript + is to denote that this number is unbounded and varies during the sampling process. To find bigrams, we begin with bigram candidates that occur at least 10 times in the corpus and use Pearson’s χ2 -test to filter out those that have χ2 -value less than 5, which corresponds to a significance level of 0.025. We then treat selected bigrams as single word types and add them to the vocabulary. 2 3 4 http://www.govtrack.us/data/us/109/ 5 Scores were downloaded from http://voteview.com/dwnomin_joint_house_and_senate.htm 6 Data will be available after blind review. 7 The ratings can range from 1 to 5, but skew positive. 8 9 http://svmlight.joachims.org/ This performs better than unregularized SLDA in our experiments. 5 Floor Debates House-Senate Senate-House PCC ↑ MSE ↓ PCC ↑ MSE ↓ Amazon Reviews PCC ↑ MSE ↓ Movie Reviews PCC ↑ MSE ↓ SVM - LDA 10 SVM - LDA 30 SVM - LDA 50 SVM - VOC SVM - LDA - VOC 0.173 0.172 0.169 0.336 0.256 0.861 0.840 0.832 1.549 0.784 0.08 0.155 0.215 0.131 0.246 1.247 1.183 1.135 1.467 1.101 0.157 0.277 0.245 0.373 0.371 1.241 1.091 1.130 0.972 0.965 0.327 0.365 0.395 0.584 0.585 0.970 0.938 0.906 0.681 0.678 MLR - LDA 10 MLR - LDA 30 MLR - LDA 50 MLR - VOC MLR - LDA - VOC 0.163 0.160 0.150 0.322 0.319 0.735 0.737 0.741 0.889 0.873 0.068 0.162 0.248 0.191 0.194 1.151 1.125 1.081 1.124 1.120 0.143 0.258 0.234 0.408 0.410 1.034 1.065 1.114 0.869 0.860 0.328 0.367 0.389 0.568 0.581 0.957 0.936 0.914 0.721 0.702 SLDA 10 SLDA 30 SLDA 50 0.154 0.174 0.254 0.729 0.793 0.897 0.090 0.128 0.245 1.145 1.188 1.184 0.270 0.357 0.241 1.113 1.146 1.939 0.383 0.433 0.503 0.953 0.852 0.772 S H L DA 0.356 0.753 0.303 1.076 0.413 0.891 0.597 0.673 Models Table 2: Regression results for Pearson’s correlation coefficient (PCC, higher is better (↑)) and mean squared error (MSE, lower is better (↓)). Results on Amazon product reviews and movie reviews are averaged over 5 folds. Subscripts denote the number of topics for parametric models. For SVM - LDA - VOC and MLR - LDA - VOC, only best results across K ∈ {10, 30, 50} are reported. Best results are in bold. the Dirichlet hyperparameters to 0.5, and use slice sampling [19] for updating hyperparameters. For SLDA , the variance of the regression is set to 0.5. For S H L DA , we use trees with maximum depth of three. We slice sample m, π, β and γ, and fix µ = 0, σ = 0.5, ω = 0.5 and ρ = 0.5. We found that the following set of initial hyperparameters works reasonably well for all the datasets in our experiments: m = 0.5, π = 100, β = (1.0, 0.5, 0.25), γ = (1, 1), α = 1. We also set the regression parameter of the root node to zero, which speeds inference (since it is associated with every document) and because it is reasonable to assume that it would not change the response variable. To compare the performance of different methods, we compute Pearson’s correlation coefficient (PCC) and mean squared error (MSE) between the true and predicted values of the response variables and average over 5 folds. For the Congressional debate corpus, following Yu et al. [20], we use documents in the House to train and test on documents in the Senate and vice versa. Results and analysis Table 2 shows the performance of all models on our three datasets. Methods that only use topic-based features such as SVM - LDA and MLR - LDA do poorly. Methods only based on lexical features like SVM - VOC and MLR - VOC outperform methods that are based only on topic features significantly for the two review datasets, but are comparable or worse on congressional debates. This suggests that reviews have more highly discriminative words than political speeches (Table 3). Combining topic-based and lexically-based features improves performance, which supports our choice of incorporating both per-topic and per-word regression parameters in S H L DA. In all cases, S H L DA achieves strong performance results. For the two cases where S H L DA was second best in MSE score (Amazon reviews and House-Senate), it outperforms other methods in PCC. Doing well in PCC for these two datasets is important since achieving low MSE is relatively easier due to the response variables’ bimodal distribution in the floor debates and positively-skewed distribution in Amazon reviews. For the floor debate dataset, the results of the House-Senate experiment are generally better than those of the Senate-House experiment, which is consistent with previous results [20] and is explained by the greater number of debates in the House. 6 Qualitative Analysis: Agendas and Framing/Perspective Although a formal coherence evaluation [21] remains a goal for future work, a qualitative look at the topic hierarchy uncovered by the model suggests that it is indeed capturing agenda/framing structure as discussed in Section 1. In Figure 3, a portion of the topic hierarchy induced from the Congressional debate corpus, Nodes A and B illustrate agendas—issues introduced into political discourse—associated with a particular ideology: Node A focuses on the hardships of the poorer victims of hurricane Katrina and is associated with Democrats, and text associated with Node E discusses a proposed constitutional amendment to ban flag burning and is associated with Republicans. Nodes C and D, children of a neutral “tax” topic, reveal how parties frame taxes as gains in terms of new social services (Democrats) and losses for job creators (Republicans). 6 E flag constitution freedom supreme_court elections rights continuity american_flag constitutional_amendm ent gses credit_rating fannie_mae regulator freddie_mac market financial_services agencies competition investors fannie bill speaker time amendment chairman people gentleman legislation congress support R:1.1 R:0 A minimum_wage commission independent_commissio n investigate hurricane_katrina increase investigation R:1.0 B percent tax economy estate_tax capital_gains money taxes businesses families tax_cuts pay tax_relief social_security affordable_housing housing manager fund activities funds organizations voter_registration faithbased nonprofits R:0.4 D:1.7 C death_tax jobs businesses business family_businesses equipment productivity repeal_permanency employees capital farms D REPUBLICAN billion budget children cuts debt tax_cuts child_support deficit education students health_care republicans national_debt R:4.3 D:2.2 DEMOCRAT D:4.5 Figure 3: Topics discovered from Congressional floor debates. Many first-level topics are bipartisan (purple), while lower level topics are associated with specific ideologies (Democrats blue, Republicans red). For example, the “tax” topic (B) is bipartisan, but its Democratic-leaning child (D) focuses on social goals supported by taxes (“children”, “education”, “health care”), while its Republican-leaning child (C) focuses on business implications (“death tax”, “jobs”, “businesses”). The number below each topic denotes the magnitude of the learned regression parameter associated with that topic. Colors and the numbers beneath each topic show the regression parameter η associated with the topic. Figure 4 shows the topic structure discovered by S H L DA in the review corpus. Nodes at higher levels are relatively neutral, with relatively small regression parameters.10 These nodes have general topics with no specific polarity. However, the bottom level clearly illustrates polarized positive/negative perspective. For example, Node A concerns washbasins for infants, and has two polarized children nodes: reviewers take a positive perspective when their children enjoy the product (Node B: “loves”, “splash”, “play”) but have negative reactions when it leaks (Node C: “leak(s/ed/ing)”). transmitter ipod car frequency iriver product transmitters live station presets itrip iriver_aft charges international_mode driving P:6.6 tried waste batteries tunecast rabbit_ears weak terrible antenna hear returned refund returning item junk return A D router setup network expander set signal wireless connect linksys connection house wireless_router laptop computer wre54g N:2.2 N:1.0 tivo adapter series adapters phone_line tivo_wireless transfer plugged wireless_adapter tivos plug dvr tivo_series tivo_box tivo_unit P:5.1 tub baby water bath sling son daughter sit bathtub sink newborn months bath_tub bathe bottom N:8.0 months loves hammock splash love baby drain eurobath hot fits wash play infant secure slip P:7.5 NEGATIVE N:0 N:2.7 B POSITIVE time bought product easy buy love using price lot able set found purchased money months transmitter car static ipod radio mp3_player signal station sound music sound_quality volume stations frequency frequencies C leaks leaked leak leaking hard waste snap suction_cups lock tabs difficult bottom tub_leaks properly ring N:8.9 monitor radio weather_radio night baby range alerts sound sony house interference channels receiver static alarm N:1.7 hear feature static monitors set live warning volume counties noise outside alert breathing rechargeable_battery alerts P:6.2 version hours phone F firmware told spent linksys tech_support technical_supportcusto mer_service range_expander support return N:10.6 E router firmware ddwrt wrt54gl version wrt54g tomato linksys linux routers flash versions browser dlink stable P:4.8 z22 palm pda palm_z22 calendar software screen contacts computer device sync information outlook data programs N:1.9 headphones sound pair bass headset sound_quality ear ears cord earbuds comfortable hear head earphones fit N:1.3 appointments organized phone lists handheld organizer photos etc pictures memos track bells books purse whistles P:5.8 noise_canceling noise sony exposed noise_cancellation stopped wires warranty noise_cancelling bud pay white_noise disappointed N:7.6 bottles bottle baby leak nipples nipple avent avent_bottles leaking son daughter formula leaks gas milk comfortable sound phones sennheiser bass px100 px100s phone headset highs portapros portapro price wear koss N:2.0 leak formula bottles_leak feeding leaked brown frustrating started clothes waste newborn playtex_ventaire soaked matter N:7.9 P:5.7 nipple breast nipples dishwasher ring sippy_cups tried breastfeed screwed breastfeeding nipple_confusion avent_system bottle P:6.4 Figure 4: Topics discovered from Amazon reviews. Higher topics are general, while lower topics are more specific. The polarity of the review is encoded in the color: red (negative) to blue (positive). Many of the firstlevel topics have no specific polarity and are associated with a broad class of products such as “routers” (Node D). However, the lowest topics in the hierarchy are often polarized; one child topic of “router” focuses on upgradable firmware such as “tomato” and “ddwrt” (Node E, positive) while another focuses on poor “tech support” and “customer service” (Node F, negative). The number below each topic is the regression parameter learned with that topic. In addition to the per-topic regression parameters, S H L DA also associates each word with a lexical regression parameter τ . Table 3 shows the top ten words with highest and lowest τ . The results are unsuprising, although the lexical regression for the Congressional debates is less clear-cut than other 10 All of the nodes at the second level have slightly negative values for the regression parameters mainly due to the very skewed distribution of the review ratings in Amazon. 7 datasets. As we saw in Section 5, for similar datasets, S H L DA’s context-specific regression is more useful when global lexical weights do not readily differentiate documents. Dataset Floor Debates Amazon Reviews Movie Reviews Top 10 words with positive weights bringing, private property, illegally, tax relief, regulation, mandates, constitutional, committee report, illegal alien highly recommend, pleased, love, loves, perfect, easy, excellent, amazing, glad, happy hilarious, fast, schindler, excellent, motion pictures, academy award, perfect, journey, fortunately, ability Top 10 words with negative weights bush administration, strong opposition, ranking, republicans, republican leadership, secret, discriminate, majority, undermine waste, returned, return, stopped, leak, junk, useless, returning, refund, terrible bad, unfortunately, supposed, waste, mess, worst, acceptable, awful, suppose, boring Table 3: Top words based on the global lexical regression coefficient, τ . For the floor debates, positive τ ’s are Republican-leaning while negative τ ’s are Democrat-leaning. 7 Related Work S H L DA joins a family of LDA extensions that introduce hierarchical topics, supervision, or both. Owing to limited space, we focus here on related work that combines the two. Petinot et al. [22] propose hierarchical Labeled LDA (hLLDA), which leverages an observed document ontology to learn topics in a tree structure; however, hLLDA assumes that the underlying tree structure is known a priori. SSHLDA [23] generalizes hLLDA by allowing the document hierarchy labels to be partially observed, with unobserved labels and topic tree structure then inferred from the data. Boyd-Graber and Resnik [24] used hierarchical distributions within topics to learn topics across languages. In addition to these “upstream” models [25], Perotte et al. [26] propose a “downstream” model called HSLDA , which jointly models documents’ hierarchy of labels and topics. HSLDA ’s topic structure is flat, however, and the response variable is a hierarchy of labels associated with each document, unlike S H L DA’s continuous response variable. Finally, another body related body of work includes models that jointly capture topics and other facets such as ideologies/perspectives [27, 28] and sentiments/opinions [29], albeit with discrete rather than continuously valued responses. Computational modeling of sentiment polarity is a voluminous field [30], and many computational political science models describe agendas [5] and ideology [31]. Looking at framing or bias at the sentence level, Greene and Resnik [32] investigate the role of syntactic structure in framing, Yano et al. [33] look at lexical indications of sentence-level bias, and Recasens et al. [34] develop linguistically informed sentence-level features for identifying bias-inducing words. 8 Conclusion We have introduced S H L DA, a model that associates a continuously valued response variable with hierarchical topics to capture both the issues under discussion and alternative perspectives on those issues. The two-level structure improves predictive performance over existing models on multiple datasets, while also adding potentially insightful hierarchical structure to the topic analysis. Based on a preliminary qualitative analysis, the topic hierarchy exposed by the model plausibly captures the idea of agenda setting, which is related to the issues that get discussed, and framing, which is related to authors’ perspectives on those issues. We plan to analyze the topic structure produced by S H L DA with political science collaborators and more generally to study how S H L DA and related models can help analyze and discover useful insights from political discourse. Acknowledgments This research was supported in part by NSF under grant #1211153 (Resnik) and #1018625 (BoydGraber and Resnik). Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsor. 8 References [1] McCombs, M. The agenda-setting role of the mass media in the shaping of public opinion. North, 2009(05-12):21, 2002. [2] McCombs, M., S. Ghanem. The convergence of agenda setting and framing. In Framing public life. 2001. [3] Baumgartner, F. R., S. L. De Boef, A. E. Boydstun. The decline of the death penalty and the discovery of innocence. Cambridge University Press, 2008. [4] Blei, D. M., A. Ng, M. Jordan. Latent Dirichlet allocation. JMLR, 3, 2003. [5] Grimmer, J. A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in Senate press releases. Political Analysis, 18(1):1–35, 2010. [6] Zhang, J. Explore objects and categories in unexplored environments based on multimodal data. Ph.D. thesis, University of Hamburg, 2012. [7] Blei, D. M., T. L. Griffiths, M. I. Jordan. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. J. ACM, 57(2), 2010. [8] Teh, Y. W., M. I. Jordan, M. J. Beal, et al. Hierarchical Dirichlet processes. JASA, 101(476), 2006. [9] Paisley, J. W., C. Wang, D. M. Blei, et al. Nested hierarchical Dirichlet processes. arXiv:1210.6738, 2012. [10] Ahmed, A., L. Hong, A. Smola. The nested Chinese restaurant franchise process: User tracking and document modeling. In ICML. 2013. [11] Kim, J. H., D. Kim, S. Kim, et al. Modeling topic hierarchies with the recursive Chinese restaurant process. In CIKM, pages 783–792. 2012. [12] Blei, D. M., J. D. McAuliffe. Supervised topic models. In NIPS. 2007. [13] Liu, D., J. Nocedal. On the limited memory BFGS method for large scale optimization. Math. Prog., 1989. [14] Thomas, M., B. Pang, L. Lee. Get out the vote: Determining support or opposition from Congressional floor-debate transcripts. In EMNLP. 2006. [15] Lewis, J. B., K. T. Poole. Measuring bias and uncertainty in ideal point estimates via the parametric bootstrap. Political Analysis, 12(2), 2004. [16] Jindal, N., B. Liu. Opinion spam and analysis. In WSDM. 2008. [17] Pang, B., L. Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL. 2005. [18] Joachims, T. Making large-scale SVM learning practical. In Adv. in Kernel Methods - SVM. 1999. [19] Neal, R. M. Slice sampling. Annals of Statistics, 31:705–767, 2003. [20] Yu, B., D. Diermeier, S. Kaufmann. Classifying party affiliation from political speech. JITP, 2008. [21] Chang, J., J. Boyd-Graber, C. Wang, et al. Reading tea leaves: How humans interpret topic models. In NIPS. 2009. [22] Petinot, Y., K. McKeown, K. Thadani. A hierarchical model of web summaries. In HLT. 2011. [23] Mao, X., Z. Ming, T.-S. Chua, et al. SSHLDA: A semi-supervised hierarchical topic model. In EMNLP. 2012. [24] Boyd-Graber, J., P. Resnik. Holistic sentiment analysis across languages: Multilingual supervised latent Dirichlet allocation. In EMNLP. 2010. [25] Mimno, D. M., A. McCallum. Topic models conditioned on arbitrary features with Dirichlet-multinomial regression. In UAI. 2008. [26] Perotte, A. J., F. Wood, N. Elhadad, et al. Hierarchically supervised latent Dirichlet allocation. In NIPS. 2011. [27] Ahmed, A., E. P. Xing. Staying informed: Supervised and semi-supervised multi-view topical analysis of ideological perspective. In EMNLP. 2010. [28] Eisenstein, J., A. Ahmed, E. P. Xing. Sparse additive generative models of text. In ICML. 2011. [29] Jo, Y., A. H. Oh. Aspect and sentiment unification model for online review analysis. In WSDM. 2011. [30] Pang, B., L. Lee. Opinion Mining and Sentiment Analysis. Now Publishers Inc, 2008. [31] Monroe, B. L., M. P. Colaresi, K. M. Quinn. Fightin’words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis, 16(4):372–403, 2008. [32] Greene, S., P. Resnik. More than words: Syntactic packaging and implicit sentiment. In NAACL. 2009. [33] Yano, T., P. Resnik, N. A. Smith. Shedding (a thousand points of) light on biased language. In NAACL-HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 2010. [34] Recasens, M., C. Danescu-Niculescu-Mizil, D. Jurafsky. Linguistic models for analyzing and detecting biased language. In ACL. 2013. 9
4 0.66931856 98 nips-2013-Documents as multiple overlapping windows into grids of counts
Author: Alessandro Perina, Nebojsa Jojic, Manuele Bicego, Andrzej Truski
Abstract: In text analysis documents are often represented as disorganized bags of words; models of such count features are typically based on mixing a small number of topics [1, 2]. Recently, it has been observed that for many text corpora documents evolve into one another in a smooth way, with some features dropping and new ones being introduced. The counting grid [3] models this spatial metaphor literally: it is a grid of word distributions learned in such a way that a document’s own distribution of features can be modeled as the sum of the histograms found in a window into the grid. The major drawback of this method is that it is essentially a mixture and all the content must be generated by a single contiguous area on the grid. This may be problematic especially for lower dimensional grids. In this paper, we overcome this issue by introducing the Componential Counting Grid which brings the componential nature of topic models to the basic counting grid. We evaluated our approach on document classification and multimodal retrieval obtaining state of the art results on standard benchmarks. 1
5 0.64284396 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
Author: Jianfei Chen, June Zhu, Zi Wang, Xun Zheng, Bo Zhang
Abstract: Logistic-normal topic models can effectively discover correlation structures among latent topics. However, their inference remains a challenge because of the non-conjugacy between the logistic-normal prior and multinomial topic mixing proportions. Existing algorithms either make restricting mean-field assumptions or are not scalable to large-scale applications. This paper presents a partially collapsed Gibbs sampling algorithm that approaches the provably correct distribution by exploring the ideas of data augmentation. To improve time efficiency, we further present a parallel implementation that can deal with large-scale applications and learn the correlation structures of thousands of topics from millions of documents. Extensive empirical results demonstrate the promise. 1
6 0.63745004 301 nips-2013-Sparse Additive Text Models with Low Rank Background
7 0.61410046 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition
8 0.59315062 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition
9 0.52979046 88 nips-2013-Designed Measurements for Vector Count Data
10 0.47921893 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex
11 0.45272386 92 nips-2013-Discovering Hidden Variables in Noisy-Or Networks using Quartet Tests
12 0.45169869 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization
13 0.44537503 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification
14 0.42488953 74 nips-2013-Convex Tensor Decomposition via Structured Schatten Norm Regularization
15 0.41883191 203 nips-2013-Multilinear Dynamical Systems for Tensor Time Series
16 0.40514156 197 nips-2013-Moment-based Uniform Deviation Bounds for $k$-means and Friends
17 0.39170077 5 nips-2013-A Deep Architecture for Matching Short Texts
18 0.39163023 36 nips-2013-Annealing between distributions by averaging moments
19 0.38485101 295 nips-2013-Simultaneous Rectification and Alignment via Robust Recovery of Low-rank Tensors
20 0.36208257 11 nips-2013-A New Convex Relaxation for Tensor Completion
topicId topicWeight
[(16, 0.021), (33, 0.112), (34, 0.086), (41, 0.016), (49, 0.374), (56, 0.151), (70, 0.021), (85, 0.032), (89, 0.024), (93, 0.06)]
simIndex simValue paperId paperTitle
1 0.96868891 323 nips-2013-Synthesizing Robust Plans under Incomplete Domain Models
Author: Tuan A. Nguyen, Subbarao Kambhampati, Minh Do
Abstract: Most current planners assume complete domain models and focus on generating correct plans. Unfortunately, domain modeling is a laborious and error-prone task, thus real world agents have to plan with incomplete domain models. While domain experts cannot guarantee completeness, often they are able to circumscribe the incompleteness of the model by providing annotations as to which parts of the domain model may be incomplete. In such cases, the goal should be to synthesize plans that are robust with respect to any known incompleteness of the domain. In this paper, we first introduce annotations expressing the knowledge of the domain incompleteness and formalize the notion of plan robustness with respect to an incomplete domain model. We then show an approach to compiling the problem of finding robust plans to the conformant probabilistic planning problem, and present experimental results with Probabilistic-FF planner. 1
2 0.92849571 6 nips-2013-A Determinantal Point Process Latent Variable Model for Inhibition in Neural Spiking Data
Author: Jasper Snoek, Richard Zemel, Ryan P. Adams
Abstract: Point processes are popular models of neural spiking behavior as they provide a statistical distribution over temporal sequences of spikes and help to reveal the complexities underlying a series of recorded action potentials. However, the most common neural point process models, the Poisson process and the gamma renewal process, do not capture interactions and correlations that are critical to modeling populations of neurons. We develop a novel model based on a determinantal point process over latent embeddings of neurons that effectively captures and helps visualize complex inhibitory and competitive interaction. We show that this model is a natural extension of the popular generalized linear model to sets of interacting neurons. The model is extended to incorporate gain control or divisive normalization, and the modulation of neural spiking based on periodic phenomena. Applied to neural spike recordings from the rat hippocampus, we see that the model captures inhibitory relationships, a dichotomy of classes of neurons, and a periodic modulation by the theta rhythm known to be present in the data. 1
3 0.89983338 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition
Author: Fang Zhao, Yongzhen Huang, Liang Wang, Tieniu Tan
Abstract: Unstructured social group activity recognition in web videos is a challenging task due to 1) the semantic gap between class labels and low-level visual features and 2) the lack of labeled training data. To tackle this problem, we propose a “relevance topic model” for jointly learning meaningful mid-level representations upon bagof-words (BoW) video representations and a classifier with sparse weights. In our approach, sparse Bayesian learning is incorporated into an undirected topic model (i.e., Replicated Softmax) to discover topics which are relevant to video classes and suitable for prediction. Rectified linear units are utilized to increase the expressive power of topics so as to explain better video data containing complex contents and make variational inference tractable for the proposed model. An efficient variational EM algorithm is presented for model parameter estimation and inference. Experimental results on the Unstructured Social Activity Attribute dataset show that our model achieves state of the art performance and outperforms other supervised topic model in terms of classification accuracy, particularly in the case of a very small number of labeled training videos. 1
4 0.89029783 266 nips-2013-Recurrent linear models of simultaneously-recorded neural populations
Author: Marius Pachitariu, Biljana Petreska, Maneesh Sahani
Abstract: Population neural recordings with long-range temporal structure are often best understood in terms of a common underlying low-dimensional dynamical process. Advances in recording technology provide access to an ever-larger fraction of the population, but the standard computational approaches available to identify the collective dynamics scale poorly with the size of the dataset. We describe a new, scalable approach to discovering low-dimensional dynamics that underlie simultaneously recorded spike trains from a neural population. We formulate the Recurrent Linear Model (RLM) by generalising the Kalman-filter-based likelihood calculation for latent linear dynamical systems to incorporate a generalised-linear observation process. We show that RLMs describe motor-cortical population data better than either directly-coupled generalised-linear models or latent linear dynamical system models with generalised-linear observations. We also introduce the cascaded generalised-linear model (CGLM) to capture low-dimensional instantaneous correlations in neural populations. The CGLM describes the cortical recordings better than either Ising or Gaussian models and, like the RLM, can be fit exactly and quickly. The CGLM can also be seen as a generalisation of a lowrank Gaussian model, in this case factor analysis. The computational tractability of the RLM and CGLM allow both to scale to very high-dimensional neural data. 1
5 0.86707765 131 nips-2013-Geometric optimisation on positive definite matrices for elliptically contoured distributions
Author: Suvrit Sra, Reshad Hosseini
Abstract: Hermitian positive definite (hpd) matrices recur throughout machine learning, statistics, and optimisation. This paper develops (conic) geometric optimisation on the cone of hpd matrices, which allows us to globally optimise a large class of nonconvex functions of hpd matrices. Specifically, we first use the Riemannian manifold structure of the hpd cone for studying functions that are nonconvex in the Euclidean sense but are geodesically convex (g-convex), hence globally optimisable. We then go beyond g-convexity, and exploit the conic geometry of hpd matrices to identify another class of functions that remain amenable to global optimisation without requiring g-convexity. We present key results that help recognise g-convexity and also the additional structure alluded to above. We illustrate our ideas by applying them to likelihood maximisation for a broad family of elliptically contoured distributions: for this maximisation, we derive novel, parameter free fixed-point algorithms. To our knowledge, ours are the most general results on geometric optimisation of hpd matrices known so far. Experiments show that advantages of using our fixed-point algorithms. 1
same-paper 6 0.86419851 70 nips-2013-Contrastive Learning Using Spectral Methods
7 0.83439827 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines
8 0.77886868 303 nips-2013-Sparse Overlapping Sets Lasso for Multitask Learning and its Application to fMRI Analysis
9 0.76209611 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization
10 0.74187434 121 nips-2013-Firing rate predictions in optimal balanced networks
11 0.69081265 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking
12 0.64551395 141 nips-2013-Inferring neural population dynamics from multiple partial recordings of the same neural circuit
13 0.64177018 64 nips-2013-Compete to Compute
14 0.63299841 246 nips-2013-Perfect Associative Learning with Spike-Timing-Dependent Plasticity
16 0.62036842 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables
17 0.61110801 304 nips-2013-Sparse nonnegative deconvolution for compressive calcium imaging: algorithms and phase transitions
18 0.60562849 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization
19 0.60542244 301 nips-2013-Sparse Additive Text Models with Low Rank Background
20 0.6049251 148 nips-2013-Latent Maximum Margin Clustering