nips nips2011 nips2011-176 knowledge-graph by maker-knowledge-mining

176 nips-2011-Multi-View Learning of Word Embeddings via CCA


Source: pdf

Author: Paramveer Dhillon, Dean P. Foster, Lyle H. Ungar

Abstract: Recently, there has been substantial interest in using large amounts of unlabeled data to learn word representations which can then be used as features in supervised classifiers for NLP tasks. However, most current approaches are slow to train, do not model the context of the word, and lack theoretical grounding. In this paper, we present a new learning method, Low Rank Multi-View Learning (LR-MVL) which uses a fast spectral method to estimate low dimensional context-specific word representations from unlabeled data. These representation features can then be used with any supervised learner. LR-MVL is extremely fast, gives guaranteed convergence to a global optimum, is theoretically elegant, and achieves state-ofthe-art performance on named entity recognition (NER) and chunking problems. 1 Introduction and Related Work Over the past decade there has been increased interest in using unlabeled data to supplement the labeled data in semi-supervised learning settings to overcome the inherent data sparsity and get improved generalization accuracies in high dimensional domains like NLP. Approaches like [1, 2] have been empirically very successful and have achieved excellent accuracies on a variety of NLP tasks. However, it is often difficult to adapt these approaches to use in conjunction with an existing supervised NLP system as these approaches enforce a particular choice of model. An increasingly popular alternative is to learn representational embeddings for words from a large collection of unlabeled data (typically using a generative model), and to use these embeddings to augment the feature set of a supervised learner. Embedding methods produce features in low dimensional spaces or over a small vocabulary size, unlike the traditional approach of working in the original high dimensional vocabulary space with only one dimension “on” at a given time. Broadly, these embedding methods fall into two categories: 1. Clustering based word representations: Clustering methods, often hierarchical, are used to group distributionally similar words based on their contexts. The two dominant approaches are Brown Clustering [3] and [4]. As recently shown, HMMs can also be used to induce a multinomial distribution over possible clusters [5]. 2. Dense representations: These representations are dense, low dimensional and real-valued. Each dimension of these representations captures latent information about a combination of syntactic and semantic word properties. They can either be induced using neural networks like C&W; embeddings [6] and Hierarchical log-linear (HLBL) embeddings [7] or by eigen-decomposition of the word co-occurrence matrix, e.g. Latent Semantic Analysis/Latent Semantic Indexing (LSA/LSI) [8]. Unfortunately, most of these representations are 1). slow to train, 2). sensitive to the scaling of the embeddings (especially 2 based approaches like LSA/PCA), 3). can get stuck in local optima (like EM trained HMM) and 4). learn a single embedding for a given word type; i.e. all the occurrences 1 of the word “bank” will have the same embedding, irrespective of whether the context of the word suggests it means “a financial institution” or “a river bank”. In this paper, we propose a novel context-specific word embedding method called Low Rank MultiView Learning, LR-MVL, which is fast to train and is guaranteed to converge to the optimal solution. As presented here, our LR-MVL embeddings are context-specific, but context oblivious embeddings (like the ones used by [6, 7]) can be trivially gotten from our model. Furthermore, building on recent advances in spectral learning for sequence models like HMMs [9, 10, 11] we show that LR-MVL has strong theoretical grounding. Particularly, we show that LR-MVL estimates low dimensional context-specific word embeddings which preserve all the information in the data if the data were generated by an HMM. Moreover, LR-MVL being linear does not face the danger of getting stuck in local optima as is the case for an EM trained HMM. LR-MVL falls into category (2) mentioned above; it learns real-valued context-specific word embeddings by performing Canonical Correlation Analysis (CCA) [12] between the past and future views of low rank approximations of the data. However, LR-MVL is more general than those methods, which work on bigram or trigram co-occurrence matrices, in that it uses longer word sequence information to estimate context-specific embeddings and also for the reasons mentioned in the last paragraph. The remainder of the paper is organized as follows. In the next section we give a brief overview of CCA, which forms the core of our method. Section 3 describes our proposed LR-MVL algorithm in detail and gives theory supporting its performance. Section 4 demonstrates the effectiveness of LR-MVL on the NLP tasks of Named Entity Recognition and Chunking. We conclude with a brief summary in Section 5. 2 Brief Review: Canonical Correlation Analysis (CCA) CCA [12] is the analog to Principal Component Analysis (PCA) for pairs of matrices. PCA computes the directions of maximum covariance between elements in a single matrix, whereas CCA computes the directions of maximal correlation between a pair of matrices. Unlike PCA, CCA does not depend on how the observations are scaled. This invariance of CCA to linear data transformations allows proofs that keeping the dominant singular vectors (those with largest singular values) will faithfully capture any state information. More specifically, given a set of n paired observation vectors {(l1 , r1 ), ..., (ln , rn )}–in our case the two matrices are the left (L) and right (R) context matrices of a word–we would like to simultaneously find the directions Φl and Φr that maximize the correlation of the projections of L onto Φl with the projections of R onto Φr . This is expressed as max Φl ,Φr E[ L, Φl R, Φr ] E[ L, Φl 2 ]E[ R, Φr 2 ] (1) where E denotes the empirical expectation. We use the notation Clr (Cll ) to denote the cross (auto) covariance matrices between L and R (i.e. L’R and L’L respectively.). The left and right canonical correlates are the solutions Φl , Φr of the following equations: Cll −1 Clr Crr −1 Crl Φl = λΦl Crr −1 Crl Cll −1 Clr Φr = λΦr 3 (2) Low Rank Multi-View Learning (LR-MVL) In LR-MVL, we compute the CCA between the past and future views of the data on a large unlabeled corpus to find the common latent structure, i.e., the hidden state associated with each token. These induced representations of the tokens can then be used as features in a supervised classifier (typically discriminative). The context around a word, consisting of the h words to the right and left of it, sits in a high dimensional space, since for a vocabulary of size v, each of the h words in the context requires an indicator function of dimension v. The key move in LR-MVL is to project the v-dimensional word 2 space down to a k dimensional state space. Thus, all eigenvector computations are done in a space that is v/k times smaller than the original space. Since a typical vocabulary contains at least 50, 000 words, and we use state spaces of order k ≈ 50 dimensions, this gives a 1,000-fold reduction in the size of calculations that are needed. The core of our LR-MVL algorithm is a fast spectral method for learning a v × k matrix A which maps each of the v words in the vocabulary to a k-dimensional state vector. We call this matrix the “eigenfeature dictionary”. We now describe the LR-MVL method, give a theorem that provides intuition into how it works, and formally present the LR-MVL algorithm. The Experiments section then shows that this low rank approximation allows us to achieve state-of-the-art performance on NLP tasks. 3.1 The LR-MVL method Given an unlabeled token sequence w={w0 , w1 , . . ., wn } we want to learn a low (k)- dimensional state vector {z0 , z1 , . . . , zn } for each observed token. The key is to find a v ×k matrix A (Algorithm 1) that maps each of the v words in the vocabulary to a reduced rank k-dimensional state vector, which is later used to induce context specific embeddings for the tokens (Algorithm 2). For supervised learning, these context specific embeddings are supplemented with other information about each token wt , such as its identity, orthographic features such as prefixes and suffixes or membership in domain-specific lexicons, and used as features in a classifier. Section 3.4 gives the algorithm more formally, but the key steps in the algorithm are, in general terms: • Take the h words to the left and to the right of each target word wt (the “Left” and “Right” contexts), and project them each down to k dimensions using A. • Take the CCA between the reduced rank left and right contexts, and use the resulting model to estimate a k dimensional state vector (the “hidden state”) for each token. • Take the CCA between the hidden states and the tokens wt . The singular vectors associated with wt form a new estimate of the eigenfeature dictionary. LR-MVL can be viewed as a type of co-training [13]: The state of each token wt is similar to that of the tokens both before and after it, and it is also similar to the states of the other occurrences of the same word elsewhere in the document (used in the outer iteration). LR-MVL takes advantage of these two different types of similarity by alternately estimating word state using CCA on the smooths of the states of the words before and after each target token and using the average over the states associated with all other occurrences of that word. 3.2 Theoretical Properties of LR-MVL We now present the theory behind the LR-MVL algorithm; particularly we show that the reduced rank matrix A allows a significant data reduction while preserving the information in our data and the estimated state does the best possible job of capturing any label information that can be inferred by a linear model. Let L be an n × hv matrix giving the words in the left context of each of the n tokens, where the context is of length h, R be the corresponding n × hv matrix for the right context, and W be an n × v matrix of indicator functions for the words themselves. We will use the following assumptions at various points in our proof: Assumption 1. L, W, and R come from a rank k HMM i.e. it has a rank k observation matrix and rank k transition matrix both of which have the same domain. For example, if the dimension of the hidden state is k and the vocabulary size is v then the observation matrix, which is k × v, has rank k. This rank condition is similar to the one used by [10]. Assumption 1A. For the three views, L, W and R assume that there exists a “hidden state H” of dimension n × k, where each row Hi has the same non-singular variance-covariance matrix and 3 such that E(Li |Hi ) = Hi β T and E(Ri |Hi ) = Hi β T and E(Wi |Hi ) = Hi β T where all β’s are of L R W rank k, where Li , Ri and Wi are the rows of L, R and W respectively. Assumption 1A follows from Assumption 1. Assumption 2. ρ(L, W), ρ(L, R) and ρ(W, R) all have rank k, where ρ(X1 , X2 ) is the expected correlation between X1 and X2 . Assumption 2 is a rank condition similar to that in [9]. Assumption 3. ρ([L, R], W) has k distinct singular values. Assumption 3 just makes the proof a little cleaner, since if there are repeated singular values, then the singular vectors are not unique. Without it, we would have to phrase results in terms of subspaces with identical singular values. We also need to define the CCA function that computes the left and right singular vectors for a pair of matrices: Definition 1 (CCA). Compute the CCA between two matrices X1 and X2 . Let ΦX1 be a matrix containing the d largest singular vectors for X1 (sorted from the largest on down). Likewise for ΦX2 . Define the function CCAd (X1 , X2 ) = [ΦX1 , ΦX2 ]. When we want just one of these Φ’s, we will use CCAd (X1 , X2 )left = ΦX1 for the left singular vectors and CCAd (X1 , X2 )right = ΦX2 for the right singular vectors. Note that the resulting singular vectors, [ΦX1 , ΦX2 ] can be used to give two redundant estimates, X1 ΦX1 and X2 ΦX2 of the “hidden” state relating X1 and X2 , if such a hidden state exists. Definition 2. Define the symbol “≈” to mean X1 ≈ X2 ⇐⇒ lim X1 = lim X2 n→∞ n→∞ where n is the sample size. Lemma 1. Define A by the following limit of the right singular vectors: CCAk ([L, R], W)right ≈ A. Under assumptions 2, 3 and 1A, such that if CCAk (L, R) ≡ [ΦL , ΦR ] then CCAk ([LΦL , RΦR ], W)right ≈ A. Lemma 1 shows that instead of finding the CCA between the full context and the words, we can take the CCA between the Left and Right contexts, estimate a k dimensional state from them, and take the CCA of that state with the words and get the same result. See the supplementary material for the Proof. ˜ Let Ah denote a matrix formed by stacking h copies of A on top of each other. Right multiplying ˜ L or R by Ah projects each of the words in that context into the k-dimensional reduced rank space. The following theorem addresses the core of the LR-MVL algorithm, showing that there is an A which gives the desired dimensionality reduction. Specifically, it shows that the previous lemma also holds in the reduced rank space. Theorem 1. Under assumptions 1, 2 and 3 there exists a unique matrix A such that if ˜ ˜ ˜ ˜ CCAk (LAh , RAh ) ≡ [ΦL , ΦR ] then ˜ ˜ ˜ ˜ CCAk ([LAh ΦL , RAh ΦR ], W)right ≈ A ˜ where Ah is the stacked form of A. See the supplementary material for the Proof 1 . ˆ It is worth noting that our matrix A corresponds to the matrix U used by [9, 10]. They showed that U is sufficient to compute the probability of a sequence of words generated by an HMM; although we do not show ˆ it here (due to limited space), our A provides a more statistically efficient estimate of U than their U , and hence can also be used to estimate the sequence probabilities. 1 4 Under the above assumptions, there is asymptotically (in the limit of infinite data) no benefit to first estimating state by finding the CCA between the left and right contexts and then finding the CCA between the estimated state and the words. One could instead just directly find the CCA between the combined left and rights contexts and the words. However, because of the Zipfian distribution of words, many words are rare or even unique, and hence one is not in the asymptotic limit. In this case, CCA between the rare words and context will not be informative, whereas finding the CCA between the left and right contexts gives a good state vector estimate even for unique words. One can then fruitfully find the CCA between the contexts and the estimated state vector for their associated words. 3.3 Using Exponential Smooths In practice, we replace the projected left and right contexts with exponential smooths (weighted average of the previous (or next) token’s state i.e. Zt−1 (or Zt+1 ) and previous (or next) token’s smoothed state i.e. St−1 (or St+1 ).), of them at a few different time scales, thus giving a further dimension reduction by a factor of context length h (say 100 words) divided by the number of smooths (often 5-7). We use a mixture of both very short and very long contexts which capture short and long range dependencies as required by NLP problems as NER, Chunking, WSD etc. Since exponential smooths are linear, we preserve the linearity of our method. 3.4 The LR-MVL Algorithm The LR-MVL algorithm (using exponential smooths) is given in Algorithm 1; it computes the pair of CCAs described above in Theorem 1. Algorithm 1 LR-MVL Algorithm - Learning from Large amounts of Unlabeled Data 1: Input: Token sequence Wn×v , state space size k, smoothing rates αj 2: Initialize the eigenfeature dictionary A to random values N (0, 1). 3: repeat 4: Set the state Zt (1 < t ≤ n) of each token wt to the eigenfeature vector of the corresponding word. Zt = (Aw : w = wt ) 5: Smooth the state estimates before and after each token to get a pair of views for each smoothing rate αj . (l,j) (l,j) = (1 − αj )St−1 + αj Zt−1 // left view L St (r,j) (r,j) j St = (1 − α )St+1 + αj Zt+1 // right view R. (l,j) (r,j) th where the t rows of L and R are, respectively, concatenations of the smooths St and St for (j) each of the α s. 6: Find the left and right canonical correlates, which are the eigenvectors Φl and Φr of (L L)−1 L R(R R)−1 R LΦl = λΦl . (R R)−1 R L(L L)−1 L RΦr = λΦr . 7: Project the left and right views on to the space spanned by the top k/2 left and right CCAs respectively (k/2) (k/2) Xl = LΦl and Xr = RΦr (k/2) (k/2) where Φl , Φr are matrices composed of the singular vectors of Φl , Φr with the k/2 largest magnitude singular values. Estimate the state for each word wt as the union of the left and right estimates: Z = [Xl , Xr ] 8: Estimate the eigenfeatures of each word type, w, as the average of the states estimated for that word. Aw = avg(Zt : wt = w) 9: Compute the change in A from the previous iteration 10: until |∆A| < 11: Output: Φk , Φk , A . r l A few iterations (∼ 5) of the above algorithm are sufficient to converge to the solution. (Since the problem is convex, there is a single solution, so there is no issue of local minima.) As [14] show for PCA, one can start with a random matrix that is only slightly larger than the true rank k of the correlation matrix, and with extremely high likelihood converge in a few iterations to within a small distance of the true principal components. In our case, if the assumptions detailed above (1, 1A, 2 and 3) are satisfied, our method converges equally rapidly to the true canonical variates. As mentioned earlier, we get further dimensionality reduction in Step 5, by replacing the Left and Right context matrices with a set of exponentially smoothed values of the reduced rank projections of the context words. Step 6 finds the CCA between the Left and Right contexts. Step 7 estimates 5 the state by combining the estimates from the left and right contexts, since we don’t know which will best estimate the state. Step 8 takes the CCA between the estimated state Z and the matrix of words W. Because W is a vector of indicator functions, this CCA takes the trivial form of a set of averages. Once we have estimated the CCA model, it is used to generate context specific embeddings for the tokens from training, development and test sets (as described in Algorithm 2). These embeddings are further supplemented with other baseline features and used in a supervised learner to predict the label of the token. Algorithm 2 LR-MVL Algorithm -Inducing Context Specific Embeddings for Train/Dev/Test Data 1: Input: Model (Φk , Φk , A) output from above algorithm and Token sequences Wtrain , (Wdev , Wtest ) r l 2: Project the left and right views L and R after smoothing onto the space spanned by the top k left and right CCAs respectively Xl = LΦk and Xr = RΦk r l and the words onto the eigenfeature dictionary Xw = W train A 3: Form the final embedding matrix Xtrain:embed by concatenating these three estimates of state Xtrain:embed = [Xl , Xw , Xr ] 4: Output: The embedding matrices Xtrain:embed , (Xdev:embed , Xtest:embed ) with context-specific representations for the tokens. These embeddings are augmented with baseline set of features mentioned in Sections 4.1.1 and 4.1.2 before learning the final classifier. Note that we can get context “oblivious” embeddings i.e. one embedding per word type, just by using the eigenfeature dictionary (Av×k ) output by Algorithm 1. 4 Experimental Results In this section we present the experimental results of LR-MVL on Named Entity Recognition (NER) and Syntactic Chunking tasks. We compare LR-MVL to state-of-the-art semi-supervised approaches like [1] (Alternating Structures Optimization (ASO)) and [2] (Semi-supervised extension of CRFs) as well as embeddings like C&W;, HLBL and Brown Clustering. 4.1 Datasets and Experimental Setup For the NER experiments we used the data from CoNLL 2003 shared task and for Chunking experiments we used the CoNLL 2000 shared task data2 with standard training, development and testing set splits. The CoNLL ’03 and the CoNLL ’00 datasets had ∼ 204K/51K/46K and ∼ 212K/ − /47K tokens respectively for Train/Dev./Test sets. 4.1.1 Named Entity Recognition (NER) We use the same set of baseline features as used by [15, 16] in their experiments. The detailed list of features is as below: • Current Word wi ; Its type information: all-capitalized, is-capitalized, all-digits and so on; Prefixes and suffixes of wi • Word tokens in window of 2 around the current word i.e. (wi−2 , wi−1 , wi , wi+1 , wi+2 ); and capitalization pattern in the window. d = • Previous two predictions yi−1 and yi−2 and conjunction of d and yi−1 • Embedding features (LR-MVL, C&W;, HLBL, Brown etc.) in a window of 2 around the current word (if applicable). Following [17] we use regularized averaged perceptron model with above set of baseline features for the NER task. We also used their BILOU text chunk representation and fast greedy inference as it was shown to give superior performance. 2 More details about the data and competition are available at http://www.cnts.ua.ac.be/ conll2003/ner/ and http://www.cnts.ua.ac.be/conll2000/chunking/ 6 We also augment the above set of baseline features with gazetteers, as is standard practice in NER experiments. We tuned our free parameter namely the size of LR-MVL embedding on the development and scaled our embedding features to have a 2 norm of 1 for each token and further multiplied them by a normalization constant (also chosen by cross validation), so that when they are used in conjunction with other categorical features in a linear classifier, they do not exert extra influence. The size of LR-MVL embeddings (state-space) that gave the best performance on the development set was k = 50 (50 each for Xl , Xw , Xr in Algorithm 2) i.e. the total size of embeddings was 50×3, and the best normalization constant was 0.5. We omit validation plots due to paucity of space. 4.1.2 Chunking For our chunking experiments we use a similar base set of features as above: • Current Word wi and word tokens in window of 2 around the current word i.e. d = (wi−2 , wi−1 , wi , wi+1 , wi+2 ); • POS tags ti in a window of 2 around the current word. • Word conjunction features wi ∩ wi+1 , i ∈ {−1, 0} and Tag conjunction features ti ∩ ti+1 , i ∈ {−2, −1, 0, 1} and ti ∩ ti+1 ∩ ti+2 , i ∈ {−2, −1, 0}. • Embedding features in a window of 2 around the current word (when applicable). Since CoNLL 00 chunking data does not have a development set, we randomly sampled 1000 sentences from the training data (8936 sentences) for development. So, we trained our chunking models on 7936 training sentences and evaluated their F1 score on the 1000 development sentences and used a CRF 3 as the supervised classifier. We tuned the size of embedding and the magnitude of 2 regularization penalty in CRF on the development set and took log (or -log of the magnitude) of the value of the features4 . The regularization penalty that gave best performance on development set was 2 and here again the best size of LR-MVL embeddings (state-space) was k = 50. Finally, we trained the CRF on the entire (“original”) training data i.e. 8936 sentences. 4.1.3 Unlabeled Data and Induction of embeddings For inducing the embeddings we used the RCV1 corpus containing Reuters newswire from Aug ’96 to Aug ’97 and containing about 63 million tokens in 3.3 million sentences5 . Case was left intact and we did not do the “cleaning” as done by [18, 16] i.e. remove all sentences which are less than 90% lowercase a-z, as our multi-view learning approach is robust to such noisy data, like news byline text (mostly all caps) which does not correlate strongly with the text of the article. We induced our LR-MVL embeddings over a period of 3 days (70 core hours on 3.0 GHz CPU) on the entire RCV1 data by performing 4 iterations, a vocabulary size of 300k and using a variety of smoothing rates (α in Algorithm 1) to capture correlations between shorter and longer contexts α = [0.005, 0.01, 0.05, 0.1, 0.5, 0.9]; theoretically we could tune the smoothing parameters on the development set but we found this mixture of long and short term dependencies to work well in practice. As far as the other embeddings are concerned i.e. C&W;, HLBL and Brown Clusters, we downloaded them from http://metaoptimize.com/projects/wordreprs. The details about their induction and parameter tuning can be found in [16]; we report their best numbers here. It is also worth noting that the unsupervised training of LR-MVL was (> 1.5 times)6 faster than other embeddings. 4.2 Results The results for NER and Chunking are shown in Tables 1 and 2, respectively, which show that LR-MVL performs significantly better than state-of-the-art competing methods on both NER and Chunking tasks. 3 http://www.chokkan.org/software/crfsuite/ Our embeddings are learnt using a linear model whereas CRF is a log-linear model, so to keep things on same scale we did this normalization. 5 We chose this particular dataset to make a fair comparison with [1, 16], who report results using RCV1 as unlabeled data. 6 As some of these embeddings were trained on GPGPU which makes our method even faster comparatively. 4 7 Embedding/Model Baseline C&W;, 200-dim HLBL, 100-dim Brown 1000 clusters Ando & Zhang ’05 Suzuki & Isozaki ’08 LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim HLBL, 100-dim C&W;, 200-dim Brown, 1000 clusters LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim No Gazetteers With Gazetteers F1-Score Dev. Set Test Set 90.03 84.39 92.46 87.46 92.00 88.13 92.32 88.52 93.15 89.31 93.66 89.36 93.11 89.55 93.61 89.91 92.91 89.35 92.98 88.88 93.25 89.41 93.91 89.89 94.41 90.06 Table 1: NER Results. Note: 1). LR-MVL (CO) are Context Oblivious embeddings which are gotten from (A) in Algorithm 1. 2). F1-score= Harmonic Mean of Precision and Recall. 3). The current state-of-the-art for this NER task is 90.90 (Test Set) but using 700 billion tokens of unlabeled data [19]. Embedding/Model Baseline HLBL, 50-dim C&W;, 50-dim Brown 3200 Clusters Ando & Zhang ’05 Suzuki & Isozaki ’08 LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim Test Set F1-Score 93.79 94.00 94.10 94.11 94.39 94.67 95.02 95.44 Table 2: Chunking Results. It is important to note that in problems like NER, the final accuracy depends on performance on rare-words and since LR-MVL is robustly able to correlate past with future views, it is able to learn better representations for rare words resulting in overall better accuracy. On rare-words (occurring < 10 times in corpus), we got 11.7%, 10.7% and 9.6% relative reduction in error over C&W;, HLBL and Brown respectively for NER; on chunking the corresponding numbers were 6.7%, 7.1% and 8.7%. Also, it is worth mentioning that modeling the context in embeddings gives decent improvements in accuracies on both NER and Chunking problems. For the case of NER, the polysemous words were mostly like Chicago, Wales, Oakland etc., which could either be a location or organization (Sports teams, Banks etc.), so when we don’t use the gazetteer features, (which are known lists of cities, persons, organizations etc.) we got higher increase in F-score by modeling context, compared to the case when we already had gazetteer features which captured most of the information about polysemous words for NER dataset and modeling the context didn’t help as much. The polysemous words for Chunking dataset were like spot (VP/NP), never (VP/ADVP), more (NP/VP/ADVP/ADJP) etc. and in this case embeddings with context helped significantly, giving 3.1 − 6.5% relative improvement in accuracy over context oblivious embeddings. 5 Summary and Conclusion In this paper, we presented a novel CCA-based multi-view learning method, LR-MVL, for large scale sequence learning problems such as arise in NLP. LR-MVL is a spectral method that works in low dimensional state-space so it is computationally efficient, and can be used to train using large amounts of unlabeled data; moreover it does not get stuck in local optima like an EM trained HMM. The embeddings learnt using LR-MVL can be used as features with any supervised learner. LR-MVL has strong theoretical grounding; is much simpler and faster than competing methods and achieves state-of-the-art accuracies on NER and Chunking problems. Acknowledgements: The authors would like to thank Alexander Yates, Ted Sandler and the three anonymous reviews for providing valuable feedback. We would also like to thank Lev Ratinov and Joseph Turian for answering our questions regarding their paper [16]. 8 References [1] Ando, R., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6 (2005) 1817–1853 [2] Suzuki, J., Isozaki, H.: Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In: In ACL. (2008) [3] Brown, P., deSouza, P., Mercer, R., Pietra, V.D., Lai, J.: Class-based n-gram models of natural language. Comput. Linguist. 18 (December 1992) 467–479 [4] Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: 31st Annual Meeting of the ACL. (1993) 183–190 [5] Huang, F., Yates, A.: Distributional representations for handling sparsity in supervised sequence-labeling. ACL ’09, Stroudsburg, PA, USA, Association for Computational Linguistics (2009) 495–503 [6] Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. ICML ’08, New York, NY, USA, ACM (2008) 160–167 [7] Mnih, A., Hinton, G.: Three new graphical models for statistical language modelling. ICML ’07, New York, NY, USA, ACM (2007) 641–648 [8] Dumais, S., Furnas, G., Landauer, T., Deerwester, S., Harshman, R.: Using latent semantic analysis to improve access to textual information. In: SIGCHI Conference on human factors in computing systems, ACM (1988) 281–285 [9] Hsu, D., Kakade, S., Zhang, T.: A spectral algorithm for learning hidden markov models. In: COLT. (2009) [10] Siddiqi, S., Boots, B., Gordon, G.J.: Reduced-rank hidden Markov models. In: AISTATS2010. (2010) [11] Song, L., Boots, B., Siddiqi, S.M., Gordon, G.J., Smola, A.J.: Hilbert space embeddings of hidden Markov models. In: ICML. (2010) [12] Hotelling, H.: Canonical correlation analysis (cca). Journal of Educational Psychology (1935) [13] Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT’ 98. (1998) 92–100 [14] Halko, N., Martinsson, P.G., Tropp, J.: Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. (Dec 2010) [15] Zhang, T., Johnson, D.: A robust risk minimization based named entity recognition system. CONLL ’03 (2003) 204–207 [16] Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. ACL ’10, Stroudsburg, PA, USA, Association for Computational Linguistics (2010) 384–394 [17] Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: CONLL. (2009) 147–155 [18] Liang, P.: Semi-supervised learning for natural language. Master’s thesis, Massachusetts Institute of Technology (2005) [19] Lin, D., Wu, X.: Phrase clustering for discriminative learning. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2. ACL ’09, Stroudsburg, PA, USA, Association for Computational Linguistics (2009) 1030–1038 9

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Recently, there has been substantial interest in using large amounts of unlabeled data to learn word representations which can then be used as features in supervised classifiers for NLP tasks. [sent-8, score-0.554]

2 In this paper, we present a new learning method, Low Rank Multi-View Learning (LR-MVL) which uses a fast spectral method to estimate low dimensional context-specific word representations from unlabeled data. [sent-10, score-0.552]

3 LR-MVL is extremely fast, gives guaranteed convergence to a global optimum, is theoretically elegant, and achieves state-ofthe-art performance on named entity recognition (NER) and chunking problems. [sent-12, score-0.38]

4 An increasingly popular alternative is to learn representational embeddings for words from a large collection of unlabeled data (typically using a generative model), and to use these embeddings to augment the feature set of a supervised learner. [sent-16, score-0.934]

5 Embedding methods produce features in low dimensional spaces or over a small vocabulary size, unlike the traditional approach of working in the original high dimensional vocabulary space with only one dimension “on” at a given time. [sent-17, score-0.376]

6 Clustering based word representations: Clustering methods, often hierarchical, are used to group distributionally similar words based on their contexts. [sent-19, score-0.362]

7 Dense representations: These representations are dense, low dimensional and real-valued. [sent-23, score-0.142]

8 Each dimension of these representations captures latent information about a combination of syntactic and semantic word properties. [sent-24, score-0.316]

9 They can either be induced using neural networks like C&W; embeddings [6] and Hierarchical log-linear (HLBL) embeddings [7] or by eigen-decomposition of the word co-occurrence matrix, e. [sent-25, score-0.945]

10 sensitive to the scaling of the embeddings (especially 2 based approaches like LSA/PCA), 3). [sent-30, score-0.358]

11 learn a single embedding for a given word type; i. [sent-32, score-0.364]

12 all the occurrences 1 of the word “bank” will have the same embedding, irrespective of whether the context of the word suggests it means “a financial institution” or “a river bank”. [sent-34, score-0.637]

13 In this paper, we propose a novel context-specific word embedding method called Low Rank MultiView Learning, LR-MVL, which is fast to train and is guaranteed to converge to the optimal solution. [sent-35, score-0.364]

14 As presented here, our LR-MVL embeddings are context-specific, but context oblivious embeddings (like the ones used by [6, 7]) can be trivially gotten from our model. [sent-36, score-0.835]

15 Particularly, we show that LR-MVL estimates low dimensional context-specific word embeddings which preserve all the information in the data if the data were generated by an HMM. [sent-38, score-0.673]

16 LR-MVL falls into category (2) mentioned above; it learns real-valued context-specific word embeddings by performing Canonical Correlation Analysis (CCA) [12] between the past and future views of low rank approximations of the data. [sent-40, score-0.784]

17 However, LR-MVL is more general than those methods, which work on bigram or trigram co-occurrence matrices, in that it uses longer word sequence information to estimate context-specific embeddings and also for the reasons mentioned in the last paragraph. [sent-41, score-0.587]

18 This invariance of CCA to linear data transformations allows proofs that keeping the dominant singular vectors (those with largest singular values) will faithfully capture any state information. [sent-50, score-0.309]

19 , (ln , rn )}–in our case the two matrices are the left (L) and right (R) context matrices of a word–we would like to simultaneously find the directions Φl and Φr that maximize the correlation of the projections of L onto Φl with the projections of R onto Φr . [sent-54, score-0.328]

20 These induced representations of the tokens can then be used as features in a supervised classifier (typically discriminative). [sent-63, score-0.361]

21 The context around a word, consisting of the h words to the right and left of it, sits in a high dimensional space, since for a vocabulary of size v, each of the h words in the context requires an indicator function of dimension v. [sent-64, score-0.621]

22 The key move in LR-MVL is to project the v-dimensional word 2 space down to a k dimensional state space. [sent-65, score-0.389]

23 Since a typical vocabulary contains at least 50, 000 words, and we use state spaces of order k ≈ 50 dimensions, this gives a 1,000-fold reduction in the size of calculations that are needed. [sent-67, score-0.163]

24 The core of our LR-MVL algorithm is a fast spectral method for learning a v × k matrix A which maps each of the v words in the vocabulary to a k-dimensional state vector. [sent-68, score-0.365]

25 1 The LR-MVL method Given an unlabeled token sequence w={w0 , w1 , . [sent-73, score-0.303]

26 , wn } we want to learn a low (k)- dimensional state vector {z0 , z1 , . [sent-76, score-0.161]

27 The key is to find a v ×k matrix A (Algorithm 1) that maps each of the v words in the vocabulary to a reduced rank k-dimensional state vector, which is later used to induce context specific embeddings for the tokens (Algorithm 2). [sent-80, score-0.994]

28 For supervised learning, these context specific embeddings are supplemented with other information about each token wt , such as its identity, orthographic features such as prefixes and suffixes or membership in domain-specific lexicons, and used as features in a classifier. [sent-81, score-0.878]

29 4 gives the algorithm more formally, but the key steps in the algorithm are, in general terms: • Take the h words to the left and to the right of each target word wt (the “Left” and “Right” contexts), and project them each down to k dimensions using A. [sent-83, score-0.544]

30 • Take the CCA between the reduced rank left and right contexts, and use the resulting model to estimate a k dimensional state vector (the “hidden state”) for each token. [sent-84, score-0.348]

31 • Take the CCA between the hidden states and the tokens wt . [sent-85, score-0.301]

32 The singular vectors associated with wt form a new estimate of the eigenfeature dictionary. [sent-86, score-0.307]

33 LR-MVL can be viewed as a type of co-training [13]: The state of each token wt is similar to that of the tokens both before and after it, and it is also similar to the states of the other occurrences of the same word elsewhere in the document (used in the outer iteration). [sent-87, score-0.808]

34 LR-MVL takes advantage of these two different types of similarity by alternately estimating word state using CCA on the smooths of the states of the words before and after each target token and using the average over the states associated with all other occurrences of that word. [sent-88, score-0.806]

35 Let L be an n × hv matrix giving the words in the left context of each of the n tokens, where the context is of length h, R be the corresponding n × hv matrix for the right context, and W be an n × v matrix of indicator functions for the words themselves. [sent-91, score-0.651]

36 it has a rank k observation matrix and rank k transition matrix both of which have the same domain. [sent-95, score-0.276]

37 For example, if the dimension of the hidden state is k and the vocabulary size is v then the observation matrix, which is k × v, has rank k. [sent-96, score-0.316]

38 ρ(L, W), ρ(L, R) and ρ(W, R) all have rank k, where ρ(X1 , X2 ) is the expected correlation between X1 and X2 . [sent-102, score-0.137]

39 Assumption 3 just makes the proof a little cleaner, since if there are repeated singular values, then the singular vectors are not unique. [sent-106, score-0.234]

40 We also need to define the CCA function that computes the left and right singular vectors for a pair of matrices: Definition 1 (CCA). [sent-108, score-0.249]

41 Let ΦX1 be a matrix containing the d largest singular vectors for X1 (sorted from the largest on down). [sent-110, score-0.168]

42 When we want just one of these Φ’s, we will use CCAd (X1 , X2 )left = ΦX1 for the left singular vectors and CCAd (X1 , X2 )right = ΦX2 for the right singular vectors. [sent-113, score-0.351]

43 Note that the resulting singular vectors, [ΦX1 , ΦX2 ] can be used to give two redundant estimates, X1 ΦX1 and X2 ΦX2 of the “hidden” state relating X1 and X2 , if such a hidden state exists. [sent-114, score-0.303]

44 Define A by the following limit of the right singular vectors: CCAk ([L, R], W)right ≈ A. [sent-118, score-0.163]

45 Lemma 1 shows that instead of finding the CCA between the full context and the words, we can take the CCA between the Left and Right contexts, estimate a k dimensional state from them, and take the CCA of that state with the words and get the same result. [sent-120, score-0.385]

46 Right multiplying ˜ L or R by Ah projects each of the words in that context into the k-dimensional reduced rank space. [sent-123, score-0.283]

47 1 4 Under the above assumptions, there is asymptotically (in the limit of infinite data) no benefit to first estimating state by finding the CCA between the left and right contexts and then finding the CCA between the estimated state and the words. [sent-131, score-0.376]

48 One could instead just directly find the CCA between the combined left and rights contexts and the words. [sent-132, score-0.165]

49 In this case, CCA between the rare words and context will not be informative, whereas finding the CCA between the left and right contexts gives a good state vector estimate even for unique words. [sent-134, score-0.512]

50 One can then fruitfully find the CCA between the contexts and the estimated state vector for their associated words. [sent-135, score-0.184]

51 3 Using Exponential Smooths In practice, we replace the projected left and right contexts with exponential smooths (weighted average of the previous (or next) token’s state i. [sent-137, score-0.447]

52 ), of them at a few different time scales, thus giving a further dimension reduction by a factor of context length h (say 100 words) divided by the number of smooths (often 5-7). [sent-142, score-0.225]

53 Since exponential smooths are linear, we preserve the linearity of our method. [sent-144, score-0.146]

54 Algorithm 1 LR-MVL Algorithm - Learning from Large amounts of Unlabeled Data 1: Input: Token sequence Wn×v , state space size k, smoothing rates αj 2: Initialize the eigenfeature dictionary A to random values N (0, 1). [sent-147, score-0.257]

55 3: repeat 4: Set the state Zt (1 < t ≤ n) of each token wt to the eigenfeature vector of the corresponding word. [sent-148, score-0.435]

56 Zt = (Aw : w = wt ) 5: Smooth the state estimates before and after each token to get a pair of views for each smoothing rate αj . [sent-149, score-0.427]

57 (l,j) (r,j) th where the t rows of L and R are, respectively, concatenations of the smooths St and St for (j) each of the α s. [sent-151, score-0.146]

58 6: Find the left and right canonical correlates, which are the eigenvectors Φl and Φr of (L L)−1 L R(R R)−1 R LΦl = λΦl . [sent-152, score-0.164]

59 7: Project the left and right views on to the space spanned by the top k/2 left and right CCAs respectively (k/2) (k/2) Xl = LΦl and Xr = RΦr (k/2) (k/2) where Φl , Φr are matrices composed of the singular vectors of Φl , Φr with the k/2 largest magnitude singular values. [sent-154, score-0.564]

60 Estimate the state for each word wt as the union of the left and right estimates: Z = [Xl , Xr ] 8: Estimate the eigenfeatures of each word type, w, as the average of the states estimated for that word. [sent-155, score-0.777]

61 ) As [14] show for PCA, one can start with a random matrix that is only slightly larger than the true rank k of the correlation matrix, and with extremely high likelihood converge in a few iterations to within a small distance of the true principal components. [sent-159, score-0.173]

62 As mentioned earlier, we get further dimensionality reduction in Step 5, by replacing the Left and Right context matrices with a set of exponentially smoothed values of the reduced rank projections of the context words. [sent-161, score-0.293]

63 Step 7 estimates 5 the state by combining the estimates from the left and right contexts, since we don’t know which will best estimate the state. [sent-163, score-0.192]

64 Step 8 takes the CCA between the estimated state Z and the matrix of words W. [sent-164, score-0.213]

65 Once we have estimated the CCA model, it is used to generate context specific embeddings for the tokens from training, development and test sets (as described in Algorithm 2). [sent-166, score-0.648]

66 These embeddings are further supplemented with other baseline features and used in a supervised learner to predict the label of the token. [sent-167, score-0.524]

67 These embeddings are augmented with baseline set of features mentioned in Sections 4. [sent-169, score-0.422]

68 Note that we can get context “oblivious” embeddings i. [sent-174, score-0.406]

69 one embedding per word type, just by using the eigenfeature dictionary (Av×k ) output by Algorithm 1. [sent-176, score-0.507]

70 We compare LR-MVL to state-of-the-art semi-supervised approaches like [1] (Alternating Structures Optimization (ASO)) and [2] (Semi-supervised extension of CRFs) as well as embeddings like C&W;, HLBL and Brown Clustering. [sent-178, score-0.389]

71 The CoNLL ’03 and the CoNLL ’00 datasets had ∼ 204K/51K/46K and ∼ 212K/ − /47K tokens respectively for Train/Dev. [sent-181, score-0.185]

72 The detailed list of features is as below: • Current Word wi ; Its type information: all-capitalized, is-capitalized, all-digits and so on; Prefixes and suffixes of wi • Word tokens in window of 2 around the current word i. [sent-186, score-0.756]

73 ) in a window of 2 around the current word (if applicable). [sent-190, score-0.299]

74 The size of LR-MVL embeddings (state-space) that gave the best performance on the development set was k = 50 (50 each for Xl , Xw , Xr in Algorithm 2) i. [sent-203, score-0.384]

75 the total size of embeddings was 50×3, and the best normalization constant was 0. [sent-205, score-0.327]

76 2 Chunking For our chunking experiments we use a similar base set of features as above: • Current Word wi and word tokens in window of 2 around the current word i. [sent-210, score-1.146]

77 d = (wi−2 , wi−1 , wi , wi+1 , wi+2 ); • POS tags ti in a window of 2 around the current word. [sent-212, score-0.193]

78 • Word conjunction features wi ∩ wi+1 , i ∈ {−1, 0} and Tag conjunction features ti ∩ ti+1 , i ∈ {−2, −1, 0, 1} and ti ∩ ti+1 ∩ ti+2 , i ∈ {−2, −1, 0}. [sent-213, score-0.42]

79 • Embedding features in a window of 2 around the current word (when applicable). [sent-214, score-0.359]

80 Since CoNLL 00 chunking data does not have a development set, we randomly sampled 1000 sentences from the training data (8936 sentences) for development. [sent-215, score-0.372]

81 So, we trained our chunking models on 7936 training sentences and evaluated their F1 score on the 1000 development sentences and used a CRF 3 as the supervised classifier. [sent-216, score-0.511]

82 We tuned the size of embedding and the magnitude of 2 regularization penalty in CRF on the development set and took log (or -log of the magnitude) of the value of the features4 . [sent-217, score-0.161]

83 The regularization penalty that gave best performance on development set was 2 and here again the best size of LR-MVL embeddings (state-space) was k = 50. [sent-218, score-0.384]

84 3 Unlabeled Data and Induction of embeddings For inducing the embeddings we used the RCV1 corpus containing Reuters newswire from Aug ’96 to Aug ’97 and containing about 63 million tokens in 3. [sent-224, score-0.871]

85 We induced our LR-MVL embeddings over a period of 3 days (70 core hours on 3. [sent-229, score-0.359]

86 0 GHz CPU) on the entire RCV1 data by performing 4 iterations, a vocabulary size of 300k and using a variety of smoothing rates (α in Algorithm 1) to capture correlations between shorter and longer contexts α = [0. [sent-230, score-0.236]

87 org/software/crfsuite/ Our embeddings are learnt using a linear model whereas CRF is a log-linear model, so to keep things on same scale we did this normalization. [sent-248, score-0.327]

88 6 As some of these embeddings were trained on GPGPU which makes our method even faster comparatively. [sent-250, score-0.327]

89 LR-MVL (CO) are Context Oblivious embeddings which are gotten from (A) in Algorithm 1. [sent-280, score-0.369]

90 90 (Test Set) but using 700 billion tokens of unlabeled data [19]. [sent-285, score-0.303]

91 It is important to note that in problems like NER, the final accuracy depends on performance on rare-words and since LR-MVL is robustly able to correlate past with future views, it is able to learn better representations for rare words resulting in overall better accuracy. [sent-295, score-0.219]

92 6% relative reduction in error over C&W;, HLBL and Brown respectively for NER; on chunking the corresponding numbers were 6. [sent-299, score-0.236]

93 Also, it is worth mentioning that modeling the context in embeddings gives decent improvements in accuracies on both NER and Chunking problems. [sent-303, score-0.451]

94 For the case of NER, the polysemous words were mostly like Chicago, Wales, Oakland etc. [sent-304, score-0.195]

95 ) we got higher increase in F-score by modeling context, compared to the case when we already had gazetteer features which captured most of the information about polysemous words for NER dataset and modeling the context didn’t help as much. [sent-307, score-0.379]

96 The polysemous words for Chunking dataset were like spot (VP/NP), never (VP/ADVP), more (NP/VP/ADVP/ADJP) etc. [sent-308, score-0.195]

97 and in this case embeddings with context helped significantly, giving 3. [sent-309, score-0.406]

98 LR-MVL is a spectral method that works in low dimensional state-space so it is computationally efficient, and can be used to train using large amounts of unlabeled data; moreover it does not get stuck in local optima like an EM trained HMM. [sent-313, score-0.304]

99 The embeddings learnt using LR-MVL can be used as features with any supervised learner. [sent-314, score-0.447]

100 : A robust risk minimization based named entity recognition system. [sent-388, score-0.144]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cca', 0.434), ('embeddings', 0.327), ('word', 0.26), ('chunking', 0.236), ('ner', 0.2), ('token', 0.185), ('tokens', 0.185), ('hlbl', 0.166), ('smooths', 0.146), ('conll', 0.125), ('unlabeled', 0.118), ('eigenfeature', 0.11), ('contexts', 0.109), ('wi', 0.106), ('ccak', 0.104), ('embedding', 0.104), ('rank', 0.102), ('singular', 0.102), ('words', 0.102), ('vocabulary', 0.088), ('brown', 0.088), ('nlp', 0.086), ('context', 0.079), ('sentences', 0.079), ('state', 0.075), ('entity', 0.073), ('zt', 0.071), ('named', 0.071), ('wt', 0.065), ('embed', 0.064), ('hi', 0.063), ('views', 0.063), ('ccad', 0.062), ('ccas', 0.062), ('cll', 0.062), ('clr', 0.062), ('gazetteers', 0.062), ('isozaki', 0.062), ('polysemous', 0.062), ('xtrain', 0.062), ('xr', 0.062), ('right', 0.061), ('supervised', 0.06), ('xes', 0.06), ('oblivious', 0.06), ('features', 0.06), ('development', 0.057), ('representations', 0.056), ('left', 0.056), ('ratinov', 0.055), ('stroudsburg', 0.055), ('dimensional', 0.054), ('xl', 0.053), ('hidden', 0.051), ('acl', 0.05), ('st', 0.049), ('conjunction', 0.049), ('co', 0.048), ('ti', 0.048), ('canonical', 0.047), ('ah', 0.045), ('accuracies', 0.045), ('suzuki', 0.043), ('ando', 0.043), ('crf', 0.042), ('crl', 0.042), ('crr', 0.042), ('gazetteer', 0.042), ('gotten', 0.042), ('lah', 0.042), ('supplemented', 0.042), ('window', 0.039), ('smoothing', 0.039), ('occurrences', 0.038), ('stuck', 0.037), ('yates', 0.037), ('rah', 0.037), ('ungar', 0.037), ('matrix', 0.036), ('linguistics', 0.036), ('xw', 0.036), ('pa', 0.036), ('baseline', 0.035), ('correlation', 0.035), ('clusters', 0.034), ('hmm', 0.034), ('foster', 0.034), ('got', 0.034), ('matrices', 0.033), ('dictionary', 0.033), ('spectral', 0.032), ('corpus', 0.032), ('low', 0.032), ('hv', 0.032), ('turian', 0.032), ('phrase', 0.032), ('core', 0.032), ('like', 0.031), ('rare', 0.03), ('vectors', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 176 nips-2011-Multi-View Learning of Word Embeddings via CCA

Author: Paramveer Dhillon, Dean P. Foster, Lyle H. Ungar

Abstract: Recently, there has been substantial interest in using large amounts of unlabeled data to learn word representations which can then be used as features in supervised classifiers for NLP tasks. However, most current approaches are slow to train, do not model the context of the word, and lack theoretical grounding. In this paper, we present a new learning method, Low Rank Multi-View Learning (LR-MVL) which uses a fast spectral method to estimate low dimensional context-specific word representations from unlabeled data. These representation features can then be used with any supervised learner. LR-MVL is extremely fast, gives guaranteed convergence to a global optimum, is theoretically elegant, and achieves state-ofthe-art performance on named entity recognition (NER) and chunking problems. 1 Introduction and Related Work Over the past decade there has been increased interest in using unlabeled data to supplement the labeled data in semi-supervised learning settings to overcome the inherent data sparsity and get improved generalization accuracies in high dimensional domains like NLP. Approaches like [1, 2] have been empirically very successful and have achieved excellent accuracies on a variety of NLP tasks. However, it is often difficult to adapt these approaches to use in conjunction with an existing supervised NLP system as these approaches enforce a particular choice of model. An increasingly popular alternative is to learn representational embeddings for words from a large collection of unlabeled data (typically using a generative model), and to use these embeddings to augment the feature set of a supervised learner. Embedding methods produce features in low dimensional spaces or over a small vocabulary size, unlike the traditional approach of working in the original high dimensional vocabulary space with only one dimension “on” at a given time. Broadly, these embedding methods fall into two categories: 1. Clustering based word representations: Clustering methods, often hierarchical, are used to group distributionally similar words based on their contexts. The two dominant approaches are Brown Clustering [3] and [4]. As recently shown, HMMs can also be used to induce a multinomial distribution over possible clusters [5]. 2. Dense representations: These representations are dense, low dimensional and real-valued. Each dimension of these representations captures latent information about a combination of syntactic and semantic word properties. They can either be induced using neural networks like C&W; embeddings [6] and Hierarchical log-linear (HLBL) embeddings [7] or by eigen-decomposition of the word co-occurrence matrix, e.g. Latent Semantic Analysis/Latent Semantic Indexing (LSA/LSI) [8]. Unfortunately, most of these representations are 1). slow to train, 2). sensitive to the scaling of the embeddings (especially 2 based approaches like LSA/PCA), 3). can get stuck in local optima (like EM trained HMM) and 4). learn a single embedding for a given word type; i.e. all the occurrences 1 of the word “bank” will have the same embedding, irrespective of whether the context of the word suggests it means “a financial institution” or “a river bank”. In this paper, we propose a novel context-specific word embedding method called Low Rank MultiView Learning, LR-MVL, which is fast to train and is guaranteed to converge to the optimal solution. As presented here, our LR-MVL embeddings are context-specific, but context oblivious embeddings (like the ones used by [6, 7]) can be trivially gotten from our model. Furthermore, building on recent advances in spectral learning for sequence models like HMMs [9, 10, 11] we show that LR-MVL has strong theoretical grounding. Particularly, we show that LR-MVL estimates low dimensional context-specific word embeddings which preserve all the information in the data if the data were generated by an HMM. Moreover, LR-MVL being linear does not face the danger of getting stuck in local optima as is the case for an EM trained HMM. LR-MVL falls into category (2) mentioned above; it learns real-valued context-specific word embeddings by performing Canonical Correlation Analysis (CCA) [12] between the past and future views of low rank approximations of the data. However, LR-MVL is more general than those methods, which work on bigram or trigram co-occurrence matrices, in that it uses longer word sequence information to estimate context-specific embeddings and also for the reasons mentioned in the last paragraph. The remainder of the paper is organized as follows. In the next section we give a brief overview of CCA, which forms the core of our method. Section 3 describes our proposed LR-MVL algorithm in detail and gives theory supporting its performance. Section 4 demonstrates the effectiveness of LR-MVL on the NLP tasks of Named Entity Recognition and Chunking. We conclude with a brief summary in Section 5. 2 Brief Review: Canonical Correlation Analysis (CCA) CCA [12] is the analog to Principal Component Analysis (PCA) for pairs of matrices. PCA computes the directions of maximum covariance between elements in a single matrix, whereas CCA computes the directions of maximal correlation between a pair of matrices. Unlike PCA, CCA does not depend on how the observations are scaled. This invariance of CCA to linear data transformations allows proofs that keeping the dominant singular vectors (those with largest singular values) will faithfully capture any state information. More specifically, given a set of n paired observation vectors {(l1 , r1 ), ..., (ln , rn )}–in our case the two matrices are the left (L) and right (R) context matrices of a word–we would like to simultaneously find the directions Φl and Φr that maximize the correlation of the projections of L onto Φl with the projections of R onto Φr . This is expressed as max Φl ,Φr E[ L, Φl R, Φr ] E[ L, Φl 2 ]E[ R, Φr 2 ] (1) where E denotes the empirical expectation. We use the notation Clr (Cll ) to denote the cross (auto) covariance matrices between L and R (i.e. L’R and L’L respectively.). The left and right canonical correlates are the solutions Φl , Φr of the following equations: Cll −1 Clr Crr −1 Crl Φl = λΦl Crr −1 Crl Cll −1 Clr Φr = λΦr 3 (2) Low Rank Multi-View Learning (LR-MVL) In LR-MVL, we compute the CCA between the past and future views of the data on a large unlabeled corpus to find the common latent structure, i.e., the hidden state associated with each token. These induced representations of the tokens can then be used as features in a supervised classifier (typically discriminative). The context around a word, consisting of the h words to the right and left of it, sits in a high dimensional space, since for a vocabulary of size v, each of the h words in the context requires an indicator function of dimension v. The key move in LR-MVL is to project the v-dimensional word 2 space down to a k dimensional state space. Thus, all eigenvector computations are done in a space that is v/k times smaller than the original space. Since a typical vocabulary contains at least 50, 000 words, and we use state spaces of order k ≈ 50 dimensions, this gives a 1,000-fold reduction in the size of calculations that are needed. The core of our LR-MVL algorithm is a fast spectral method for learning a v × k matrix A which maps each of the v words in the vocabulary to a k-dimensional state vector. We call this matrix the “eigenfeature dictionary”. We now describe the LR-MVL method, give a theorem that provides intuition into how it works, and formally present the LR-MVL algorithm. The Experiments section then shows that this low rank approximation allows us to achieve state-of-the-art performance on NLP tasks. 3.1 The LR-MVL method Given an unlabeled token sequence w={w0 , w1 , . . ., wn } we want to learn a low (k)- dimensional state vector {z0 , z1 , . . . , zn } for each observed token. The key is to find a v ×k matrix A (Algorithm 1) that maps each of the v words in the vocabulary to a reduced rank k-dimensional state vector, which is later used to induce context specific embeddings for the tokens (Algorithm 2). For supervised learning, these context specific embeddings are supplemented with other information about each token wt , such as its identity, orthographic features such as prefixes and suffixes or membership in domain-specific lexicons, and used as features in a classifier. Section 3.4 gives the algorithm more formally, but the key steps in the algorithm are, in general terms: • Take the h words to the left and to the right of each target word wt (the “Left” and “Right” contexts), and project them each down to k dimensions using A. • Take the CCA between the reduced rank left and right contexts, and use the resulting model to estimate a k dimensional state vector (the “hidden state”) for each token. • Take the CCA between the hidden states and the tokens wt . The singular vectors associated with wt form a new estimate of the eigenfeature dictionary. LR-MVL can be viewed as a type of co-training [13]: The state of each token wt is similar to that of the tokens both before and after it, and it is also similar to the states of the other occurrences of the same word elsewhere in the document (used in the outer iteration). LR-MVL takes advantage of these two different types of similarity by alternately estimating word state using CCA on the smooths of the states of the words before and after each target token and using the average over the states associated with all other occurrences of that word. 3.2 Theoretical Properties of LR-MVL We now present the theory behind the LR-MVL algorithm; particularly we show that the reduced rank matrix A allows a significant data reduction while preserving the information in our data and the estimated state does the best possible job of capturing any label information that can be inferred by a linear model. Let L be an n × hv matrix giving the words in the left context of each of the n tokens, where the context is of length h, R be the corresponding n × hv matrix for the right context, and W be an n × v matrix of indicator functions for the words themselves. We will use the following assumptions at various points in our proof: Assumption 1. L, W, and R come from a rank k HMM i.e. it has a rank k observation matrix and rank k transition matrix both of which have the same domain. For example, if the dimension of the hidden state is k and the vocabulary size is v then the observation matrix, which is k × v, has rank k. This rank condition is similar to the one used by [10]. Assumption 1A. For the three views, L, W and R assume that there exists a “hidden state H” of dimension n × k, where each row Hi has the same non-singular variance-covariance matrix and 3 such that E(Li |Hi ) = Hi β T and E(Ri |Hi ) = Hi β T and E(Wi |Hi ) = Hi β T where all β’s are of L R W rank k, where Li , Ri and Wi are the rows of L, R and W respectively. Assumption 1A follows from Assumption 1. Assumption 2. ρ(L, W), ρ(L, R) and ρ(W, R) all have rank k, where ρ(X1 , X2 ) is the expected correlation between X1 and X2 . Assumption 2 is a rank condition similar to that in [9]. Assumption 3. ρ([L, R], W) has k distinct singular values. Assumption 3 just makes the proof a little cleaner, since if there are repeated singular values, then the singular vectors are not unique. Without it, we would have to phrase results in terms of subspaces with identical singular values. We also need to define the CCA function that computes the left and right singular vectors for a pair of matrices: Definition 1 (CCA). Compute the CCA between two matrices X1 and X2 . Let ΦX1 be a matrix containing the d largest singular vectors for X1 (sorted from the largest on down). Likewise for ΦX2 . Define the function CCAd (X1 , X2 ) = [ΦX1 , ΦX2 ]. When we want just one of these Φ’s, we will use CCAd (X1 , X2 )left = ΦX1 for the left singular vectors and CCAd (X1 , X2 )right = ΦX2 for the right singular vectors. Note that the resulting singular vectors, [ΦX1 , ΦX2 ] can be used to give two redundant estimates, X1 ΦX1 and X2 ΦX2 of the “hidden” state relating X1 and X2 , if such a hidden state exists. Definition 2. Define the symbol “≈” to mean X1 ≈ X2 ⇐⇒ lim X1 = lim X2 n→∞ n→∞ where n is the sample size. Lemma 1. Define A by the following limit of the right singular vectors: CCAk ([L, R], W)right ≈ A. Under assumptions 2, 3 and 1A, such that if CCAk (L, R) ≡ [ΦL , ΦR ] then CCAk ([LΦL , RΦR ], W)right ≈ A. Lemma 1 shows that instead of finding the CCA between the full context and the words, we can take the CCA between the Left and Right contexts, estimate a k dimensional state from them, and take the CCA of that state with the words and get the same result. See the supplementary material for the Proof. ˜ Let Ah denote a matrix formed by stacking h copies of A on top of each other. Right multiplying ˜ L or R by Ah projects each of the words in that context into the k-dimensional reduced rank space. The following theorem addresses the core of the LR-MVL algorithm, showing that there is an A which gives the desired dimensionality reduction. Specifically, it shows that the previous lemma also holds in the reduced rank space. Theorem 1. Under assumptions 1, 2 and 3 there exists a unique matrix A such that if ˜ ˜ ˜ ˜ CCAk (LAh , RAh ) ≡ [ΦL , ΦR ] then ˜ ˜ ˜ ˜ CCAk ([LAh ΦL , RAh ΦR ], W)right ≈ A ˜ where Ah is the stacked form of A. See the supplementary material for the Proof 1 . ˆ It is worth noting that our matrix A corresponds to the matrix U used by [9, 10]. They showed that U is sufficient to compute the probability of a sequence of words generated by an HMM; although we do not show ˆ it here (due to limited space), our A provides a more statistically efficient estimate of U than their U , and hence can also be used to estimate the sequence probabilities. 1 4 Under the above assumptions, there is asymptotically (in the limit of infinite data) no benefit to first estimating state by finding the CCA between the left and right contexts and then finding the CCA between the estimated state and the words. One could instead just directly find the CCA between the combined left and rights contexts and the words. However, because of the Zipfian distribution of words, many words are rare or even unique, and hence one is not in the asymptotic limit. In this case, CCA between the rare words and context will not be informative, whereas finding the CCA between the left and right contexts gives a good state vector estimate even for unique words. One can then fruitfully find the CCA between the contexts and the estimated state vector for their associated words. 3.3 Using Exponential Smooths In practice, we replace the projected left and right contexts with exponential smooths (weighted average of the previous (or next) token’s state i.e. Zt−1 (or Zt+1 ) and previous (or next) token’s smoothed state i.e. St−1 (or St+1 ).), of them at a few different time scales, thus giving a further dimension reduction by a factor of context length h (say 100 words) divided by the number of smooths (often 5-7). We use a mixture of both very short and very long contexts which capture short and long range dependencies as required by NLP problems as NER, Chunking, WSD etc. Since exponential smooths are linear, we preserve the linearity of our method. 3.4 The LR-MVL Algorithm The LR-MVL algorithm (using exponential smooths) is given in Algorithm 1; it computes the pair of CCAs described above in Theorem 1. Algorithm 1 LR-MVL Algorithm - Learning from Large amounts of Unlabeled Data 1: Input: Token sequence Wn×v , state space size k, smoothing rates αj 2: Initialize the eigenfeature dictionary A to random values N (0, 1). 3: repeat 4: Set the state Zt (1 < t ≤ n) of each token wt to the eigenfeature vector of the corresponding word. Zt = (Aw : w = wt ) 5: Smooth the state estimates before and after each token to get a pair of views for each smoothing rate αj . (l,j) (l,j) = (1 − αj )St−1 + αj Zt−1 // left view L St (r,j) (r,j) j St = (1 − α )St+1 + αj Zt+1 // right view R. (l,j) (r,j) th where the t rows of L and R are, respectively, concatenations of the smooths St and St for (j) each of the α s. 6: Find the left and right canonical correlates, which are the eigenvectors Φl and Φr of (L L)−1 L R(R R)−1 R LΦl = λΦl . (R R)−1 R L(L L)−1 L RΦr = λΦr . 7: Project the left and right views on to the space spanned by the top k/2 left and right CCAs respectively (k/2) (k/2) Xl = LΦl and Xr = RΦr (k/2) (k/2) where Φl , Φr are matrices composed of the singular vectors of Φl , Φr with the k/2 largest magnitude singular values. Estimate the state for each word wt as the union of the left and right estimates: Z = [Xl , Xr ] 8: Estimate the eigenfeatures of each word type, w, as the average of the states estimated for that word. Aw = avg(Zt : wt = w) 9: Compute the change in A from the previous iteration 10: until |∆A| < 11: Output: Φk , Φk , A . r l A few iterations (∼ 5) of the above algorithm are sufficient to converge to the solution. (Since the problem is convex, there is a single solution, so there is no issue of local minima.) As [14] show for PCA, one can start with a random matrix that is only slightly larger than the true rank k of the correlation matrix, and with extremely high likelihood converge in a few iterations to within a small distance of the true principal components. In our case, if the assumptions detailed above (1, 1A, 2 and 3) are satisfied, our method converges equally rapidly to the true canonical variates. As mentioned earlier, we get further dimensionality reduction in Step 5, by replacing the Left and Right context matrices with a set of exponentially smoothed values of the reduced rank projections of the context words. Step 6 finds the CCA between the Left and Right contexts. Step 7 estimates 5 the state by combining the estimates from the left and right contexts, since we don’t know which will best estimate the state. Step 8 takes the CCA between the estimated state Z and the matrix of words W. Because W is a vector of indicator functions, this CCA takes the trivial form of a set of averages. Once we have estimated the CCA model, it is used to generate context specific embeddings for the tokens from training, development and test sets (as described in Algorithm 2). These embeddings are further supplemented with other baseline features and used in a supervised learner to predict the label of the token. Algorithm 2 LR-MVL Algorithm -Inducing Context Specific Embeddings for Train/Dev/Test Data 1: Input: Model (Φk , Φk , A) output from above algorithm and Token sequences Wtrain , (Wdev , Wtest ) r l 2: Project the left and right views L and R after smoothing onto the space spanned by the top k left and right CCAs respectively Xl = LΦk and Xr = RΦk r l and the words onto the eigenfeature dictionary Xw = W train A 3: Form the final embedding matrix Xtrain:embed by concatenating these three estimates of state Xtrain:embed = [Xl , Xw , Xr ] 4: Output: The embedding matrices Xtrain:embed , (Xdev:embed , Xtest:embed ) with context-specific representations for the tokens. These embeddings are augmented with baseline set of features mentioned in Sections 4.1.1 and 4.1.2 before learning the final classifier. Note that we can get context “oblivious” embeddings i.e. one embedding per word type, just by using the eigenfeature dictionary (Av×k ) output by Algorithm 1. 4 Experimental Results In this section we present the experimental results of LR-MVL on Named Entity Recognition (NER) and Syntactic Chunking tasks. We compare LR-MVL to state-of-the-art semi-supervised approaches like [1] (Alternating Structures Optimization (ASO)) and [2] (Semi-supervised extension of CRFs) as well as embeddings like C&W;, HLBL and Brown Clustering. 4.1 Datasets and Experimental Setup For the NER experiments we used the data from CoNLL 2003 shared task and for Chunking experiments we used the CoNLL 2000 shared task data2 with standard training, development and testing set splits. The CoNLL ’03 and the CoNLL ’00 datasets had ∼ 204K/51K/46K and ∼ 212K/ − /47K tokens respectively for Train/Dev./Test sets. 4.1.1 Named Entity Recognition (NER) We use the same set of baseline features as used by [15, 16] in their experiments. The detailed list of features is as below: • Current Word wi ; Its type information: all-capitalized, is-capitalized, all-digits and so on; Prefixes and suffixes of wi • Word tokens in window of 2 around the current word i.e. (wi−2 , wi−1 , wi , wi+1 , wi+2 ); and capitalization pattern in the window. d = • Previous two predictions yi−1 and yi−2 and conjunction of d and yi−1 • Embedding features (LR-MVL, C&W;, HLBL, Brown etc.) in a window of 2 around the current word (if applicable). Following [17] we use regularized averaged perceptron model with above set of baseline features for the NER task. We also used their BILOU text chunk representation and fast greedy inference as it was shown to give superior performance. 2 More details about the data and competition are available at http://www.cnts.ua.ac.be/ conll2003/ner/ and http://www.cnts.ua.ac.be/conll2000/chunking/ 6 We also augment the above set of baseline features with gazetteers, as is standard practice in NER experiments. We tuned our free parameter namely the size of LR-MVL embedding on the development and scaled our embedding features to have a 2 norm of 1 for each token and further multiplied them by a normalization constant (also chosen by cross validation), so that when they are used in conjunction with other categorical features in a linear classifier, they do not exert extra influence. The size of LR-MVL embeddings (state-space) that gave the best performance on the development set was k = 50 (50 each for Xl , Xw , Xr in Algorithm 2) i.e. the total size of embeddings was 50×3, and the best normalization constant was 0.5. We omit validation plots due to paucity of space. 4.1.2 Chunking For our chunking experiments we use a similar base set of features as above: • Current Word wi and word tokens in window of 2 around the current word i.e. d = (wi−2 , wi−1 , wi , wi+1 , wi+2 ); • POS tags ti in a window of 2 around the current word. • Word conjunction features wi ∩ wi+1 , i ∈ {−1, 0} and Tag conjunction features ti ∩ ti+1 , i ∈ {−2, −1, 0, 1} and ti ∩ ti+1 ∩ ti+2 , i ∈ {−2, −1, 0}. • Embedding features in a window of 2 around the current word (when applicable). Since CoNLL 00 chunking data does not have a development set, we randomly sampled 1000 sentences from the training data (8936 sentences) for development. So, we trained our chunking models on 7936 training sentences and evaluated their F1 score on the 1000 development sentences and used a CRF 3 as the supervised classifier. We tuned the size of embedding and the magnitude of 2 regularization penalty in CRF on the development set and took log (or -log of the magnitude) of the value of the features4 . The regularization penalty that gave best performance on development set was 2 and here again the best size of LR-MVL embeddings (state-space) was k = 50. Finally, we trained the CRF on the entire (“original”) training data i.e. 8936 sentences. 4.1.3 Unlabeled Data and Induction of embeddings For inducing the embeddings we used the RCV1 corpus containing Reuters newswire from Aug ’96 to Aug ’97 and containing about 63 million tokens in 3.3 million sentences5 . Case was left intact and we did not do the “cleaning” as done by [18, 16] i.e. remove all sentences which are less than 90% lowercase a-z, as our multi-view learning approach is robust to such noisy data, like news byline text (mostly all caps) which does not correlate strongly with the text of the article. We induced our LR-MVL embeddings over a period of 3 days (70 core hours on 3.0 GHz CPU) on the entire RCV1 data by performing 4 iterations, a vocabulary size of 300k and using a variety of smoothing rates (α in Algorithm 1) to capture correlations between shorter and longer contexts α = [0.005, 0.01, 0.05, 0.1, 0.5, 0.9]; theoretically we could tune the smoothing parameters on the development set but we found this mixture of long and short term dependencies to work well in practice. As far as the other embeddings are concerned i.e. C&W;, HLBL and Brown Clusters, we downloaded them from http://metaoptimize.com/projects/wordreprs. The details about their induction and parameter tuning can be found in [16]; we report their best numbers here. It is also worth noting that the unsupervised training of LR-MVL was (> 1.5 times)6 faster than other embeddings. 4.2 Results The results for NER and Chunking are shown in Tables 1 and 2, respectively, which show that LR-MVL performs significantly better than state-of-the-art competing methods on both NER and Chunking tasks. 3 http://www.chokkan.org/software/crfsuite/ Our embeddings are learnt using a linear model whereas CRF is a log-linear model, so to keep things on same scale we did this normalization. 5 We chose this particular dataset to make a fair comparison with [1, 16], who report results using RCV1 as unlabeled data. 6 As some of these embeddings were trained on GPGPU which makes our method even faster comparatively. 4 7 Embedding/Model Baseline C&W;, 200-dim HLBL, 100-dim Brown 1000 clusters Ando & Zhang ’05 Suzuki & Isozaki ’08 LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim HLBL, 100-dim C&W;, 200-dim Brown, 1000 clusters LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim No Gazetteers With Gazetteers F1-Score Dev. Set Test Set 90.03 84.39 92.46 87.46 92.00 88.13 92.32 88.52 93.15 89.31 93.66 89.36 93.11 89.55 93.61 89.91 92.91 89.35 92.98 88.88 93.25 89.41 93.91 89.89 94.41 90.06 Table 1: NER Results. Note: 1). LR-MVL (CO) are Context Oblivious embeddings which are gotten from (A) in Algorithm 1. 2). F1-score= Harmonic Mean of Precision and Recall. 3). The current state-of-the-art for this NER task is 90.90 (Test Set) but using 700 billion tokens of unlabeled data [19]. Embedding/Model Baseline HLBL, 50-dim C&W;, 50-dim Brown 3200 Clusters Ando & Zhang ’05 Suzuki & Isozaki ’08 LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim Test Set F1-Score 93.79 94.00 94.10 94.11 94.39 94.67 95.02 95.44 Table 2: Chunking Results. It is important to note that in problems like NER, the final accuracy depends on performance on rare-words and since LR-MVL is robustly able to correlate past with future views, it is able to learn better representations for rare words resulting in overall better accuracy. On rare-words (occurring < 10 times in corpus), we got 11.7%, 10.7% and 9.6% relative reduction in error over C&W;, HLBL and Brown respectively for NER; on chunking the corresponding numbers were 6.7%, 7.1% and 8.7%. Also, it is worth mentioning that modeling the context in embeddings gives decent improvements in accuracies on both NER and Chunking problems. For the case of NER, the polysemous words were mostly like Chicago, Wales, Oakland etc., which could either be a location or organization (Sports teams, Banks etc.), so when we don’t use the gazetteer features, (which are known lists of cities, persons, organizations etc.) we got higher increase in F-score by modeling context, compared to the case when we already had gazetteer features which captured most of the information about polysemous words for NER dataset and modeling the context didn’t help as much. The polysemous words for Chunking dataset were like spot (VP/NP), never (VP/ADVP), more (NP/VP/ADVP/ADJP) etc. and in this case embeddings with context helped significantly, giving 3.1 − 6.5% relative improvement in accuracy over context oblivious embeddings. 5 Summary and Conclusion In this paper, we presented a novel CCA-based multi-view learning method, LR-MVL, for large scale sequence learning problems such as arise in NLP. LR-MVL is a spectral method that works in low dimensional state-space so it is computationally efficient, and can be used to train using large amounts of unlabeled data; moreover it does not get stuck in local optima like an EM trained HMM. The embeddings learnt using LR-MVL can be used as features with any supervised learner. LR-MVL has strong theoretical grounding; is much simpler and faster than competing methods and achieves state-of-the-art accuracies on NER and Chunking problems. Acknowledgements: The authors would like to thank Alexander Yates, Ted Sandler and the three anonymous reviews for providing valuable feedback. We would also like to thank Lev Ratinov and Joseph Turian for answering our questions regarding their paper [16]. 8 References [1] Ando, R., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6 (2005) 1817–1853 [2] Suzuki, J., Isozaki, H.: Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In: In ACL. (2008) [3] Brown, P., deSouza, P., Mercer, R., Pietra, V.D., Lai, J.: Class-based n-gram models of natural language. Comput. Linguist. 18 (December 1992) 467–479 [4] Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: 31st Annual Meeting of the ACL. (1993) 183–190 [5] Huang, F., Yates, A.: Distributional representations for handling sparsity in supervised sequence-labeling. ACL ’09, Stroudsburg, PA, USA, Association for Computational Linguistics (2009) 495–503 [6] Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. ICML ’08, New York, NY, USA, ACM (2008) 160–167 [7] Mnih, A., Hinton, G.: Three new graphical models for statistical language modelling. ICML ’07, New York, NY, USA, ACM (2007) 641–648 [8] Dumais, S., Furnas, G., Landauer, T., Deerwester, S., Harshman, R.: Using latent semantic analysis to improve access to textual information. In: SIGCHI Conference on human factors in computing systems, ACM (1988) 281–285 [9] Hsu, D., Kakade, S., Zhang, T.: A spectral algorithm for learning hidden markov models. In: COLT. (2009) [10] Siddiqi, S., Boots, B., Gordon, G.J.: Reduced-rank hidden Markov models. In: AISTATS2010. (2010) [11] Song, L., Boots, B., Siddiqi, S.M., Gordon, G.J., Smola, A.J.: Hilbert space embeddings of hidden Markov models. In: ICML. (2010) [12] Hotelling, H.: Canonical correlation analysis (cca). Journal of Educational Psychology (1935) [13] Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT’ 98. (1998) 92–100 [14] Halko, N., Martinsson, P.G., Tropp, J.: Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. (Dec 2010) [15] Zhang, T., Johnson, D.: A robust risk minimization based named entity recognition system. CONLL ’03 (2003) 204–207 [16] Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. ACL ’10, Stroudsburg, PA, USA, Association for Computational Linguistics (2010) 384–394 [17] Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: CONLL. (2009) 147–155 [18] Liang, P.: Semi-supervised learning for natural language. Master’s thesis, Massachusetts Institute of Technology (2005) [19] Lin, D., Wu, X.: Phrase clustering for discriminative learning. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2. ACL ’09, Stroudsburg, PA, USA, Association for Computational Linguistics (2009) 1030–1038 9

2 0.16249335 54 nips-2011-Co-regularized Multi-view Spectral Clustering

Author: Abhishek Kumar, Piyush Rai, Hal Daume

Abstract: In many clustering problems, we have access to multiple views of the data each of which could be individually used for clustering. Exploiting information from multiple views, one can hope to find a clustering that is more accurate than the ones obtained using the individual views. Often these different views admit same underlying clustering of the data, so we can approach this problem by looking for clusterings that are consistent across the views, i.e., corresponding data points in each view should have same cluster membership. We propose a spectral clustering framework that achieves this goal by co-regularizing the clustering hypotheses, and propose two co-regularization schemes to accomplish this. Experimental comparisons with a number of baselines on two synthetic and three real-world datasets establish the efficacy of our proposed approaches.

3 0.15345857 167 nips-2011-Maximum Covariance Unfolding : Manifold Learning for Bimodal Data

Author: Vijay Mahadevan, Chi W. Wong, Jose C. Pereira, Tom Liu, Nuno Vasconcelos, Lawrence K. Saul

Abstract: We propose maximum covariance unfolding (MCU), a manifold learning algorithm for simultaneous dimensionality reduction of data from different input modalities. Given high dimensional inputs from two different but naturally aligned sources, MCU computes a common low dimensional embedding that maximizes the cross-modal (inter-source) correlations while preserving the local (intra-source) distances. In this paper, we explore two applications of MCU. First we use MCU to analyze EEG-fMRI data, where an important goal is to visualize the fMRI voxels that are most strongly correlated with changes in EEG traces. To perform this visualization, we augment MCU with an additional step for metric learning in the high dimensional voxel space. Second, we use MCU to perform cross-modal retrieval of matched image and text samples from Wikipedia. To manage large applications of MCU, we develop a fast implementation based on ideas from spectral graph theory. These ideas transform the original problem for MCU, one of semidefinite programming, into a simpler problem in semidefinite quadratic linear programming. 1

4 0.10473345 74 nips-2011-Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection

Author: Richard Socher, Eric H. Huang, Jeffrey Pennin, Christopher D. Manning, Andrew Y. Ng

Abstract: Paraphrase detection is the task of examining two sentences and determining whether they have the same meaning. In order to obtain high accuracy on this task, thorough syntactic and semantic analysis of the two statements is needed. We introduce a method for paraphrase detection based on recursive autoencoders (RAE). Our unsupervised RAEs are based on a novel unfolding objective and learn feature vectors for phrases in syntactic trees. These features are used to measure the word- and phrase-wise similarity between two sentences. Since sentences may be of arbitrary length, the resulting matrix of similarity measures is of variable size. We introduce a novel dynamic pooling layer which computes a fixed-sized representation from the variable-sized matrices. The pooled representation is then used as input to a classifier. Our method outperforms other state-of-the-art approaches on the challenging MSRP paraphrase corpus. 1

5 0.090156771 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models

Author: Le Song, Eric P. Xing, Ankur P. Parikh

Abstract: Latent tree graphical models are natural tools for expressing long range and hierarchical dependencies among many variables which are common in computer vision, bioinformatics and natural language processing problems. However, existing models are largely restricted to discrete and Gaussian variables due to computational constraints; furthermore, algorithms for estimating the latent tree structure and learning the model parameters are largely restricted to heuristic local search. We present a method based on kernel embeddings of distributions for latent tree graphical models with continuous and non-Gaussian variables. Our method can recover the latent tree structures with provable guarantees and perform local-minimum free parameter learning and efficient inference. Experiments on simulated and real data show the advantage of our proposed approach. 1

6 0.088195533 58 nips-2011-Complexity of Inference in Latent Dirichlet Allocation

7 0.077682093 216 nips-2011-Portmanteau Vocabularies for Multi-Cue Image Representation

8 0.075319186 211 nips-2011-Penalty Decomposition Methods for Rank Minimization

9 0.069302157 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data

10 0.067852519 4 nips-2011-A Convergence Analysis of Log-Linear Training

11 0.066102378 152 nips-2011-Learning in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint

12 0.065912776 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models

13 0.064851932 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

14 0.064486958 258 nips-2011-Sparse Bayesian Multi-Task Learning

15 0.064287759 260 nips-2011-Sparse Features for PCA-Like Linear Regression

16 0.06116062 263 nips-2011-Sparse Manifold Clustering and Embedding

17 0.058649089 61 nips-2011-Contextual Gaussian Process Bandit Optimization

18 0.058223046 186 nips-2011-Noise Thresholds for Spectral Clustering

19 0.058010865 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods

20 0.054878961 259 nips-2011-Sparse Estimation with Structured Dictionaries


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.182), (1, 0.023), (2, -0.05), (3, -0.026), (4, -0.032), (5, -0.05), (6, 0.042), (7, 0.017), (8, 0.027), (9, -0.046), (10, 0.033), (11, 0.114), (12, 0.059), (13, -0.059), (14, 0.011), (15, -0.064), (16, -0.143), (17, 0.077), (18, 0.031), (19, -0.038), (20, -0.071), (21, 0.012), (22, -0.014), (23, 0.02), (24, 0.049), (25, 0.118), (26, 0.022), (27, -0.01), (28, 0.009), (29, -0.067), (30, -0.103), (31, -0.056), (32, 0.016), (33, -0.116), (34, -0.006), (35, 0.051), (36, -0.07), (37, -0.136), (38, -0.145), (39, 0.052), (40, 0.039), (41, 0.047), (42, 0.044), (43, -0.0), (44, -0.16), (45, 0.181), (46, 0.111), (47, 0.014), (48, -0.082), (49, -0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94116563 176 nips-2011-Multi-View Learning of Word Embeddings via CCA

Author: Paramveer Dhillon, Dean P. Foster, Lyle H. Ungar

Abstract: Recently, there has been substantial interest in using large amounts of unlabeled data to learn word representations which can then be used as features in supervised classifiers for NLP tasks. However, most current approaches are slow to train, do not model the context of the word, and lack theoretical grounding. In this paper, we present a new learning method, Low Rank Multi-View Learning (LR-MVL) which uses a fast spectral method to estimate low dimensional context-specific word representations from unlabeled data. These representation features can then be used with any supervised learner. LR-MVL is extremely fast, gives guaranteed convergence to a global optimum, is theoretically elegant, and achieves state-ofthe-art performance on named entity recognition (NER) and chunking problems. 1 Introduction and Related Work Over the past decade there has been increased interest in using unlabeled data to supplement the labeled data in semi-supervised learning settings to overcome the inherent data sparsity and get improved generalization accuracies in high dimensional domains like NLP. Approaches like [1, 2] have been empirically very successful and have achieved excellent accuracies on a variety of NLP tasks. However, it is often difficult to adapt these approaches to use in conjunction with an existing supervised NLP system as these approaches enforce a particular choice of model. An increasingly popular alternative is to learn representational embeddings for words from a large collection of unlabeled data (typically using a generative model), and to use these embeddings to augment the feature set of a supervised learner. Embedding methods produce features in low dimensional spaces or over a small vocabulary size, unlike the traditional approach of working in the original high dimensional vocabulary space with only one dimension “on” at a given time. Broadly, these embedding methods fall into two categories: 1. Clustering based word representations: Clustering methods, often hierarchical, are used to group distributionally similar words based on their contexts. The two dominant approaches are Brown Clustering [3] and [4]. As recently shown, HMMs can also be used to induce a multinomial distribution over possible clusters [5]. 2. Dense representations: These representations are dense, low dimensional and real-valued. Each dimension of these representations captures latent information about a combination of syntactic and semantic word properties. They can either be induced using neural networks like C&W; embeddings [6] and Hierarchical log-linear (HLBL) embeddings [7] or by eigen-decomposition of the word co-occurrence matrix, e.g. Latent Semantic Analysis/Latent Semantic Indexing (LSA/LSI) [8]. Unfortunately, most of these representations are 1). slow to train, 2). sensitive to the scaling of the embeddings (especially 2 based approaches like LSA/PCA), 3). can get stuck in local optima (like EM trained HMM) and 4). learn a single embedding for a given word type; i.e. all the occurrences 1 of the word “bank” will have the same embedding, irrespective of whether the context of the word suggests it means “a financial institution” or “a river bank”. In this paper, we propose a novel context-specific word embedding method called Low Rank MultiView Learning, LR-MVL, which is fast to train and is guaranteed to converge to the optimal solution. As presented here, our LR-MVL embeddings are context-specific, but context oblivious embeddings (like the ones used by [6, 7]) can be trivially gotten from our model. Furthermore, building on recent advances in spectral learning for sequence models like HMMs [9, 10, 11] we show that LR-MVL has strong theoretical grounding. Particularly, we show that LR-MVL estimates low dimensional context-specific word embeddings which preserve all the information in the data if the data were generated by an HMM. Moreover, LR-MVL being linear does not face the danger of getting stuck in local optima as is the case for an EM trained HMM. LR-MVL falls into category (2) mentioned above; it learns real-valued context-specific word embeddings by performing Canonical Correlation Analysis (CCA) [12] between the past and future views of low rank approximations of the data. However, LR-MVL is more general than those methods, which work on bigram or trigram co-occurrence matrices, in that it uses longer word sequence information to estimate context-specific embeddings and also for the reasons mentioned in the last paragraph. The remainder of the paper is organized as follows. In the next section we give a brief overview of CCA, which forms the core of our method. Section 3 describes our proposed LR-MVL algorithm in detail and gives theory supporting its performance. Section 4 demonstrates the effectiveness of LR-MVL on the NLP tasks of Named Entity Recognition and Chunking. We conclude with a brief summary in Section 5. 2 Brief Review: Canonical Correlation Analysis (CCA) CCA [12] is the analog to Principal Component Analysis (PCA) for pairs of matrices. PCA computes the directions of maximum covariance between elements in a single matrix, whereas CCA computes the directions of maximal correlation between a pair of matrices. Unlike PCA, CCA does not depend on how the observations are scaled. This invariance of CCA to linear data transformations allows proofs that keeping the dominant singular vectors (those with largest singular values) will faithfully capture any state information. More specifically, given a set of n paired observation vectors {(l1 , r1 ), ..., (ln , rn )}–in our case the two matrices are the left (L) and right (R) context matrices of a word–we would like to simultaneously find the directions Φl and Φr that maximize the correlation of the projections of L onto Φl with the projections of R onto Φr . This is expressed as max Φl ,Φr E[ L, Φl R, Φr ] E[ L, Φl 2 ]E[ R, Φr 2 ] (1) where E denotes the empirical expectation. We use the notation Clr (Cll ) to denote the cross (auto) covariance matrices between L and R (i.e. L’R and L’L respectively.). The left and right canonical correlates are the solutions Φl , Φr of the following equations: Cll −1 Clr Crr −1 Crl Φl = λΦl Crr −1 Crl Cll −1 Clr Φr = λΦr 3 (2) Low Rank Multi-View Learning (LR-MVL) In LR-MVL, we compute the CCA between the past and future views of the data on a large unlabeled corpus to find the common latent structure, i.e., the hidden state associated with each token. These induced representations of the tokens can then be used as features in a supervised classifier (typically discriminative). The context around a word, consisting of the h words to the right and left of it, sits in a high dimensional space, since for a vocabulary of size v, each of the h words in the context requires an indicator function of dimension v. The key move in LR-MVL is to project the v-dimensional word 2 space down to a k dimensional state space. Thus, all eigenvector computations are done in a space that is v/k times smaller than the original space. Since a typical vocabulary contains at least 50, 000 words, and we use state spaces of order k ≈ 50 dimensions, this gives a 1,000-fold reduction in the size of calculations that are needed. The core of our LR-MVL algorithm is a fast spectral method for learning a v × k matrix A which maps each of the v words in the vocabulary to a k-dimensional state vector. We call this matrix the “eigenfeature dictionary”. We now describe the LR-MVL method, give a theorem that provides intuition into how it works, and formally present the LR-MVL algorithm. The Experiments section then shows that this low rank approximation allows us to achieve state-of-the-art performance on NLP tasks. 3.1 The LR-MVL method Given an unlabeled token sequence w={w0 , w1 , . . ., wn } we want to learn a low (k)- dimensional state vector {z0 , z1 , . . . , zn } for each observed token. The key is to find a v ×k matrix A (Algorithm 1) that maps each of the v words in the vocabulary to a reduced rank k-dimensional state vector, which is later used to induce context specific embeddings for the tokens (Algorithm 2). For supervised learning, these context specific embeddings are supplemented with other information about each token wt , such as its identity, orthographic features such as prefixes and suffixes or membership in domain-specific lexicons, and used as features in a classifier. Section 3.4 gives the algorithm more formally, but the key steps in the algorithm are, in general terms: • Take the h words to the left and to the right of each target word wt (the “Left” and “Right” contexts), and project them each down to k dimensions using A. • Take the CCA between the reduced rank left and right contexts, and use the resulting model to estimate a k dimensional state vector (the “hidden state”) for each token. • Take the CCA between the hidden states and the tokens wt . The singular vectors associated with wt form a new estimate of the eigenfeature dictionary. LR-MVL can be viewed as a type of co-training [13]: The state of each token wt is similar to that of the tokens both before and after it, and it is also similar to the states of the other occurrences of the same word elsewhere in the document (used in the outer iteration). LR-MVL takes advantage of these two different types of similarity by alternately estimating word state using CCA on the smooths of the states of the words before and after each target token and using the average over the states associated with all other occurrences of that word. 3.2 Theoretical Properties of LR-MVL We now present the theory behind the LR-MVL algorithm; particularly we show that the reduced rank matrix A allows a significant data reduction while preserving the information in our data and the estimated state does the best possible job of capturing any label information that can be inferred by a linear model. Let L be an n × hv matrix giving the words in the left context of each of the n tokens, where the context is of length h, R be the corresponding n × hv matrix for the right context, and W be an n × v matrix of indicator functions for the words themselves. We will use the following assumptions at various points in our proof: Assumption 1. L, W, and R come from a rank k HMM i.e. it has a rank k observation matrix and rank k transition matrix both of which have the same domain. For example, if the dimension of the hidden state is k and the vocabulary size is v then the observation matrix, which is k × v, has rank k. This rank condition is similar to the one used by [10]. Assumption 1A. For the three views, L, W and R assume that there exists a “hidden state H” of dimension n × k, where each row Hi has the same non-singular variance-covariance matrix and 3 such that E(Li |Hi ) = Hi β T and E(Ri |Hi ) = Hi β T and E(Wi |Hi ) = Hi β T where all β’s are of L R W rank k, where Li , Ri and Wi are the rows of L, R and W respectively. Assumption 1A follows from Assumption 1. Assumption 2. ρ(L, W), ρ(L, R) and ρ(W, R) all have rank k, where ρ(X1 , X2 ) is the expected correlation between X1 and X2 . Assumption 2 is a rank condition similar to that in [9]. Assumption 3. ρ([L, R], W) has k distinct singular values. Assumption 3 just makes the proof a little cleaner, since if there are repeated singular values, then the singular vectors are not unique. Without it, we would have to phrase results in terms of subspaces with identical singular values. We also need to define the CCA function that computes the left and right singular vectors for a pair of matrices: Definition 1 (CCA). Compute the CCA between two matrices X1 and X2 . Let ΦX1 be a matrix containing the d largest singular vectors for X1 (sorted from the largest on down). Likewise for ΦX2 . Define the function CCAd (X1 , X2 ) = [ΦX1 , ΦX2 ]. When we want just one of these Φ’s, we will use CCAd (X1 , X2 )left = ΦX1 for the left singular vectors and CCAd (X1 , X2 )right = ΦX2 for the right singular vectors. Note that the resulting singular vectors, [ΦX1 , ΦX2 ] can be used to give two redundant estimates, X1 ΦX1 and X2 ΦX2 of the “hidden” state relating X1 and X2 , if such a hidden state exists. Definition 2. Define the symbol “≈” to mean X1 ≈ X2 ⇐⇒ lim X1 = lim X2 n→∞ n→∞ where n is the sample size. Lemma 1. Define A by the following limit of the right singular vectors: CCAk ([L, R], W)right ≈ A. Under assumptions 2, 3 and 1A, such that if CCAk (L, R) ≡ [ΦL , ΦR ] then CCAk ([LΦL , RΦR ], W)right ≈ A. Lemma 1 shows that instead of finding the CCA between the full context and the words, we can take the CCA between the Left and Right contexts, estimate a k dimensional state from them, and take the CCA of that state with the words and get the same result. See the supplementary material for the Proof. ˜ Let Ah denote a matrix formed by stacking h copies of A on top of each other. Right multiplying ˜ L or R by Ah projects each of the words in that context into the k-dimensional reduced rank space. The following theorem addresses the core of the LR-MVL algorithm, showing that there is an A which gives the desired dimensionality reduction. Specifically, it shows that the previous lemma also holds in the reduced rank space. Theorem 1. Under assumptions 1, 2 and 3 there exists a unique matrix A such that if ˜ ˜ ˜ ˜ CCAk (LAh , RAh ) ≡ [ΦL , ΦR ] then ˜ ˜ ˜ ˜ CCAk ([LAh ΦL , RAh ΦR ], W)right ≈ A ˜ where Ah is the stacked form of A. See the supplementary material for the Proof 1 . ˆ It is worth noting that our matrix A corresponds to the matrix U used by [9, 10]. They showed that U is sufficient to compute the probability of a sequence of words generated by an HMM; although we do not show ˆ it here (due to limited space), our A provides a more statistically efficient estimate of U than their U , and hence can also be used to estimate the sequence probabilities. 1 4 Under the above assumptions, there is asymptotically (in the limit of infinite data) no benefit to first estimating state by finding the CCA between the left and right contexts and then finding the CCA between the estimated state and the words. One could instead just directly find the CCA between the combined left and rights contexts and the words. However, because of the Zipfian distribution of words, many words are rare or even unique, and hence one is not in the asymptotic limit. In this case, CCA between the rare words and context will not be informative, whereas finding the CCA between the left and right contexts gives a good state vector estimate even for unique words. One can then fruitfully find the CCA between the contexts and the estimated state vector for their associated words. 3.3 Using Exponential Smooths In practice, we replace the projected left and right contexts with exponential smooths (weighted average of the previous (or next) token’s state i.e. Zt−1 (or Zt+1 ) and previous (or next) token’s smoothed state i.e. St−1 (or St+1 ).), of them at a few different time scales, thus giving a further dimension reduction by a factor of context length h (say 100 words) divided by the number of smooths (often 5-7). We use a mixture of both very short and very long contexts which capture short and long range dependencies as required by NLP problems as NER, Chunking, WSD etc. Since exponential smooths are linear, we preserve the linearity of our method. 3.4 The LR-MVL Algorithm The LR-MVL algorithm (using exponential smooths) is given in Algorithm 1; it computes the pair of CCAs described above in Theorem 1. Algorithm 1 LR-MVL Algorithm - Learning from Large amounts of Unlabeled Data 1: Input: Token sequence Wn×v , state space size k, smoothing rates αj 2: Initialize the eigenfeature dictionary A to random values N (0, 1). 3: repeat 4: Set the state Zt (1 < t ≤ n) of each token wt to the eigenfeature vector of the corresponding word. Zt = (Aw : w = wt ) 5: Smooth the state estimates before and after each token to get a pair of views for each smoothing rate αj . (l,j) (l,j) = (1 − αj )St−1 + αj Zt−1 // left view L St (r,j) (r,j) j St = (1 − α )St+1 + αj Zt+1 // right view R. (l,j) (r,j) th where the t rows of L and R are, respectively, concatenations of the smooths St and St for (j) each of the α s. 6: Find the left and right canonical correlates, which are the eigenvectors Φl and Φr of (L L)−1 L R(R R)−1 R LΦl = λΦl . (R R)−1 R L(L L)−1 L RΦr = λΦr . 7: Project the left and right views on to the space spanned by the top k/2 left and right CCAs respectively (k/2) (k/2) Xl = LΦl and Xr = RΦr (k/2) (k/2) where Φl , Φr are matrices composed of the singular vectors of Φl , Φr with the k/2 largest magnitude singular values. Estimate the state for each word wt as the union of the left and right estimates: Z = [Xl , Xr ] 8: Estimate the eigenfeatures of each word type, w, as the average of the states estimated for that word. Aw = avg(Zt : wt = w) 9: Compute the change in A from the previous iteration 10: until |∆A| < 11: Output: Φk , Φk , A . r l A few iterations (∼ 5) of the above algorithm are sufficient to converge to the solution. (Since the problem is convex, there is a single solution, so there is no issue of local minima.) As [14] show for PCA, one can start with a random matrix that is only slightly larger than the true rank k of the correlation matrix, and with extremely high likelihood converge in a few iterations to within a small distance of the true principal components. In our case, if the assumptions detailed above (1, 1A, 2 and 3) are satisfied, our method converges equally rapidly to the true canonical variates. As mentioned earlier, we get further dimensionality reduction in Step 5, by replacing the Left and Right context matrices with a set of exponentially smoothed values of the reduced rank projections of the context words. Step 6 finds the CCA between the Left and Right contexts. Step 7 estimates 5 the state by combining the estimates from the left and right contexts, since we don’t know which will best estimate the state. Step 8 takes the CCA between the estimated state Z and the matrix of words W. Because W is a vector of indicator functions, this CCA takes the trivial form of a set of averages. Once we have estimated the CCA model, it is used to generate context specific embeddings for the tokens from training, development and test sets (as described in Algorithm 2). These embeddings are further supplemented with other baseline features and used in a supervised learner to predict the label of the token. Algorithm 2 LR-MVL Algorithm -Inducing Context Specific Embeddings for Train/Dev/Test Data 1: Input: Model (Φk , Φk , A) output from above algorithm and Token sequences Wtrain , (Wdev , Wtest ) r l 2: Project the left and right views L and R after smoothing onto the space spanned by the top k left and right CCAs respectively Xl = LΦk and Xr = RΦk r l and the words onto the eigenfeature dictionary Xw = W train A 3: Form the final embedding matrix Xtrain:embed by concatenating these three estimates of state Xtrain:embed = [Xl , Xw , Xr ] 4: Output: The embedding matrices Xtrain:embed , (Xdev:embed , Xtest:embed ) with context-specific representations for the tokens. These embeddings are augmented with baseline set of features mentioned in Sections 4.1.1 and 4.1.2 before learning the final classifier. Note that we can get context “oblivious” embeddings i.e. one embedding per word type, just by using the eigenfeature dictionary (Av×k ) output by Algorithm 1. 4 Experimental Results In this section we present the experimental results of LR-MVL on Named Entity Recognition (NER) and Syntactic Chunking tasks. We compare LR-MVL to state-of-the-art semi-supervised approaches like [1] (Alternating Structures Optimization (ASO)) and [2] (Semi-supervised extension of CRFs) as well as embeddings like C&W;, HLBL and Brown Clustering. 4.1 Datasets and Experimental Setup For the NER experiments we used the data from CoNLL 2003 shared task and for Chunking experiments we used the CoNLL 2000 shared task data2 with standard training, development and testing set splits. The CoNLL ’03 and the CoNLL ’00 datasets had ∼ 204K/51K/46K and ∼ 212K/ − /47K tokens respectively for Train/Dev./Test sets. 4.1.1 Named Entity Recognition (NER) We use the same set of baseline features as used by [15, 16] in their experiments. The detailed list of features is as below: • Current Word wi ; Its type information: all-capitalized, is-capitalized, all-digits and so on; Prefixes and suffixes of wi • Word tokens in window of 2 around the current word i.e. (wi−2 , wi−1 , wi , wi+1 , wi+2 ); and capitalization pattern in the window. d = • Previous two predictions yi−1 and yi−2 and conjunction of d and yi−1 • Embedding features (LR-MVL, C&W;, HLBL, Brown etc.) in a window of 2 around the current word (if applicable). Following [17] we use regularized averaged perceptron model with above set of baseline features for the NER task. We also used their BILOU text chunk representation and fast greedy inference as it was shown to give superior performance. 2 More details about the data and competition are available at http://www.cnts.ua.ac.be/ conll2003/ner/ and http://www.cnts.ua.ac.be/conll2000/chunking/ 6 We also augment the above set of baseline features with gazetteers, as is standard practice in NER experiments. We tuned our free parameter namely the size of LR-MVL embedding on the development and scaled our embedding features to have a 2 norm of 1 for each token and further multiplied them by a normalization constant (also chosen by cross validation), so that when they are used in conjunction with other categorical features in a linear classifier, they do not exert extra influence. The size of LR-MVL embeddings (state-space) that gave the best performance on the development set was k = 50 (50 each for Xl , Xw , Xr in Algorithm 2) i.e. the total size of embeddings was 50×3, and the best normalization constant was 0.5. We omit validation plots due to paucity of space. 4.1.2 Chunking For our chunking experiments we use a similar base set of features as above: • Current Word wi and word tokens in window of 2 around the current word i.e. d = (wi−2 , wi−1 , wi , wi+1 , wi+2 ); • POS tags ti in a window of 2 around the current word. • Word conjunction features wi ∩ wi+1 , i ∈ {−1, 0} and Tag conjunction features ti ∩ ti+1 , i ∈ {−2, −1, 0, 1} and ti ∩ ti+1 ∩ ti+2 , i ∈ {−2, −1, 0}. • Embedding features in a window of 2 around the current word (when applicable). Since CoNLL 00 chunking data does not have a development set, we randomly sampled 1000 sentences from the training data (8936 sentences) for development. So, we trained our chunking models on 7936 training sentences and evaluated their F1 score on the 1000 development sentences and used a CRF 3 as the supervised classifier. We tuned the size of embedding and the magnitude of 2 regularization penalty in CRF on the development set and took log (or -log of the magnitude) of the value of the features4 . The regularization penalty that gave best performance on development set was 2 and here again the best size of LR-MVL embeddings (state-space) was k = 50. Finally, we trained the CRF on the entire (“original”) training data i.e. 8936 sentences. 4.1.3 Unlabeled Data and Induction of embeddings For inducing the embeddings we used the RCV1 corpus containing Reuters newswire from Aug ’96 to Aug ’97 and containing about 63 million tokens in 3.3 million sentences5 . Case was left intact and we did not do the “cleaning” as done by [18, 16] i.e. remove all sentences which are less than 90% lowercase a-z, as our multi-view learning approach is robust to such noisy data, like news byline text (mostly all caps) which does not correlate strongly with the text of the article. We induced our LR-MVL embeddings over a period of 3 days (70 core hours on 3.0 GHz CPU) on the entire RCV1 data by performing 4 iterations, a vocabulary size of 300k and using a variety of smoothing rates (α in Algorithm 1) to capture correlations between shorter and longer contexts α = [0.005, 0.01, 0.05, 0.1, 0.5, 0.9]; theoretically we could tune the smoothing parameters on the development set but we found this mixture of long and short term dependencies to work well in practice. As far as the other embeddings are concerned i.e. C&W;, HLBL and Brown Clusters, we downloaded them from http://metaoptimize.com/projects/wordreprs. The details about their induction and parameter tuning can be found in [16]; we report their best numbers here. It is also worth noting that the unsupervised training of LR-MVL was (> 1.5 times)6 faster than other embeddings. 4.2 Results The results for NER and Chunking are shown in Tables 1 and 2, respectively, which show that LR-MVL performs significantly better than state-of-the-art competing methods on both NER and Chunking tasks. 3 http://www.chokkan.org/software/crfsuite/ Our embeddings are learnt using a linear model whereas CRF is a log-linear model, so to keep things on same scale we did this normalization. 5 We chose this particular dataset to make a fair comparison with [1, 16], who report results using RCV1 as unlabeled data. 6 As some of these embeddings were trained on GPGPU which makes our method even faster comparatively. 4 7 Embedding/Model Baseline C&W;, 200-dim HLBL, 100-dim Brown 1000 clusters Ando & Zhang ’05 Suzuki & Isozaki ’08 LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim HLBL, 100-dim C&W;, 200-dim Brown, 1000 clusters LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim No Gazetteers With Gazetteers F1-Score Dev. Set Test Set 90.03 84.39 92.46 87.46 92.00 88.13 92.32 88.52 93.15 89.31 93.66 89.36 93.11 89.55 93.61 89.91 92.91 89.35 92.98 88.88 93.25 89.41 93.91 89.89 94.41 90.06 Table 1: NER Results. Note: 1). LR-MVL (CO) are Context Oblivious embeddings which are gotten from (A) in Algorithm 1. 2). F1-score= Harmonic Mean of Precision and Recall. 3). The current state-of-the-art for this NER task is 90.90 (Test Set) but using 700 billion tokens of unlabeled data [19]. Embedding/Model Baseline HLBL, 50-dim C&W;, 50-dim Brown 3200 Clusters Ando & Zhang ’05 Suzuki & Isozaki ’08 LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim Test Set F1-Score 93.79 94.00 94.10 94.11 94.39 94.67 95.02 95.44 Table 2: Chunking Results. It is important to note that in problems like NER, the final accuracy depends on performance on rare-words and since LR-MVL is robustly able to correlate past with future views, it is able to learn better representations for rare words resulting in overall better accuracy. On rare-words (occurring < 10 times in corpus), we got 11.7%, 10.7% and 9.6% relative reduction in error over C&W;, HLBL and Brown respectively for NER; on chunking the corresponding numbers were 6.7%, 7.1% and 8.7%. Also, it is worth mentioning that modeling the context in embeddings gives decent improvements in accuracies on both NER and Chunking problems. For the case of NER, the polysemous words were mostly like Chicago, Wales, Oakland etc., which could either be a location or organization (Sports teams, Banks etc.), so when we don’t use the gazetteer features, (which are known lists of cities, persons, organizations etc.) we got higher increase in F-score by modeling context, compared to the case when we already had gazetteer features which captured most of the information about polysemous words for NER dataset and modeling the context didn’t help as much. The polysemous words for Chunking dataset were like spot (VP/NP), never (VP/ADVP), more (NP/VP/ADVP/ADJP) etc. and in this case embeddings with context helped significantly, giving 3.1 − 6.5% relative improvement in accuracy over context oblivious embeddings. 5 Summary and Conclusion In this paper, we presented a novel CCA-based multi-view learning method, LR-MVL, for large scale sequence learning problems such as arise in NLP. LR-MVL is a spectral method that works in low dimensional state-space so it is computationally efficient, and can be used to train using large amounts of unlabeled data; moreover it does not get stuck in local optima like an EM trained HMM. The embeddings learnt using LR-MVL can be used as features with any supervised learner. LR-MVL has strong theoretical grounding; is much simpler and faster than competing methods and achieves state-of-the-art accuracies on NER and Chunking problems. Acknowledgements: The authors would like to thank Alexander Yates, Ted Sandler and the three anonymous reviews for providing valuable feedback. We would also like to thank Lev Ratinov and Joseph Turian for answering our questions regarding their paper [16]. 8 References [1] Ando, R., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6 (2005) 1817–1853 [2] Suzuki, J., Isozaki, H.: Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In: In ACL. (2008) [3] Brown, P., deSouza, P., Mercer, R., Pietra, V.D., Lai, J.: Class-based n-gram models of natural language. Comput. Linguist. 18 (December 1992) 467–479 [4] Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: 31st Annual Meeting of the ACL. (1993) 183–190 [5] Huang, F., Yates, A.: Distributional representations for handling sparsity in supervised sequence-labeling. ACL ’09, Stroudsburg, PA, USA, Association for Computational Linguistics (2009) 495–503 [6] Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. ICML ’08, New York, NY, USA, ACM (2008) 160–167 [7] Mnih, A., Hinton, G.: Three new graphical models for statistical language modelling. ICML ’07, New York, NY, USA, ACM (2007) 641–648 [8] Dumais, S., Furnas, G., Landauer, T., Deerwester, S., Harshman, R.: Using latent semantic analysis to improve access to textual information. In: SIGCHI Conference on human factors in computing systems, ACM (1988) 281–285 [9] Hsu, D., Kakade, S., Zhang, T.: A spectral algorithm for learning hidden markov models. In: COLT. (2009) [10] Siddiqi, S., Boots, B., Gordon, G.J.: Reduced-rank hidden Markov models. In: AISTATS2010. (2010) [11] Song, L., Boots, B., Siddiqi, S.M., Gordon, G.J., Smola, A.J.: Hilbert space embeddings of hidden Markov models. In: ICML. (2010) [12] Hotelling, H.: Canonical correlation analysis (cca). Journal of Educational Psychology (1935) [13] Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT’ 98. (1998) 92–100 [14] Halko, N., Martinsson, P.G., Tropp, J.: Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. (Dec 2010) [15] Zhang, T., Johnson, D.: A robust risk minimization based named entity recognition system. CONLL ’03 (2003) 204–207 [16] Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. ACL ’10, Stroudsburg, PA, USA, Association for Computational Linguistics (2010) 384–394 [17] Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: CONLL. (2009) 147–155 [18] Liang, P.: Semi-supervised learning for natural language. Master’s thesis, Massachusetts Institute of Technology (2005) [19] Lin, D., Wu, X.: Phrase clustering for discriminative learning. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2. ACL ’09, Stroudsburg, PA, USA, Association for Computational Linguistics (2009) 1030–1038 9

2 0.73485267 167 nips-2011-Maximum Covariance Unfolding : Manifold Learning for Bimodal Data

Author: Vijay Mahadevan, Chi W. Wong, Jose C. Pereira, Tom Liu, Nuno Vasconcelos, Lawrence K. Saul

Abstract: We propose maximum covariance unfolding (MCU), a manifold learning algorithm for simultaneous dimensionality reduction of data from different input modalities. Given high dimensional inputs from two different but naturally aligned sources, MCU computes a common low dimensional embedding that maximizes the cross-modal (inter-source) correlations while preserving the local (intra-source) distances. In this paper, we explore two applications of MCU. First we use MCU to analyze EEG-fMRI data, where an important goal is to visualize the fMRI voxels that are most strongly correlated with changes in EEG traces. To perform this visualization, we augment MCU with an additional step for metric learning in the high dimensional voxel space. Second, we use MCU to perform cross-modal retrieval of matched image and text samples from Wikipedia. To manage large applications of MCU, we develop a fast implementation based on ideas from spectral graph theory. These ideas transform the original problem for MCU, one of semidefinite programming, into a simpler problem in semidefinite quadratic linear programming. 1

3 0.55308884 74 nips-2011-Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection

Author: Richard Socher, Eric H. Huang, Jeffrey Pennin, Christopher D. Manning, Andrew Y. Ng

Abstract: Paraphrase detection is the task of examining two sentences and determining whether they have the same meaning. In order to obtain high accuracy on this task, thorough syntactic and semantic analysis of the two statements is needed. We introduce a method for paraphrase detection based on recursive autoencoders (RAE). Our unsupervised RAEs are based on a novel unfolding objective and learn feature vectors for phrases in syntactic trees. These features are used to measure the word- and phrase-wise similarity between two sentences. Since sentences may be of arbitrary length, the resulting matrix of similarity measures is of variable size. We introduce a novel dynamic pooling layer which computes a fixed-sized representation from the variable-sized matrices. The pooled representation is then used as input to a classifier. Our method outperforms other state-of-the-art approaches on the challenging MSRP paraphrase corpus. 1

4 0.53078651 125 nips-2011-Identifying Alzheimer's Disease-Related Brain Regions from Multi-Modality Neuroimaging Data using Sparse Composite Linear Discrimination Analysis

Author: Shuai Huang, Jing Li, Jieping Ye, Teresa Wu, Kewei Chen, Adam Fleisher, Eric Reiman

Abstract: Diagnosis of Alzheimer’s disease (AD) at the early stage of the disease development is of great clinical importance. Current clinical assessment that relies primarily on cognitive measures proves low sensitivity and specificity. The fast growing neuroimaging techniques hold great promise. Research so far has focused on single neuroimaging modality. However, as different modalities provide complementary measures for the same disease pathology, fusion of multi-modality data may increase the statistical power in identification of disease-related brain regions. This is especially true for early AD, at which stage the disease-related regions are most likely to be weakeffect regions that are difficult to be detected from a single modality alone. We propose a sparse composite linear discriminant analysis model (SCLDA) for identification of disease-related brain regions of early AD from multi-modality data. SCLDA uses a novel formulation that decomposes each LDA parameter into a product of a common parameter shared by all the modalities and a parameter specific to each modality, which enables joint analysis of all the modalities and borrowing strength from one another. We prove that this formulation is equivalent to a penalized likelihood with non-convex regularization, which can be solved by the DC (difference of convex functions) programming. We show that in using the DC programming, the property of the nonconvex regularization in terms of preserving weak-effect features can be nicely revealed. We perform extensive simulations to show that SCLDA outperforms existing competing algorithms on feature selection, especially on the ability for identifying weak-effect features. We apply SCLDA to the Magnetic Resonance Imaging (MRI) and Positron Emission Tomography (PET) images of 49 AD patients and 67 normal controls (NC). Our study identifies disease-related brain regions consistent with findings in the AD literature. 1 In tro du cti on Alzheimer’s disease (AD) is a fatal, neurodegenerative disorder that currently affects over five million people in the U.S. It leads to substantial, progressive neuron damage that is irreversible, which eventually causes death. Early diagnosis of AD is of great clinical importance, because disease-modifying therapies given to patients at the early stage of their disease development will have a much better effect in slowing down the disease progression and helping preserve some cognitive functions of the brain. However, current clinical assessment that majorly relies on cognitive measures proves low sensitivity and specificity in early diagnosis of AD. This is because these cognitive measures are vulnerable to the confounding effect from some non-AD related factors such as patients’ mood, and presence of other illnesses or major life events [1]. The confounding effect is especially severe in the diagnosis of early AD, at which time cognitive 1 impairment is not yet apparent. On the other hand, fast growing neuroimaging techniques, such as Magnetic Resonance Imaging (MRI) and Positron Emission Tomography (PET), provide great opportunities for improving early diagnosis of AD, due to their ability for overcoming the limitations of conventional cognitive measures. There are two major categories of neuroimaging techniques, i.e., functional and structure neuroimaging. MRI is a typical structural neuroimaging technique, which allows for visualization of brain anatomy. PET is a typical functional neuroimaging technique, which measures the cerebral metabolic rate for glucose. Both techniques have been extensively applied to AD studies. For example, studies based on MRI have consistently revealed brain atrophy that involves the hippocampus and entorhinal cortex [2-6]; studies based on PET have revealed functional abnormality that involves the posterior temporal and parietal association cortices [8-10], posterior cingulate, precuneus, and medial temporal cortices [11-14]. There is overlap between the disease-related brain regions detected by MRI and those by PET, such as regions in the hippocampus area and the mesia temporal lobe [15-17]. This is not surprising since MRI and PET are two complementary measures for the same disease pathology, i.e., it starts mainly in the hippocampus and entorhinal cortex, and subsequently spreads throughout temporal and orbiogrontal cortext, poseterior cingulated, and association cortex [7]. However, most existing studies only exploited structural and functional alterations in separation, which ignore the potential interaction between them. The fusion of MRI and PET imaging modalities will increase the statistical power in identification of disease-related brain regions, especially for early AD, at which stage the disease-related regions are most likely to be weakeffect regions that are difficult to be detected from MRI or PET alone. Once a good set of diseaserelated brain regions is identified, they can be further used to build an effective classifier (i.e., a biomarker from the clinical perspective) to enable AD diagnose with high sensitivity and specificity. The idea of multi-modality data fusion in the research of neurodegenerative disorders has been exploited before. For example, a number of models have been proposed to combine electroencephalography (EEG) and functional MRI (fMRI), including parallel EEG-fMRI independent component analysis [18]-[19], EEG-informed fMRI analysis [18] [20], and variational Bayesian methods [18] [21]. The purpose of these studies is different from ours, i.e., they aim to combine EEG, which has high temporal resolution but low spatial resolution, and fMRI, which has low temporal resolution but high spatial resolution, so as to obtain an accurate picture for the whole brain with both high spatial and high temporal resolutions [18]-[21]. Also, there have been some studies that include both MRI and PET data for classification [15], [22][25]. However, these studies do not make use of the fact that MRI and PET measure the same underlying disease pathology from two complementary perspectives (i.e., structural and functional perspectives), so that the analysis of one imaging modality can borrow strength from the other. In this paper, we focus on the problem of identifying disease-related brain regions from multimodality data. This is actually a variable selection problem. Because MRI and PET data are highdimensional, regularization techniques are needed for effective variable selection, such as the L1regularization technique [25]-[30] and the L2/L1-regularization technique [31]. In particular, L2/L1-regularization has been used for variable selection jointly on multiple related datasets, also known as multitask feature selection [31], which has a similar nature to our problem. Note that both L1- and L2/L1-regularizations are convex regularizations, which have gained them popularity in the literature. On the other hand, there is increasing evidence that these convex regularizations tend to produce too severely shrunken parameter estimates. Therefore, these convex regularizations could lead to miss-identification of the weak-effect disease-related brain regions, which unfortunately make up a large portion of the disease-related brain regions especially in early AD. Also, convex regularizations tend to select many irrelevant variables to compensate for the overly severe shrinkage in the parameters of the relevant variables. Considering these limitations of convex regularizations, we study non-convex regularizations [33]-[35] [39], which have the advantage of producing mildly or slightly shrunken parameter estimates so as to be able to preserve weak-effect disease-related brain regions and the advantage of avoiding selecting many disease-irrelevant regions. Specifically in this paper, we propose a sparse composite linear discriminant analysis model, called SCLDA, for identification of disease-related brain regions from multi-modality data. The contributions of our paper include: 2 • • • 2 Formulation: We propose a novel formulation that decomposes each LDA parameter into a product of a common parameter shared by all the data sources and a parameter specific to each data source, which enables joint analysis of all the data sources and borrowing strength from one another. We further prove that this formulation is equivalent to a penalized likelihood with non-convex regularization. Algorithm: We show that the proposed non-convex optimization can be solved by the DC (difference of convex functions) programming [39]. More importantly, we show that in using the DC programming, the property of the non-convex regularization in terms of preserving weak-effect features can be nicely revealed. Application: We apply the proposed SCLDA to the PET and MRI data of early AD patients and normal controls (NC). Our study identifies disease-related brain regions that are consistent with the findings in the AD literature. AD vs. NC classification based on these identified regions achieves high accuracy, which makes the proposed method a useful tool for clinical diagnosis of early AD. In contrast, the convex-regularization based multitask feature selection method [31] identifies more irrelevant brain regions and yields a lower classification accuracy. Rev iew o f L D A a nd it s v ari an ts ! Denote đ?’ = đ?‘?!, đ?‘?! , ‌ , đ?‘?! as the variables and assume there are đ??˝ classes. Denote đ?‘ ! as the ! sample size of class đ?‘— and đ?‘ = !!! đ?‘ ! is the total sample size. Let đ??ł = đ?’›! , đ?’›! , ‌ , đ?’›! ! be the đ?‘ Ă—đ?‘? sample matrix, where đ?’›! is the đ?‘– !! sample and đ?‘” đ?‘– is its associated class index. Let ! ! ! ! đ?›?! = !!! đ?’›! be the overall sample mean, !!!,! ! !! đ?’›! be the sample mean of class đ?‘—, đ?›? = !! ! ! đ?’› − đ?›? ! !!! ! ! ! đ??–! = !! !!!,! ! !! ! ! đ?‘ đ??– be the ! !!! ! ! đ??“= ! đ?’›! − đ?›? đ?’›! − đ?›?! ! be the total normalized sum of squares and products (SSQP), đ?’›! − đ?›?! ! be the normalized class SSQP of class đ?‘—, and đ??–= overall normalized class SSQP. The objective of LDA is to seek for a đ?‘?Ă—đ?‘ž linear transformation matrix, đ?›‰! , with which đ?›‰! đ?‘? ! retains the maximum amount of class discrimination information in đ?‘?. To achieve this objective, one approach is to seek for the đ?›‰! that maximizes the between-class variance of đ?›‰! đ?‘?, which can ! be measured by tr(đ?›‰! đ??“đ?›‰! ), while minimizing the within-class variance of đ?›‰! đ?‘?, which can be ! ! measured by tr(đ?›‰! đ??–đ?›‰! ). Here tr() is the matrix trace operator. This is equivalent to solving the ! following optimization problem:  đ?›‰! = argmax đ?›‰! đ??­đ??Ť(đ?›‰! đ??“đ?›‰! ) ! đ??­đ??Ť(đ?›‰! đ??–đ?›‰! ) ! . (1) Note that đ?›‰! corresponds to the right eigenvector of đ??– !! đ??“ and đ?‘ž = đ??˝ − 1. Another approach used for finding the đ?›‰! is to use the maximum likelihood estimation for Gaussian populations that have different means and a common covariance matrix. Specifically, as in [36], this approach is developed by assuming the class distributions are Gaussian with a common covariance matrix, and their mean differences lie in a đ?‘ž-dimensional subspace of the đ?‘?dimensional original variable space. Hastie [37] further generalized this approach by assuming that class distributions are a mixture of Gaussians, which has more flexibility than LDA. However, both approaches assume a common covariance matrix for all the classes, which is too strict in many practical applications, especially in high-dimensional problems where the covariance matrices of different classes tend to be different. Consequently, the linear transformation explored by LDA may not be effective. In [38], a heterogeneous LDA (HLDA) is developed to relax this assumption. The HLDA seeks for a đ?‘?Ă—đ?‘? linear transformation matrix, đ?›‰, in which only the first đ?‘ž columns (đ?›‰! ) contain discrimination information and the remaining đ?‘? − đ?‘ž columns (đ?›‰!!! ) contain no discrimination information. For Gaussian models, assuming lack of discrimination information is equivalent to assuming that the means and the covariance matrices of the class distributions are the same for all 3 classes, in the đ?‘? − đ?‘ž dimensional subspace. Following this, the log-likelihood function of đ?›‰ can be written as below [38]: ! đ?‘™ đ?›‰|đ??™ = − log đ?›‰! đ??“đ?›‰!!! − !!! ! !! ! !!! ! log đ?›‰! đ??–! đ?›‰! + đ?‘ log đ?›‰ , ! (2) Here đ??€ denotes the determinant of matrix đ??€. There is no closed-form solution for đ?›‰. As a result, numeric methods are needed to derive the maximum likelihood estimate for đ?›‰. It is worth mentioning that the LDA in the form of (1) is a special case of the HLDA [38]. 3 T he p ro po sed SC L DA Suppose that there are multiple data sources, đ??™ ! , đ??™ ! , ‌ , đ??™ ! , with each data source capturing one aspect of the same set of physical variables, e.g., the MRI and PET capture the structural and functional aspects of the same brain regions. For each data source, đ??™ ! , there is a linear transformation matrix đ?›‰ ! , which retains the maximum amount of class discrimination information in đ??™ ! . A naive way for estimating đ?šŻ = đ?›‰ ! , đ?›‰ ! , ‌ , đ?›‰ ! is to separately estimate each đ?›‰ ! based on đ??™ ! . Apparently, this approach does not take advantage of the fact that all the data sources measure the same physical process. Also, when the sample size of each data source is small, this approach may lead to unreliable estimates for the đ?›‰ ! ’s. To tackle these problems, we propose a composite parameterization following the line as [40]. ! Specifically, let đ?œƒ!,! be the element at the k-th row and l-th column of đ?›‰ ! . We treat ! ! ! ! ! ! đ?œƒ!,! , đ?œƒ!,! , ‌ , đ?œƒ!,! as an interrelated group and parameterize each đ?œƒ!,! as đ?œƒ!,! = đ?›ż! đ?›ž!,! , for 1 ≤ đ?‘˜ ≤ đ?‘?, 1 ≤ đ?‘™ ≤ đ?‘? and 1 ≤ đ?‘š ≤ đ?‘€. In order to assure identifiability, we restrict each đ?›ż! ≼ 0. Here, đ?›ż! represents the common information shared by all the data sources about variable đ?‘˜, while ! đ?›ž!,! represents the specific information only captured by the đ?‘š !! data source. For example, for disease-related brain region identification, if đ?›ż! = 0, it means that all the data sources indicate variable đ?‘˜ is not a disease-related brain region; otherwise, variable đ?‘˜ is a disease-related brain ! region. đ?›ž!,! ≠ 0 means that the đ?‘š !! data source supports this assertion. The log-likelihood function of đ?šŻ is: đ?‘™! đ?šŻ| đ??™ ! , đ??™ ! , ‌ , đ??™ ! = ! !!! − ! ! ! đ?‘ ! ! log đ?›‰!!! đ??“ ! log đ?›‰ ! ! ! ! đ?›‰!!! − !! ! !!! ! ! log đ?›‰! đ??–! ! ! đ?›‰! +  , which follows the same line as (2). However, our formulation includes the following constraints on đ?šŻ: ! ! đ?œƒ!,! = đ?›ż! đ?›ž!,! , đ?›ż! ≼ 0, 1 ≤ đ?‘˜, đ?‘™ ≤ đ?‘?, 1 ≤ đ?‘š ≤ đ?‘€. Let ! đ?šŞ = đ?›ž!,! , 1 ≤ đ?‘˜ ≤ đ?‘?, 1 ≤ đ?‘™ ≤ đ?‘?, 1 ≤ đ?‘š ≤ đ?‘€ and (3) đ?šż = đ?›ż! , 1 ≤ đ?‘˜ ≤ đ?‘? . An intuitive choice for estimation of đ?šŞ and đ?šż is to maximize the đ?‘™! đ?šŻ| đ??™ ! , đ??™ ! , ‌ , đ??™ !   subject to the constraints in (3). However, it can be anticipated that no element in the estimated đ?šŞ and đ?šż will be exactly zero, resulting in a model which is not interpretable, i.e., poor identification of diseaserelated regions. Thus, we encourage the estimation of đ?šż and the first đ?‘ž columns of đ?šŞ (i.e., the columns containing discrimination information) to be sparse, by imposing the L1-penalty on đ?šŞ and đ?šż. By doing so, we obtain the following optimization problem for the proposed SCLDA: đ?šŻ = argmin đ?šŻ đ?‘™! đ?šŻ| đ??™ ! , đ??™ ! , ‌ , đ??™ ! ! đ?œƒ!,! = = argmin đ?šŻ −đ?‘™! đ?šŻ| đ??™ ! , đ??™ ! , ‌ , đ??™ ! ! đ?œ†! !,!,! đ?›ž!,!   , subject to ! đ?›ż! đ?›ž!,! , đ?›ż! ≼ 0, 1 ≤ đ?‘˜, đ?‘™ ≤ +   đ?œ†! ! đ?›ż! + đ?‘?, 1 ≤ đ?‘š ≤ đ?‘€.                                (4)  Here, đ?œ†! and đ?œ†! control the degrees of sparsity of đ?šż and đ?šŞ, respectively. Tuning of two regularization parameters is difficult. Fortunately, we prove the following Theorem which indicates that formulation (4) is equivalent to a simpler optimization problem involving only one regularization parameter. 4 Theorem 1: The optimization problem (4) is equivalent to the following optimization problem: đ?šŻ = argmin đ?šŻ đ?‘™! đ?šŻ| đ??™ = argmin đ?šŻ −đ?‘™! đ?šŻ| đ??™ with đ?œ† = 2 ! ! , đ??™ ! ! , đ??™ ,‌, đ??™ ! ! ,‌, đ??™ +  đ?œ† ! ! ! !!! ! !!! ! đ?œƒ!,!  , (5) ! đ?œ†! đ?œ†! , i.e., đ?œƒ!,! = đ?œƒ!,! . The proof can be found in the supplementary document. It can also be found in the supplementary material how this formulation will serve the purpose of the composite parameterization, i.e., common information and specific information can be estimated separately and simultaneously. The optimization problem (5) is a non-convex optimization problem that is difficult to solve. We address this problem by using an iterative two-stage procedure known as Difference of Convex functions (DC) programming [39]. A full description of the algorithm can be found in the supplemental material. 4 S im ula tion s tu d ies In this section, we conduct experiments to compare the performance of the proposed SCLDA with sparse LDA (SLDA) [42] and multitask feature selection [31]. Specifically, as we focus on LDA, we use the multitask feature selection method developed in [31] on LDA, denoted as MSLDA. Both SLDA and MSLDA adopt convex regularizations. Specifically, SLDA selects features from one single data source with L1-regularization; MSLDA selects features from multiple data sources with L2/L1 regularization. We evaluate the performances of these three methods across various parameters settings, including the number of variables, đ?‘?, the number of features, đ?‘™, the number of data sources, M, sample size, đ?‘›, and the degree of overlapping of the features across different data sources, s% (the larger the đ?‘ %, the more shared features among the datasets). Definition of đ?‘ % can be found in the simulation procedure that is included in the supplemental material. For each specification of the parameters settings, đ?‘€ datasets can be generated following the simulation procedure. We apply the proposed SCLDA to the đ?‘€ datasets, and identify one feature vector đ?›‰(!) for each dataset, with đ?œ† and đ?‘ž chosen by the method described in section 3.3. The result can be described by the number of true positives (TPs) as well as the number of false positives (FPs). Here, true positives are the non-zero elements in the learned feature vector đ?›‰(!) which are also non-zero in the đ?›ƒ! ; false positives are the non-zero elements in đ?›‰(!) , which are actually zero in đ?›ƒ! . As there are đ?‘š pairs of the TPs and FPs for the đ?‘€ datasets, the average TP over the M datasets and the average FP over the M datasets are used as the performance measures. This procedure (i.e., from data simulation, to SCLDA, to TPs and FPs generation) can be repeated for đ??ľ times, and đ??ľ pairs of average TP and average FP are collected for SCLDA. In a similar way, we can obtain đ??ľ pairs of average TP and average FP for both SLDA and MSLDA. Figures 1 (a) and (b) show comparison between SCLDA, SLDA and MSLDA by scattering the average TP against the average FP for each method. Each point corresponds to one of the N repetitions. The comparison is across various parameters settings, including the number of variables (đ?‘? = 100,200,500), the number of data sources (đ?‘š = 2,5,10), and the degree of overlapping of the features across different data sources (đ?‘ % = 90%, 70%). Additionally, đ?‘› đ?‘? is kept constant, đ?‘› đ?‘? = 1. A general observation is that SCLDA is better than SLDA and MSLDA across all the parameter settings. Some specific trends can be summarized as follows: (i) Both SCLDA and MSLDA outperform SLDA in terms of TPs; SCLDA further outperforms MSLDA in terms of FPs. (ii) In Figure 2 (a), rows correspond to different numbers of data sources, i.e., đ?‘š = 2,5,10 respectively. It is clear that the advantage of SCLDA over both SLDA and MSLDA is more significant when there are more data sources. Also, MSLDA performs consistently better than SLDA. Similar phenomena are shown in Figure 2 (b). This demonstrates that in analyzing each data source, both SCLDA and MSLDA are able to make use of the information contained in other data sources. SCLDA can use this information more efficiently, as SCLDA can produce less shrunken parameter estimates than MSLDA and thus it is able to preserve weak-effect features. (iii) Comparing Figures 2 (a) and (b), it can be seen that the advantage of SCLDA or MSLDA over SLDA is more significant as the data sources have more degree of overlapping in their 5 features. Finally, although not presented here, our simulation shows that the three methods perform similarly when đ?‘ % = 40 or less. (a) (b) Figure 1: Average numbers of TPs vs FPs for SCLDA (green symbols “+â€?), SLDA (blue symbols “*â€?) and MSLDA (red symbols “oâ€?) (a) đ?‘ % = 90%, đ?‘› đ?‘? = 1; (b) đ?‘ % = 70%, đ?‘› đ?‘? = 1 5 C ase st ud y 5.1 Data preprocessing Our study includes 49 AD patient and 67 age-matched normal controls (NC), with each subject of AD or NC being scanned both by PET and MRI. The PET and MRI images can be downloaded from the database by the Alzheimer’s Disease Neuroimaging Initiative. In what follows, we outline the data preprocessing steps. Each image is spatially normalized to the Montreal Neurological Institute (MNI) template, using the affine transformation and subsequent non-linear wraping algorithm [43] implemented in the SPM MATLAB toolbox. This is to ensure that each voxel is located in the same anatomical region for all subjects, so that spatial locations can be reported and interpreted in a consistent manner. Once all the images in the MNI template, we further apply the Automated Anatomical Labeling (AAL) technique [43] to segment the whole brain of each subject into 116 brain regions. The 90 regions that belong to the cerebral cortex are selected for the later analysis, as the other regions are not included in the cerebral cortex are rarely considered related with AD in the literature. The measurement of each region in the PET data is regional cerebral blood flow (rCBF); the measurement of each region in the MRI data is the structural volume of the region. 5.2 Disease-related brain regions SCLDA is applied to the preprocessed PET and MRI data of AD and NC with the penalty parameter selected by the AIC method mentioned in section 3. 26 disease-related brain regions are identified from PET and 21 from MRI (see Table 1 for their names). The maps of the diseaserelated brain regions identified from PET and MRI are highlighted in Figure 2 (a) and (b), respectively, with different colors given to neighboring regions in order to distinguish them. Each figure is a set of horizontal cut away slices of the brain as seen from the top, which aims to provide a full view of locations of the regions. One major observation is that the identified disease-related brain regions from MRI are in the hippocampus, parahippocampus, temporal lobe, frontal lobe, and precuneus, which is consistent with the existing literature that reports structural atrophy in these brain areas. [3-6,12-14]. The identified disease-related brain regions from PET are in the temporal, frontal and parietal lobes, which is consistent with many functional neuroimaging studies that report reduced rCBF or 6 reduced cortical glucose metabolism in these areas [8-10, 12-14]. Many of these identified disease-related regions can be explained in terms of the AD pathology. For example, hippocampus is a region affected by AD the earliest and severely [6] Also, as regions in the temporal lobe are essential for memory, damage on these regions by AD can explain the memory loss which is a major clinic symptom of AD. The consistency of our findings with the AD literature supports effectiveness of the proposed SCLDA. Another finding is that there is a large overlap between the identified disease-related regions from PET and those from MRI, which implies strong interaction between functional and structural alterations in these regions. Although well-accepted biological mechanisms underlying this interaction are still not very clear, there are several explanations existing in the literature. The first explanation is that both functional and structural alterations could be the consequence of dendritic arborizations, which results from intracellular accumulation of PHFtau and further leads to neuron death and grey matter loss [14]. The second explanation is that the AD pathology may include a vascular component, which may result in reduced rCBF due to limited blood supply and may ultimately result in structural alteration such as brain atrophy [45]. (a) (b) Figure 2: locations of disease-related brain regions identified from (a) MRI; (b) PET 5.3 Classification accuracy As one of our primary goals is to distinguish AD from NC, the identified disease-related brain regions through SCLDA are further utilized for establishing a classification model. Specifically, for each subject, the rCBF values of the 26 disease-related brain regions identified from PET and the structural volumes of the 21 disease-related brain regions identified from MRI are used, as a joint spatial pattern of both brain physiology and structure. As a result, each subject is associated with a vector with 47 features/variables. Linear SVM (Support Vector Machine) is employed as the classifier. The classification accuracy based on 10-fold cross-validation is 94.3%. For comparison purposes, MSLDA is also applied, which identifies 45 and 38 disease-related brain regions for PET and MRI, respectively. Linear SVM applied to the 45+38 features gives a classification accuracy of only 85.8%. Note that MSLDA identifies a much larger number of disease-related brain regions than SCLDA, but some of the identified regions by MSLDA may indeed be disease-irrelevant, so including them deteriorates the classification. 5.4 Relationship between structural atrophy and abnormal rCBF, and severity of cognitive impairment in AD In addition to classification, it is also of interest to further verify relevance of the identified disease-related regions with AD in an alternative way. One approach is to investigate the degree to which those disease-related regions are relevant to cognitive impairment that can be measured by the Alzheimer’s disease assessment scale – cognitive subscale (ADAS-cog). ADAS measures severity of the most important symptoms of AD, while its subscale, ADAS-cog, is the most 7 popular cognitive testing instrument used in clinic trails. The ADAS-cog consists of 11 items measuring disturbances of memory, language, praxis, attention and other cognitive abilities that are often affected by AD. As the total score of these 11 items provides an overall assessment of cognitive impairment, we regress this ADAS-cog total score (the response) against the rCBF or structure volume measurement (the predictor) of each identified brain region, using a simple regression. The regression results are listed in Table 1. It is not surprising to find that some regions in the hippocampus area and temporal lobes are among the best predictors, as these regions are extensively reported in the literature as the most severely affected by AD [3-6]. Also, it is found that most of these brain regions are weak-effect predictors, as most of them can only explain a small portion of the variability in the ADAS-cog total score, i.e., many R-square values in Table 1 are less than 10%. However, although the effects are weak, most of them are significant, i.e., most of the p-values in Table 1 are smaller than 0.05. Furthermore, it is worth noting that 70.22% variability in ADAS-cog can be explained by taking all the 26 brain regions identified from PET as predictors in a multiple regression model; 49.72% variability can be explained by taking all the 21 brain regions from MRI as predictors in a multiple regression model. All this findings imply that the disease-related brain regions are indeed weakeffect features if considered individually, but jointly they can play a strong role for characterizing AD. This verifies the suitability of the proposed SCLDA for AD studies, as SCLDA can preserve weak-effect features. Table 1: Explanatory power of regional rCBF and structural volume for variability in ADAS-cog (“~â€? means this region is not identified from PET (or MRI) as a disease-related region by SCLDA) PET Brain regions Precentral_L Precentral_R Frontal_Sup_L Frontal_Sup_R Frontal_Mid_R Frontal_M_O_L Frontal_M_O_R Insula_L Insula_R Cingulum_A_R Cingulum_Mid_L Cingulum_Post_L Hippocampus_L Hippocampus_R ParaHippocamp_L 6 R 2 0.003 0.044 0.051 0.044 0.056 0.036 0.019 0.016 ~ 0.004 0.001 0.184 0.158 ~ 0.206 pvalue 0.503 0.022 0.013 0.023 0.010 0.040 0.138 0.171 ~ 0.497 0.733 <10-4 <10-4 ~ <10-4 MRI R 2 0.027 ~ 0.047 ~ 0.072 0.086 0.126 0.163 0.125 0.082 0.040 ~ ~ 0.242 ~ PET Brain regions pvalue 0.077 ~ 0.018 ~ 0.003 0.001 0.000 <10-4 0.000 0.001 0.030 ~ ~ <10-4 ~ Amygdala_L Calcarine_L Lingual_L Postcentral_L Parietal_Sup_R Angular_R Precuneus_R Paracentr_Lobu_L Pallidum_L Pallidum_R Heschl_L Heschl_R Temporal_P_S_R Temporal_Inf_R All regions R 2 0.090 0.038 0.066 0.038 0.001 0.173 0.063 0.035 0.082 ~ 0.001 0.000 0.008 0.187 0.702 pvalue 0.001 0.034 0.005 0.035 0.677 <10-4 0.006 0.043 0.001 ~ 0.640 0.744 0.336 <10-4 <10-4 MRI R 2 0.313 0.028 0.044 0.026 ~ 0.063 0.025 0.000 ~ 0.020 ~ 0.111 0.071 0.147 0.497 pvalue <10-4 0.070 0.023 0.081 ~ 0.006 0.084 0.769 ~ 0.122 ~ 0.000 0.003 <10-4 <10-4 C on clu sio n In the paper, we proposed a SCLDA model for identification of disease-related brain regions of AD from multi-modality data, which is capable to preserve weak-effect disease-related brain regions due to its less shrinkage imposed on its parameters. We applied SCLDA to the PET and MRI data of early AD patients and normal controls. As MRI and PET measure two complementary aspects (structural and functional aspects, respectively) of the same AD pathology, fusion of these two image modalities can make effective use of their interaction and thus improve the statistical power in identification of disease-related brain regions. Our findings were consistent with the literature and also showed some new aspects that may suggest further investigation in neuroimaging research in the future. 8 References [1] deToledo-Morrell, L., Stoub, T.R., Bulgakova, M. 2004. MRI-derived entorhinal volume is a good predictor of conversion from MCI to AD. Neurobiol. Aging 25, 1197–1203. [2] Morra, J.H., Tu, Z. Validation of automated hippocampal segmentation method. NeuroImage 43, 59–68, 2008. [3] Morra, J.H., Tu, Z. 2009a. Automated 3D mapping of hippocampal atrophy. Hum. Brain Map. 30, 2766–2788. [4] Morra, J.H., Tu, Z. 2009b. Automated mapping of hippocampal atrophy in 1-year repeat MRI data. NeuroImage 45, 213-221. [5] Schroeter, M.L., Stein, T. 2009. Neural correlates of AD and MCI. NeuroImage 47, 1196–1206. [6] Braak, H., Braak, E. 1991. Neuropathological stageing of Alzheimer-related changes. Acta Neuro. 82, 239–259. [7] Bradley, K.M., O'Sullivan. 2002. Cerebral perfusion SPET correlated with Braak pathological stage in AD. Brain 125, 1772–1781. [8] Keilp, J.G., Alexander, G.E. 1996. Inferior parietal perfusion, lateralization, and neuropsychological dysfunction in AD. Brain Cogn. 32, 365–383. [9] Schroeter, M.L., Stein, T. 2009. Neural correlates of AD and MCI. NeuroImage 47, 1196–1206. [10] Asllani, I., Habeck, C. 2008. Multivariate and univariate analysis of continuous arterial spin labeling perfusion MRI [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] in AD. J. Cereb. Blood Flow Metab. 28, 725–736. Du,A.T., Jahng, G.H. 2006. Hypoperfusion in frontotemporal dementia and AD. Neurology 67, 1215–1220. Ishii, K., Kitagaki, H. 1996. Decreased medial temporal oxygen metabolism in AD. J. Nucl. Med. 37, 1159–1165. Johnson, N.A., Jahng, G.H. 2005. Pattern of cerebral hypoperfusion in AD. Radiology 234, 851–859. Wolf, H., Jelic, V. 2003. A critical discussion of the role of neuroimaging in MCI. Acta Neuroal: 107 (4), 52-76. Tosun, D., Mojabi, P. 2010. Joint analysis of structural and perfusion MRI for cognitive assessment and classification of AD and normal aging. NeuroImage 52, 186-197. Alsop, D., Casement, M. 2008. Hippocampal hyperperfusion in Alzheimer's disease. NeuroImage 42, 1267–1274. Mosconi, L., Tsui, W.-H. 2005. Reduced hippocampal metabolism in MCI and AD. Neurology 64, 1860–1867. Mulert, C., Lemieux, L. 2010. EEG-fMRI: physiological basis, technique and applications. Springer. Xu, L., Qiu, C., Xu, P. and Yao, D. 2010. A parallel framework for simultaneous EEG/fMRI analysis: methodology and simulation. NeuroImage, 52(3), 1123-1134. Philiastides, M. and Sajda, P. 2007. EEG-informed fMRI reveals spatiotemporal characteristics of perceptual decision making. Journal of Neuroscience, 27(48), 13082-13091. Daunizeau, J., Grova, C. 2007. Symmetrical event-related EEG/fMRI information fusion. NeuroImage 36, 69-87. Jagust, W. 2006. PET and MRI in the diagnosis and prediction of dementia. Alzheimer’s Dement 2, 36-42. Kawachi, T., Ishii, K. and Sakamoto, S. 2006. Comparison of the diagnostic performance of FDG-PET and VBM. Eur.J.Nucl.Med.Mol.Imaging 33, 801-809. Matsunari, I., Samuraki, M. 2007. Comparison of 18F-FDG PET and optimized voxel-based morphometry for detection of AD. J.Nucl.Med 48, 1961-1970. Schmidt, M., Fung, G. and Rosales, R. 2007. Fast optimization methods for L1-regularization: a comparative study and 2 new approaches. ECML 2007. Liu, J., Ji, S. and Ye, J. 2009. SLEP: sparse learning with efficient projections, Arizona state university. Tibshirani, R. 1996. Regression Shrinkage and Selection via the Lasso, JRSS, Series B, 58(1):267–288. Friedman, J., Hastie, T. and Tibshirani, R. 2007. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 8(1):1–10. Zou, H., Hastie, T. and Tibshirani, R. 2006. Sparse PCA, J. of Comp. and Graphical Statistics, 15(2), 262-286. Qiao, Z., Zhou, L and Huang, J. 2006. Sparse LDA with applications to high dimensional low sample size data. IAENG applied mathematics, 39(1). Argyriou, A., Evgeniou, T. and Pontil, M. 2008. Convex multi-task feature learning. Machine Learning 73(3): 243– 272. Huang, S., Li, J., et al. 2010. Learning Brain Connectivity of AD by Sparse Inverse Covariance Estimation, NeuroImage, 50, 935-949. Candes, E., Wakin, M. and Boyd, S. 2008. Enhancing sparsity by reweighted L1 minimization. Journal of Fourier analysis and applications, 14(5), 877-905. Mazumder, R.; Friedman, J. 2009. SparseNet: Coordinate Descent with Non-Convex Penalties. Manuscript. Zhang, T. 2008. Multi-stage Convex Relaxation for Learning with Sparse Regularization. NIPS 2008. Campbell, N. 1984. Canonical variate analysis ageneral formulation. Australian Jour of Stat 26, 86–96. Hastie, T. and Tibshirani, R. 1994. Discriminant analysis by gaussian mixtures. Technical report. AT&T; Bell Lab. Kumar, N. and Andreou, G. 1998. Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Communication, 26 (4), 283-297. Gasso, G., Rakotomamonjy, A. and Canu, S. 2009. Recovering sparse signals with non-convex penalties and DC programming. IEEE Trans. Signal Processing 57( 12), 4686-4698. Guo, J., Levina, E., Michailidis, G. and Zhu, J. 2011. Joint estimation of multiple graphical models. Biometrika 98(1) 1-15. Bertsekas, D. 1982. Projected newton methods for optimization problems with simple constraints. SIAM J. Control Optim 20, 221-246. Clemmensen, L., Hastie, T., Witten, D. and Ersboll:, B. 2011. Sparse Discriminant Analysis. Technometrics (in press) Friston, K.J., Ashburner, J. 1995. Spatial registration and normalization of images. HBM 2, 89–165. Tzourio-Mazoyer, N., et al., 2002. Automated anatomical labelling of activations in SPM. NeuroImage 15, 273–289. Bidzan, L. 2005. Vascular factors in dementia. Psychiatr. Pol. 39, 977-986. 9

5 0.50961423 54 nips-2011-Co-regularized Multi-view Spectral Clustering

Author: Abhishek Kumar, Piyush Rai, Hal Daume

Abstract: In many clustering problems, we have access to multiple views of the data each of which could be individually used for clustering. Exploiting information from multiple views, one can hope to find a clustering that is more accurate than the ones obtained using the individual views. Often these different views admit same underlying clustering of the data, so we can approach this problem by looking for clusterings that are consistent across the views, i.e., corresponding data points in each view should have same cluster membership. We propose a spectral clustering framework that achieves this goal by co-regularizing the clustering hypotheses, and propose two co-regularization schemes to accomplish this. Experimental comparisons with a number of baselines on two synthetic and three real-world datasets establish the efficacy of our proposed approaches.

6 0.47176367 68 nips-2011-Demixed Principal Component Analysis

7 0.45255193 51 nips-2011-Clustered Multi-Task Learning Via Alternating Structure Optimization

8 0.43348995 4 nips-2011-A Convergence Analysis of Log-Linear Training

9 0.43276203 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data

10 0.41679344 152 nips-2011-Learning in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint

11 0.39512584 263 nips-2011-Sparse Manifold Clustering and Embedding

12 0.39421189 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

13 0.39314073 211 nips-2011-Penalty Decomposition Methods for Rank Minimization

14 0.39102554 164 nips-2011-Manifold Precis: An Annealing Technique for Diverse Sampling of Manifolds

15 0.37990478 33 nips-2011-An Exact Algorithm for F-Measure Maximization

16 0.37892047 38 nips-2011-Anatomically Constrained Decoding of Finger Flexion from Electrocorticographic Signals

17 0.37471479 7 nips-2011-A Machine Learning Approach to Predict Chemical Reactions

18 0.36914307 136 nips-2011-Inverting Grice's Maxims to Learn Rules from Natural Language Extractions

19 0.3607007 165 nips-2011-Matrix Completion for Multi-label Image Classification

20 0.36050981 254 nips-2011-Similarity-based Learning via Data Driven Embeddings


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.044), (4, 0.026), (20, 0.038), (26, 0.011), (31, 0.07), (33, 0.04), (38, 0.293), (43, 0.044), (45, 0.142), (57, 0.036), (65, 0.011), (70, 0.017), (74, 0.05), (83, 0.028), (99, 0.077)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75625491 176 nips-2011-Multi-View Learning of Word Embeddings via CCA

Author: Paramveer Dhillon, Dean P. Foster, Lyle H. Ungar

Abstract: Recently, there has been substantial interest in using large amounts of unlabeled data to learn word representations which can then be used as features in supervised classifiers for NLP tasks. However, most current approaches are slow to train, do not model the context of the word, and lack theoretical grounding. In this paper, we present a new learning method, Low Rank Multi-View Learning (LR-MVL) which uses a fast spectral method to estimate low dimensional context-specific word representations from unlabeled data. These representation features can then be used with any supervised learner. LR-MVL is extremely fast, gives guaranteed convergence to a global optimum, is theoretically elegant, and achieves state-ofthe-art performance on named entity recognition (NER) and chunking problems. 1 Introduction and Related Work Over the past decade there has been increased interest in using unlabeled data to supplement the labeled data in semi-supervised learning settings to overcome the inherent data sparsity and get improved generalization accuracies in high dimensional domains like NLP. Approaches like [1, 2] have been empirically very successful and have achieved excellent accuracies on a variety of NLP tasks. However, it is often difficult to adapt these approaches to use in conjunction with an existing supervised NLP system as these approaches enforce a particular choice of model. An increasingly popular alternative is to learn representational embeddings for words from a large collection of unlabeled data (typically using a generative model), and to use these embeddings to augment the feature set of a supervised learner. Embedding methods produce features in low dimensional spaces or over a small vocabulary size, unlike the traditional approach of working in the original high dimensional vocabulary space with only one dimension “on” at a given time. Broadly, these embedding methods fall into two categories: 1. Clustering based word representations: Clustering methods, often hierarchical, are used to group distributionally similar words based on their contexts. The two dominant approaches are Brown Clustering [3] and [4]. As recently shown, HMMs can also be used to induce a multinomial distribution over possible clusters [5]. 2. Dense representations: These representations are dense, low dimensional and real-valued. Each dimension of these representations captures latent information about a combination of syntactic and semantic word properties. They can either be induced using neural networks like C&W; embeddings [6] and Hierarchical log-linear (HLBL) embeddings [7] or by eigen-decomposition of the word co-occurrence matrix, e.g. Latent Semantic Analysis/Latent Semantic Indexing (LSA/LSI) [8]. Unfortunately, most of these representations are 1). slow to train, 2). sensitive to the scaling of the embeddings (especially 2 based approaches like LSA/PCA), 3). can get stuck in local optima (like EM trained HMM) and 4). learn a single embedding for a given word type; i.e. all the occurrences 1 of the word “bank” will have the same embedding, irrespective of whether the context of the word suggests it means “a financial institution” or “a river bank”. In this paper, we propose a novel context-specific word embedding method called Low Rank MultiView Learning, LR-MVL, which is fast to train and is guaranteed to converge to the optimal solution. As presented here, our LR-MVL embeddings are context-specific, but context oblivious embeddings (like the ones used by [6, 7]) can be trivially gotten from our model. Furthermore, building on recent advances in spectral learning for sequence models like HMMs [9, 10, 11] we show that LR-MVL has strong theoretical grounding. Particularly, we show that LR-MVL estimates low dimensional context-specific word embeddings which preserve all the information in the data if the data were generated by an HMM. Moreover, LR-MVL being linear does not face the danger of getting stuck in local optima as is the case for an EM trained HMM. LR-MVL falls into category (2) mentioned above; it learns real-valued context-specific word embeddings by performing Canonical Correlation Analysis (CCA) [12] between the past and future views of low rank approximations of the data. However, LR-MVL is more general than those methods, which work on bigram or trigram co-occurrence matrices, in that it uses longer word sequence information to estimate context-specific embeddings and also for the reasons mentioned in the last paragraph. The remainder of the paper is organized as follows. In the next section we give a brief overview of CCA, which forms the core of our method. Section 3 describes our proposed LR-MVL algorithm in detail and gives theory supporting its performance. Section 4 demonstrates the effectiveness of LR-MVL on the NLP tasks of Named Entity Recognition and Chunking. We conclude with a brief summary in Section 5. 2 Brief Review: Canonical Correlation Analysis (CCA) CCA [12] is the analog to Principal Component Analysis (PCA) for pairs of matrices. PCA computes the directions of maximum covariance between elements in a single matrix, whereas CCA computes the directions of maximal correlation between a pair of matrices. Unlike PCA, CCA does not depend on how the observations are scaled. This invariance of CCA to linear data transformations allows proofs that keeping the dominant singular vectors (those with largest singular values) will faithfully capture any state information. More specifically, given a set of n paired observation vectors {(l1 , r1 ), ..., (ln , rn )}–in our case the two matrices are the left (L) and right (R) context matrices of a word–we would like to simultaneously find the directions Φl and Φr that maximize the correlation of the projections of L onto Φl with the projections of R onto Φr . This is expressed as max Φl ,Φr E[ L, Φl R, Φr ] E[ L, Φl 2 ]E[ R, Φr 2 ] (1) where E denotes the empirical expectation. We use the notation Clr (Cll ) to denote the cross (auto) covariance matrices between L and R (i.e. L’R and L’L respectively.). The left and right canonical correlates are the solutions Φl , Φr of the following equations: Cll −1 Clr Crr −1 Crl Φl = λΦl Crr −1 Crl Cll −1 Clr Φr = λΦr 3 (2) Low Rank Multi-View Learning (LR-MVL) In LR-MVL, we compute the CCA between the past and future views of the data on a large unlabeled corpus to find the common latent structure, i.e., the hidden state associated with each token. These induced representations of the tokens can then be used as features in a supervised classifier (typically discriminative). The context around a word, consisting of the h words to the right and left of it, sits in a high dimensional space, since for a vocabulary of size v, each of the h words in the context requires an indicator function of dimension v. The key move in LR-MVL is to project the v-dimensional word 2 space down to a k dimensional state space. Thus, all eigenvector computations are done in a space that is v/k times smaller than the original space. Since a typical vocabulary contains at least 50, 000 words, and we use state spaces of order k ≈ 50 dimensions, this gives a 1,000-fold reduction in the size of calculations that are needed. The core of our LR-MVL algorithm is a fast spectral method for learning a v × k matrix A which maps each of the v words in the vocabulary to a k-dimensional state vector. We call this matrix the “eigenfeature dictionary”. We now describe the LR-MVL method, give a theorem that provides intuition into how it works, and formally present the LR-MVL algorithm. The Experiments section then shows that this low rank approximation allows us to achieve state-of-the-art performance on NLP tasks. 3.1 The LR-MVL method Given an unlabeled token sequence w={w0 , w1 , . . ., wn } we want to learn a low (k)- dimensional state vector {z0 , z1 , . . . , zn } for each observed token. The key is to find a v ×k matrix A (Algorithm 1) that maps each of the v words in the vocabulary to a reduced rank k-dimensional state vector, which is later used to induce context specific embeddings for the tokens (Algorithm 2). For supervised learning, these context specific embeddings are supplemented with other information about each token wt , such as its identity, orthographic features such as prefixes and suffixes or membership in domain-specific lexicons, and used as features in a classifier. Section 3.4 gives the algorithm more formally, but the key steps in the algorithm are, in general terms: • Take the h words to the left and to the right of each target word wt (the “Left” and “Right” contexts), and project them each down to k dimensions using A. • Take the CCA between the reduced rank left and right contexts, and use the resulting model to estimate a k dimensional state vector (the “hidden state”) for each token. • Take the CCA between the hidden states and the tokens wt . The singular vectors associated with wt form a new estimate of the eigenfeature dictionary. LR-MVL can be viewed as a type of co-training [13]: The state of each token wt is similar to that of the tokens both before and after it, and it is also similar to the states of the other occurrences of the same word elsewhere in the document (used in the outer iteration). LR-MVL takes advantage of these two different types of similarity by alternately estimating word state using CCA on the smooths of the states of the words before and after each target token and using the average over the states associated with all other occurrences of that word. 3.2 Theoretical Properties of LR-MVL We now present the theory behind the LR-MVL algorithm; particularly we show that the reduced rank matrix A allows a significant data reduction while preserving the information in our data and the estimated state does the best possible job of capturing any label information that can be inferred by a linear model. Let L be an n × hv matrix giving the words in the left context of each of the n tokens, where the context is of length h, R be the corresponding n × hv matrix for the right context, and W be an n × v matrix of indicator functions for the words themselves. We will use the following assumptions at various points in our proof: Assumption 1. L, W, and R come from a rank k HMM i.e. it has a rank k observation matrix and rank k transition matrix both of which have the same domain. For example, if the dimension of the hidden state is k and the vocabulary size is v then the observation matrix, which is k × v, has rank k. This rank condition is similar to the one used by [10]. Assumption 1A. For the three views, L, W and R assume that there exists a “hidden state H” of dimension n × k, where each row Hi has the same non-singular variance-covariance matrix and 3 such that E(Li |Hi ) = Hi β T and E(Ri |Hi ) = Hi β T and E(Wi |Hi ) = Hi β T where all β’s are of L R W rank k, where Li , Ri and Wi are the rows of L, R and W respectively. Assumption 1A follows from Assumption 1. Assumption 2. ρ(L, W), ρ(L, R) and ρ(W, R) all have rank k, where ρ(X1 , X2 ) is the expected correlation between X1 and X2 . Assumption 2 is a rank condition similar to that in [9]. Assumption 3. ρ([L, R], W) has k distinct singular values. Assumption 3 just makes the proof a little cleaner, since if there are repeated singular values, then the singular vectors are not unique. Without it, we would have to phrase results in terms of subspaces with identical singular values. We also need to define the CCA function that computes the left and right singular vectors for a pair of matrices: Definition 1 (CCA). Compute the CCA between two matrices X1 and X2 . Let ΦX1 be a matrix containing the d largest singular vectors for X1 (sorted from the largest on down). Likewise for ΦX2 . Define the function CCAd (X1 , X2 ) = [ΦX1 , ΦX2 ]. When we want just one of these Φ’s, we will use CCAd (X1 , X2 )left = ΦX1 for the left singular vectors and CCAd (X1 , X2 )right = ΦX2 for the right singular vectors. Note that the resulting singular vectors, [ΦX1 , ΦX2 ] can be used to give two redundant estimates, X1 ΦX1 and X2 ΦX2 of the “hidden” state relating X1 and X2 , if such a hidden state exists. Definition 2. Define the symbol “≈” to mean X1 ≈ X2 ⇐⇒ lim X1 = lim X2 n→∞ n→∞ where n is the sample size. Lemma 1. Define A by the following limit of the right singular vectors: CCAk ([L, R], W)right ≈ A. Under assumptions 2, 3 and 1A, such that if CCAk (L, R) ≡ [ΦL , ΦR ] then CCAk ([LΦL , RΦR ], W)right ≈ A. Lemma 1 shows that instead of finding the CCA between the full context and the words, we can take the CCA between the Left and Right contexts, estimate a k dimensional state from them, and take the CCA of that state with the words and get the same result. See the supplementary material for the Proof. ˜ Let Ah denote a matrix formed by stacking h copies of A on top of each other. Right multiplying ˜ L or R by Ah projects each of the words in that context into the k-dimensional reduced rank space. The following theorem addresses the core of the LR-MVL algorithm, showing that there is an A which gives the desired dimensionality reduction. Specifically, it shows that the previous lemma also holds in the reduced rank space. Theorem 1. Under assumptions 1, 2 and 3 there exists a unique matrix A such that if ˜ ˜ ˜ ˜ CCAk (LAh , RAh ) ≡ [ΦL , ΦR ] then ˜ ˜ ˜ ˜ CCAk ([LAh ΦL , RAh ΦR ], W)right ≈ A ˜ where Ah is the stacked form of A. See the supplementary material for the Proof 1 . ˆ It is worth noting that our matrix A corresponds to the matrix U used by [9, 10]. They showed that U is sufficient to compute the probability of a sequence of words generated by an HMM; although we do not show ˆ it here (due to limited space), our A provides a more statistically efficient estimate of U than their U , and hence can also be used to estimate the sequence probabilities. 1 4 Under the above assumptions, there is asymptotically (in the limit of infinite data) no benefit to first estimating state by finding the CCA between the left and right contexts and then finding the CCA between the estimated state and the words. One could instead just directly find the CCA between the combined left and rights contexts and the words. However, because of the Zipfian distribution of words, many words are rare or even unique, and hence one is not in the asymptotic limit. In this case, CCA between the rare words and context will not be informative, whereas finding the CCA between the left and right contexts gives a good state vector estimate even for unique words. One can then fruitfully find the CCA between the contexts and the estimated state vector for their associated words. 3.3 Using Exponential Smooths In practice, we replace the projected left and right contexts with exponential smooths (weighted average of the previous (or next) token’s state i.e. Zt−1 (or Zt+1 ) and previous (or next) token’s smoothed state i.e. St−1 (or St+1 ).), of them at a few different time scales, thus giving a further dimension reduction by a factor of context length h (say 100 words) divided by the number of smooths (often 5-7). We use a mixture of both very short and very long contexts which capture short and long range dependencies as required by NLP problems as NER, Chunking, WSD etc. Since exponential smooths are linear, we preserve the linearity of our method. 3.4 The LR-MVL Algorithm The LR-MVL algorithm (using exponential smooths) is given in Algorithm 1; it computes the pair of CCAs described above in Theorem 1. Algorithm 1 LR-MVL Algorithm - Learning from Large amounts of Unlabeled Data 1: Input: Token sequence Wn×v , state space size k, smoothing rates αj 2: Initialize the eigenfeature dictionary A to random values N (0, 1). 3: repeat 4: Set the state Zt (1 < t ≤ n) of each token wt to the eigenfeature vector of the corresponding word. Zt = (Aw : w = wt ) 5: Smooth the state estimates before and after each token to get a pair of views for each smoothing rate αj . (l,j) (l,j) = (1 − αj )St−1 + αj Zt−1 // left view L St (r,j) (r,j) j St = (1 − α )St+1 + αj Zt+1 // right view R. (l,j) (r,j) th where the t rows of L and R are, respectively, concatenations of the smooths St and St for (j) each of the α s. 6: Find the left and right canonical correlates, which are the eigenvectors Φl and Φr of (L L)−1 L R(R R)−1 R LΦl = λΦl . (R R)−1 R L(L L)−1 L RΦr = λΦr . 7: Project the left and right views on to the space spanned by the top k/2 left and right CCAs respectively (k/2) (k/2) Xl = LΦl and Xr = RΦr (k/2) (k/2) where Φl , Φr are matrices composed of the singular vectors of Φl , Φr with the k/2 largest magnitude singular values. Estimate the state for each word wt as the union of the left and right estimates: Z = [Xl , Xr ] 8: Estimate the eigenfeatures of each word type, w, as the average of the states estimated for that word. Aw = avg(Zt : wt = w) 9: Compute the change in A from the previous iteration 10: until |∆A| < 11: Output: Φk , Φk , A . r l A few iterations (∼ 5) of the above algorithm are sufficient to converge to the solution. (Since the problem is convex, there is a single solution, so there is no issue of local minima.) As [14] show for PCA, one can start with a random matrix that is only slightly larger than the true rank k of the correlation matrix, and with extremely high likelihood converge in a few iterations to within a small distance of the true principal components. In our case, if the assumptions detailed above (1, 1A, 2 and 3) are satisfied, our method converges equally rapidly to the true canonical variates. As mentioned earlier, we get further dimensionality reduction in Step 5, by replacing the Left and Right context matrices with a set of exponentially smoothed values of the reduced rank projections of the context words. Step 6 finds the CCA between the Left and Right contexts. Step 7 estimates 5 the state by combining the estimates from the left and right contexts, since we don’t know which will best estimate the state. Step 8 takes the CCA between the estimated state Z and the matrix of words W. Because W is a vector of indicator functions, this CCA takes the trivial form of a set of averages. Once we have estimated the CCA model, it is used to generate context specific embeddings for the tokens from training, development and test sets (as described in Algorithm 2). These embeddings are further supplemented with other baseline features and used in a supervised learner to predict the label of the token. Algorithm 2 LR-MVL Algorithm -Inducing Context Specific Embeddings for Train/Dev/Test Data 1: Input: Model (Φk , Φk , A) output from above algorithm and Token sequences Wtrain , (Wdev , Wtest ) r l 2: Project the left and right views L and R after smoothing onto the space spanned by the top k left and right CCAs respectively Xl = LΦk and Xr = RΦk r l and the words onto the eigenfeature dictionary Xw = W train A 3: Form the final embedding matrix Xtrain:embed by concatenating these three estimates of state Xtrain:embed = [Xl , Xw , Xr ] 4: Output: The embedding matrices Xtrain:embed , (Xdev:embed , Xtest:embed ) with context-specific representations for the tokens. These embeddings are augmented with baseline set of features mentioned in Sections 4.1.1 and 4.1.2 before learning the final classifier. Note that we can get context “oblivious” embeddings i.e. one embedding per word type, just by using the eigenfeature dictionary (Av×k ) output by Algorithm 1. 4 Experimental Results In this section we present the experimental results of LR-MVL on Named Entity Recognition (NER) and Syntactic Chunking tasks. We compare LR-MVL to state-of-the-art semi-supervised approaches like [1] (Alternating Structures Optimization (ASO)) and [2] (Semi-supervised extension of CRFs) as well as embeddings like C&W;, HLBL and Brown Clustering. 4.1 Datasets and Experimental Setup For the NER experiments we used the data from CoNLL 2003 shared task and for Chunking experiments we used the CoNLL 2000 shared task data2 with standard training, development and testing set splits. The CoNLL ’03 and the CoNLL ’00 datasets had ∼ 204K/51K/46K and ∼ 212K/ − /47K tokens respectively for Train/Dev./Test sets. 4.1.1 Named Entity Recognition (NER) We use the same set of baseline features as used by [15, 16] in their experiments. The detailed list of features is as below: • Current Word wi ; Its type information: all-capitalized, is-capitalized, all-digits and so on; Prefixes and suffixes of wi • Word tokens in window of 2 around the current word i.e. (wi−2 , wi−1 , wi , wi+1 , wi+2 ); and capitalization pattern in the window. d = • Previous two predictions yi−1 and yi−2 and conjunction of d and yi−1 • Embedding features (LR-MVL, C&W;, HLBL, Brown etc.) in a window of 2 around the current word (if applicable). Following [17] we use regularized averaged perceptron model with above set of baseline features for the NER task. We also used their BILOU text chunk representation and fast greedy inference as it was shown to give superior performance. 2 More details about the data and competition are available at http://www.cnts.ua.ac.be/ conll2003/ner/ and http://www.cnts.ua.ac.be/conll2000/chunking/ 6 We also augment the above set of baseline features with gazetteers, as is standard practice in NER experiments. We tuned our free parameter namely the size of LR-MVL embedding on the development and scaled our embedding features to have a 2 norm of 1 for each token and further multiplied them by a normalization constant (also chosen by cross validation), so that when they are used in conjunction with other categorical features in a linear classifier, they do not exert extra influence. The size of LR-MVL embeddings (state-space) that gave the best performance on the development set was k = 50 (50 each for Xl , Xw , Xr in Algorithm 2) i.e. the total size of embeddings was 50×3, and the best normalization constant was 0.5. We omit validation plots due to paucity of space. 4.1.2 Chunking For our chunking experiments we use a similar base set of features as above: • Current Word wi and word tokens in window of 2 around the current word i.e. d = (wi−2 , wi−1 , wi , wi+1 , wi+2 ); • POS tags ti in a window of 2 around the current word. • Word conjunction features wi ∩ wi+1 , i ∈ {−1, 0} and Tag conjunction features ti ∩ ti+1 , i ∈ {−2, −1, 0, 1} and ti ∩ ti+1 ∩ ti+2 , i ∈ {−2, −1, 0}. • Embedding features in a window of 2 around the current word (when applicable). Since CoNLL 00 chunking data does not have a development set, we randomly sampled 1000 sentences from the training data (8936 sentences) for development. So, we trained our chunking models on 7936 training sentences and evaluated their F1 score on the 1000 development sentences and used a CRF 3 as the supervised classifier. We tuned the size of embedding and the magnitude of 2 regularization penalty in CRF on the development set and took log (or -log of the magnitude) of the value of the features4 . The regularization penalty that gave best performance on development set was 2 and here again the best size of LR-MVL embeddings (state-space) was k = 50. Finally, we trained the CRF on the entire (“original”) training data i.e. 8936 sentences. 4.1.3 Unlabeled Data and Induction of embeddings For inducing the embeddings we used the RCV1 corpus containing Reuters newswire from Aug ’96 to Aug ’97 and containing about 63 million tokens in 3.3 million sentences5 . Case was left intact and we did not do the “cleaning” as done by [18, 16] i.e. remove all sentences which are less than 90% lowercase a-z, as our multi-view learning approach is robust to such noisy data, like news byline text (mostly all caps) which does not correlate strongly with the text of the article. We induced our LR-MVL embeddings over a period of 3 days (70 core hours on 3.0 GHz CPU) on the entire RCV1 data by performing 4 iterations, a vocabulary size of 300k and using a variety of smoothing rates (α in Algorithm 1) to capture correlations between shorter and longer contexts α = [0.005, 0.01, 0.05, 0.1, 0.5, 0.9]; theoretically we could tune the smoothing parameters on the development set but we found this mixture of long and short term dependencies to work well in practice. As far as the other embeddings are concerned i.e. C&W;, HLBL and Brown Clusters, we downloaded them from http://metaoptimize.com/projects/wordreprs. The details about their induction and parameter tuning can be found in [16]; we report their best numbers here. It is also worth noting that the unsupervised training of LR-MVL was (> 1.5 times)6 faster than other embeddings. 4.2 Results The results for NER and Chunking are shown in Tables 1 and 2, respectively, which show that LR-MVL performs significantly better than state-of-the-art competing methods on both NER and Chunking tasks. 3 http://www.chokkan.org/software/crfsuite/ Our embeddings are learnt using a linear model whereas CRF is a log-linear model, so to keep things on same scale we did this normalization. 5 We chose this particular dataset to make a fair comparison with [1, 16], who report results using RCV1 as unlabeled data. 6 As some of these embeddings were trained on GPGPU which makes our method even faster comparatively. 4 7 Embedding/Model Baseline C&W;, 200-dim HLBL, 100-dim Brown 1000 clusters Ando & Zhang ’05 Suzuki & Isozaki ’08 LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim HLBL, 100-dim C&W;, 200-dim Brown, 1000 clusters LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim No Gazetteers With Gazetteers F1-Score Dev. Set Test Set 90.03 84.39 92.46 87.46 92.00 88.13 92.32 88.52 93.15 89.31 93.66 89.36 93.11 89.55 93.61 89.91 92.91 89.35 92.98 88.88 93.25 89.41 93.91 89.89 94.41 90.06 Table 1: NER Results. Note: 1). LR-MVL (CO) are Context Oblivious embeddings which are gotten from (A) in Algorithm 1. 2). F1-score= Harmonic Mean of Precision and Recall. 3). The current state-of-the-art for this NER task is 90.90 (Test Set) but using 700 billion tokens of unlabeled data [19]. Embedding/Model Baseline HLBL, 50-dim C&W;, 50-dim Brown 3200 Clusters Ando & Zhang ’05 Suzuki & Isozaki ’08 LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim Test Set F1-Score 93.79 94.00 94.10 94.11 94.39 94.67 95.02 95.44 Table 2: Chunking Results. It is important to note that in problems like NER, the final accuracy depends on performance on rare-words and since LR-MVL is robustly able to correlate past with future views, it is able to learn better representations for rare words resulting in overall better accuracy. On rare-words (occurring < 10 times in corpus), we got 11.7%, 10.7% and 9.6% relative reduction in error over C&W;, HLBL and Brown respectively for NER; on chunking the corresponding numbers were 6.7%, 7.1% and 8.7%. Also, it is worth mentioning that modeling the context in embeddings gives decent improvements in accuracies on both NER and Chunking problems. For the case of NER, the polysemous words were mostly like Chicago, Wales, Oakland etc., which could either be a location or organization (Sports teams, Banks etc.), so when we don’t use the gazetteer features, (which are known lists of cities, persons, organizations etc.) we got higher increase in F-score by modeling context, compared to the case when we already had gazetteer features which captured most of the information about polysemous words for NER dataset and modeling the context didn’t help as much. The polysemous words for Chunking dataset were like spot (VP/NP), never (VP/ADVP), more (NP/VP/ADVP/ADJP) etc. and in this case embeddings with context helped significantly, giving 3.1 − 6.5% relative improvement in accuracy over context oblivious embeddings. 5 Summary and Conclusion In this paper, we presented a novel CCA-based multi-view learning method, LR-MVL, for large scale sequence learning problems such as arise in NLP. LR-MVL is a spectral method that works in low dimensional state-space so it is computationally efficient, and can be used to train using large amounts of unlabeled data; moreover it does not get stuck in local optima like an EM trained HMM. The embeddings learnt using LR-MVL can be used as features with any supervised learner. LR-MVL has strong theoretical grounding; is much simpler and faster than competing methods and achieves state-of-the-art accuracies on NER and Chunking problems. Acknowledgements: The authors would like to thank Alexander Yates, Ted Sandler and the three anonymous reviews for providing valuable feedback. We would also like to thank Lev Ratinov and Joseph Turian for answering our questions regarding their paper [16]. 8 References [1] Ando, R., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6 (2005) 1817–1853 [2] Suzuki, J., Isozaki, H.: Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In: In ACL. (2008) [3] Brown, P., deSouza, P., Mercer, R., Pietra, V.D., Lai, J.: Class-based n-gram models of natural language. Comput. Linguist. 18 (December 1992) 467–479 [4] Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: 31st Annual Meeting of the ACL. (1993) 183–190 [5] Huang, F., Yates, A.: Distributional representations for handling sparsity in supervised sequence-labeling. ACL ’09, Stroudsburg, PA, USA, Association for Computational Linguistics (2009) 495–503 [6] Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. ICML ’08, New York, NY, USA, ACM (2008) 160–167 [7] Mnih, A., Hinton, G.: Three new graphical models for statistical language modelling. ICML ’07, New York, NY, USA, ACM (2007) 641–648 [8] Dumais, S., Furnas, G., Landauer, T., Deerwester, S., Harshman, R.: Using latent semantic analysis to improve access to textual information. In: SIGCHI Conference on human factors in computing systems, ACM (1988) 281–285 [9] Hsu, D., Kakade, S., Zhang, T.: A spectral algorithm for learning hidden markov models. In: COLT. (2009) [10] Siddiqi, S., Boots, B., Gordon, G.J.: Reduced-rank hidden Markov models. In: AISTATS2010. (2010) [11] Song, L., Boots, B., Siddiqi, S.M., Gordon, G.J., Smola, A.J.: Hilbert space embeddings of hidden Markov models. In: ICML. (2010) [12] Hotelling, H.: Canonical correlation analysis (cca). Journal of Educational Psychology (1935) [13] Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT’ 98. (1998) 92–100 [14] Halko, N., Martinsson, P.G., Tropp, J.: Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. (Dec 2010) [15] Zhang, T., Johnson, D.: A robust risk minimization based named entity recognition system. CONLL ’03 (2003) 204–207 [16] Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. ACL ’10, Stroudsburg, PA, USA, Association for Computational Linguistics (2010) 384–394 [17] Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: CONLL. (2009) 147–155 [18] Liang, P.: Semi-supervised learning for natural language. Master’s thesis, Massachusetts Institute of Technology (2005) [19] Lin, D., Wu, X.: Phrase clustering for discriminative learning. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2. ACL ’09, Stroudsburg, PA, USA, Association for Computational Linguistics (2009) 1030–1038 9

2 0.71329063 194 nips-2011-On Causal Discovery with Cyclic Additive Noise Models

Author: Joris M. Mooij, Dominik Janzing, Tom Heskes, Bernhard Schölkopf

Abstract: We study a particular class of cyclic causal models, where each variable is a (possibly nonlinear) function of its parents and additive noise. We prove that the causal graph of such models is generically identifiable in the bivariate, Gaussian-noise case. We also propose a method to learn such models from observational data. In the acyclic case, the method reduces to ordinary regression, but in the more challenging cyclic case, an additional term arises in the loss function, which makes it a special case of nonlinear independent component analysis. We illustrate the proposed method on synthetic data. 1

3 0.64394855 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines

Author: Guido F. Montufar, Johannes Rauh, Nihat Ay

Abstract: We present explicit classes of probability distributions that can be learned by Restricted Boltzmann Machines (RBMs) depending on the number of units that they contain, and which are representative for the expressive power of the model. We use this to show that the maximal Kullback-Leibler divergence to the RBM model with n visible and m hidden units is bounded from above by (n−1)−log(m+1). In this way we can specify the number of hidden units that guarantees a sufficiently rich model containing different classes of distributions and respecting a given error tolerance. 1

4 0.56216925 186 nips-2011-Noise Thresholds for Spectral Clustering

Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh

Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1

5 0.55812889 263 nips-2011-Sparse Manifold Clustering and Embedding

Author: Ehsan Elhamifar, René Vidal

Abstract: We propose an algorithm called Sparse Manifold Clustering and Embedding (SMCE) for simultaneous clustering and dimensionality reduction of data lying in multiple nonlinear manifolds. Similar to most dimensionality reduction methods, SMCE finds a small neighborhood around each data point and connects each point to its neighbors with appropriate weights. The key difference is that SMCE finds both the neighbors and the weights automatically. This is done by solving a sparse optimization problem, which encourages selecting nearby points that lie in the same manifold and approximately span a low-dimensional affine subspace. The optimal solution encodes information that can be used for clustering and dimensionality reduction using spectral clustering and embedding. Moreover, the size of the optimal neighborhood of a data point, which can be different for different points, provides an estimate of the dimension of the manifold to which the point belongs. Experiments demonstrate that our method can effectively handle multiple manifolds that are very close to each other, manifolds with non-uniform sampling and holes, as well as estimate the intrinsic dimensions of the manifolds. 1 1.1

6 0.55436933 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

7 0.5537082 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries

8 0.5505107 271 nips-2011-Statistical Tests for Optimization Efficiency

9 0.54969859 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

10 0.54940581 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

11 0.54888767 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

12 0.54818869 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

13 0.54789555 244 nips-2011-Selecting Receptive Fields in Deep Networks

14 0.547782 169 nips-2011-Maximum Margin Multi-Label Structured Prediction

15 0.54649776 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition

16 0.54642218 157 nips-2011-Learning to Search Efficiently in High Dimensions

17 0.54572237 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices

18 0.54543883 32 nips-2011-An Empirical Evaluation of Thompson Sampling

19 0.54527086 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

20 0.54499811 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding