nips nips2011 nips2011-167 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Vijay Mahadevan, Chi W. Wong, Jose C. Pereira, Tom Liu, Nuno Vasconcelos, Lawrence K. Saul
Abstract: We propose maximum covariance unfolding (MCU), a manifold learning algorithm for simultaneous dimensionality reduction of data from different input modalities. Given high dimensional inputs from two different but naturally aligned sources, MCU computes a common low dimensional embedding that maximizes the cross-modal (inter-source) correlations while preserving the local (intra-source) distances. In this paper, we explore two applications of MCU. First we use MCU to analyze EEG-fMRI data, where an important goal is to visualize the fMRI voxels that are most strongly correlated with changes in EEG traces. To perform this visualization, we augment MCU with an additional step for metric learning in the high dimensional voxel space. Second, we use MCU to perform cross-modal retrieval of matched image and text samples from Wikipedia. To manage large applications of MCU, we develop a fast implementation based on ideas from spectral graph theory. These ideas transform the original problem for MCU, one of semidefinite programming, into a simpler problem in semidefinite quadratic linear programming. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We propose maximum covariance unfolding (MCU), a manifold learning algorithm for simultaneous dimensionality reduction of data from different input modalities. [sent-10, score-0.242]
2 Given high dimensional inputs from two different but naturally aligned sources, MCU computes a common low dimensional embedding that maximizes the cross-modal (inter-source) correlations while preserving the local (intra-source) distances. [sent-11, score-0.409]
3 First we use MCU to analyze EEG-fMRI data, where an important goal is to visualize the fMRI voxels that are most strongly correlated with changes in EEG traces. [sent-13, score-0.104]
4 To perform this visualization, we augment MCU with an additional step for metric learning in the high dimensional voxel space. [sent-14, score-0.261]
5 Second, we use MCU to perform cross-modal retrieval of matched image and text samples from Wikipedia. [sent-15, score-0.122]
6 1 Introduction Recent advances in manifold learning and nonlinear dimensionality reduction have led to powerful, new methods for the analysis and visualization of high dimensional data [14, 1, 20, 24, 16]. [sent-18, score-0.254]
7 1 All these methods solve variants of the same basic underlying problem: given high dimensional inputs, {x1 , x2 , . [sent-21, score-0.119]
8 , xn }, compute low dimensional outputs {y1 , y2 , . [sent-24, score-0.169]
9 For instance, in certain applications, aligned data is acquired from two different modalities — we refer to such data as bimodal — and the goal is to find low dimensional representations that capture their interdependencies. [sent-32, score-0.354]
10 In this paper, we investigate the use of maximum variance unfolding (MVU) [24] for the simultaneous dimensionality reduction of data from different input modalities. [sent-33, score-0.14]
11 In its original formulation, MVU computes a low dimensional embedding that maximizes the variance of its outputs, subject to constraints that preserve local distances. [sent-35, score-0.262]
12 We explore a modification of MVU that computes a joint embedding of high dimensional inputs from different data sources. [sent-36, score-0.183]
13 In this joint embedding, our goal is to discover a common low dimensional representation of just those degrees of variability that are correlated across different modalities. [sent-37, score-0.213]
14 To achieve this goal, we design the embedding to maximize the inter-source correlation between aligned outputs while preserving the local, intra-source distances. [sent-38, score-0.253]
15 By analogy to MVU, we call our approach maximum covariance unfolding (MCU). [sent-39, score-0.101]
16 In addition, for one of our applications—the analysis of EEG-fMRI data— we show how to extend the basic optimization of MCU to visualize the high dimensional correlations between different input modalities. [sent-44, score-0.119]
17 This is done by adding extra variables to the original SDP; these variables can be viewed as performing a type of metric learning in the high dimensional voxel space. [sent-45, score-0.261]
18 In particular, they indicate which fMRI voxels (in the high dimensional space of fMRI images) correlate most strongly with observed changes in the EEG recordings. [sent-46, score-0.211]
19 Bowling et al [4, 5] developed a related approach known as action-respecting embedding for problems in robot localization. [sent-48, score-0.064]
20 Song et al [18] reinterpreted the optimization criterion of MVU, then proposed an extension of the original algorithm that computes low dimensional embeddings subject to class labels or other side information. [sent-49, score-0.277]
21 2 Maximum Covariance Unfolding We propose a novel adaptation of MVU, termed maximum covariance unfolding or MCU to perform non-linear correlation between two aligned datasets whose points have a one-to-one correspondence. [sent-52, score-0.34]
22 MCU embeds the two datasets, of different dimensions, into a single low dimensional manifold such that the two resulting embeddings are maximally correlated. [sent-53, score-0.349]
23 As in MVU, the embeddings are such that local distances are preserved. [sent-54, score-0.107]
24 1 Formulation Let {x1i }n , x1i ∈ Rp1 and {x2i }n , x2i ∈ Rp2 be two aligned datasets belonging to two difi=1 i=1 ferent input spaces, and {y1i }n , y1i ∈ Rd and {y2i }n , y2i ∈ Rd be the corresponding low i=1 i=1 dimensional representations (in the output space), with d p1 and d p2 . [sent-57, score-0.256]
25 As in MVU [21], we need to find a low dimensional mapping such that the Euclidean distance between pairs of points in a local neighborhood are preserved. [sent-58, score-0.279]
26 For each dataset s ∈ {1, 2}, if points xsj and xsk are neighbors or are common neighbors of another point, we denote an indicator variable ηsij = 1. [sent-59, score-0.144]
27 This allows us to formulate the MCU very similarly to the MVU formulation of [21], and so we omit the details for the sake of brevity. [sent-62, score-0.055]
28 The equivalent matrix constraints are, Kij = 0, ∀i, j ≤ n Kij = 0, ∀i, j > n ij (4) ij The objective function is to maximize the covariance between the low dimensional representations of the two datasets. [sent-64, score-0.337]
29 One shortcoming of the MCU formulation is that it provides no means to visualize the results. [sent-70, score-0.055]
30 While the low-dimensional embeddings of the two datasets may be well correlated, there is no way to identify which dimensions or covariates of the data points in one modality contribute to high correlation with the points in the other modality. [sent-71, score-0.397]
31 2 Metric Learning for Visualization For each dimension in one dataset, we need to compute a measure of how much it contributes to the correlation between the datasets. [sent-74, score-0.103]
32 This can be done using a metric learning type step applied to data of one or both modalities within the MCU formulation. [sent-75, score-0.123]
33 The MCU formulation of Section 2 assumes that the distances between the points is Euclidean. [sent-77, score-0.132]
34 However, inspired by the recently proposed ideas in metric learning [22], we use a more general distance metric by applying a linear transformation T1 of size p1 × p1 in the space, and then perform MCU using the transformed points, T1 xi . [sent-79, score-0.126]
35 This allows some distances to shrink/expand if that would help in increasing the correlation with {x2i }. [sent-80, score-0.131]
36 i=1 In order to find the weight vector that produces the maximal correlation between the two datasets, these p1 new variables can be learned within the MCU framework by adding them to the optimization 3 problem. [sent-83, score-0.133]
37 As each dimension has a corresponding weight, the optimal weight vector returned would be a map over the dimensions indicating how strongly each is correlated to {x2i }. [sent-84, score-0.173]
38 To modify the MCU formulation to include these new variables, we replace all Euclidean distance measurements for the data points in the first dataset in (2) with the weighted distance D1ij = m σm (xim − xjm )2 . [sent-85, score-0.197]
39 This adds a linear function of the new weight variables to the existing distance constraints of (2). [sent-86, score-0.06]
40 However, if we had to define the neighborhood of a data point itself using this weighted distance, the formulation would become non-convex. [sent-87, score-0.086]
41 So we assume that the neighborhood is composed of points that are closest in time . [sent-88, score-0.08]
42 The objective function of (6) does not change, but we need to maximize the objective over the p1 weight variables also. [sent-91, score-0.059]
43 The new formulation, denoted MCU-ML, is written as: Maximize: Wij Kij , with W = ij subject to: σk ≥ 0, ∀k ∈ {1 . [sent-93, score-0.084]
44 3 Resting-state EEG-fMRI Data In the absence of an explicit task, temporal synchrony of the blood oxygenation level dependent (BOLD) signal is maintained across distinct brain regions. [sent-98, score-0.102]
45 fMRI datasets have high resolution of the order of a few millimeters, but offer poor temporal resolution as it measures the delayed haemodynamic response to neural activity. [sent-100, score-0.11]
46 In addition, changes in resting-state BOLD connectivity measures are typically interpreted as changes in coherent neural activity across respective brain regions. [sent-101, score-0.111]
47 However, this interpretation may be misleading because the BOLD signal is a complex function of neural activity, oxygen metabolism, cerebral blood flow (CBF), and cerebral blood volume (CBV) [3]. [sent-102, score-0.062]
48 To address these shortcomings, simultaneous acquisition of electroencephalographic data (EEG) during functional magnetic resonance imaging (fMRI) is becoming more popular in brain imaging [13]. [sent-103, score-0.197]
49 The EEG recording provides high temporal resolution of neural activity (5kHz), but poor spatial resolution due to electric signal distortion by the skull and scalp and the limitations on the number of electrodes that can be placed on the scalp. [sent-104, score-0.107]
50 Therefore the goal of simultaneous acquisition of EEG and fMRI is to exploit the complementary nature of the two imaging modalities to obtain spatiotemporally resolved neural signal and metabolic state information [13]. [sent-105, score-0.169]
51 Specifically, using high temporal resolution EEG data, we are able to examine dynamic changes and non-stationary properties of neural activity at different frequency bands. [sent-106, score-0.067]
52 By correlating with the EEG data with the high resolution BOLD data, we are able to examine the corresponding spatial regions in which neural activity occurs. [sent-107, score-0.067]
53 Most often, a simple voxel-wise correlation of the fMRI data with the EEG power time series in a specific frequency band is performed [13]. [sent-109, score-0.103]
54 To address this issue, more sophisticated linear methods such as canonical correlation analysis (CCA) [7], and the partial least squares method [11] have been proposed. [sent-111, score-0.134]
55 Therefore, lin4 ear approaches to correlate the fMRI data with the EEG data may not capture any low dimensional manifold structure. [sent-113, score-0.274]
56 To address these limitations we propose the use of MCU to learn low dimensional manifolds for both the fMRI and EEG data such that the output embeddings are maximally correlated. [sent-114, score-0.302]
57 In addition, we learn a metric in the fMRI input space to identify which voxels of the fMRI correlate most strongly with observed changes in the EEG recordings. [sent-115, score-0.14]
58 fMRI data were acquired with the following parameters: echo planar imaging with 150 volumes, 30 slices, 3. [sent-126, score-0.058]
59 438 × 5mm3 voxel size, 64 × 64 matrix size, TR=2s, TE=30ms. [sent-128, score-0.094]
60 The 5 frequency channels of the EEG data were averaged to produce a 63 dimensional time series of 145 time points. [sent-130, score-0.119]
61 The fMRI data consisted of a 122880 (64 × 64 × 30) dimensional time series with 145 time points. [sent-131, score-0.119]
62 Therefore, attempts to embed these points, especially the fMRI data, into a low dimensional manifold have been made using non-linear dimensionality reduction techniques such as Laplacian eigenmaps [17]. [sent-135, score-0.276]
63 While such techniques may be used to find manifold embeddings for fMRI and EEG data separately, they are not useful for finding patterns of correlation between the two. [sent-136, score-0.255]
64 We applied the MCU-ML approach to learn a visualization map and a joint low dimensional embedding for the EEG-fMRI dataset. [sent-141, score-0.313]
65 For CCA, the average of the canonical directions (weighted using the canonical correlations) was used as the weight vector. [sent-144, score-0.092]
66 In both cases, the 145 dimensional weight vector was projected back to the fMRI voxel space using the principal components of the PCA step. [sent-145, score-0.243]
67 Two types of voxel wise correlations maps were computed to assess the performance of MCU-ML. [sent-146, score-0.094]
68 First, a naive correlation map was generated where each voxel was separately correlated with the average EEG power time course from the alpha aband (8-12Hz) (which is known to be correlated with the fMRI resting-state network [13]) from all the 63 electrodes. [sent-147, score-0.337]
69 The averaged fMRI signal from the ROI was then correlated with the whole brain to obtain a voxel-wise correlation map. [sent-150, score-0.191]
70 Therefore, voxels in the PCC region should have high correlation with the EEG data. [sent-151, score-0.163]
71 This information provides a “sanity-check” version of the fMRI correlation map. [sent-152, score-0.103]
72 The functional connectivity map is shown in 2(a), and the correlation map obtained using MCU-ML, overlaid with the relevant anatomical regions appears in 2(b). [sent-154, score-0.364]
73 (a) naive correlation map (b) using only PCA (c) using CCA (d) using MCU-ML match, showing well localized regions of positive correlation in the DMN, and regions of negative correlation in the TPN. [sent-189, score-0.361]
74 The correlation maps for 12 slices overlaid with over a high-resolution T1weighted image for the proposed MCU-ML approach are shown in Figure 3(b). [sent-190, score-0.23]
75 8 60 10 20 30 40 50 60 (a) (b) Figure 2: (a) the functional connectivity map, and (b) the MCU-ML correlation map overlaid with information about the anatomical regions relevant during rest state. [sent-199, score-0.312]
76 (b) A montage showing the recovered weights for each voxel in the 12 anatomically significant slices, with the MCU overlaid on a high-resolution T1-weighted image. [sent-205, score-0.19]
77 It is also interesting to see that the weights that produce maximal correlation with the EEG dataset are very different from the eigenvalues of PCA themselves, indicating that the dimensions that are important for correlation are not necessarily the ones with maximum variance. [sent-209, score-0.253]
78 [23] modified the original formulation using graph laplacian regularization to reduce the size of the SDP. [sent-213, score-0.126]
79 The graph laplacian depends only on nearest neighbor relations and in MCU these are assumed to be unchanged as the points are embedded from the original space to the low dimensional manifold. [sent-217, score-0.289]
80 From the solution of the SQLP, the vectors{u1i }m and {u2i }m , can be obtained using the spectral decomposition method i=1 i=1 described in [21], followed by the low dimensional coordinates {y1i }n and {y2i }n , using (8). [sent-232, score-0.195]
81 Figure 4 shows the embeddings of this data generated by CCA and 7 by Fast-MCU. [sent-235, score-0.079]
82 While CCA discovers two significant dimensions, the Fast-MCU accurately extracts the low dimensional manifold where the embeddings lie in a narrow strip. [sent-236, score-0.321]
83 (c) low dimensional manifolds obtained using Fast-MCU, before and after the gradient based improvement step. [sent-239, score-0.195]
84 (best viewed in color) To further test the proposed Fast-MCU on real data, we use the recently proposed Wikipedia dataset composed of text and image pairs [12]. [sent-240, score-0.077]
85 The dataset consists of 2866 text - image pairs, each belonging to one of 10 semantic categories. [sent-241, score-0.077]
86 We first represented the text using an LDA model [2] with 20 topics, and the image using a histogram over a SIFT [10] codebook of 4096 codewords. [sent-249, score-0.077]
87 The common low dimensional manifold was learned from the text-image pairs of the training set using the SQLP based formulation of (13), with m = 20, followed by a gradient ascent step as described in the previous section. [sent-250, score-0.297]
88 To compare the performance of Fast-MCU, we also used CCA and kernel CCA (kCCA) to learn the maximally correlated joint spaces from the training set. [sent-251, score-0.072]
89 For CCA, this involves a linear transformation to the low dimensional subspace, while for kCCA this is achieved by evaluating a linear combination of the kernel functions of the training points [9]. [sent-254, score-0.218]
90 The same mapping is then applied to the projection of the neighbors in the learned low dimensional joint manifold to compute the projection of the test point. [sent-256, score-0.273]
91 To perform retrieval, all the test points of both modalities, image and text, are projected to the joint space learned using the training set. [sent-257, score-0.082]
92 For a given test point of one modality, its distance to all the projected test points of the other modality are computed, and these are then ranked. [sent-258, score-0.119]
93 In this work, we used the normalized correlation distance, which was shown to be the best performing distance metric in [12]. [sent-259, score-0.181]
94 In addition, Fast-MCU produced significantly lower number of dimensions for the embeddings - CCA produced 19 signficant dimensions compared to just 3 for Fast-MCU. [sent-264, score-0.173]
95 198 Conclusions In this paper, we describe an adaptation of MVU to analyze correlation of high-dimensional aligned data such as EEG-fMRI data and image-text corpora. [sent-273, score-0.16]
96 Our results on EEG-fMRI data show that 8 the proposed approach is able to make anatomically significant predictions about which voxels of the fMRI are most correlated with changes in EEG signals. [sent-274, score-0.142]
97 This ability of MCU makes it much more broadly applicable because in general we expect inputs from truly different modalities to have many independent degrees of freedom: e. [sent-277, score-0.075]
98 Multi-set canonical correlation analysis for the fusion of concurrent single trial ERP and functional MRI. [sent-323, score-0.161]
99 Canonical correlation analysis: An overview with application to learning methods. [sent-336, score-0.103]
100 Fast graph laplacian regularized kernel learning via semidefinite– quadratic–linear programming. [sent-437, score-0.071]
wordName wordTfidf (topN-words)
[('mcu', 0.63), ('fmri', 0.319), ('eeg', 0.274), ('mvu', 0.232), ('cca', 0.203), ('kij', 0.137), ('kii', 0.133), ('kjj', 0.133), ('dimensional', 0.119), ('correlation', 0.103), ('sqlp', 0.099), ('voxel', 0.094), ('kcca', 0.083), ('jolla', 0.08), ('embeddings', 0.079), ('sdp', 0.077), ('modalities', 0.075), ('manifold', 0.073), ('unfolding', 0.072), ('dmn', 0.066), ('embedding', 0.064), ('qt', 0.062), ('diego', 0.061), ('voxels', 0.06), ('overlaid', 0.058), ('aligned', 0.057), ('semide', 0.056), ('ij', 0.055), ('formulation', 0.055), ('map', 0.052), ('la', 0.051), ('low', 0.05), ('tpn', 0.05), ('ukk', 0.05), ('weinberger', 0.05), ('points', 0.049), ('metric', 0.048), ('san', 0.048), ('sdps', 0.048), ('dimensions', 0.047), ('retrieval', 0.045), ('text', 0.044), ('correlated', 0.044), ('brain', 0.044), ('laplacian', 0.044), ('neuroimage', 0.042), ('resolution', 0.04), ('modality', 0.04), ('connectivity', 0.04), ('anatomically', 0.038), ('shaw', 0.038), ('slices', 0.036), ('bold', 0.036), ('bowling', 0.034), ('simultaneous', 0.034), ('dimensionality', 0.034), ('california', 0.033), ('costa', 0.033), ('ghodsi', 0.033), ('sedumi', 0.033), ('xim', 0.033), ('xjm', 0.033), ('xsj', 0.033), ('ysi', 0.033), ('image', 0.033), ('pca', 0.033), ('ece', 0.033), ('anatomical', 0.032), ('correlate', 0.032), ('imaging', 0.032), ('neighbors', 0.031), ('canonical', 0.031), ('neighborhood', 0.031), ('ca', 0.031), ('blood', 0.031), ('distance', 0.03), ('weight', 0.03), ('datasets', 0.03), ('covariance', 0.029), ('maximize', 0.029), ('subject', 0.029), ('distances', 0.028), ('acquisition', 0.028), ('maximally', 0.028), ('visualization', 0.028), ('zi', 0.028), ('activity', 0.027), ('graph', 0.027), ('wikipedia', 0.027), ('bimodal', 0.027), ('synchrony', 0.027), ('roi', 0.027), ('functional', 0.027), ('manifolds', 0.026), ('acquired', 0.026), ('spectral', 0.026), ('qk', 0.026), ('pcc', 0.025), ('rolls', 0.025), ('radiology', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999997 167 nips-2011-Maximum Covariance Unfolding : Manifold Learning for Bimodal Data
Author: Vijay Mahadevan, Chi W. Wong, Jose C. Pereira, Tom Liu, Nuno Vasconcelos, Lawrence K. Saul
Abstract: We propose maximum covariance unfolding (MCU), a manifold learning algorithm for simultaneous dimensionality reduction of data from different input modalities. Given high dimensional inputs from two different but naturally aligned sources, MCU computes a common low dimensional embedding that maximizes the cross-modal (inter-source) correlations while preserving the local (intra-source) distances. In this paper, we explore two applications of MCU. First we use MCU to analyze EEG-fMRI data, where an important goal is to visualize the fMRI voxels that are most strongly correlated with changes in EEG traces. To perform this visualization, we augment MCU with an additional step for metric learning in the high dimensional voxel space. Second, we use MCU to perform cross-modal retrieval of matched image and text samples from Wikipedia. To manage large applications of MCU, we develop a fast implementation based on ideas from spectral graph theory. These ideas transform the original problem for MCU, one of semidefinite programming, into a simpler problem in semidefinite quadratic linear programming. 1
2 0.15345857 176 nips-2011-Multi-View Learning of Word Embeddings via CCA
Author: Paramveer Dhillon, Dean P. Foster, Lyle H. Ungar
Abstract: Recently, there has been substantial interest in using large amounts of unlabeled data to learn word representations which can then be used as features in supervised classifiers for NLP tasks. However, most current approaches are slow to train, do not model the context of the word, and lack theoretical grounding. In this paper, we present a new learning method, Low Rank Multi-View Learning (LR-MVL) which uses a fast spectral method to estimate low dimensional context-specific word representations from unlabeled data. These representation features can then be used with any supervised learner. LR-MVL is extremely fast, gives guaranteed convergence to a global optimum, is theoretically elegant, and achieves state-ofthe-art performance on named entity recognition (NER) and chunking problems. 1 Introduction and Related Work Over the past decade there has been increased interest in using unlabeled data to supplement the labeled data in semi-supervised learning settings to overcome the inherent data sparsity and get improved generalization accuracies in high dimensional domains like NLP. Approaches like [1, 2] have been empirically very successful and have achieved excellent accuracies on a variety of NLP tasks. However, it is often difficult to adapt these approaches to use in conjunction with an existing supervised NLP system as these approaches enforce a particular choice of model. An increasingly popular alternative is to learn representational embeddings for words from a large collection of unlabeled data (typically using a generative model), and to use these embeddings to augment the feature set of a supervised learner. Embedding methods produce features in low dimensional spaces or over a small vocabulary size, unlike the traditional approach of working in the original high dimensional vocabulary space with only one dimension “on” at a given time. Broadly, these embedding methods fall into two categories: 1. Clustering based word representations: Clustering methods, often hierarchical, are used to group distributionally similar words based on their contexts. The two dominant approaches are Brown Clustering [3] and [4]. As recently shown, HMMs can also be used to induce a multinomial distribution over possible clusters [5]. 2. Dense representations: These representations are dense, low dimensional and real-valued. Each dimension of these representations captures latent information about a combination of syntactic and semantic word properties. They can either be induced using neural networks like C&W; embeddings [6] and Hierarchical log-linear (HLBL) embeddings [7] or by eigen-decomposition of the word co-occurrence matrix, e.g. Latent Semantic Analysis/Latent Semantic Indexing (LSA/LSI) [8]. Unfortunately, most of these representations are 1). slow to train, 2). sensitive to the scaling of the embeddings (especially 2 based approaches like LSA/PCA), 3). can get stuck in local optima (like EM trained HMM) and 4). learn a single embedding for a given word type; i.e. all the occurrences 1 of the word “bank” will have the same embedding, irrespective of whether the context of the word suggests it means “a financial institution” or “a river bank”. In this paper, we propose a novel context-specific word embedding method called Low Rank MultiView Learning, LR-MVL, which is fast to train and is guaranteed to converge to the optimal solution. As presented here, our LR-MVL embeddings are context-specific, but context oblivious embeddings (like the ones used by [6, 7]) can be trivially gotten from our model. Furthermore, building on recent advances in spectral learning for sequence models like HMMs [9, 10, 11] we show that LR-MVL has strong theoretical grounding. Particularly, we show that LR-MVL estimates low dimensional context-specific word embeddings which preserve all the information in the data if the data were generated by an HMM. Moreover, LR-MVL being linear does not face the danger of getting stuck in local optima as is the case for an EM trained HMM. LR-MVL falls into category (2) mentioned above; it learns real-valued context-specific word embeddings by performing Canonical Correlation Analysis (CCA) [12] between the past and future views of low rank approximations of the data. However, LR-MVL is more general than those methods, which work on bigram or trigram co-occurrence matrices, in that it uses longer word sequence information to estimate context-specific embeddings and also for the reasons mentioned in the last paragraph. The remainder of the paper is organized as follows. In the next section we give a brief overview of CCA, which forms the core of our method. Section 3 describes our proposed LR-MVL algorithm in detail and gives theory supporting its performance. Section 4 demonstrates the effectiveness of LR-MVL on the NLP tasks of Named Entity Recognition and Chunking. We conclude with a brief summary in Section 5. 2 Brief Review: Canonical Correlation Analysis (CCA) CCA [12] is the analog to Principal Component Analysis (PCA) for pairs of matrices. PCA computes the directions of maximum covariance between elements in a single matrix, whereas CCA computes the directions of maximal correlation between a pair of matrices. Unlike PCA, CCA does not depend on how the observations are scaled. This invariance of CCA to linear data transformations allows proofs that keeping the dominant singular vectors (those with largest singular values) will faithfully capture any state information. More specifically, given a set of n paired observation vectors {(l1 , r1 ), ..., (ln , rn )}–in our case the two matrices are the left (L) and right (R) context matrices of a word–we would like to simultaneously find the directions Φl and Φr that maximize the correlation of the projections of L onto Φl with the projections of R onto Φr . This is expressed as max Φl ,Φr E[ L, Φl R, Φr ] E[ L, Φl 2 ]E[ R, Φr 2 ] (1) where E denotes the empirical expectation. We use the notation Clr (Cll ) to denote the cross (auto) covariance matrices between L and R (i.e. L’R and L’L respectively.). The left and right canonical correlates are the solutions Φl , Φr of the following equations: Cll −1 Clr Crr −1 Crl Φl = λΦl Crr −1 Crl Cll −1 Clr Φr = λΦr 3 (2) Low Rank Multi-View Learning (LR-MVL) In LR-MVL, we compute the CCA between the past and future views of the data on a large unlabeled corpus to find the common latent structure, i.e., the hidden state associated with each token. These induced representations of the tokens can then be used as features in a supervised classifier (typically discriminative). The context around a word, consisting of the h words to the right and left of it, sits in a high dimensional space, since for a vocabulary of size v, each of the h words in the context requires an indicator function of dimension v. The key move in LR-MVL is to project the v-dimensional word 2 space down to a k dimensional state space. Thus, all eigenvector computations are done in a space that is v/k times smaller than the original space. Since a typical vocabulary contains at least 50, 000 words, and we use state spaces of order k ≈ 50 dimensions, this gives a 1,000-fold reduction in the size of calculations that are needed. The core of our LR-MVL algorithm is a fast spectral method for learning a v × k matrix A which maps each of the v words in the vocabulary to a k-dimensional state vector. We call this matrix the “eigenfeature dictionary”. We now describe the LR-MVL method, give a theorem that provides intuition into how it works, and formally present the LR-MVL algorithm. The Experiments section then shows that this low rank approximation allows us to achieve state-of-the-art performance on NLP tasks. 3.1 The LR-MVL method Given an unlabeled token sequence w={w0 , w1 , . . ., wn } we want to learn a low (k)- dimensional state vector {z0 , z1 , . . . , zn } for each observed token. The key is to find a v ×k matrix A (Algorithm 1) that maps each of the v words in the vocabulary to a reduced rank k-dimensional state vector, which is later used to induce context specific embeddings for the tokens (Algorithm 2). For supervised learning, these context specific embeddings are supplemented with other information about each token wt , such as its identity, orthographic features such as prefixes and suffixes or membership in domain-specific lexicons, and used as features in a classifier. Section 3.4 gives the algorithm more formally, but the key steps in the algorithm are, in general terms: • Take the h words to the left and to the right of each target word wt (the “Left” and “Right” contexts), and project them each down to k dimensions using A. • Take the CCA between the reduced rank left and right contexts, and use the resulting model to estimate a k dimensional state vector (the “hidden state”) for each token. • Take the CCA between the hidden states and the tokens wt . The singular vectors associated with wt form a new estimate of the eigenfeature dictionary. LR-MVL can be viewed as a type of co-training [13]: The state of each token wt is similar to that of the tokens both before and after it, and it is also similar to the states of the other occurrences of the same word elsewhere in the document (used in the outer iteration). LR-MVL takes advantage of these two different types of similarity by alternately estimating word state using CCA on the smooths of the states of the words before and after each target token and using the average over the states associated with all other occurrences of that word. 3.2 Theoretical Properties of LR-MVL We now present the theory behind the LR-MVL algorithm; particularly we show that the reduced rank matrix A allows a significant data reduction while preserving the information in our data and the estimated state does the best possible job of capturing any label information that can be inferred by a linear model. Let L be an n × hv matrix giving the words in the left context of each of the n tokens, where the context is of length h, R be the corresponding n × hv matrix for the right context, and W be an n × v matrix of indicator functions for the words themselves. We will use the following assumptions at various points in our proof: Assumption 1. L, W, and R come from a rank k HMM i.e. it has a rank k observation matrix and rank k transition matrix both of which have the same domain. For example, if the dimension of the hidden state is k and the vocabulary size is v then the observation matrix, which is k × v, has rank k. This rank condition is similar to the one used by [10]. Assumption 1A. For the three views, L, W and R assume that there exists a “hidden state H” of dimension n × k, where each row Hi has the same non-singular variance-covariance matrix and 3 such that E(Li |Hi ) = Hi β T and E(Ri |Hi ) = Hi β T and E(Wi |Hi ) = Hi β T where all β’s are of L R W rank k, where Li , Ri and Wi are the rows of L, R and W respectively. Assumption 1A follows from Assumption 1. Assumption 2. ρ(L, W), ρ(L, R) and ρ(W, R) all have rank k, where ρ(X1 , X2 ) is the expected correlation between X1 and X2 . Assumption 2 is a rank condition similar to that in [9]. Assumption 3. ρ([L, R], W) has k distinct singular values. Assumption 3 just makes the proof a little cleaner, since if there are repeated singular values, then the singular vectors are not unique. Without it, we would have to phrase results in terms of subspaces with identical singular values. We also need to define the CCA function that computes the left and right singular vectors for a pair of matrices: Definition 1 (CCA). Compute the CCA between two matrices X1 and X2 . Let ΦX1 be a matrix containing the d largest singular vectors for X1 (sorted from the largest on down). Likewise for ΦX2 . Define the function CCAd (X1 , X2 ) = [ΦX1 , ΦX2 ]. When we want just one of these Φ’s, we will use CCAd (X1 , X2 )left = ΦX1 for the left singular vectors and CCAd (X1 , X2 )right = ΦX2 for the right singular vectors. Note that the resulting singular vectors, [ΦX1 , ΦX2 ] can be used to give two redundant estimates, X1 ΦX1 and X2 ΦX2 of the “hidden” state relating X1 and X2 , if such a hidden state exists. Definition 2. Define the symbol “≈” to mean X1 ≈ X2 ⇐⇒ lim X1 = lim X2 n→∞ n→∞ where n is the sample size. Lemma 1. Define A by the following limit of the right singular vectors: CCAk ([L, R], W)right ≈ A. Under assumptions 2, 3 and 1A, such that if CCAk (L, R) ≡ [ΦL , ΦR ] then CCAk ([LΦL , RΦR ], W)right ≈ A. Lemma 1 shows that instead of finding the CCA between the full context and the words, we can take the CCA between the Left and Right contexts, estimate a k dimensional state from them, and take the CCA of that state with the words and get the same result. See the supplementary material for the Proof. ˜ Let Ah denote a matrix formed by stacking h copies of A on top of each other. Right multiplying ˜ L or R by Ah projects each of the words in that context into the k-dimensional reduced rank space. The following theorem addresses the core of the LR-MVL algorithm, showing that there is an A which gives the desired dimensionality reduction. Specifically, it shows that the previous lemma also holds in the reduced rank space. Theorem 1. Under assumptions 1, 2 and 3 there exists a unique matrix A such that if ˜ ˜ ˜ ˜ CCAk (LAh , RAh ) ≡ [ΦL , ΦR ] then ˜ ˜ ˜ ˜ CCAk ([LAh ΦL , RAh ΦR ], W)right ≈ A ˜ where Ah is the stacked form of A. See the supplementary material for the Proof 1 . ˆ It is worth noting that our matrix A corresponds to the matrix U used by [9, 10]. They showed that U is sufficient to compute the probability of a sequence of words generated by an HMM; although we do not show ˆ it here (due to limited space), our A provides a more statistically efficient estimate of U than their U , and hence can also be used to estimate the sequence probabilities. 1 4 Under the above assumptions, there is asymptotically (in the limit of infinite data) no benefit to first estimating state by finding the CCA between the left and right contexts and then finding the CCA between the estimated state and the words. One could instead just directly find the CCA between the combined left and rights contexts and the words. However, because of the Zipfian distribution of words, many words are rare or even unique, and hence one is not in the asymptotic limit. In this case, CCA between the rare words and context will not be informative, whereas finding the CCA between the left and right contexts gives a good state vector estimate even for unique words. One can then fruitfully find the CCA between the contexts and the estimated state vector for their associated words. 3.3 Using Exponential Smooths In practice, we replace the projected left and right contexts with exponential smooths (weighted average of the previous (or next) token’s state i.e. Zt−1 (or Zt+1 ) and previous (or next) token’s smoothed state i.e. St−1 (or St+1 ).), of them at a few different time scales, thus giving a further dimension reduction by a factor of context length h (say 100 words) divided by the number of smooths (often 5-7). We use a mixture of both very short and very long contexts which capture short and long range dependencies as required by NLP problems as NER, Chunking, WSD etc. Since exponential smooths are linear, we preserve the linearity of our method. 3.4 The LR-MVL Algorithm The LR-MVL algorithm (using exponential smooths) is given in Algorithm 1; it computes the pair of CCAs described above in Theorem 1. Algorithm 1 LR-MVL Algorithm - Learning from Large amounts of Unlabeled Data 1: Input: Token sequence Wn×v , state space size k, smoothing rates αj 2: Initialize the eigenfeature dictionary A to random values N (0, 1). 3: repeat 4: Set the state Zt (1 < t ≤ n) of each token wt to the eigenfeature vector of the corresponding word. Zt = (Aw : w = wt ) 5: Smooth the state estimates before and after each token to get a pair of views for each smoothing rate αj . (l,j) (l,j) = (1 − αj )St−1 + αj Zt−1 // left view L St (r,j) (r,j) j St = (1 − α )St+1 + αj Zt+1 // right view R. (l,j) (r,j) th where the t rows of L and R are, respectively, concatenations of the smooths St and St for (j) each of the α s. 6: Find the left and right canonical correlates, which are the eigenvectors Φl and Φr of (L L)−1 L R(R R)−1 R LΦl = λΦl . (R R)−1 R L(L L)−1 L RΦr = λΦr . 7: Project the left and right views on to the space spanned by the top k/2 left and right CCAs respectively (k/2) (k/2) Xl = LΦl and Xr = RΦr (k/2) (k/2) where Φl , Φr are matrices composed of the singular vectors of Φl , Φr with the k/2 largest magnitude singular values. Estimate the state for each word wt as the union of the left and right estimates: Z = [Xl , Xr ] 8: Estimate the eigenfeatures of each word type, w, as the average of the states estimated for that word. Aw = avg(Zt : wt = w) 9: Compute the change in A from the previous iteration 10: until |∆A| < 11: Output: Φk , Φk , A . r l A few iterations (∼ 5) of the above algorithm are sufficient to converge to the solution. (Since the problem is convex, there is a single solution, so there is no issue of local minima.) As [14] show for PCA, one can start with a random matrix that is only slightly larger than the true rank k of the correlation matrix, and with extremely high likelihood converge in a few iterations to within a small distance of the true principal components. In our case, if the assumptions detailed above (1, 1A, 2 and 3) are satisfied, our method converges equally rapidly to the true canonical variates. As mentioned earlier, we get further dimensionality reduction in Step 5, by replacing the Left and Right context matrices with a set of exponentially smoothed values of the reduced rank projections of the context words. Step 6 finds the CCA between the Left and Right contexts. Step 7 estimates 5 the state by combining the estimates from the left and right contexts, since we don’t know which will best estimate the state. Step 8 takes the CCA between the estimated state Z and the matrix of words W. Because W is a vector of indicator functions, this CCA takes the trivial form of a set of averages. Once we have estimated the CCA model, it is used to generate context specific embeddings for the tokens from training, development and test sets (as described in Algorithm 2). These embeddings are further supplemented with other baseline features and used in a supervised learner to predict the label of the token. Algorithm 2 LR-MVL Algorithm -Inducing Context Specific Embeddings for Train/Dev/Test Data 1: Input: Model (Φk , Φk , A) output from above algorithm and Token sequences Wtrain , (Wdev , Wtest ) r l 2: Project the left and right views L and R after smoothing onto the space spanned by the top k left and right CCAs respectively Xl = LΦk and Xr = RΦk r l and the words onto the eigenfeature dictionary Xw = W train A 3: Form the final embedding matrix Xtrain:embed by concatenating these three estimates of state Xtrain:embed = [Xl , Xw , Xr ] 4: Output: The embedding matrices Xtrain:embed , (Xdev:embed , Xtest:embed ) with context-specific representations for the tokens. These embeddings are augmented with baseline set of features mentioned in Sections 4.1.1 and 4.1.2 before learning the final classifier. Note that we can get context “oblivious” embeddings i.e. one embedding per word type, just by using the eigenfeature dictionary (Av×k ) output by Algorithm 1. 4 Experimental Results In this section we present the experimental results of LR-MVL on Named Entity Recognition (NER) and Syntactic Chunking tasks. We compare LR-MVL to state-of-the-art semi-supervised approaches like [1] (Alternating Structures Optimization (ASO)) and [2] (Semi-supervised extension of CRFs) as well as embeddings like C&W;, HLBL and Brown Clustering. 4.1 Datasets and Experimental Setup For the NER experiments we used the data from CoNLL 2003 shared task and for Chunking experiments we used the CoNLL 2000 shared task data2 with standard training, development and testing set splits. The CoNLL ’03 and the CoNLL ’00 datasets had ∼ 204K/51K/46K and ∼ 212K/ − /47K tokens respectively for Train/Dev./Test sets. 4.1.1 Named Entity Recognition (NER) We use the same set of baseline features as used by [15, 16] in their experiments. The detailed list of features is as below: • Current Word wi ; Its type information: all-capitalized, is-capitalized, all-digits and so on; Prefixes and suffixes of wi • Word tokens in window of 2 around the current word i.e. (wi−2 , wi−1 , wi , wi+1 , wi+2 ); and capitalization pattern in the window. d = • Previous two predictions yi−1 and yi−2 and conjunction of d and yi−1 • Embedding features (LR-MVL, C&W;, HLBL, Brown etc.) in a window of 2 around the current word (if applicable). Following [17] we use regularized averaged perceptron model with above set of baseline features for the NER task. We also used their BILOU text chunk representation and fast greedy inference as it was shown to give superior performance. 2 More details about the data and competition are available at http://www.cnts.ua.ac.be/ conll2003/ner/ and http://www.cnts.ua.ac.be/conll2000/chunking/ 6 We also augment the above set of baseline features with gazetteers, as is standard practice in NER experiments. We tuned our free parameter namely the size of LR-MVL embedding on the development and scaled our embedding features to have a 2 norm of 1 for each token and further multiplied them by a normalization constant (also chosen by cross validation), so that when they are used in conjunction with other categorical features in a linear classifier, they do not exert extra influence. The size of LR-MVL embeddings (state-space) that gave the best performance on the development set was k = 50 (50 each for Xl , Xw , Xr in Algorithm 2) i.e. the total size of embeddings was 50×3, and the best normalization constant was 0.5. We omit validation plots due to paucity of space. 4.1.2 Chunking For our chunking experiments we use a similar base set of features as above: • Current Word wi and word tokens in window of 2 around the current word i.e. d = (wi−2 , wi−1 , wi , wi+1 , wi+2 ); • POS tags ti in a window of 2 around the current word. • Word conjunction features wi ∩ wi+1 , i ∈ {−1, 0} and Tag conjunction features ti ∩ ti+1 , i ∈ {−2, −1, 0, 1} and ti ∩ ti+1 ∩ ti+2 , i ∈ {−2, −1, 0}. • Embedding features in a window of 2 around the current word (when applicable). Since CoNLL 00 chunking data does not have a development set, we randomly sampled 1000 sentences from the training data (8936 sentences) for development. So, we trained our chunking models on 7936 training sentences and evaluated their F1 score on the 1000 development sentences and used a CRF 3 as the supervised classifier. We tuned the size of embedding and the magnitude of 2 regularization penalty in CRF on the development set and took log (or -log of the magnitude) of the value of the features4 . The regularization penalty that gave best performance on development set was 2 and here again the best size of LR-MVL embeddings (state-space) was k = 50. Finally, we trained the CRF on the entire (“original”) training data i.e. 8936 sentences. 4.1.3 Unlabeled Data and Induction of embeddings For inducing the embeddings we used the RCV1 corpus containing Reuters newswire from Aug ’96 to Aug ’97 and containing about 63 million tokens in 3.3 million sentences5 . Case was left intact and we did not do the “cleaning” as done by [18, 16] i.e. remove all sentences which are less than 90% lowercase a-z, as our multi-view learning approach is robust to such noisy data, like news byline text (mostly all caps) which does not correlate strongly with the text of the article. We induced our LR-MVL embeddings over a period of 3 days (70 core hours on 3.0 GHz CPU) on the entire RCV1 data by performing 4 iterations, a vocabulary size of 300k and using a variety of smoothing rates (α in Algorithm 1) to capture correlations between shorter and longer contexts α = [0.005, 0.01, 0.05, 0.1, 0.5, 0.9]; theoretically we could tune the smoothing parameters on the development set but we found this mixture of long and short term dependencies to work well in practice. As far as the other embeddings are concerned i.e. C&W;, HLBL and Brown Clusters, we downloaded them from http://metaoptimize.com/projects/wordreprs. The details about their induction and parameter tuning can be found in [16]; we report their best numbers here. It is also worth noting that the unsupervised training of LR-MVL was (> 1.5 times)6 faster than other embeddings. 4.2 Results The results for NER and Chunking are shown in Tables 1 and 2, respectively, which show that LR-MVL performs significantly better than state-of-the-art competing methods on both NER and Chunking tasks. 3 http://www.chokkan.org/software/crfsuite/ Our embeddings are learnt using a linear model whereas CRF is a log-linear model, so to keep things on same scale we did this normalization. 5 We chose this particular dataset to make a fair comparison with [1, 16], who report results using RCV1 as unlabeled data. 6 As some of these embeddings were trained on GPGPU which makes our method even faster comparatively. 4 7 Embedding/Model Baseline C&W;, 200-dim HLBL, 100-dim Brown 1000 clusters Ando & Zhang ’05 Suzuki & Isozaki ’08 LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim HLBL, 100-dim C&W;, 200-dim Brown, 1000 clusters LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim No Gazetteers With Gazetteers F1-Score Dev. Set Test Set 90.03 84.39 92.46 87.46 92.00 88.13 92.32 88.52 93.15 89.31 93.66 89.36 93.11 89.55 93.61 89.91 92.91 89.35 92.98 88.88 93.25 89.41 93.91 89.89 94.41 90.06 Table 1: NER Results. Note: 1). LR-MVL (CO) are Context Oblivious embeddings which are gotten from (A) in Algorithm 1. 2). F1-score= Harmonic Mean of Precision and Recall. 3). The current state-of-the-art for this NER task is 90.90 (Test Set) but using 700 billion tokens of unlabeled data [19]. Embedding/Model Baseline HLBL, 50-dim C&W;, 50-dim Brown 3200 Clusters Ando & Zhang ’05 Suzuki & Isozaki ’08 LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim Test Set F1-Score 93.79 94.00 94.10 94.11 94.39 94.67 95.02 95.44 Table 2: Chunking Results. It is important to note that in problems like NER, the final accuracy depends on performance on rare-words and since LR-MVL is robustly able to correlate past with future views, it is able to learn better representations for rare words resulting in overall better accuracy. On rare-words (occurring < 10 times in corpus), we got 11.7%, 10.7% and 9.6% relative reduction in error over C&W;, HLBL and Brown respectively for NER; on chunking the corresponding numbers were 6.7%, 7.1% and 8.7%. Also, it is worth mentioning that modeling the context in embeddings gives decent improvements in accuracies on both NER and Chunking problems. For the case of NER, the polysemous words were mostly like Chicago, Wales, Oakland etc., which could either be a location or organization (Sports teams, Banks etc.), so when we don’t use the gazetteer features, (which are known lists of cities, persons, organizations etc.) we got higher increase in F-score by modeling context, compared to the case when we already had gazetteer features which captured most of the information about polysemous words for NER dataset and modeling the context didn’t help as much. The polysemous words for Chunking dataset were like spot (VP/NP), never (VP/ADVP), more (NP/VP/ADVP/ADJP) etc. and in this case embeddings with context helped significantly, giving 3.1 − 6.5% relative improvement in accuracy over context oblivious embeddings. 5 Summary and Conclusion In this paper, we presented a novel CCA-based multi-view learning method, LR-MVL, for large scale sequence learning problems such as arise in NLP. LR-MVL is a spectral method that works in low dimensional state-space so it is computationally efficient, and can be used to train using large amounts of unlabeled data; moreover it does not get stuck in local optima like an EM trained HMM. The embeddings learnt using LR-MVL can be used as features with any supervised learner. LR-MVL has strong theoretical grounding; is much simpler and faster than competing methods and achieves state-of-the-art accuracies on NER and Chunking problems. Acknowledgements: The authors would like to thank Alexander Yates, Ted Sandler and the three anonymous reviews for providing valuable feedback. We would also like to thank Lev Ratinov and Joseph Turian for answering our questions regarding their paper [16]. 8 References [1] Ando, R., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6 (2005) 1817–1853 [2] Suzuki, J., Isozaki, H.: Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In: In ACL. (2008) [3] Brown, P., deSouza, P., Mercer, R., Pietra, V.D., Lai, J.: Class-based n-gram models of natural language. Comput. Linguist. 18 (December 1992) 467–479 [4] Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: 31st Annual Meeting of the ACL. (1993) 183–190 [5] Huang, F., Yates, A.: Distributional representations for handling sparsity in supervised sequence-labeling. ACL ’09, Stroudsburg, PA, USA, Association for Computational Linguistics (2009) 495–503 [6] Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. ICML ’08, New York, NY, USA, ACM (2008) 160–167 [7] Mnih, A., Hinton, G.: Three new graphical models for statistical language modelling. ICML ’07, New York, NY, USA, ACM (2007) 641–648 [8] Dumais, S., Furnas, G., Landauer, T., Deerwester, S., Harshman, R.: Using latent semantic analysis to improve access to textual information. In: SIGCHI Conference on human factors in computing systems, ACM (1988) 281–285 [9] Hsu, D., Kakade, S., Zhang, T.: A spectral algorithm for learning hidden markov models. In: COLT. (2009) [10] Siddiqi, S., Boots, B., Gordon, G.J.: Reduced-rank hidden Markov models. In: AISTATS2010. (2010) [11] Song, L., Boots, B., Siddiqi, S.M., Gordon, G.J., Smola, A.J.: Hilbert space embeddings of hidden Markov models. In: ICML. (2010) [12] Hotelling, H.: Canonical correlation analysis (cca). Journal of Educational Psychology (1935) [13] Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT’ 98. (1998) 92–100 [14] Halko, N., Martinsson, P.G., Tropp, J.: Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. (Dec 2010) [15] Zhang, T., Johnson, D.: A robust risk minimization based named entity recognition system. CONLL ’03 (2003) 204–207 [16] Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. ACL ’10, Stroudsburg, PA, USA, Association for Computational Linguistics (2010) 384–394 [17] Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: CONLL. (2009) 147–155 [18] Liang, P.: Semi-supervised learning for natural language. Master’s thesis, Massachusetts Institute of Technology (2005) [19] Lin, D., Wu, X.: Phrase clustering for discriminative learning. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2. ACL ’09, Stroudsburg, PA, USA, Association for Computational Linguistics (2009) 1030–1038 9
3 0.092950821 263 nips-2011-Sparse Manifold Clustering and Embedding
Author: Ehsan Elhamifar, René Vidal
Abstract: We propose an algorithm called Sparse Manifold Clustering and Embedding (SMCE) for simultaneous clustering and dimensionality reduction of data lying in multiple nonlinear manifolds. Similar to most dimensionality reduction methods, SMCE finds a small neighborhood around each data point and connects each point to its neighbors with appropriate weights. The key difference is that SMCE finds both the neighbors and the weights automatically. This is done by solving a sparse optimization problem, which encourages selecting nearby points that lie in the same manifold and approximately span a low-dimensional affine subspace. The optimal solution encodes information that can be used for clustering and dimensionality reduction using spectral clustering and embedding. Moreover, the size of the optimal neighborhood of a data point, which can be different for different points, provides an estimate of the dimension of the manifold to which the point belongs. Experiments demonstrate that our method can effectively handle multiple manifolds that are very close to each other, manifolds with non-uniform sampling and holes, as well as estimate the intrinsic dimensions of the manifolds. 1 1.1
4 0.073685631 54 nips-2011-Co-regularized Multi-view Spectral Clustering
Author: Abhishek Kumar, Piyush Rai, Hal Daume
Abstract: In many clustering problems, we have access to multiple views of the data each of which could be individually used for clustering. Exploiting information from multiple views, one can hope to find a clustering that is more accurate than the ones obtained using the individual views. Often these different views admit same underlying clustering of the data, so we can approach this problem by looking for clusterings that are consistent across the views, i.e., corresponding data points in each view should have same cluster membership. We propose a spectral clustering framework that achieves this goal by co-regularizing the clustering hypotheses, and propose two co-regularization schemes to accomplish this. Experimental comparisons with a number of baselines on two synthetic and three real-world datasets establish the efficacy of our proposed approaches.
5 0.071160771 179 nips-2011-Multilinear Subspace Regression: An Orthogonal Tensor Decomposition Approach
Author: Qibin Zhao, Cesar F. Caiafa, Danilo P. Mandic, Liqing Zhang, Tonio Ball, Andreas Schulze-bonhage, Andrzej S. Cichocki
Abstract: A multilinear subspace regression model based on so called latent variable decomposition is introduced. Unlike standard regression methods which typically employ matrix (2D) data representations followed by vector subspace transformations, the proposed approach uses tensor subspace transformations to model common latent variables across both the independent and dependent data. The proposed approach aims to maximize the correlation between the so derived latent variables and is shown to be suitable for the prediction of multidimensional dependent data from multidimensional independent data, where for the estimation of the latent variables we introduce an algorithm based on Multilinear Singular Value Decomposition (MSVD) on a specially defined cross-covariance tensor. It is next shown that in this way we are also able to unify the existing Partial Least Squares (PLS) and N-way PLS regression algorithms within the same framework. Simulations on benchmark synthetic data confirm the advantages of the proposed approach, in terms of its predictive ability and robustness, especially for small sample sizes. The potential of the proposed technique is further illustrated on a real world task of the decoding of human intracranial electrocorticogram (ECoG) from a simultaneously recorded scalp electroencephalograph (EEG). 1
7 0.066280782 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
8 0.063924357 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization
9 0.061555188 150 nips-2011-Learning a Distance Metric from a Network
10 0.056200504 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time
11 0.051762193 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators
12 0.050496139 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
13 0.049458504 68 nips-2011-Demixed Principal Component Analysis
14 0.049433831 157 nips-2011-Learning to Search Efficiently in High Dimensions
15 0.048031773 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features
16 0.047471628 119 nips-2011-Higher-Order Correlation Clustering for Image Segmentation
17 0.046399947 211 nips-2011-Penalty Decomposition Methods for Rank Minimization
18 0.045436602 171 nips-2011-Metric Learning with Multiple Kernels
19 0.045378238 168 nips-2011-Maximum Margin Multi-Instance Learning
20 0.044929493 301 nips-2011-Variational Gaussian Process Dynamical Systems
topicId topicWeight
[(0, 0.143), (1, 0.052), (2, -0.024), (3, -0.026), (4, -0.029), (5, 0.001), (6, 0.012), (7, 0.001), (8, 0.076), (9, -0.043), (10, 0.018), (11, 0.099), (12, 0.034), (13, -0.04), (14, 0.027), (15, -0.024), (16, -0.117), (17, 0.095), (18, -0.039), (19, -0.064), (20, -0.076), (21, 0.041), (22, 0.041), (23, 0.069), (24, 0.073), (25, 0.018), (26, -0.018), (27, -0.072), (28, -0.029), (29, -0.016), (30, -0.059), (31, 0.005), (32, -0.019), (33, -0.052), (34, -0.032), (35, -0.02), (36, 0.013), (37, -0.025), (38, -0.042), (39, 0.11), (40, 0.062), (41, 0.015), (42, 0.031), (43, 0.026), (44, -0.144), (45, 0.119), (46, 0.131), (47, 0.038), (48, -0.121), (49, -0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.90798837 167 nips-2011-Maximum Covariance Unfolding : Manifold Learning for Bimodal Data
Author: Vijay Mahadevan, Chi W. Wong, Jose C. Pereira, Tom Liu, Nuno Vasconcelos, Lawrence K. Saul
Abstract: We propose maximum covariance unfolding (MCU), a manifold learning algorithm for simultaneous dimensionality reduction of data from different input modalities. Given high dimensional inputs from two different but naturally aligned sources, MCU computes a common low dimensional embedding that maximizes the cross-modal (inter-source) correlations while preserving the local (intra-source) distances. In this paper, we explore two applications of MCU. First we use MCU to analyze EEG-fMRI data, where an important goal is to visualize the fMRI voxels that are most strongly correlated with changes in EEG traces. To perform this visualization, we augment MCU with an additional step for metric learning in the high dimensional voxel space. Second, we use MCU to perform cross-modal retrieval of matched image and text samples from Wikipedia. To manage large applications of MCU, we develop a fast implementation based on ideas from spectral graph theory. These ideas transform the original problem for MCU, one of semidefinite programming, into a simpler problem in semidefinite quadratic linear programming. 1
2 0.75998569 176 nips-2011-Multi-View Learning of Word Embeddings via CCA
Author: Paramveer Dhillon, Dean P. Foster, Lyle H. Ungar
Abstract: Recently, there has been substantial interest in using large amounts of unlabeled data to learn word representations which can then be used as features in supervised classifiers for NLP tasks. However, most current approaches are slow to train, do not model the context of the word, and lack theoretical grounding. In this paper, we present a new learning method, Low Rank Multi-View Learning (LR-MVL) which uses a fast spectral method to estimate low dimensional context-specific word representations from unlabeled data. These representation features can then be used with any supervised learner. LR-MVL is extremely fast, gives guaranteed convergence to a global optimum, is theoretically elegant, and achieves state-ofthe-art performance on named entity recognition (NER) and chunking problems. 1 Introduction and Related Work Over the past decade there has been increased interest in using unlabeled data to supplement the labeled data in semi-supervised learning settings to overcome the inherent data sparsity and get improved generalization accuracies in high dimensional domains like NLP. Approaches like [1, 2] have been empirically very successful and have achieved excellent accuracies on a variety of NLP tasks. However, it is often difficult to adapt these approaches to use in conjunction with an existing supervised NLP system as these approaches enforce a particular choice of model. An increasingly popular alternative is to learn representational embeddings for words from a large collection of unlabeled data (typically using a generative model), and to use these embeddings to augment the feature set of a supervised learner. Embedding methods produce features in low dimensional spaces or over a small vocabulary size, unlike the traditional approach of working in the original high dimensional vocabulary space with only one dimension “on” at a given time. Broadly, these embedding methods fall into two categories: 1. Clustering based word representations: Clustering methods, often hierarchical, are used to group distributionally similar words based on their contexts. The two dominant approaches are Brown Clustering [3] and [4]. As recently shown, HMMs can also be used to induce a multinomial distribution over possible clusters [5]. 2. Dense representations: These representations are dense, low dimensional and real-valued. Each dimension of these representations captures latent information about a combination of syntactic and semantic word properties. They can either be induced using neural networks like C&W; embeddings [6] and Hierarchical log-linear (HLBL) embeddings [7] or by eigen-decomposition of the word co-occurrence matrix, e.g. Latent Semantic Analysis/Latent Semantic Indexing (LSA/LSI) [8]. Unfortunately, most of these representations are 1). slow to train, 2). sensitive to the scaling of the embeddings (especially 2 based approaches like LSA/PCA), 3). can get stuck in local optima (like EM trained HMM) and 4). learn a single embedding for a given word type; i.e. all the occurrences 1 of the word “bank” will have the same embedding, irrespective of whether the context of the word suggests it means “a financial institution” or “a river bank”. In this paper, we propose a novel context-specific word embedding method called Low Rank MultiView Learning, LR-MVL, which is fast to train and is guaranteed to converge to the optimal solution. As presented here, our LR-MVL embeddings are context-specific, but context oblivious embeddings (like the ones used by [6, 7]) can be trivially gotten from our model. Furthermore, building on recent advances in spectral learning for sequence models like HMMs [9, 10, 11] we show that LR-MVL has strong theoretical grounding. Particularly, we show that LR-MVL estimates low dimensional context-specific word embeddings which preserve all the information in the data if the data were generated by an HMM. Moreover, LR-MVL being linear does not face the danger of getting stuck in local optima as is the case for an EM trained HMM. LR-MVL falls into category (2) mentioned above; it learns real-valued context-specific word embeddings by performing Canonical Correlation Analysis (CCA) [12] between the past and future views of low rank approximations of the data. However, LR-MVL is more general than those methods, which work on bigram or trigram co-occurrence matrices, in that it uses longer word sequence information to estimate context-specific embeddings and also for the reasons mentioned in the last paragraph. The remainder of the paper is organized as follows. In the next section we give a brief overview of CCA, which forms the core of our method. Section 3 describes our proposed LR-MVL algorithm in detail and gives theory supporting its performance. Section 4 demonstrates the effectiveness of LR-MVL on the NLP tasks of Named Entity Recognition and Chunking. We conclude with a brief summary in Section 5. 2 Brief Review: Canonical Correlation Analysis (CCA) CCA [12] is the analog to Principal Component Analysis (PCA) for pairs of matrices. PCA computes the directions of maximum covariance between elements in a single matrix, whereas CCA computes the directions of maximal correlation between a pair of matrices. Unlike PCA, CCA does not depend on how the observations are scaled. This invariance of CCA to linear data transformations allows proofs that keeping the dominant singular vectors (those with largest singular values) will faithfully capture any state information. More specifically, given a set of n paired observation vectors {(l1 , r1 ), ..., (ln , rn )}–in our case the two matrices are the left (L) and right (R) context matrices of a word–we would like to simultaneously find the directions Φl and Φr that maximize the correlation of the projections of L onto Φl with the projections of R onto Φr . This is expressed as max Φl ,Φr E[ L, Φl R, Φr ] E[ L, Φl 2 ]E[ R, Φr 2 ] (1) where E denotes the empirical expectation. We use the notation Clr (Cll ) to denote the cross (auto) covariance matrices between L and R (i.e. L’R and L’L respectively.). The left and right canonical correlates are the solutions Φl , Φr of the following equations: Cll −1 Clr Crr −1 Crl Φl = λΦl Crr −1 Crl Cll −1 Clr Φr = λΦr 3 (2) Low Rank Multi-View Learning (LR-MVL) In LR-MVL, we compute the CCA between the past and future views of the data on a large unlabeled corpus to find the common latent structure, i.e., the hidden state associated with each token. These induced representations of the tokens can then be used as features in a supervised classifier (typically discriminative). The context around a word, consisting of the h words to the right and left of it, sits in a high dimensional space, since for a vocabulary of size v, each of the h words in the context requires an indicator function of dimension v. The key move in LR-MVL is to project the v-dimensional word 2 space down to a k dimensional state space. Thus, all eigenvector computations are done in a space that is v/k times smaller than the original space. Since a typical vocabulary contains at least 50, 000 words, and we use state spaces of order k ≈ 50 dimensions, this gives a 1,000-fold reduction in the size of calculations that are needed. The core of our LR-MVL algorithm is a fast spectral method for learning a v × k matrix A which maps each of the v words in the vocabulary to a k-dimensional state vector. We call this matrix the “eigenfeature dictionary”. We now describe the LR-MVL method, give a theorem that provides intuition into how it works, and formally present the LR-MVL algorithm. The Experiments section then shows that this low rank approximation allows us to achieve state-of-the-art performance on NLP tasks. 3.1 The LR-MVL method Given an unlabeled token sequence w={w0 , w1 , . . ., wn } we want to learn a low (k)- dimensional state vector {z0 , z1 , . . . , zn } for each observed token. The key is to find a v ×k matrix A (Algorithm 1) that maps each of the v words in the vocabulary to a reduced rank k-dimensional state vector, which is later used to induce context specific embeddings for the tokens (Algorithm 2). For supervised learning, these context specific embeddings are supplemented with other information about each token wt , such as its identity, orthographic features such as prefixes and suffixes or membership in domain-specific lexicons, and used as features in a classifier. Section 3.4 gives the algorithm more formally, but the key steps in the algorithm are, in general terms: • Take the h words to the left and to the right of each target word wt (the “Left” and “Right” contexts), and project them each down to k dimensions using A. • Take the CCA between the reduced rank left and right contexts, and use the resulting model to estimate a k dimensional state vector (the “hidden state”) for each token. • Take the CCA between the hidden states and the tokens wt . The singular vectors associated with wt form a new estimate of the eigenfeature dictionary. LR-MVL can be viewed as a type of co-training [13]: The state of each token wt is similar to that of the tokens both before and after it, and it is also similar to the states of the other occurrences of the same word elsewhere in the document (used in the outer iteration). LR-MVL takes advantage of these two different types of similarity by alternately estimating word state using CCA on the smooths of the states of the words before and after each target token and using the average over the states associated with all other occurrences of that word. 3.2 Theoretical Properties of LR-MVL We now present the theory behind the LR-MVL algorithm; particularly we show that the reduced rank matrix A allows a significant data reduction while preserving the information in our data and the estimated state does the best possible job of capturing any label information that can be inferred by a linear model. Let L be an n × hv matrix giving the words in the left context of each of the n tokens, where the context is of length h, R be the corresponding n × hv matrix for the right context, and W be an n × v matrix of indicator functions for the words themselves. We will use the following assumptions at various points in our proof: Assumption 1. L, W, and R come from a rank k HMM i.e. it has a rank k observation matrix and rank k transition matrix both of which have the same domain. For example, if the dimension of the hidden state is k and the vocabulary size is v then the observation matrix, which is k × v, has rank k. This rank condition is similar to the one used by [10]. Assumption 1A. For the three views, L, W and R assume that there exists a “hidden state H” of dimension n × k, where each row Hi has the same non-singular variance-covariance matrix and 3 such that E(Li |Hi ) = Hi β T and E(Ri |Hi ) = Hi β T and E(Wi |Hi ) = Hi β T where all β’s are of L R W rank k, where Li , Ri and Wi are the rows of L, R and W respectively. Assumption 1A follows from Assumption 1. Assumption 2. ρ(L, W), ρ(L, R) and ρ(W, R) all have rank k, where ρ(X1 , X2 ) is the expected correlation between X1 and X2 . Assumption 2 is a rank condition similar to that in [9]. Assumption 3. ρ([L, R], W) has k distinct singular values. Assumption 3 just makes the proof a little cleaner, since if there are repeated singular values, then the singular vectors are not unique. Without it, we would have to phrase results in terms of subspaces with identical singular values. We also need to define the CCA function that computes the left and right singular vectors for a pair of matrices: Definition 1 (CCA). Compute the CCA between two matrices X1 and X2 . Let ΦX1 be a matrix containing the d largest singular vectors for X1 (sorted from the largest on down). Likewise for ΦX2 . Define the function CCAd (X1 , X2 ) = [ΦX1 , ΦX2 ]. When we want just one of these Φ’s, we will use CCAd (X1 , X2 )left = ΦX1 for the left singular vectors and CCAd (X1 , X2 )right = ΦX2 for the right singular vectors. Note that the resulting singular vectors, [ΦX1 , ΦX2 ] can be used to give two redundant estimates, X1 ΦX1 and X2 ΦX2 of the “hidden” state relating X1 and X2 , if such a hidden state exists. Definition 2. Define the symbol “≈” to mean X1 ≈ X2 ⇐⇒ lim X1 = lim X2 n→∞ n→∞ where n is the sample size. Lemma 1. Define A by the following limit of the right singular vectors: CCAk ([L, R], W)right ≈ A. Under assumptions 2, 3 and 1A, such that if CCAk (L, R) ≡ [ΦL , ΦR ] then CCAk ([LΦL , RΦR ], W)right ≈ A. Lemma 1 shows that instead of finding the CCA between the full context and the words, we can take the CCA between the Left and Right contexts, estimate a k dimensional state from them, and take the CCA of that state with the words and get the same result. See the supplementary material for the Proof. ˜ Let Ah denote a matrix formed by stacking h copies of A on top of each other. Right multiplying ˜ L or R by Ah projects each of the words in that context into the k-dimensional reduced rank space. The following theorem addresses the core of the LR-MVL algorithm, showing that there is an A which gives the desired dimensionality reduction. Specifically, it shows that the previous lemma also holds in the reduced rank space. Theorem 1. Under assumptions 1, 2 and 3 there exists a unique matrix A such that if ˜ ˜ ˜ ˜ CCAk (LAh , RAh ) ≡ [ΦL , ΦR ] then ˜ ˜ ˜ ˜ CCAk ([LAh ΦL , RAh ΦR ], W)right ≈ A ˜ where Ah is the stacked form of A. See the supplementary material for the Proof 1 . ˆ It is worth noting that our matrix A corresponds to the matrix U used by [9, 10]. They showed that U is sufficient to compute the probability of a sequence of words generated by an HMM; although we do not show ˆ it here (due to limited space), our A provides a more statistically efficient estimate of U than their U , and hence can also be used to estimate the sequence probabilities. 1 4 Under the above assumptions, there is asymptotically (in the limit of infinite data) no benefit to first estimating state by finding the CCA between the left and right contexts and then finding the CCA between the estimated state and the words. One could instead just directly find the CCA between the combined left and rights contexts and the words. However, because of the Zipfian distribution of words, many words are rare or even unique, and hence one is not in the asymptotic limit. In this case, CCA between the rare words and context will not be informative, whereas finding the CCA between the left and right contexts gives a good state vector estimate even for unique words. One can then fruitfully find the CCA between the contexts and the estimated state vector for their associated words. 3.3 Using Exponential Smooths In practice, we replace the projected left and right contexts with exponential smooths (weighted average of the previous (or next) token’s state i.e. Zt−1 (or Zt+1 ) and previous (or next) token’s smoothed state i.e. St−1 (or St+1 ).), of them at a few different time scales, thus giving a further dimension reduction by a factor of context length h (say 100 words) divided by the number of smooths (often 5-7). We use a mixture of both very short and very long contexts which capture short and long range dependencies as required by NLP problems as NER, Chunking, WSD etc. Since exponential smooths are linear, we preserve the linearity of our method. 3.4 The LR-MVL Algorithm The LR-MVL algorithm (using exponential smooths) is given in Algorithm 1; it computes the pair of CCAs described above in Theorem 1. Algorithm 1 LR-MVL Algorithm - Learning from Large amounts of Unlabeled Data 1: Input: Token sequence Wn×v , state space size k, smoothing rates αj 2: Initialize the eigenfeature dictionary A to random values N (0, 1). 3: repeat 4: Set the state Zt (1 < t ≤ n) of each token wt to the eigenfeature vector of the corresponding word. Zt = (Aw : w = wt ) 5: Smooth the state estimates before and after each token to get a pair of views for each smoothing rate αj . (l,j) (l,j) = (1 − αj )St−1 + αj Zt−1 // left view L St (r,j) (r,j) j St = (1 − α )St+1 + αj Zt+1 // right view R. (l,j) (r,j) th where the t rows of L and R are, respectively, concatenations of the smooths St and St for (j) each of the α s. 6: Find the left and right canonical correlates, which are the eigenvectors Φl and Φr of (L L)−1 L R(R R)−1 R LΦl = λΦl . (R R)−1 R L(L L)−1 L RΦr = λΦr . 7: Project the left and right views on to the space spanned by the top k/2 left and right CCAs respectively (k/2) (k/2) Xl = LΦl and Xr = RΦr (k/2) (k/2) where Φl , Φr are matrices composed of the singular vectors of Φl , Φr with the k/2 largest magnitude singular values. Estimate the state for each word wt as the union of the left and right estimates: Z = [Xl , Xr ] 8: Estimate the eigenfeatures of each word type, w, as the average of the states estimated for that word. Aw = avg(Zt : wt = w) 9: Compute the change in A from the previous iteration 10: until |∆A| < 11: Output: Φk , Φk , A . r l A few iterations (∼ 5) of the above algorithm are sufficient to converge to the solution. (Since the problem is convex, there is a single solution, so there is no issue of local minima.) As [14] show for PCA, one can start with a random matrix that is only slightly larger than the true rank k of the correlation matrix, and with extremely high likelihood converge in a few iterations to within a small distance of the true principal components. In our case, if the assumptions detailed above (1, 1A, 2 and 3) are satisfied, our method converges equally rapidly to the true canonical variates. As mentioned earlier, we get further dimensionality reduction in Step 5, by replacing the Left and Right context matrices with a set of exponentially smoothed values of the reduced rank projections of the context words. Step 6 finds the CCA between the Left and Right contexts. Step 7 estimates 5 the state by combining the estimates from the left and right contexts, since we don’t know which will best estimate the state. Step 8 takes the CCA between the estimated state Z and the matrix of words W. Because W is a vector of indicator functions, this CCA takes the trivial form of a set of averages. Once we have estimated the CCA model, it is used to generate context specific embeddings for the tokens from training, development and test sets (as described in Algorithm 2). These embeddings are further supplemented with other baseline features and used in a supervised learner to predict the label of the token. Algorithm 2 LR-MVL Algorithm -Inducing Context Specific Embeddings for Train/Dev/Test Data 1: Input: Model (Φk , Φk , A) output from above algorithm and Token sequences Wtrain , (Wdev , Wtest ) r l 2: Project the left and right views L and R after smoothing onto the space spanned by the top k left and right CCAs respectively Xl = LΦk and Xr = RΦk r l and the words onto the eigenfeature dictionary Xw = W train A 3: Form the final embedding matrix Xtrain:embed by concatenating these three estimates of state Xtrain:embed = [Xl , Xw , Xr ] 4: Output: The embedding matrices Xtrain:embed , (Xdev:embed , Xtest:embed ) with context-specific representations for the tokens. These embeddings are augmented with baseline set of features mentioned in Sections 4.1.1 and 4.1.2 before learning the final classifier. Note that we can get context “oblivious” embeddings i.e. one embedding per word type, just by using the eigenfeature dictionary (Av×k ) output by Algorithm 1. 4 Experimental Results In this section we present the experimental results of LR-MVL on Named Entity Recognition (NER) and Syntactic Chunking tasks. We compare LR-MVL to state-of-the-art semi-supervised approaches like [1] (Alternating Structures Optimization (ASO)) and [2] (Semi-supervised extension of CRFs) as well as embeddings like C&W;, HLBL and Brown Clustering. 4.1 Datasets and Experimental Setup For the NER experiments we used the data from CoNLL 2003 shared task and for Chunking experiments we used the CoNLL 2000 shared task data2 with standard training, development and testing set splits. The CoNLL ’03 and the CoNLL ’00 datasets had ∼ 204K/51K/46K and ∼ 212K/ − /47K tokens respectively for Train/Dev./Test sets. 4.1.1 Named Entity Recognition (NER) We use the same set of baseline features as used by [15, 16] in their experiments. The detailed list of features is as below: • Current Word wi ; Its type information: all-capitalized, is-capitalized, all-digits and so on; Prefixes and suffixes of wi • Word tokens in window of 2 around the current word i.e. (wi−2 , wi−1 , wi , wi+1 , wi+2 ); and capitalization pattern in the window. d = • Previous two predictions yi−1 and yi−2 and conjunction of d and yi−1 • Embedding features (LR-MVL, C&W;, HLBL, Brown etc.) in a window of 2 around the current word (if applicable). Following [17] we use regularized averaged perceptron model with above set of baseline features for the NER task. We also used their BILOU text chunk representation and fast greedy inference as it was shown to give superior performance. 2 More details about the data and competition are available at http://www.cnts.ua.ac.be/ conll2003/ner/ and http://www.cnts.ua.ac.be/conll2000/chunking/ 6 We also augment the above set of baseline features with gazetteers, as is standard practice in NER experiments. We tuned our free parameter namely the size of LR-MVL embedding on the development and scaled our embedding features to have a 2 norm of 1 for each token and further multiplied them by a normalization constant (also chosen by cross validation), so that when they are used in conjunction with other categorical features in a linear classifier, they do not exert extra influence. The size of LR-MVL embeddings (state-space) that gave the best performance on the development set was k = 50 (50 each for Xl , Xw , Xr in Algorithm 2) i.e. the total size of embeddings was 50×3, and the best normalization constant was 0.5. We omit validation plots due to paucity of space. 4.1.2 Chunking For our chunking experiments we use a similar base set of features as above: • Current Word wi and word tokens in window of 2 around the current word i.e. d = (wi−2 , wi−1 , wi , wi+1 , wi+2 ); • POS tags ti in a window of 2 around the current word. • Word conjunction features wi ∩ wi+1 , i ∈ {−1, 0} and Tag conjunction features ti ∩ ti+1 , i ∈ {−2, −1, 0, 1} and ti ∩ ti+1 ∩ ti+2 , i ∈ {−2, −1, 0}. • Embedding features in a window of 2 around the current word (when applicable). Since CoNLL 00 chunking data does not have a development set, we randomly sampled 1000 sentences from the training data (8936 sentences) for development. So, we trained our chunking models on 7936 training sentences and evaluated their F1 score on the 1000 development sentences and used a CRF 3 as the supervised classifier. We tuned the size of embedding and the magnitude of 2 regularization penalty in CRF on the development set and took log (or -log of the magnitude) of the value of the features4 . The regularization penalty that gave best performance on development set was 2 and here again the best size of LR-MVL embeddings (state-space) was k = 50. Finally, we trained the CRF on the entire (“original”) training data i.e. 8936 sentences. 4.1.3 Unlabeled Data and Induction of embeddings For inducing the embeddings we used the RCV1 corpus containing Reuters newswire from Aug ’96 to Aug ’97 and containing about 63 million tokens in 3.3 million sentences5 . Case was left intact and we did not do the “cleaning” as done by [18, 16] i.e. remove all sentences which are less than 90% lowercase a-z, as our multi-view learning approach is robust to such noisy data, like news byline text (mostly all caps) which does not correlate strongly with the text of the article. We induced our LR-MVL embeddings over a period of 3 days (70 core hours on 3.0 GHz CPU) on the entire RCV1 data by performing 4 iterations, a vocabulary size of 300k and using a variety of smoothing rates (α in Algorithm 1) to capture correlations between shorter and longer contexts α = [0.005, 0.01, 0.05, 0.1, 0.5, 0.9]; theoretically we could tune the smoothing parameters on the development set but we found this mixture of long and short term dependencies to work well in practice. As far as the other embeddings are concerned i.e. C&W;, HLBL and Brown Clusters, we downloaded them from http://metaoptimize.com/projects/wordreprs. The details about their induction and parameter tuning can be found in [16]; we report their best numbers here. It is also worth noting that the unsupervised training of LR-MVL was (> 1.5 times)6 faster than other embeddings. 4.2 Results The results for NER and Chunking are shown in Tables 1 and 2, respectively, which show that LR-MVL performs significantly better than state-of-the-art competing methods on both NER and Chunking tasks. 3 http://www.chokkan.org/software/crfsuite/ Our embeddings are learnt using a linear model whereas CRF is a log-linear model, so to keep things on same scale we did this normalization. 5 We chose this particular dataset to make a fair comparison with [1, 16], who report results using RCV1 as unlabeled data. 6 As some of these embeddings were trained on GPGPU which makes our method even faster comparatively. 4 7 Embedding/Model Baseline C&W;, 200-dim HLBL, 100-dim Brown 1000 clusters Ando & Zhang ’05 Suzuki & Isozaki ’08 LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim HLBL, 100-dim C&W;, 200-dim Brown, 1000 clusters LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim No Gazetteers With Gazetteers F1-Score Dev. Set Test Set 90.03 84.39 92.46 87.46 92.00 88.13 92.32 88.52 93.15 89.31 93.66 89.36 93.11 89.55 93.61 89.91 92.91 89.35 92.98 88.88 93.25 89.41 93.91 89.89 94.41 90.06 Table 1: NER Results. Note: 1). LR-MVL (CO) are Context Oblivious embeddings which are gotten from (A) in Algorithm 1. 2). F1-score= Harmonic Mean of Precision and Recall. 3). The current state-of-the-art for this NER task is 90.90 (Test Set) but using 700 billion tokens of unlabeled data [19]. Embedding/Model Baseline HLBL, 50-dim C&W;, 50-dim Brown 3200 Clusters Ando & Zhang ’05 Suzuki & Isozaki ’08 LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim Test Set F1-Score 93.79 94.00 94.10 94.11 94.39 94.67 95.02 95.44 Table 2: Chunking Results. It is important to note that in problems like NER, the final accuracy depends on performance on rare-words and since LR-MVL is robustly able to correlate past with future views, it is able to learn better representations for rare words resulting in overall better accuracy. On rare-words (occurring < 10 times in corpus), we got 11.7%, 10.7% and 9.6% relative reduction in error over C&W;, HLBL and Brown respectively for NER; on chunking the corresponding numbers were 6.7%, 7.1% and 8.7%. Also, it is worth mentioning that modeling the context in embeddings gives decent improvements in accuracies on both NER and Chunking problems. For the case of NER, the polysemous words were mostly like Chicago, Wales, Oakland etc., which could either be a location or organization (Sports teams, Banks etc.), so when we don’t use the gazetteer features, (which are known lists of cities, persons, organizations etc.) we got higher increase in F-score by modeling context, compared to the case when we already had gazetteer features which captured most of the information about polysemous words for NER dataset and modeling the context didn’t help as much. The polysemous words for Chunking dataset were like spot (VP/NP), never (VP/ADVP), more (NP/VP/ADVP/ADJP) etc. and in this case embeddings with context helped significantly, giving 3.1 − 6.5% relative improvement in accuracy over context oblivious embeddings. 5 Summary and Conclusion In this paper, we presented a novel CCA-based multi-view learning method, LR-MVL, for large scale sequence learning problems such as arise in NLP. LR-MVL is a spectral method that works in low dimensional state-space so it is computationally efficient, and can be used to train using large amounts of unlabeled data; moreover it does not get stuck in local optima like an EM trained HMM. The embeddings learnt using LR-MVL can be used as features with any supervised learner. LR-MVL has strong theoretical grounding; is much simpler and faster than competing methods and achieves state-of-the-art accuracies on NER and Chunking problems. Acknowledgements: The authors would like to thank Alexander Yates, Ted Sandler and the three anonymous reviews for providing valuable feedback. We would also like to thank Lev Ratinov and Joseph Turian for answering our questions regarding their paper [16]. 8 References [1] Ando, R., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6 (2005) 1817–1853 [2] Suzuki, J., Isozaki, H.: Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In: In ACL. (2008) [3] Brown, P., deSouza, P., Mercer, R., Pietra, V.D., Lai, J.: Class-based n-gram models of natural language. Comput. Linguist. 18 (December 1992) 467–479 [4] Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: 31st Annual Meeting of the ACL. (1993) 183–190 [5] Huang, F., Yates, A.: Distributional representations for handling sparsity in supervised sequence-labeling. ACL ’09, Stroudsburg, PA, USA, Association for Computational Linguistics (2009) 495–503 [6] Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. ICML ’08, New York, NY, USA, ACM (2008) 160–167 [7] Mnih, A., Hinton, G.: Three new graphical models for statistical language modelling. ICML ’07, New York, NY, USA, ACM (2007) 641–648 [8] Dumais, S., Furnas, G., Landauer, T., Deerwester, S., Harshman, R.: Using latent semantic analysis to improve access to textual information. In: SIGCHI Conference on human factors in computing systems, ACM (1988) 281–285 [9] Hsu, D., Kakade, S., Zhang, T.: A spectral algorithm for learning hidden markov models. In: COLT. (2009) [10] Siddiqi, S., Boots, B., Gordon, G.J.: Reduced-rank hidden Markov models. In: AISTATS2010. (2010) [11] Song, L., Boots, B., Siddiqi, S.M., Gordon, G.J., Smola, A.J.: Hilbert space embeddings of hidden Markov models. In: ICML. (2010) [12] Hotelling, H.: Canonical correlation analysis (cca). Journal of Educational Psychology (1935) [13] Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT’ 98. (1998) 92–100 [14] Halko, N., Martinsson, P.G., Tropp, J.: Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. (Dec 2010) [15] Zhang, T., Johnson, D.: A robust risk minimization based named entity recognition system. CONLL ’03 (2003) 204–207 [16] Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. ACL ’10, Stroudsburg, PA, USA, Association for Computational Linguistics (2010) 384–394 [17] Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: CONLL. (2009) 147–155 [18] Liang, P.: Semi-supervised learning for natural language. Master’s thesis, Massachusetts Institute of Technology (2005) [19] Lin, D., Wu, X.: Phrase clustering for discriminative learning. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2. ACL ’09, Stroudsburg, PA, USA, Association for Computational Linguistics (2009) 1030–1038 9
3 0.56792247 263 nips-2011-Sparse Manifold Clustering and Embedding
Author: Ehsan Elhamifar, René Vidal
Abstract: We propose an algorithm called Sparse Manifold Clustering and Embedding (SMCE) for simultaneous clustering and dimensionality reduction of data lying in multiple nonlinear manifolds. Similar to most dimensionality reduction methods, SMCE finds a small neighborhood around each data point and connects each point to its neighbors with appropriate weights. The key difference is that SMCE finds both the neighbors and the weights automatically. This is done by solving a sparse optimization problem, which encourages selecting nearby points that lie in the same manifold and approximately span a low-dimensional affine subspace. The optimal solution encodes information that can be used for clustering and dimensionality reduction using spectral clustering and embedding. Moreover, the size of the optimal neighborhood of a data point, which can be different for different points, provides an estimate of the dimension of the manifold to which the point belongs. Experiments demonstrate that our method can effectively handle multiple manifolds that are very close to each other, manifolds with non-uniform sampling and holes, as well as estimate the intrinsic dimensions of the manifolds. 1 1.1
Author: Shuai Huang, Jing Li, Jieping Ye, Teresa Wu, Kewei Chen, Adam Fleisher, Eric Reiman
Abstract: Diagnosis of Alzheimer’s disease (AD) at the early stage of the disease development is of great clinical importance. Current clinical assessment that relies primarily on cognitive measures proves low sensitivity and specificity. The fast growing neuroimaging techniques hold great promise. Research so far has focused on single neuroimaging modality. However, as different modalities provide complementary measures for the same disease pathology, fusion of multi-modality data may increase the statistical power in identification of disease-related brain regions. This is especially true for early AD, at which stage the disease-related regions are most likely to be weakeffect regions that are difficult to be detected from a single modality alone. We propose a sparse composite linear discriminant analysis model (SCLDA) for identification of disease-related brain regions of early AD from multi-modality data. SCLDA uses a novel formulation that decomposes each LDA parameter into a product of a common parameter shared by all the modalities and a parameter specific to each modality, which enables joint analysis of all the modalities and borrowing strength from one another. We prove that this formulation is equivalent to a penalized likelihood with non-convex regularization, which can be solved by the DC (difference of convex functions) programming. We show that in using the DC programming, the property of the nonconvex regularization in terms of preserving weak-effect features can be nicely revealed. We perform extensive simulations to show that SCLDA outperforms existing competing algorithms on feature selection, especially on the ability for identifying weak-effect features. We apply SCLDA to the Magnetic Resonance Imaging (MRI) and Positron Emission Tomography (PET) images of 49 AD patients and 67 normal controls (NC). Our study identifies disease-related brain regions consistent with findings in the AD literature. 1 In tro du cti on Alzheimer’s disease (AD) is a fatal, neurodegenerative disorder that currently affects over five million people in the U.S. It leads to substantial, progressive neuron damage that is irreversible, which eventually causes death. Early diagnosis of AD is of great clinical importance, because disease-modifying therapies given to patients at the early stage of their disease development will have a much better effect in slowing down the disease progression and helping preserve some cognitive functions of the brain. However, current clinical assessment that majorly relies on cognitive measures proves low sensitivity and specificity in early diagnosis of AD. This is because these cognitive measures are vulnerable to the confounding effect from some non-AD related factors such as patients’ mood, and presence of other illnesses or major life events [1]. The confounding effect is especially severe in the diagnosis of early AD, at which time cognitive 1 impairment is not yet apparent. On the other hand, fast growing neuroimaging techniques, such as Magnetic Resonance Imaging (MRI) and Positron Emission Tomography (PET), provide great opportunities for improving early diagnosis of AD, due to their ability for overcoming the limitations of conventional cognitive measures. There are two major categories of neuroimaging techniques, i.e., functional and structure neuroimaging. MRI is a typical structural neuroimaging technique, which allows for visualization of brain anatomy. PET is a typical functional neuroimaging technique, which measures the cerebral metabolic rate for glucose. Both techniques have been extensively applied to AD studies. For example, studies based on MRI have consistently revealed brain atrophy that involves the hippocampus and entorhinal cortex [2-6]; studies based on PET have revealed functional abnormality that involves the posterior temporal and parietal association cortices [8-10], posterior cingulate, precuneus, and medial temporal cortices [11-14]. There is overlap between the disease-related brain regions detected by MRI and those by PET, such as regions in the hippocampus area and the mesia temporal lobe [15-17]. This is not surprising since MRI and PET are two complementary measures for the same disease pathology, i.e., it starts mainly in the hippocampus and entorhinal cortex, and subsequently spreads throughout temporal and orbiogrontal cortext, poseterior cingulated, and association cortex [7]. However, most existing studies only exploited structural and functional alterations in separation, which ignore the potential interaction between them. The fusion of MRI and PET imaging modalities will increase the statistical power in identification of disease-related brain regions, especially for early AD, at which stage the disease-related regions are most likely to be weakeffect regions that are difficult to be detected from MRI or PET alone. Once a good set of diseaserelated brain regions is identified, they can be further used to build an effective classifier (i.e., a biomarker from the clinical perspective) to enable AD diagnose with high sensitivity and specificity. The idea of multi-modality data fusion in the research of neurodegenerative disorders has been exploited before. For example, a number of models have been proposed to combine electroencephalography (EEG) and functional MRI (fMRI), including parallel EEG-fMRI independent component analysis [18]-[19], EEG-informed fMRI analysis [18] [20], and variational Bayesian methods [18] [21]. The purpose of these studies is different from ours, i.e., they aim to combine EEG, which has high temporal resolution but low spatial resolution, and fMRI, which has low temporal resolution but high spatial resolution, so as to obtain an accurate picture for the whole brain with both high spatial and high temporal resolutions [18]-[21]. Also, there have been some studies that include both MRI and PET data for classification [15], [22][25]. However, these studies do not make use of the fact that MRI and PET measure the same underlying disease pathology from two complementary perspectives (i.e., structural and functional perspectives), so that the analysis of one imaging modality can borrow strength from the other. In this paper, we focus on the problem of identifying disease-related brain regions from multimodality data. This is actually a variable selection problem. Because MRI and PET data are highdimensional, regularization techniques are needed for effective variable selection, such as the L1regularization technique [25]-[30] and the L2/L1-regularization technique [31]. In particular, L2/L1-regularization has been used for variable selection jointly on multiple related datasets, also known as multitask feature selection [31], which has a similar nature to our problem. Note that both L1- and L2/L1-regularizations are convex regularizations, which have gained them popularity in the literature. On the other hand, there is increasing evidence that these convex regularizations tend to produce too severely shrunken parameter estimates. Therefore, these convex regularizations could lead to miss-identification of the weak-effect disease-related brain regions, which unfortunately make up a large portion of the disease-related brain regions especially in early AD. Also, convex regularizations tend to select many irrelevant variables to compensate for the overly severe shrinkage in the parameters of the relevant variables. Considering these limitations of convex regularizations, we study non-convex regularizations [33]-[35] [39], which have the advantage of producing mildly or slightly shrunken parameter estimates so as to be able to preserve weak-effect disease-related brain regions and the advantage of avoiding selecting many disease-irrelevant regions. Specifically in this paper, we propose a sparse composite linear discriminant analysis model, called SCLDA, for identification of disease-related brain regions from multi-modality data. The contributions of our paper include: 2 • • • 2 Formulation: We propose a novel formulation that decomposes each LDA parameter into a product of a common parameter shared by all the data sources and a parameter specific to each data source, which enables joint analysis of all the data sources and borrowing strength from one another. We further prove that this formulation is equivalent to a penalized likelihood with non-convex regularization. Algorithm: We show that the proposed non-convex optimization can be solved by the DC (difference of convex functions) programming [39]. More importantly, we show that in using the DC programming, the property of the non-convex regularization in terms of preserving weak-effect features can be nicely revealed. Application: We apply the proposed SCLDA to the PET and MRI data of early AD patients and normal controls (NC). Our study identifies disease-related brain regions that are consistent with the findings in the AD literature. AD vs. NC classification based on these identified regions achieves high accuracy, which makes the proposed method a useful tool for clinical diagnosis of early AD. In contrast, the convex-regularization based multitask feature selection method [31] identifies more irrelevant brain regions and yields a lower classification accuracy. Rev iew o f L D A a nd it s v ari an ts ! Denote đ?’ = đ?‘?!, đ?‘?! , ‌ , đ?‘?! as the variables and assume there are đ??˝ classes. Denote đ?‘ ! as the ! sample size of class đ?‘— and đ?‘ = !!! đ?‘ ! is the total sample size. Let đ??ł = đ?’›! , đ?’›! , ‌ , đ?’›! ! be the đ?‘ Ă—đ?‘? sample matrix, where đ?’›! is the đ?‘– !! sample and đ?‘” đ?‘– is its associated class index. Let ! ! ! ! đ?›?! = !!! đ?’›! be the overall sample mean, !!!,! ! !! đ?’›! be the sample mean of class đ?‘—, đ?›? = !! ! ! đ?’› − đ?›? ! !!! ! ! ! đ??–! = !! !!!,! ! !! ! ! đ?‘ đ??– be the ! !!! ! ! đ??“= ! đ?’›! − đ?›? đ?’›! − đ?›?! ! be the total normalized sum of squares and products (SSQP), đ?’›! − đ?›?! ! be the normalized class SSQP of class đ?‘—, and đ??–= overall normalized class SSQP. The objective of LDA is to seek for a đ?‘?Ă—đ?‘ž linear transformation matrix, đ?›‰! , with which đ?›‰! đ?‘? ! retains the maximum amount of class discrimination information in đ?‘?. To achieve this objective, one approach is to seek for the đ?›‰! that maximizes the between-class variance of đ?›‰! đ?‘?, which can ! be measured by tr(đ?›‰! đ??“đ?›‰! ), while minimizing the within-class variance of đ?›‰! đ?‘?, which can be ! ! measured by tr(đ?›‰! đ??–đ?›‰! ). Here tr() is the matrix trace operator. This is equivalent to solving the ! following optimization problem:  đ?›‰! = argmax đ?›‰! đ??đ??Ť(đ?›‰! đ??“đ?›‰! ) ! đ??đ??Ť(đ?›‰! đ??–đ?›‰! ) ! . (1) Note that đ?›‰! corresponds to the right eigenvector of đ??– !! đ??“ and đ?‘ž = đ??˝ − 1. Another approach used for finding the đ?›‰! is to use the maximum likelihood estimation for Gaussian populations that have different means and a common covariance matrix. Specifically, as in [36], this approach is developed by assuming the class distributions are Gaussian with a common covariance matrix, and their mean differences lie in a đ?‘ž-dimensional subspace of the đ?‘?dimensional original variable space. Hastie [37] further generalized this approach by assuming that class distributions are a mixture of Gaussians, which has more flexibility than LDA. However, both approaches assume a common covariance matrix for all the classes, which is too strict in many practical applications, especially in high-dimensional problems where the covariance matrices of different classes tend to be different. Consequently, the linear transformation explored by LDA may not be effective. In [38], a heterogeneous LDA (HLDA) is developed to relax this assumption. The HLDA seeks for a đ?‘?Ă—đ?‘? linear transformation matrix, đ?›‰, in which only the first đ?‘ž columns (đ?›‰! ) contain discrimination information and the remaining đ?‘? − đ?‘ž columns (đ?›‰!!! ) contain no discrimination information. For Gaussian models, assuming lack of discrimination information is equivalent to assuming that the means and the covariance matrices of the class distributions are the same for all 3 classes, in the đ?‘? − đ?‘ž dimensional subspace. Following this, the log-likelihood function of đ?›‰ can be written as below [38]: ! đ?‘™ đ?›‰|đ??™ = − log đ?›‰! đ??“đ?›‰!!! − !!! ! !! ! !!! ! log đ?›‰! đ??–! đ?›‰! + đ?‘ log đ?›‰ , ! (2) Here đ??€ denotes the determinant of matrix đ??€. There is no closed-form solution for đ?›‰. As a result, numeric methods are needed to derive the maximum likelihood estimate for đ?›‰. It is worth mentioning that the LDA in the form of (1) is a special case of the HLDA [38]. 3 T he p ro po sed SC L DA Suppose that there are multiple data sources, đ??™ ! , đ??™ ! , ‌ , đ??™ ! , with each data source capturing one aspect of the same set of physical variables, e.g., the MRI and PET capture the structural and functional aspects of the same brain regions. For each data source, đ??™ ! , there is a linear transformation matrix đ?›‰ ! , which retains the maximum amount of class discrimination information in đ??™ ! . A naive way for estimating đ?šŻ = đ?›‰ ! , đ?›‰ ! , ‌ , đ?›‰ ! is to separately estimate each đ?›‰ ! based on đ??™ ! . Apparently, this approach does not take advantage of the fact that all the data sources measure the same physical process. Also, when the sample size of each data source is small, this approach may lead to unreliable estimates for the đ?›‰ ! ’s. To tackle these problems, we propose a composite parameterization following the line as [40]. ! Specifically, let đ?œƒ!,! be the element at the k-th row and l-th column of đ?›‰ ! . We treat ! ! ! ! ! ! đ?œƒ!,! , đ?œƒ!,! , ‌ , đ?œƒ!,! as an interrelated group and parameterize each đ?œƒ!,! as đ?œƒ!,! = đ?›ż! đ?›ž!,! , for 1 ≤ đ?‘˜ ≤ đ?‘?, 1 ≤ đ?‘™ ≤ đ?‘? and 1 ≤ đ?‘š ≤ đ?‘€. In order to assure identifiability, we restrict each đ?›ż! ≼ 0. Here, đ?›ż! represents the common information shared by all the data sources about variable đ?‘˜, while ! đ?›ž!,! represents the specific information only captured by the đ?‘š !! data source. For example, for disease-related brain region identification, if đ?›ż! = 0, it means that all the data sources indicate variable đ?‘˜ is not a disease-related brain region; otherwise, variable đ?‘˜ is a disease-related brain ! region. đ?›ž!,! ≠0 means that the đ?‘š !! data source supports this assertion. The log-likelihood function of đ?šŻ is: đ?‘™! đ?šŻ| đ??™ ! , đ??™ ! , ‌ , đ??™ ! = ! !!! − ! ! ! đ?‘ ! ! log đ?›‰!!! đ??“ ! log đ?›‰ ! ! ! ! đ?›‰!!! − !! ! !!! ! ! log đ?›‰! đ??–! ! ! đ?›‰! +  , which follows the same line as (2). However, our formulation includes the following constraints on đ?šŻ: ! ! đ?œƒ!,! = đ?›ż! đ?›ž!,! , đ?›ż! ≼ 0, 1 ≤ đ?‘˜, đ?‘™ ≤ đ?‘?, 1 ≤ đ?‘š ≤ đ?‘€. Let ! đ?šŞ = đ?›ž!,! , 1 ≤ đ?‘˜ ≤ đ?‘?, 1 ≤ đ?‘™ ≤ đ?‘?, 1 ≤ đ?‘š ≤ đ?‘€ and (3) đ?šż = đ?›ż! , 1 ≤ đ?‘˜ ≤ đ?‘? . An intuitive choice for estimation of đ?šŞ and đ?šż is to maximize the đ?‘™! đ?šŻ| đ??™ ! , đ??™ ! , ‌ , đ??™ !   subject to the constraints in (3). However, it can be anticipated that no element in the estimated đ?šŞ and đ?šż will be exactly zero, resulting in a model which is not interpretable, i.e., poor identification of diseaserelated regions. Thus, we encourage the estimation of đ?šż and the first đ?‘ž columns of đ?šŞ (i.e., the columns containing discrimination information) to be sparse, by imposing the L1-penalty on đ?šŞ and đ?šż. By doing so, we obtain the following optimization problem for the proposed SCLDA: đ?šŻ = argmin đ?šŻ đ?‘™! đ?šŻ| đ??™ ! , đ??™ ! , ‌ , đ??™ ! ! đ?œƒ!,! = = argmin đ?šŻ −đ?‘™! đ?šŻ| đ??™ ! , đ??™ ! , ‌ , đ??™ ! ! đ?œ†! !,!,! đ?›ž!,!  , subject to ! đ?›ż! đ?›ž!,! , đ?›ż! ≼ 0, 1 ≤ đ?‘˜, đ?‘™ ≤ +  đ?œ†! ! đ?›ż! + đ?‘?, 1 ≤ đ?‘š ≤ đ?‘€.                               (4) Here, đ?œ†! and đ?œ†! control the degrees of sparsity of đ?šż and đ?šŞ, respectively. Tuning of two regularization parameters is difficult. Fortunately, we prove the following Theorem which indicates that formulation (4) is equivalent to a simpler optimization problem involving only one regularization parameter. 4 Theorem 1: The optimization problem (4) is equivalent to the following optimization problem: đ?šŻ = argmin đ?šŻ đ?‘™! đ?šŻ| đ??™ = argmin đ?šŻ −đ?‘™! đ?šŻ| đ??™ with đ?œ† = 2 ! ! , đ??™ ! ! , đ??™ ,‌, đ??™ ! ! ,‌, đ??™ +  đ?œ† ! ! ! !!! ! !!! ! đ?œƒ!,!  , (5) ! đ?œ†! đ?œ†! , i.e., đ?œƒ!,! = đ?œƒ!,! . The proof can be found in the supplementary document. It can also be found in the supplementary material how this formulation will serve the purpose of the composite parameterization, i.e., common information and specific information can be estimated separately and simultaneously. The optimization problem (5) is a non-convex optimization problem that is difficult to solve. We address this problem by using an iterative two-stage procedure known as Difference of Convex functions (DC) programming [39]. A full description of the algorithm can be found in the supplemental material. 4 S im ula tion s tu d ies In this section, we conduct experiments to compare the performance of the proposed SCLDA with sparse LDA (SLDA) [42] and multitask feature selection [31]. Specifically, as we focus on LDA, we use the multitask feature selection method developed in [31] on LDA, denoted as MSLDA. Both SLDA and MSLDA adopt convex regularizations. Specifically, SLDA selects features from one single data source with L1-regularization; MSLDA selects features from multiple data sources with L2/L1 regularization. We evaluate the performances of these three methods across various parameters settings, including the number of variables, đ?‘?, the number of features, đ?‘™, the number of data sources, M, sample size, đ?‘›, and the degree of overlapping of the features across different data sources, s% (the larger the đ?‘ %, the more shared features among the datasets). Definition of đ?‘ % can be found in the simulation procedure that is included in the supplemental material. For each specification of the parameters settings, đ?‘€ datasets can be generated following the simulation procedure. We apply the proposed SCLDA to the đ?‘€ datasets, and identify one feature vector đ?›‰(!) for each dataset, with đ?œ† and đ?‘ž chosen by the method described in section 3.3. The result can be described by the number of true positives (TPs) as well as the number of false positives (FPs). Here, true positives are the non-zero elements in the learned feature vector đ?›‰(!) which are also non-zero in the đ?›ƒ! ; false positives are the non-zero elements in đ?›‰(!) , which are actually zero in đ?›ƒ! . As there are đ?‘š pairs of the TPs and FPs for the đ?‘€ datasets, the average TP over the M datasets and the average FP over the M datasets are used as the performance measures. This procedure (i.e., from data simulation, to SCLDA, to TPs and FPs generation) can be repeated for đ??ľ times, and đ??ľ pairs of average TP and average FP are collected for SCLDA. In a similar way, we can obtain đ??ľ pairs of average TP and average FP for both SLDA and MSLDA. Figures 1 (a) and (b) show comparison between SCLDA, SLDA and MSLDA by scattering the average TP against the average FP for each method. Each point corresponds to one of the N repetitions. The comparison is across various parameters settings, including the number of variables (đ?‘? = 100,200,500), the number of data sources (đ?‘š = 2,5,10), and the degree of overlapping of the features across different data sources (đ?‘ % = 90%, 70%). Additionally, đ?‘› đ?‘? is kept constant, đ?‘› đ?‘? = 1. A general observation is that SCLDA is better than SLDA and MSLDA across all the parameter settings. Some specific trends can be summarized as follows: (i) Both SCLDA and MSLDA outperform SLDA in terms of TPs; SCLDA further outperforms MSLDA in terms of FPs. (ii) In Figure 2 (a), rows correspond to different numbers of data sources, i.e., đ?‘š = 2,5,10 respectively. It is clear that the advantage of SCLDA over both SLDA and MSLDA is more significant when there are more data sources. Also, MSLDA performs consistently better than SLDA. Similar phenomena are shown in Figure 2 (b). This demonstrates that in analyzing each data source, both SCLDA and MSLDA are able to make use of the information contained in other data sources. SCLDA can use this information more efficiently, as SCLDA can produce less shrunken parameter estimates than MSLDA and thus it is able to preserve weak-effect features. (iii) Comparing Figures 2 (a) and (b), it can be seen that the advantage of SCLDA or MSLDA over SLDA is more significant as the data sources have more degree of overlapping in their 5 features. Finally, although not presented here, our simulation shows that the three methods perform similarly when đ?‘ % = 40 or less. (a) (b) Figure 1: Average numbers of TPs vs FPs for SCLDA (green symbols “+â€?), SLDA (blue symbols “*â€?) and MSLDA (red symbols “oâ€?) (a) đ?‘ % = 90%, đ?‘› đ?‘? = 1; (b) đ?‘ % = 70%, đ?‘› đ?‘? = 1 5 C ase st ud y 5.1 Data preprocessing Our study includes 49 AD patient and 67 age-matched normal controls (NC), with each subject of AD or NC being scanned both by PET and MRI. The PET and MRI images can be downloaded from the database by the Alzheimer’s Disease Neuroimaging Initiative. In what follows, we outline the data preprocessing steps. Each image is spatially normalized to the Montreal Neurological Institute (MNI) template, using the affine transformation and subsequent non-linear wraping algorithm [43] implemented in the SPM MATLAB toolbox. This is to ensure that each voxel is located in the same anatomical region for all subjects, so that spatial locations can be reported and interpreted in a consistent manner. Once all the images in the MNI template, we further apply the Automated Anatomical Labeling (AAL) technique [43] to segment the whole brain of each subject into 116 brain regions. The 90 regions that belong to the cerebral cortex are selected for the later analysis, as the other regions are not included in the cerebral cortex are rarely considered related with AD in the literature. The measurement of each region in the PET data is regional cerebral blood flow (rCBF); the measurement of each region in the MRI data is the structural volume of the region. 5.2 Disease-related brain regions SCLDA is applied to the preprocessed PET and MRI data of AD and NC with the penalty parameter selected by the AIC method mentioned in section 3. 26 disease-related brain regions are identified from PET and 21 from MRI (see Table 1 for their names). The maps of the diseaserelated brain regions identified from PET and MRI are highlighted in Figure 2 (a) and (b), respectively, with different colors given to neighboring regions in order to distinguish them. Each figure is a set of horizontal cut away slices of the brain as seen from the top, which aims to provide a full view of locations of the regions. One major observation is that the identified disease-related brain regions from MRI are in the hippocampus, parahippocampus, temporal lobe, frontal lobe, and precuneus, which is consistent with the existing literature that reports structural atrophy in these brain areas. [3-6,12-14]. The identified disease-related brain regions from PET are in the temporal, frontal and parietal lobes, which is consistent with many functional neuroimaging studies that report reduced rCBF or 6 reduced cortical glucose metabolism in these areas [8-10, 12-14]. Many of these identified disease-related regions can be explained in terms of the AD pathology. For example, hippocampus is a region affected by AD the earliest and severely [6] Also, as regions in the temporal lobe are essential for memory, damage on these regions by AD can explain the memory loss which is a major clinic symptom of AD. The consistency of our findings with the AD literature supports effectiveness of the proposed SCLDA. Another finding is that there is a large overlap between the identified disease-related regions from PET and those from MRI, which implies strong interaction between functional and structural alterations in these regions. Although well-accepted biological mechanisms underlying this interaction are still not very clear, there are several explanations existing in the literature. The first explanation is that both functional and structural alterations could be the consequence of dendritic arborizations, which results from intracellular accumulation of PHFtau and further leads to neuron death and grey matter loss [14]. The second explanation is that the AD pathology may include a vascular component, which may result in reduced rCBF due to limited blood supply and may ultimately result in structural alteration such as brain atrophy [45]. (a) (b) Figure 2: locations of disease-related brain regions identified from (a) MRI; (b) PET 5.3 Classification accuracy As one of our primary goals is to distinguish AD from NC, the identified disease-related brain regions through SCLDA are further utilized for establishing a classification model. Specifically, for each subject, the rCBF values of the 26 disease-related brain regions identified from PET and the structural volumes of the 21 disease-related brain regions identified from MRI are used, as a joint spatial pattern of both brain physiology and structure. As a result, each subject is associated with a vector with 47 features/variables. Linear SVM (Support Vector Machine) is employed as the classifier. The classification accuracy based on 10-fold cross-validation is 94.3%. For comparison purposes, MSLDA is also applied, which identifies 45 and 38 disease-related brain regions for PET and MRI, respectively. Linear SVM applied to the 45+38 features gives a classification accuracy of only 85.8%. Note that MSLDA identifies a much larger number of disease-related brain regions than SCLDA, but some of the identified regions by MSLDA may indeed be disease-irrelevant, so including them deteriorates the classification. 5.4 Relationship between structural atrophy and abnormal rCBF, and severity of cognitive impairment in AD In addition to classification, it is also of interest to further verify relevance of the identified disease-related regions with AD in an alternative way. One approach is to investigate the degree to which those disease-related regions are relevant to cognitive impairment that can be measured by the Alzheimer’s disease assessment scale – cognitive subscale (ADAS-cog). ADAS measures severity of the most important symptoms of AD, while its subscale, ADAS-cog, is the most 7 popular cognitive testing instrument used in clinic trails. The ADAS-cog consists of 11 items measuring disturbances of memory, language, praxis, attention and other cognitive abilities that are often affected by AD. As the total score of these 11 items provides an overall assessment of cognitive impairment, we regress this ADAS-cog total score (the response) against the rCBF or structure volume measurement (the predictor) of each identified brain region, using a simple regression. The regression results are listed in Table 1. It is not surprising to find that some regions in the hippocampus area and temporal lobes are among the best predictors, as these regions are extensively reported in the literature as the most severely affected by AD [3-6]. Also, it is found that most of these brain regions are weak-effect predictors, as most of them can only explain a small portion of the variability in the ADAS-cog total score, i.e., many R-square values in Table 1 are less than 10%. However, although the effects are weak, most of them are significant, i.e., most of the p-values in Table 1 are smaller than 0.05. Furthermore, it is worth noting that 70.22% variability in ADAS-cog can be explained by taking all the 26 brain regions identified from PET as predictors in a multiple regression model; 49.72% variability can be explained by taking all the 21 brain regions from MRI as predictors in a multiple regression model. All this findings imply that the disease-related brain regions are indeed weakeffect features if considered individually, but jointly they can play a strong role for characterizing AD. This verifies the suitability of the proposed SCLDA for AD studies, as SCLDA can preserve weak-effect features. Table 1: Explanatory power of regional rCBF and structural volume for variability in ADAS-cog (“~â€? means this region is not identified from PET (or MRI) as a disease-related region by SCLDA) PET Brain regions Precentral_L Precentral_R Frontal_Sup_L Frontal_Sup_R Frontal_Mid_R Frontal_M_O_L Frontal_M_O_R Insula_L Insula_R Cingulum_A_R Cingulum_Mid_L Cingulum_Post_L Hippocampus_L Hippocampus_R ParaHippocamp_L 6 R 2 0.003 0.044 0.051 0.044 0.056 0.036 0.019 0.016 ~ 0.004 0.001 0.184 0.158 ~ 0.206 pvalue 0.503 0.022 0.013 0.023 0.010 0.040 0.138 0.171 ~ 0.497 0.733 <10-4 <10-4 ~ <10-4 MRI R 2 0.027 ~ 0.047 ~ 0.072 0.086 0.126 0.163 0.125 0.082 0.040 ~ ~ 0.242 ~ PET Brain regions pvalue 0.077 ~ 0.018 ~ 0.003 0.001 0.000 <10-4 0.000 0.001 0.030 ~ ~ <10-4 ~ Amygdala_L Calcarine_L Lingual_L Postcentral_L Parietal_Sup_R Angular_R Precuneus_R Paracentr_Lobu_L Pallidum_L Pallidum_R Heschl_L Heschl_R Temporal_P_S_R Temporal_Inf_R All regions R 2 0.090 0.038 0.066 0.038 0.001 0.173 0.063 0.035 0.082 ~ 0.001 0.000 0.008 0.187 0.702 pvalue 0.001 0.034 0.005 0.035 0.677 <10-4 0.006 0.043 0.001 ~ 0.640 0.744 0.336 <10-4 <10-4 MRI R 2 0.313 0.028 0.044 0.026 ~ 0.063 0.025 0.000 ~ 0.020 ~ 0.111 0.071 0.147 0.497 pvalue <10-4 0.070 0.023 0.081 ~ 0.006 0.084 0.769 ~ 0.122 ~ 0.000 0.003 <10-4 <10-4 C on clu sio n In the paper, we proposed a SCLDA model for identification of disease-related brain regions of AD from multi-modality data, which is capable to preserve weak-effect disease-related brain regions due to its less shrinkage imposed on its parameters. We applied SCLDA to the PET and MRI data of early AD patients and normal controls. As MRI and PET measure two complementary aspects (structural and functional aspects, respectively) of the same AD pathology, fusion of these two image modalities can make effective use of their interaction and thus improve the statistical power in identification of disease-related brain regions. Our findings were consistent with the literature and also showed some new aspects that may suggest further investigation in neuroimaging research in the future. 8 References [1] deToledo-Morrell, L., Stoub, T.R., Bulgakova, M. 2004. MRI-derived entorhinal volume is a good predictor of conversion from MCI to AD. Neurobiol. Aging 25, 1197–1203. [2] Morra, J.H., Tu, Z. Validation of automated hippocampal segmentation method. NeuroImage 43, 59–68, 2008. [3] Morra, J.H., Tu, Z. 2009a. Automated 3D mapping of hippocampal atrophy. Hum. Brain Map. 30, 2766–2788. [4] Morra, J.H., Tu, Z. 2009b. Automated mapping of hippocampal atrophy in 1-year repeat MRI data. NeuroImage 45, 213-221. [5] Schroeter, M.L., Stein, T. 2009. Neural correlates of AD and MCI. NeuroImage 47, 1196–1206. [6] Braak, H., Braak, E. 1991. Neuropathological stageing of Alzheimer-related changes. Acta Neuro. 82, 239–259. [7] Bradley, K.M., O'Sullivan. 2002. Cerebral perfusion SPET correlated with Braak pathological stage in AD. Brain 125, 1772–1781. [8] Keilp, J.G., Alexander, G.E. 1996. Inferior parietal perfusion, lateralization, and neuropsychological dysfunction in AD. Brain Cogn. 32, 365–383. [9] Schroeter, M.L., Stein, T. 2009. Neural correlates of AD and MCI. NeuroImage 47, 1196–1206. [10] Asllani, I., Habeck, C. 2008. Multivariate and univariate analysis of continuous arterial spin labeling perfusion MRI [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] in AD. J. Cereb. Blood Flow Metab. 28, 725–736. Du,A.T., Jahng, G.H. 2006. Hypoperfusion in frontotemporal dementia and AD. Neurology 67, 1215–1220. Ishii, K., Kitagaki, H. 1996. Decreased medial temporal oxygen metabolism in AD. J. Nucl. Med. 37, 1159–1165. Johnson, N.A., Jahng, G.H. 2005. Pattern of cerebral hypoperfusion in AD. Radiology 234, 851–859. Wolf, H., Jelic, V. 2003. A critical discussion of the role of neuroimaging in MCI. Acta Neuroal: 107 (4), 52-76. Tosun, D., Mojabi, P. 2010. Joint analysis of structural and perfusion MRI for cognitive assessment and classification of AD and normal aging. NeuroImage 52, 186-197. Alsop, D., Casement, M. 2008. Hippocampal hyperperfusion in Alzheimer's disease. NeuroImage 42, 1267–1274. Mosconi, L., Tsui, W.-H. 2005. Reduced hippocampal metabolism in MCI and AD. Neurology 64, 1860–1867. Mulert, C., Lemieux, L. 2010. EEG-fMRI: physiological basis, technique and applications. Springer. Xu, L., Qiu, C., Xu, P. and Yao, D. 2010. A parallel framework for simultaneous EEG/fMRI analysis: methodology and simulation. NeuroImage, 52(3), 1123-1134. Philiastides, M. and Sajda, P. 2007. EEG-informed fMRI reveals spatiotemporal characteristics of perceptual decision making. Journal of Neuroscience, 27(48), 13082-13091. Daunizeau, J., Grova, C. 2007. Symmetrical event-related EEG/fMRI information fusion. NeuroImage 36, 69-87. Jagust, W. 2006. PET and MRI in the diagnosis and prediction of dementia. Alzheimer’s Dement 2, 36-42. Kawachi, T., Ishii, K. and Sakamoto, S. 2006. Comparison of the diagnostic performance of FDG-PET and VBM. Eur.J.Nucl.Med.Mol.Imaging 33, 801-809. Matsunari, I., Samuraki, M. 2007. Comparison of 18F-FDG PET and optimized voxel-based morphometry for detection of AD. J.Nucl.Med 48, 1961-1970. Schmidt, M., Fung, G. and Rosales, R. 2007. Fast optimization methods for L1-regularization: a comparative study and 2 new approaches. ECML 2007. Liu, J., Ji, S. and Ye, J. 2009. SLEP: sparse learning with efficient projections, Arizona state university. Tibshirani, R. 1996. Regression Shrinkage and Selection via the Lasso, JRSS, Series B, 58(1):267–288. Friedman, J., Hastie, T. and Tibshirani, R. 2007. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 8(1):1–10. Zou, H., Hastie, T. and Tibshirani, R. 2006. Sparse PCA, J. of Comp. and Graphical Statistics, 15(2), 262-286. Qiao, Z., Zhou, L and Huang, J. 2006. Sparse LDA with applications to high dimensional low sample size data. IAENG applied mathematics, 39(1). Argyriou, A., Evgeniou, T. and Pontil, M. 2008. Convex multi-task feature learning. Machine Learning 73(3): 243– 272. Huang, S., Li, J., et al. 2010. Learning Brain Connectivity of AD by Sparse Inverse Covariance Estimation, NeuroImage, 50, 935-949. Candes, E., Wakin, M. and Boyd, S. 2008. Enhancing sparsity by reweighted L1 minimization. Journal of Fourier analysis and applications, 14(5), 877-905. Mazumder, R.; Friedman, J. 2009. SparseNet: Coordinate Descent with Non-Convex Penalties. Manuscript. Zhang, T. 2008. Multi-stage Convex Relaxation for Learning with Sparse Regularization. NIPS 2008. Campbell, N. 1984. Canonical variate analysis ageneral formulation. Australian Jour of Stat 26, 86–96. Hastie, T. and Tibshirani, R. 1994. Discriminant analysis by gaussian mixtures. Technical report. AT&T; Bell Lab. Kumar, N. and Andreou, G. 1998. Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Communication, 26 (4), 283-297. Gasso, G., Rakotomamonjy, A. and Canu, S. 2009. Recovering sparse signals with non-convex penalties and DC programming. IEEE Trans. Signal Processing 57( 12), 4686-4698. Guo, J., Levina, E., Michailidis, G. and Zhu, J. 2011. Joint estimation of multiple graphical models. Biometrika 98(1) 1-15. Bertsekas, D. 1982. Projected newton methods for optimization problems with simple constraints. SIAM J. Control Optim 20, 221-246. Clemmensen, L., Hastie, T., Witten, D. and Ersboll:, B. 2011. Sparse Discriminant Analysis. Technometrics (in press) Friston, K.J., Ashburner, J. 1995. Spatial registration and normalization of images. HBM 2, 89–165. Tzourio-Mazoyer, N., et al., 2002. Automated anatomical labelling of activations in SPM. NeuroImage 15, 273–289. Bidzan, L. 2005. Vascular factors in dementia. Psychiatr. Pol. 39, 977-986. 9
5 0.54070514 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators
Author: Dominique C. Perrault-joncas, Marina Meila
Abstract: This paper considers the problem of embedding directed graphs in Euclidean space while retaining directional information. We model the observed graph as a sample from a manifold endowed with a vector field, and we design an algorithm that separates and recovers the features of this process: the geometry of the manifold, the data density and the vector field. The algorithm is motivated by our analysis of Laplacian-type operators and their continuous limit as generators of diffusions on a manifold. We illustrate the recovery algorithm on both artificially constructed and real data. 1 Motivation Recent advances in graph embedding and visualization have focused on undirected graphs, for which the graph Laplacian properties make the analysis particularly elegant [1, 2]. However, there is an important number of graph data, such as social networks, alignment scores between biological sequences, and citation data, which are naturally asymmetric. A commonly used approach for this type of data is to disregard the asymmetry by studying the spectral properties of W +W T or W T W , where W is the affinity matrix of the graph. Some approaches have been offered to preserve the asymmetry information contained in data: [3], [4], [5] or to define directed Laplacian operators [6]. Although quite successful, these works adopt a purely graph-theoretical point of view. Thus, they are not concerned with the generative process that produces the graph, nor with the interpretability and statistical properties of their algorithms. In contrast, we view the nodes of a directed graph as a finite sample from a manifold in Euclidean space, and the edges as macroscopic observations of a diffusion kernel between neighboring points on the manifold. We explore how this diffusion kernel determines the overall connectivity and asymmetry of the resulting graph and demonstrate how Laplacian-type operators of this graph can offer insights into the underlying generative process. Based on the analysis of the Laplacian-type operators, we derive an algorithm that, in the limit of infinite sample and vanishing bandwidth, recovers the key features of the sampling process: manifold geometry, sampling distribution, and local directionality, up to their intrinsic indeterminacies. 2 Model The first premise here is that we observe a directed graph G, with n nodes, having weights W = [Wij ] for the edge from node i to node j. In following with common Laplacian-based embedding approaches, we assume that G is a geometric random graph constructed from n points sampled according to distribution p = e−U on an unobserved compact smooth manifold M ⊆ Rl of known intrinsic dimension d ≤ l. The edge weight Wij is then determined by a directed similarity kernel k (xi , xj ) with bandwidth . The directional component of k (xi , xj ) will be taken to be derived 1 from a vector field r on M, which assigns a preferred direction between weights Wij and Wji . The choice of a vector field r to characterize the directional component of G might seem restrictive at first. In the asymptotic limit of → 0 and n → ∞ however, kernels are characterized by their diffusion, drift, and source components [7]. As such, r is sufficient to characterize any directionality associated with a drift component and as it turns out, the component of r normal M in Rl can also be use to characterize any source component. As for the diffusion component, it is not possible to uniquely identify it from G alone [8]. Some absolute knownledge of M is needed to say anything about it. Hence, without loss of generality, we will construct k (xi , xj ) so that the diffusion component ends being isotropic and constant, i.e. equal to Laplace-Beltrami operator ∆ on M. The schematic of this generative process is shown in the top left of Figure 1 below. From left to right: the graph generative process mapping the sample on M to geometric random graph G via the kernel k (x, y), then the subsequent embedding (α) Ψn of G by operators Haa,n , (α) Hss,n (defined in section 3.1). As these operators converge to (α) their respective limits, Haa and (α) Hss , so will Ψn → Ψ, pn → p, and rn → r. We design an algorithm that, given G, produces the top right embedding (Ψn , pn , and rn ). Figure 1: Schematic of our framework. The question is then as follows: can the generative process’ geometry M, distribution p = e−U , and directionality r, be recovered from G? In other words, is there an embedding of G in Rm , m ≥ d that approximates all three components of the process and that is also consistent as sample size increases and the bandwidth vanishes? In the case of undirected graphs, the theory of Laplacian eigenmaps [1] and Diffusion maps [9] answers this question in the affirmative, in that the geometry of M and p = e−U can be inferred using spectral graph theory. The aim here is to build on the undirected problem and recover all three components of the generative process from a directed graph G. The spectral approach to undirected graph embedding relies on the fact that eigenfunctions of the Laplace-Beltrami operator are known to preserve the local geometry of M [1]. With a consistent empirical Laplace-Beltrami operator based on G, its eigenvectors also recover the geometry of M and converge to the corresponding eigenfunctions on M. For a directed graph G, an additional operator is needed to recover the local directional component r, but the principle remains the same. (α) The schematic for this is shown in Figure 1 where two operators - Hss,n , introduced in [9] for (α) undirected embeddings, and Haa,n , a new operator defined in section 3.1 - are used to obtain the (α) (α) (α) embedding Ψn , distribution pn , and vector field rn . As Haa,n and Hss,n converge to Haa and (α) Hss , Ψn , pn , and rn also converge to Ψ, p, and r, where Ψ is the local geometry preserving the embedding of M into Rm . (α) The algorithm we propose in Section 4 will calculate the matrices corresponding to H·,n from the graph G, and with their eigenvectors, will find estimates for the node coordinates Ψ, the directional component r, and the sampling distribution p. In the next section we briefly describe the mathematical models of the diffusion processes that our model relies on. 2 2.1 Problem Setting The similarity kernel k (x, y) can be used to define transport operators on M. The natural transport operator is defined by normalizing k (x, y) as T [f ](x) = M k (x, y) f (y)p(y)dy , where p (x) = p (x) k (x, y)p(y)dy . (1) M T [f ](x) represents the diffusion of a distribution f (y) by the transition density k (x, y)p(y)/ k (x, y )p(y )dy . The eigenfunctions of this infinitesimal operator are the continuous limit of the eigenvectors of the transition probability matrix P = D−1 W given by normalizing the affinity matrix W of G by D = diag(W 1) [10]. Meanwhile, the infinitesimal transition ∂f (T − I)f = lim (2) →0 ∂t defines the backward equation for this diffusion process over M based on kernel k . Obtaining the explicit expression for transport operators like (2) is then the main technical challenge. 2.2 Choice of Kernel In order for T [f ] to have the correct asymptotic form, some hypotheses about the similarity kernel k (x, y) are required. The hypotheses are best presented by considering the decomposition of k (x, y) into symmetric h (x, y) = h (y, x) and anti-symmetric a (x, y) = −a (y, x) components: k (x, y) = h (x, y) + a (x, y) . (3) The symmetric component h (x, y) is assumed to satisfy the following properties: 1. h (||y − 2 x||2 ) = h(||y−x|| / ) , and 2. h ≥ 0 and h is exponentially decreasing as ||y − x|| → ∞. This form d/2 of symmetric kernel was used in [9] to analyze the diffusion map. For the asymmetric part of the similarity kernel, we assume the form a (x, y) = r(x, y) h(||y − x||2 / ) · (y − x) , d/2 2 (4) with r(x, y) = r(y, x) so that a (x, y) = −a (y, x). Here r(x, y) is a smooth vector field on the manifold that gives an orientation to the asymmetry of the kernel k (x, y). It is worth noting that the dependence of r(x, y) on both x and y implies that r : M × M → Rl with Rl the ambient space of M; however in the asymptotic limit, the dependence in y is only important “locally” (x = y), and as such it is appropriate to think of r(x, x) being a vector field on M. As a side note, it is worth pointing out that even though the form of a (x, y) might seem restrictive at first, it is sufficiently rich to describe any vector field . This can be seen by taking r(x, y) = (w(x) + w(y))/2 so that at x = y the resulting vector field is given by r(x, x) = w(x) for an arbitrary vector field w(x). 3 Continuous Limit of Laplacian Type Operators We are now ready to state the main asymptotic result. Proposition 3.1 Let M be a compact, closed, smooth manifold of dimension d and k (x, y) an asymmetric similarity kernel satisfying the conditions of section 2.2, then for any function f ∈ C 2 (M), the integral operator based on k has the asymptotic expansion k (x, y)f (y)dy = m0 f (x) + g(f (x), x) + o( ) , (5) M where g(f (x), x) = and m0 = Rd m2 (ω(x)f (x) + ∆f (x) + r · 2 h(||u||2 )du, m2 = Rd u2 h(||u||2 )du. i 3 f (x) + f (x) · r + c(x)f (x)) (6) The proof can be found in [8] along with the definition of ω(x) and c(x) in (6). For now, it suffices to say that ω(x) corresponds to an interaction between the symmetric kernel h and the curvature of M and was first derived in [9]. Meanwhile, c(x) is a new term that originates from the interaction between h and the component of r that is normal to M in the ambient space Rl . Proposition 3.1 foreshadows a general fact about spectral embedding algorithms: in most cases, Laplacian operators confound the effects of spatial proximity, sampling density and directional flow due to the presence of the various terms above. 3.1 Anisotropic Limit Operators Proposition 3.1 above can be used to derive the limits of a variety of Laplacian type operators associated with spectral embedding algorithms like [5, 6, 3]. Although we will focus primarily on a few operators that give the most insight into the generative process and enable us to recover the model defined in Figure 1, we first present four distinct families of operators for completeness. These operator families are inspired by the anisotropic family of operators that [9] introduced for undirected graphs, which make use of anisotropic kernels of the form: k (α) (x, y) = k (x, y) pα (x)pα (y) , (7) with α ∈ [0, 1] where α = 0 is the isotropic limit. To normalize the anisotropic kernels, we need (α) (α) (α) as p (x) = M k (x, y)p(y)dy. From (7), four to redefine the outdegrees distribution of k families of diffusion processes of the form ft = H (α) [f ](x) can be derived depending on which kernel is normalized and which outdegree distribution is used for the normalization. Specifically, (α) (α) we define transport operators by normalizing the asymmetric k or symmetric h kernels with the 1 asymmetric p or symmetric q = M h (x, y)p(y)dy outdegree distribution . To keep track of all options, we introduce the following notation: the operators will be indexed by the type of kernel and outdegree distribution they correspond to (symmetric or asymmetric), with the first index identifying the kernel and the second index identifying the outdegree distribution. For example, the family of anisotropic limit operators introduced by [9] is defined by normalizing the symmetric kernel by (α) the symmetric outdegree distribution, hence they will be denoted as Hss , with the superscript corresponding to the anisotropic power α. Proposition 3.2 With the above notation, (α) Haa [f ] = ∆f − 2 (1 − α) U · f + r· f (α) Has [f ] (α) Hsa [f ] (α) Hss [f ] = ∆f − 2 (1 − α) U · = ∆f − 2 (1 − α) U · = ∆f − 2(1 − α) U · (8) f − cf + (α − 1)(r · U )f − ( · r + (α − 1)r · f + (c + f. U )f · r)f + r · f (9) (10) (11) The proof of this proposition, which can be found in [8], follows from repeated application of Proposition 3.1 to p(y) or q(y) and then to k α (x, y) or hα (x, y), as well as the fact that p1 = α 1 p−α [1 − α (ω + ∆p p + 2r · p p +2 · r + c)] + o( ). (α) Thus, if we use the asymmetric k and p , we get Haa , defined by the advected diffusion equa(α) tion (8). In general, Haa is not hermitian, so it commonly has complex eigenvectors. This makes (1) embedding directed graphs with this operator problematic. Nevertheless, Haa will play an important role in extracting the directionality of the sampling process. If we use the symmetric kernel h but the asymmetric outdegree distribution p , we get the family (α) of operators Hsa , of which the WCut of [3] is a special case (α = 0). If we reverse the above, i.e. (α) (α) (α) use k and q , we obtain Has . This turns out to be merely a combination of Haa and Hsa . 1 The reader may notice that there are in fact eight possible combinations of kernel and degree distribution, since the anisotripic kernel (7) could also be defined using a symmetric or asymmetric outdegree distribution. However, there are only four distinct asymptotic results and they are all covered by using one kernel (symmetric or asymmetric) and one degree distribution (symmetric or asymmetric) throughout. 4 Algorithm 1 Directed Embedding Input: Affinity matrix Wi,j and embedding dimension m, (m ≥ d) 1. S ← (W + W T )/2 (Steps 1–6 estimate the coordinates as in [11]) n 2. qi ← j=1 Si,j , Q = diag(q) 3. V ← Q−1 SQ−1 (1) n 4. qi ← j=1 Vi,j , Q(1) = diag(q (1) ) (1) −1 5. Hss,n ← Q(1) V 6. Compute the Ψ the n × (m + 1) matrix with orthonormal columns containing the m + 1 largest (1) right eigenvector (by eigenvalue) of Hss,n as well as the Λ the (m + 1) × (m + 1) diagonal matrix of eigenvalues. Eigenvectors 2 to m + 1 from Ψ are the m coordinates of the embedding. (1) 7. Compute π the left eigenvector of Hss,n with eigenvalue 1. (Steps 7–8 estimate the density) n 8. π ← π/ i=1 πi is the density distribution over the embedding. n 9. pi ← j=1 Wi,j , P = diag(p) (Steps 9–13 estimate the vector field r) 10. T ← P −1 W P −1 (1) n 11. pi ← j=1 Ti,j , P (1) = diag(p(1) ) (1) −1 12. Haa,n ← P (1) T (1) (1) 13. R ← (Haa,n − Hss,n )Ψ/2. Columns 2 to m + 1 of R are the vector field components in the direction of the corresponding coordinates of the embedding. (α) Finally, if we only consider the symmetric kernel h and degree distribution q , we recover Hss , the anisotropic kernels of [9] for symmetric graphs. This operator for α = 1 is shown to separate the manifold from the probability distribution [11] and will be used as part of our recovery algorithm. Isolating the Vector Field r 4 Our aim is to esimate the manifold M, the density distribution p = e−U , and the vector field r. The (1) first two components of the data can be recovered from Hss as shown in [11] and summarized in Algorithm 1. At this juncture, one feature of generative process is missing: the vector field r. The natural approach (α) (α) for recovering r is to isolate the linear operator r · from Haa by substracting Hss : (α) (α) Haa − Hss = r · . (12) The advantage of recovering r in operator form as in (12) is that r · is coordinate free. In other words, as long as the chosen embedding of M is diffeomorphic to M2 , (12) can be used to express the component of r that lies in the tangent space T M, which we denote by r|| . Specifically, let Ψ be a diffeomorphic embedding of M ; the component of r along coordinate ψk is then given by r · ψk = rk , and so, in general, r|| = r · Ψ. (13) The subtle point that only r|| is recovered from (13) follows from the fact that the operator r · only defined along M and hence any directional derivative is necessarily along T M. is Equation (13) and the previous observations are the basis for Algorithm 1, which recovers the three important features of the generative process for an asymmetric graph with affinity matrix W . A similar approach can be employed to recover c + · r, or simply · r if r has no component perpendicular to the tangent space T M (meaning that c ≡ 0). Recovering c + · r is achieved by taking advantage of the fact that (1) (1) (Hsa − Hss ) = (c + 2 · r) , (14) (1) A diffeomorphic embedding is guaranteed by using the eigendecomposition of Hss . 5 (1) (1) which is a diagonal operator. Taking into account that for finite n (Hsa,n − Hss,n ) is not perfectly (1) (1) diagonal, using ψn ≡ 1n (vector of ones), i.e. (Hsa,n − Hss,n )[1n ] = (cn + · rn ), has been found (1) (1) empirically to be more stable than simply extracting the diagonal of (Hsa,n − Hss,n ). 5 Experiments Artificial Data For illustrative purposes, we begin by applying our method to an artificial example. We use the planet Earth as a manifold with a topographic density distribution, where sampling probability is proportional to elevation. We also consider two vector fields: the first is parallel to the line of constant latitude and purely tangential to the sphere, while the second is parallel to the line of constant longitude with a component of the vector field perpendicular to the manifold. The true model with constant latitude vector field is shown in Figure 2, along with the estimated density and vector field projected on the true manifold (sphere). Model Recovered Latitudinal (a) Longitudinal (b) Figure 2: (a): Sphere with latitudinal vector field, i.e East-West asymmetry, with Wew > Wwe if node w lies to the West of node e. The graph nodes are sampled non-uniformly, with the topographic map of the world as sampling density. We sample n = 5000 nodes, and observe only the resulting W matrix, but not the node locations. From W , our algorithm estimates the sample locations (geometry), the vector field (black arrows) generating the observed asymmetries, and the sampling distribution at each data point (colormap). (b) Vector fields on a spherical region (blue), and their estimates (red): latitudinal vector field tangent to the manifold (left) and longitudinal vector field with component perpendicular to manifold tangent plane (right). Both the estimated density and vector field agree with the true model, demonstrating that for artificial data, the recovery algorithm 1 performs quite well. We note that the estimated density does not recover all the details of the original density, even for large sample size (here n = 5000 with = 0.07). Meanwhile, the estimated vector field performs quite well even when the sampling is reduced to n = 500 with = 0.1. This can be seen in Figure 2, b, where the true and estimated vector fields are superimposed. Figure 2 also demonstrates how r · only recovers the tangential component of r. The estimated geometry is not shown on any of these figures, since the success of the diffusion map in recovering the geometry for such a simple manifold is already well established [2, 9]. Real DataThe National Longitudinal Survey of Youth (NLSY) 1979 Cohort is a representative sample of young men and women in the United States who were followed from 1979 to 2000 [12, 13]. The aim here is to use this survey to obtain a representation of the job market as a diffusion process over a manifold. The data set consists of a sample of 7,816 individual career sequences of length 64, listing the jobs a particular individual held every quarter between the ages of 20 and 36. Each token in the sequence identifies a job. Each job corresponds to an industry × occupation pair. There are 25 unique industry and 20 unique occupation indices. Out of the 500 possible pairings, approximately 450 occur in the data, with only 213 occurring with sufficient frequency to be included here. Thus, our graph G has 213 nodes - the jobs - and our observations consist of 7,816 walks between the graph nodes. We convert these walks to a directed graph with affinity matrix W . Specifically, Wij represents the number of times a transition from job i to job j was observed (Note that this matrix is asymmetric, 6 i.e Wij = Wji ). Normalizing each row i of W by its outdegree di gives P = diag(di )−1 W , the non-parametric maximum likelihood estimator for the Markov chain over G for the progression (0) of career sequences. This Markov chain has as limit operator Haa , as the granularity of the job market increases along with the number of observations. Thus, in trying to recover the geometry, distribution and vector field, we are actually interested in estimating the full advective effect of the (0) diffusion process generated by Haa ; that is, we want to estimate r · − 2 U · where we can use (0) (1) −2 U · = Hss − Hss to complement Algorithm 1. 0.25 1600 0.05 0.1 1400 0.1 3 0.7 1800 0.15 ! 0.8 0.15 0.2 0.9 0.2 2000 0.25 !30.05 0.6 0.5 0 0 0.4 1200 −0.05 −0.05 −0.1 0.3 −0.1 1000 0.2 −0.15 −0.15 800 −0.2 −0.25 0.1 −0.2 −0.25 −0.1 −0.05 0 !2 0.05 0.1 0.15 0.2 (a) −0.1 −0.05 0 !2 0.05 0.1 0.15 0.2 (b) Figure 3: Embedding the job market along with field r − 2 U over the first two non-constant eigenvectors. The color map corresponds to the mean monthly wage in dollars (a) and to the female proportion (b) for each job. We obtain an embedding of the job market that describes the relative position of jobs, their distribution, and the natural time progression from each job. Of these, the relative position and natural time progression are the most interesting. Together, they summarize the job market dynamics by describing which jobs are naturally “close” as well as where they can lead in the future. From a public policy perspective, this can potentially improve focus on certain jobs for helping individuals attain better upward mobility. The job market was found to be a high dimensional manifold. We present only the first two dimen(0) sions, that is, the second and third eigenvectors of Hss , since the first eigenvector is uninformative (constant) by construction. The eigenvectors showed correlation with important demographic data, such as wages and gender. Figure 3 displays this two-dimensional sub-embedding along with the directional information r − 2 U for each dimension. The plot shows very little net progression toward regions of increasing mean salary3 . This is somewhat surprising, but it is easy to overstate this observation: diffusion alone would be enough to move the individuals towards higher salary. What Figure 3 (a) suggests is that there appear to be no “external forces” advecting individuals towards higher salary. Nevertheless, there appear to be other external forces at play in the job market: Figure 3 (b), which is analogous to Figure 3 (a), but with gender replacing the salary color scheme, suggests that these forces push individuals towards greater gender differentiation. This is especially true amongst male-dominated jobs which appear to be advected toward the left edge of the embedding. Hence, this simple analysis of the job market can be seen as an indication that males and females tend to move away from each other over time, while neither seems to have a monopoly on high- or low- paying jobs. 6 Discussion This paper makes three contributions: (1) it introduces a manifold-based generative model for directed graphs with weighted edges, (2) it obtains asymptotic results for operators constructed from the directed graphs, and (3) these asymptotic results lead to a natural algorithm for estimating the model. 3 It is worth noting that in the NLSY data set, high paying jobs are teacher, nurse and mechanic. This is due to the fact that the career paths observed stop at at age 36, which is relatively early in an individual’s career. 7 Generative Models that assume that data are sampled from a manifold are standard for undirected graphs, but to our knowledge, none have yet been proposed for directed graphs. When W is symmetric, it is natural to assume that it depends on the points’ proximity. For asymmetric affinities W , one must include an additional component to explain the asymmetry. In the asymptotic limit, this is tantamount to defining a vector field on the manifold. Algorithm We have used from [9] the idea of defining anisotropic kernels (indexed by α) in order to separate the density p and the manifold geometry M. Also, we adopted their general assumptions about the symmetric part of the kernel. As a consequence, the recovery algorithm for p and M is identical to theirs. However, insofar as the asymmetric part of the kernel is concerned, everything, starting from the definition and the introduction of the vector field r as a way to model the asymmetry, through the derivation of the asymptotic expression for the symmetric plus asymmetric kernel, is new. We go significantly beyond the elegant idea of [9] regarding the use of anisotropic kernels by analyzing the four distinct renormalizations possible for a given α, each of them combining different aspects of M, p and r. Only the successful (and novel) combination of two different anisotropic operators is able to recover the directional flow r. Algorithm 1 is natural, but we do not claim it is the only possible one in the context of our model. (α) For instance, we can also use Hsa to recover the operator · r (which empirically seems to have worse numerical properties than r · ). In the National Longitudinal Survery of Youth study, we were interested in the whole advective term, so we estimated it from a different combination of operators. Depending on the specific question, other features of the model could be obtained Limit Results Proposition 3.1 is a general result on the asymptotics of asymmetric kernels. Recovering the manifold and r is just one, albeit the most useful, of the many ways of exploiting these (0) results. For instance, Hsa is the limit operator of the operators used in [3] and [5]. The limit analysis could be extended to other digraph embedding algorithms such as [4, 6]. How general is our model? Any kernel can be decomposed into a symmetric and an asymmetric part, as we have done. The assumptions on the symmetric part h are standard. The paper of [7] goes one step further from these assumptions; we will discuss it in relationship with our work shortly. The more interesting question is how limiting are our assumptions regarding the choice of kernel, especially the asymmetric part, which we parameterized as a (x, y) = r/2 · (y − x)h (x, y) in (4). In the asymptotic limit, this choice turns out to be fully general, at least up to the identifiable aspects of the model. For a more detailed discussion of this issue, see [8]. In [7], Ting, Huang and Jordan presented asymptotic results for a general family of kernels that includes asymmetric and random kernels. Our k can be expressed in the notation of [7] by taking wx (y) ← 1+r(x, y)·(y−x), rx (y) ← 1, K0 ← h, h ← . Their assumptions are more general than the assumptions we make here, yet our model is general up to what can be identified from G alone. The distinction arises because [7] focuses on the graph construction methods from an observed sample of M, while we focus on explaining an observed directed graph G through a manifold generative process. Moreover, while the [7] results can be used to analyze data from directed graphs, they differ from our Proposition 3.1. Specifically, with respect to the limit in Theorem 3 from [7], we obtain the additional source terms f (x) · r and c(x)f (x) that follow from not enforcing (α) (α) conservation of mass while defining operators Hsa and Has . We applied our theory of directed graph embedding to the analysis of the career sequences in Section 5, but asymmetric affinity data abound in other social contexts, and in the physical and life sciences. Indeed, any “similarity” score that is obtained from a likelihood of the form Wvu =likelihood(u|v) is generally asymmetric. Hence our methods can be applied to study not only social networks, but also patterns of human movement, road traffic, and trade relations, as well as alignment scores in molecular biology. Finally, the physical interpretation of our model also makes it naturally applicable to physical models of flows. Acknowledgments This research was partially supported by NSW awards IIS-0313339 and IIS-0535100. 8 References [1] Belkin and Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15:1373–1396, 2002. [2] Nadler, Lafon, and Coifman. Diffusion maps, spectral clustering and eigenfunctions of fokker-planck operators. In Neural Information Processing Systems Conference, 2006. [3] Meila and Pentney. Clustering by weighted cuts in directed graphs. In SIAM Data Mining Conference, 2007. [4] Zhou, Huang, and Scholkopf. Learning from labeled and unlabeled data on a directed graph. In International Conference on Machine Learning, pages 1041–1048, 2005. [5] Zhou, Schlkopf, and Hofmann. Semi-supervised learning on directed graphs. In Advances in Neural Information Processing Systems, volume 17, pages 1633–1640, 2005. [6] Fan R. K. Chung. The diameter and laplacian eigenvalues of directed graphs. Electr. J. Comb., 13, 2006. [7] Ting, Huang, and Jordan. An analysis of the convergence of graph Laplacians. In International Conference on Machine Learning, 2010. [8] Dominique Perrault-Joncas and Marina Meil˘ . Directed graph embedding: an algorithm based on contina uous limits of laplacian-type operators. Technical Report TR 587, University of Washington - Department of Statistics, November 2011. [9] Coifman and Lafon. Diffusion maps. Applied and Computational Harmonic Analysis, 21:6–30, 2006. [10] Mikhail Belkin and Partha Niyogi. Convergence of laplacian eigenmaps. preprint, short version NIPS 2008, 2008. [11] Coifman, Lafon, Lee, Maggioni, Warner, and Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. In Proceedings of the National Academy of Sciences, pages 7426–7431, 2005. [12] United States Department of Labor. National longitudinal survey of youth 1979 cohort. http://www.bls.gov/nls/, retrived October 2011. [13] Marc A. Scott. Affinity models for career sequences. Journal of the Royal Statistical Society: Series C (Applied Statistics), 60(3):417–436, 2011. 9
6 0.50059843 74 nips-2011-Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection
7 0.46527007 164 nips-2011-Manifold Precis: An Annealing Technique for Diverse Sampling of Manifolds
8 0.43344447 152 nips-2011-Learning in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint
9 0.43273187 38 nips-2011-Anatomically Constrained Decoding of Finger Flexion from Electrocorticographic Signals
10 0.43219802 225 nips-2011-Probabilistic amplitude and frequency demodulation
11 0.43151361 67 nips-2011-Data Skeletonization via Reeb Graphs
12 0.43083337 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization
13 0.41552758 150 nips-2011-Learning a Distance Metric from a Network
14 0.4128952 54 nips-2011-Co-regularized Multi-view Spectral Clustering
15 0.4079465 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
16 0.39909288 108 nips-2011-Greedy Algorithms for Structurally Constrained High Dimensional Problems
17 0.39857203 253 nips-2011-Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment
18 0.36700359 5 nips-2011-A Denoising View of Matrix Completion
19 0.36502951 179 nips-2011-Multilinear Subspace Regression: An Orthogonal Tensor Decomposition Approach
20 0.36128145 7 nips-2011-A Machine Learning Approach to Predict Chemical Reactions
topicId topicWeight
[(0, 0.035), (4, 0.061), (20, 0.034), (26, 0.02), (31, 0.068), (33, 0.033), (43, 0.04), (45, 0.095), (57, 0.041), (65, 0.015), (74, 0.06), (83, 0.047), (84, 0.011), (94, 0.275), (99, 0.067)]
simIndex simValue paperId paperTitle
same-paper 1 0.74097151 167 nips-2011-Maximum Covariance Unfolding : Manifold Learning for Bimodal Data
Author: Vijay Mahadevan, Chi W. Wong, Jose C. Pereira, Tom Liu, Nuno Vasconcelos, Lawrence K. Saul
Abstract: We propose maximum covariance unfolding (MCU), a manifold learning algorithm for simultaneous dimensionality reduction of data from different input modalities. Given high dimensional inputs from two different but naturally aligned sources, MCU computes a common low dimensional embedding that maximizes the cross-modal (inter-source) correlations while preserving the local (intra-source) distances. In this paper, we explore two applications of MCU. First we use MCU to analyze EEG-fMRI data, where an important goal is to visualize the fMRI voxels that are most strongly correlated with changes in EEG traces. To perform this visualization, we augment MCU with an additional step for metric learning in the high dimensional voxel space. Second, we use MCU to perform cross-modal retrieval of matched image and text samples from Wikipedia. To manage large applications of MCU, we develop a fast implementation based on ideas from spectral graph theory. These ideas transform the original problem for MCU, one of semidefinite programming, into a simpler problem in semidefinite quadratic linear programming. 1
2 0.53007603 186 nips-2011-Noise Thresholds for Spectral Clustering
Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh
Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1
3 0.52162337 263 nips-2011-Sparse Manifold Clustering and Embedding
Author: Ehsan Elhamifar, René Vidal
Abstract: We propose an algorithm called Sparse Manifold Clustering and Embedding (SMCE) for simultaneous clustering and dimensionality reduction of data lying in multiple nonlinear manifolds. Similar to most dimensionality reduction methods, SMCE finds a small neighborhood around each data point and connects each point to its neighbors with appropriate weights. The key difference is that SMCE finds both the neighbors and the weights automatically. This is done by solving a sparse optimization problem, which encourages selecting nearby points that lie in the same manifold and approximately span a low-dimensional affine subspace. The optimal solution encodes information that can be used for clustering and dimensionality reduction using spectral clustering and embedding. Moreover, the size of the optimal neighborhood of a data point, which can be different for different points, provides an estimate of the dimension of the manifold to which the point belongs. Experiments demonstrate that our method can effectively handle multiple manifolds that are very close to each other, manifolds with non-uniform sampling and holes, as well as estimate the intrinsic dimensions of the manifolds. 1 1.1
4 0.52121627 303 nips-2011-Video Annotation and Tracking with Active Learning
Author: Carl Vondrick, Deva Ramanan
Abstract: We introduce a novel active learning framework for video annotation. By judiciously choosing which frames a user should annotate, we can obtain highly accurate tracks with minimal user effort. We cast this problem as one of active learning, and show that we can obtain excellent performance by querying frames that, if annotated, would produce a large expected change in the estimated object track. We implement a constrained tracker and compute the expected change for putative annotations with efficient dynamic programming algorithms. We demonstrate our framework on four datasets, including two benchmark datasets constructed with key frame annotations obtained by Amazon Mechanical Turk. Our results indicate that we could obtain equivalent labels for a small fraction of the original cost. 1
5 0.51885158 102 nips-2011-Generalised Coupled Tensor Factorisation
Author: Kenan Y. Yılmaz, Ali T. Cemgil, Umut Simsekli
Abstract: We derive algorithms for generalised tensor factorisation (GTF) by building upon the well-established theory of Generalised Linear Models. Our algorithms are general in the sense that we can compute arbitrary factorisations in a message passing framework, derived for a broad class of exponential family distributions including special cases such as Tweedie’s distributions corresponding to βdivergences. By bounding the step size of the Fisher Scoring iteration of the GLM, we obtain general updates for real data and multiplicative updates for non-negative data. The GTF framework is, then extended easily to address the problems when multiple observed tensors are factorised simultaneously. We illustrate our coupled factorisation approach on synthetic data as well as on a musical audio restoration problem. 1
6 0.5151692 180 nips-2011-Multiple Instance Filtering
7 0.51513225 127 nips-2011-Image Parsing with Stochastic Scene Grammar
8 0.51493096 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
9 0.51450855 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
10 0.51419365 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
11 0.51364386 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data
12 0.51340187 276 nips-2011-Structured sparse coding via lateral inhibition
13 0.51337659 66 nips-2011-Crowdclustering
14 0.51299304 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
15 0.51261067 150 nips-2011-Learning a Distance Metric from a Network
16 0.51225334 227 nips-2011-Pylon Model for Semantic Segmentation
17 0.51207262 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
18 0.51199973 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
19 0.51099563 258 nips-2011-Sparse Bayesian Multi-Task Learning
20 0.51053035 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration