nips nips2011 nips2011-54 knowledge-graph by maker-knowledge-mining

54 nips-2011-Co-regularized Multi-view Spectral Clustering


Source: pdf

Author: Abhishek Kumar, Piyush Rai, Hal Daume

Abstract: In many clustering problems, we have access to multiple views of the data each of which could be individually used for clustering. Exploiting information from multiple views, one can hope to find a clustering that is more accurate than the ones obtained using the individual views. Often these different views admit same underlying clustering of the data, so we can approach this problem by looking for clusterings that are consistent across the views, i.e., corresponding data points in each view should have same cluster membership. We propose a spectral clustering framework that achieves this goal by co-regularizing the clustering hypotheses, and propose two co-regularization schemes to accomplish this. Experimental comparisons with a number of baselines on two synthetic and three real-world datasets establish the efficacy of our proposed approaches.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In many clustering problems, we have access to multiple views of the data each of which could be individually used for clustering. [sent-10, score-0.97]

2 Exploiting information from multiple views, one can hope to find a clustering that is more accurate than the ones obtained using the individual views. [sent-11, score-0.486]

3 Often these different views admit same underlying clustering of the data, so we can approach this problem by looking for clusterings that are consistent across the views, i. [sent-12, score-1.061]

4 We propose a spectral clustering framework that achieves this goal by co-regularizing the clustering hypotheses, and propose two co-regularization schemes to accomplish this. [sent-15, score-1.253]

5 1 Introduction Many real-world datasets have representations in the form of multiple views [1, 2]. [sent-17, score-0.583]

6 Although these individual views might be sufficient on their own for a given learning task, they can often provide complementary information to each other which can lead to improved performance on the learning task at hand. [sent-19, score-0.536]

7 Our of the numerous clustering algorithms, Spectral Clustering has gained considerable attention in the recent past due to its strong performance on arbitrary shaped clusters, and due to its well-defined mathematical framework [3]. [sent-21, score-0.432]

8 Spectral clustering is accomplished by constructing a graph from the data points with edges between them representing the similarities, and solving a relaxation of the normalized min-cut problem on this graph [4]. [sent-22, score-0.568]

9 For the multi-view clustering problem, we work with the assumption that the true underlying clustering would assign corresponding points in each view to the same cluster. [sent-23, score-0.997]

10 Given this assumption, we can approach the multi-view clustering problem by limiting our search to clusterings that are compatible across the graphs defined over each of the views: corresponding nodes in each graph should have the same cluster membership. [sent-24, score-0.749]

11 In this paper, we propose two spectral clustering algorithms that achieve this goal by co-regularizing the clustering hypotheses across views. [sent-25, score-1.312]

12 We propose novel spectral clustering objective functions that implicitly combine graphs from multiple views of the data to achieve a better clustering. [sent-27, score-1.473]

13 Our proposed methods give us a way to combine multiple kernels (or similarity matrices) for the clustering problem. [sent-28, score-0.575]

14 , xn } denote the examples in view v and K(v) denote the similarity or kernel matrix of X in this view. [sent-36, score-0.388]

15 The single view spectral clustering algorithm of [5] solves the following optimization problem for the normalized graph Laplacian L(v) : max U(v) ∈Rn×k T tr U(v) L(v) U(v) , T s. [sent-38, score-1.194]

16 For a detailed introduction to both theoretical and practical aspects of spectral clustering, the reader is referred to [3]. [sent-42, score-0.412]

17 Our multi-view spectral clustering framework builds on the standard spectral clustering with a single view, by appealing to the co-regularization framework typically used in the semi-supervised learning literature [1]. [sent-43, score-1.688]

18 Co-regularization in semi-supervised learning essentially works by making the hypotheses learned from different views of the data agree with each other on unlabeled data [6]. [sent-44, score-0.638]

19 The framework employs two main assumptions for its success: (a) the true target functions in each view should agree on the labels for the unlabeled data (compatibility), and (b) the views are independent given the class label (conditional independence). [sent-45, score-0.788]

20 In the case of clustering, this would mean that a data point in both views would be assigned to the correct cluster with high probability. [sent-49, score-0.592]

21 Here, we propose two co-regularization based approaches to make the clustering hypotheses on different graphs (i. [sent-50, score-0.525]

22 The effectiveness of spectral clustering hinges crucially on the construction of the graph Laplacian and the resulting eigenvectors that reflect the cluster structure in the data. [sent-53, score-1.081]

23 Therefore, we construct an objective function that consists of the graph Laplacians from all the views of the data and regularize on the eigenvectors of the Laplacians such that the cluster structures resulting from each Laplacian look consistent across all the views. [sent-54, score-0.899]

24 1) enforces that the eigenvectors U(v) and U(w) of a view pair (v, w) should have high pairwise similarity (using a pair-wise co-regularization criteria we will define in Section 2. [sent-56, score-0.488]

25 The idea is different from previously proposed consensus clustering approaches [7] that commit to individual clusterings in the first step and then combine them to a consensus in the second step. [sent-60, score-0.825]

26 1 Pairwise Co-regularization In standard spectral clustering, the eigenvector matrix U(v) is the data representation for subsequent k-means clustering step (with i’th row mapping to the original i’th sample). [sent-63, score-0.821]

27 This amounts to enforcing the spectral clustering hypotheses (which are based on the U(·) ’s) to be the same across all the views. [sent-65, score-0.903]

28 First, the similarity measure (or kernel) used in the Laplacian for spectral clustering has already taken care of the non-linearities present in the data (if any), and the embedding U(·) being real-valued cluster indicators, can be considered to obey linear similarities. [sent-76, score-0.976]

29 Substituting this in Equation 2 and ignoring the F constant additive and scaling terms that depend on the number of clusters, we get T D(U(v) , U(w) ) = −tr U(v) U(v) U(w) U(w) T We want to minimize the above disagreement between the clusterings of views v and w. [sent-79, score-0.648]

30 Combining this with the spectral clustering objectives of individual views, we get the following joint maximization problem for two graphs: max T T T U(v) ∈Rn×k U(w) ∈Rn×k tr U(v) L(v) U(v) + tr U(w) L(w) U(w) + λ tr U(v) U(v) U(w) U(w) T (3) (v)T s. [sent-80, score-1.241]

31 U (v) U (w)T = I, U (w) U =I The hyperparameter λ trades-off the spectral clustering objectives and the spectral embedding (dis)agreement term. [sent-82, score-1.262]

32 (4) T This is a standard spectral clustering objective on view v with graph Laplacian L(v) + λU(w) U(w) . [sent-90, score-1.109]

33 The difference from standard kernel combination (kernel addition, for example) is that the combination is adaptive since U(w) keeps getting updated at each step, as guided by the clustering algorithm. [sent-92, score-0.545]

34 Note that we can use either U(v) or U(w) in the final k-means step of the spectral clustering algorithm. [sent-107, score-0.821]

35 In our experiments, we note a marginal difference in the clustering performance depending on which U(·) is used in the final step of k-means clustering. [sent-108, score-0.409]

36 2 Extension to Multiple Views We can extend the co-regularized spectral clustering proposed in the previous section for more than two views. [sent-110, score-0.821]

37 U(v) U(v) = I U(v) , (6) 1≤w≤m, w=v We initialize all U(v) , 2 ≤ v ≤ m by solving the spectral clustering problem for single views. [sent-122, score-0.842]

38 The optimization is then cycled over all views while keeping the previously obtained U(·) ’s fixed. [sent-125, score-0.536]

39 In contrast with the pairwise regularization approach which has m pairwise regularization terms, 2 where m is the number of views, the centroid based regularization scheme has m pairwise regularization terms. [sent-128, score-0.469]

40 U(v) U(v) = I, ∀ 1 ≤ v ≤ V, U∗ U∗ = I This objective tries to balance a trade-off between the individual spectral clustering objectives and the agreement of each of the view-specific eigenvectors U(v) with the consensus eigenvectors U∗ . [sent-134, score-1.309]

41 It is easy to see that with all other view-specific eigenvectors and the consensus U∗ fixed, optimizing U(v) for view v amounts to solving the following: max T U(v) ∈Rn×k T tr U(v) L(v) U(v) + λv tr U(v) U(v) U∗ U∗ T , T s. [sent-137, score-0.666]

42 U(v) U(v) = I (8) which is nothing but equivalent to solving the standard spectral clustering objective for U(v) with a T modified Laplacian L(v) + λv U∗ U∗ . [sent-139, score-0.899]

43 U∗ U∗ = I (10) v which is equivalent to solving the standard spectral clustering objective for U∗ with a modified T Laplacian v λv U(v) U(v) . [sent-144, score-0.899]

44 One possible downside of the centroid-based co-regularization approach is that noisy views could potentially affect the optimal U∗ as it depends on all the views. [sent-149, score-0.51]

45 If it is a priori known that some views are noisy, then it is advisable to use a small value of λv for such views, so as to prevent them from adversely affecting U∗ . [sent-151, score-0.535]

46 4 3 Experiments We compare both of our co-regularization based multi-view spectral clustering approaches with a number of baselines. [sent-152, score-0.851]

47 , one that achieves the best spectral clustering performance using a single view of the data. [sent-155, score-1.0]

48 • Feature Concatenation: Concatenating the features of each view, and then running standard spectral clustering using the graph Laplacian derived from the joint view representation of the data. [sent-156, score-1.052]

49 • Kernel Addition: Combining different kernels by adding them, and then running standard spectral clustering on the corresponding Laplacian. [sent-157, score-0.863]

50 • Kernel Product (element-wise): Multiplying the corresponding entries of kernels and applying standard spectral clustering on the resultant Laplacian. [sent-161, score-0.863]

51 However, in our experiments, we use different width parameters for different views so the performances of kernel product may not be directly comparable to feature concatenation. [sent-163, score-0.672]

52 • CCA based Feature Extraction: Applying CCA for feature fusion from multiple views of the data [10], and then running spectral clustering using these extracted features. [sent-164, score-1.408]

53 We apply both standard CCA and kernel CCA for feature extraction and report the clustering results for whichever gives the best performance. [sent-165, score-0.571]

54 • Minimizing-Disagreement Spectral Clustering: Our last baseline is the minimizingdisagreement approach to spectral clustering [11], and is perhaps most closely related to our coregularization based approach to spectral clustering. [sent-166, score-1.347]

55 For datasets with more than 2 views, we have also explicitly mentioned the number of views in parentheses. [sent-170, score-0.532]

56 • Synthetic data 1: Our first synthetic dataset consists of two views and is generated in a manner akin to [12] which first chooses the cluster ci each sample belongs to, and then generates each (1) (2) of the views xi and xi from a two-component Gaussian mixture model. [sent-173, score-1.233]

57 These views are (1) (2) combined to form the sample (xi , xi , ci ). [sent-174, score-0.51]

58 The cluster (1) (1) (2) (2) means in view 1 are µ1 = (1 1) , µ2 = (2 2), and in view 2 are µ1 = (2 2) , µ2 = (1 1). [sent-176, score-0.44]

59 The covariances for the two views are given below. [sent-177, score-0.51]

60 The cluster means in view 1 are µ1 = (1 1) , µ2 = (3 4); in view 2 (2) (2) (3) (3) are µ1 = (1 2) , µ2 = (2 2); and in view 3 are µ1 = (1 1) , µ2 = (3 3). [sent-192, score-0.619]

61 The covariances (v) for the three views are given below. [sent-193, score-0.51]

62 Since spectral clustering essentially works with similarities of the data, we first project the data using Latent Semantic Analysis (LSA) [16] to a 100-dimensional space and compute similarities in this lower dimensional space. [sent-220, score-0.963]

63 We report results on our co-regularized spectral clustering for two, three and four views cases. [sent-227, score-1.331]

64 We use normalized mutual information (NMI) as the clustering quality evaluation measure, which gives the mutual information between obtained clustering and the true clustering normalized by the cluster entropies. [sent-228, score-1.423]

65 For two-view synthetic data (Synthetic Data 1), both the co-regularized spectral clustering approaches outperform all the baselines by a significant margin, with the pairwise approach doing marginally better than the centroid-based approach. [sent-239, score-1.105]

66 In fact, it reduces the performance when the third view is added, so we report the performance with only two views for feature concatenation. [sent-245, score-0.715]

67 Kernel addition with three views gives a good improvement over single view case. [sent-246, score-0.717]

68 As compared to other baselines (with two views), both our co-regularized spectral clustering approaches with two views perform better. [sent-247, score-1.45]

69 For the document clustering results on Reuters multilingual data, English and French languages are used as the two views. [sent-249, score-0.56]

70 The next best performance is attained by minimum-disagreement spectral clustering [11] approach. [sent-251, score-0.821]

71 The numbers (2), (3) and (4) indicate the number of views used in our co-regularized spectral clustering approach. [sent-346, score-1.331]

72 Other multi-view baselines were run with maximum number of views available (or maximum number of views they can handle). [sent-347, score-1.109]

73 On the other hand, both of our co-regularized spectral clustering approaches with two views outperform the single view case. [sent-353, score-1.563]

74 As we added more views that were available for the Caltech-101 datasets, we found that the performance of the pairwise approach consistently went up as we added the third and the fourth view. [sent-354, score-0.584]

75 On the other hand, the performance of the centroid-based approach slightly got worse upon adding the third view (possibly due to the view being noisy which affected the learned U∗ ); however addition of the fourth view brought the performance almost close to that of the pairwise case. [sent-355, score-0.665]

76 1 (b) Figure 1: NMI scores of Co-regularized Spectral Clustering as a function of λ for (a) Reuters multilingual data and (b) Caltech-101 data We also experiment with various values of co-regularization parameter λ and observe its effect on the clustering performance. [sent-384, score-0.519]

77 For the range of λ shown in the plot, the NMI for co-regularized spectral clustering is greater than the closest baseline for most of 7 the λ values. [sent-400, score-0.873]

78 4 Related Work A number of clustering algorithms have been proposed in the past to learn with multiple views of the data. [sent-402, score-0.97]

79 Some of them first extract a set of shared features from the multiple views and then apply any off-the-shelf clustering algorithm such as k-means on these features. [sent-403, score-0.97]

80 Alternatively, some other approaches exploit the multiple views of the data as part of the clustering algorithm itself. [sent-405, score-1.0]

81 For example, [19] proposed an Co-EM based framework for multi-view clustering in mixture models. [sent-406, score-0.432]

82 Multi-view clustering algorithms have also been proposed in the framework of spectral clustering [11, 20, 21]. [sent-410, score-1.253]

83 [11] approaches the problem of two-view clustering by constructing a bipartite graph from nodes of both views. [sent-413, score-0.524]

84 Subsequently, they solve standard spectral clustering problem on this bipartite graph. [sent-415, score-0.854]

85 In [21], a co-training based framework is proposed where the similarity matrix of one view is constrained by the eigenvectors of the Laplacian in the other view. [sent-416, score-0.401]

86 Consensus clustering approaches can also be applied to the problem of multi-view clustering [7]. [sent-418, score-0.848]

87 5 Discussion We proposed a multi-view clustering approach in the framework of spectral clustering. [sent-421, score-0.844]

88 The approach uses the philosophy of co-regularization to make the clusterings in different views agree with each other. [sent-422, score-0.673]

89 To the best of our knowledge, this is the first work to apply the idea to the problem of unsupervised learning, in particular to spectral clustering. [sent-424, score-0.437]

90 The co-regularized spectral clustering has a joint optimization function for spectral embeddings of all the views. [sent-425, score-1.273]

91 An alternating maximization framework reduces the problem to the standard spectral clustering objective which is efficiently solvable using state-ofthe-art eigensolvers. [sent-426, score-0.993]

92 It is possible to extend the proposed framework to the case where some of the views have missing data. [sent-427, score-0.57]

93 One possible approach to estimate the missing entry could be to simply average the similarities from views in which the data point is available. [sent-430, score-0.618]

94 It is also possible to assign weights to different views in the proposed objective function as done in [20], if we have some a priori knowledge about the informativeness of the views. [sent-433, score-0.592]

95 Our co-regularization based framework can also be applied to other unsupervised problems such as spectral methods for dimensionality reduction. [sent-434, score-0.46]

96 For example, the Kernel PCA algorithm [23] can be extended to work with multiple views by defining each view as having its own Kernel PCA objective function and having a regularizer which enforces the embeddings to look similar across all views (e. [sent-435, score-1.434]

97 There has been very little prior work analyzing spectral clustering methods. [sent-439, score-0.821]

98 For instance, there has been some work on consistency analysis of single view spectral clustering [24], which provides results about the rate of convergence as the sample size increases, using tools from theory of linear operators and empirical processes. [sent-440, score-1.0]

99 Similar convergence properties could be studied for multi-view spectral clustering. [sent-441, score-0.412]

100 Learning from multiple partially observed views - an application to multilingual text categorization. [sent-494, score-0.671]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('views', 0.51), ('spectral', 0.412), ('clustering', 0.409), ('nmi', 0.183), ('view', 0.179), ('kernel', 0.136), ('cca', 0.134), ('eigenvectors', 0.126), ('consensus', 0.124), ('ku', 0.118), ('clusterings', 0.112), ('multilingual', 0.11), ('tr', 0.108), ('laplacian', 0.1), ('baselines', 0.089), ('concatenation', 0.088), ('reuters', 0.082), ('cluster', 0.082), ('centroid', 0.075), ('pairwise', 0.074), ('similarity', 0.073), ('similarities', 0.071), ('synthetic', 0.068), ('objective', 0.057), ('abhishek', 0.055), ('graph', 0.052), ('hypotheses', 0.052), ('agree', 0.051), ('alternating', 0.051), ('multiple', 0.051), ('co', 0.048), ('handwritten', 0.046), ('documents', 0.044), ('regularization', 0.043), ('rn', 0.042), ('kernels', 0.042), ('coregularization', 0.042), ('minimizingdisagreement', 0.042), ('synth', 0.042), ('hal', 0.041), ('languages', 0.041), ('maximization', 0.041), ('embeddings', 0.04), ('english', 0.038), ('french', 0.038), ('missing', 0.037), ('enforces', 0.036), ('akin', 0.035), ('peak', 0.034), ('graphs', 0.034), ('normalized', 0.034), ('piyush', 0.034), ('maryland', 0.034), ('bipartite', 0.033), ('corpus', 0.032), ('ulrike', 0.031), ('weighing', 0.031), ('digits', 0.031), ('baseline', 0.03), ('across', 0.03), ('laplacians', 0.03), ('mikhail', 0.03), ('compatible', 0.03), ('uci', 0.03), ('approaches', 0.03), ('starts', 0.029), ('objectives', 0.029), ('addition', 0.028), ('dataset', 0.028), ('th', 0.028), ('daum', 0.027), ('translations', 0.027), ('disagreement', 0.026), ('parentheses', 0.026), ('feature', 0.026), ('keeping', 0.026), ('individual', 0.026), ('worse', 0.026), ('semisupervised', 0.026), ('unlabeled', 0.025), ('unsupervised', 0.025), ('priori', 0.025), ('frobenius', 0.024), ('md', 0.024), ('mutual', 0.023), ('outperform', 0.023), ('reaches', 0.023), ('framework', 0.023), ('informative', 0.023), ('datasets', 0.022), ('equation', 0.022), ('von', 0.022), ('compatibility', 0.022), ('noted', 0.022), ('closest', 0.022), ('park', 0.021), ('regularize', 0.021), ('look', 0.021), ('solving', 0.021), ('pca', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 54 nips-2011-Co-regularized Multi-view Spectral Clustering

Author: Abhishek Kumar, Piyush Rai, Hal Daume

Abstract: In many clustering problems, we have access to multiple views of the data each of which could be individually used for clustering. Exploiting information from multiple views, one can hope to find a clustering that is more accurate than the ones obtained using the individual views. Often these different views admit same underlying clustering of the data, so we can approach this problem by looking for clusterings that are consistent across the views, i.e., corresponding data points in each view should have same cluster membership. We propose a spectral clustering framework that achieves this goal by co-regularizing the clustering hypotheses, and propose two co-regularization schemes to accomplish this. Experimental comparisons with a number of baselines on two synthetic and three real-world datasets establish the efficacy of our proposed approaches.

2 0.40592444 186 nips-2011-Noise Thresholds for Spectral Clustering

Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh

Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1

3 0.16416098 198 nips-2011-On U-processes and clustering performance

Author: Stéphan J. Clémençcon

Abstract: Many clustering techniques aim at optimizing empirical criteria that are of the form of a U -statistic of degree two. Given a measure of dissimilarity between pairs of observations, the goal is to minimize the within cluster point scatter over a class of partitions of the feature space. It is the purpose of this paper to define a general statistical framework, relying on the theory of U -processes, for studying the performance of such clustering methods. In this setup, under adequate assumptions on the complexity of the subsets forming the partition candidates, the √ excess of clustering risk is proved to be of the order OP (1/ n). Based on recent results related to the tail behavior of degenerate U -processes, it is also shown how to establish tighter rate bounds. Model selection issues, related to the number of clusters forming the data partition in particular, are also considered. 1

4 0.16249335 176 nips-2011-Multi-View Learning of Word Embeddings via CCA

Author: Paramveer Dhillon, Dean P. Foster, Lyle H. Ungar

Abstract: Recently, there has been substantial interest in using large amounts of unlabeled data to learn word representations which can then be used as features in supervised classifiers for NLP tasks. However, most current approaches are slow to train, do not model the context of the word, and lack theoretical grounding. In this paper, we present a new learning method, Low Rank Multi-View Learning (LR-MVL) which uses a fast spectral method to estimate low dimensional context-specific word representations from unlabeled data. These representation features can then be used with any supervised learner. LR-MVL is extremely fast, gives guaranteed convergence to a global optimum, is theoretically elegant, and achieves state-ofthe-art performance on named entity recognition (NER) and chunking problems. 1 Introduction and Related Work Over the past decade there has been increased interest in using unlabeled data to supplement the labeled data in semi-supervised learning settings to overcome the inherent data sparsity and get improved generalization accuracies in high dimensional domains like NLP. Approaches like [1, 2] have been empirically very successful and have achieved excellent accuracies on a variety of NLP tasks. However, it is often difficult to adapt these approaches to use in conjunction with an existing supervised NLP system as these approaches enforce a particular choice of model. An increasingly popular alternative is to learn representational embeddings for words from a large collection of unlabeled data (typically using a generative model), and to use these embeddings to augment the feature set of a supervised learner. Embedding methods produce features in low dimensional spaces or over a small vocabulary size, unlike the traditional approach of working in the original high dimensional vocabulary space with only one dimension “on” at a given time. Broadly, these embedding methods fall into two categories: 1. Clustering based word representations: Clustering methods, often hierarchical, are used to group distributionally similar words based on their contexts. The two dominant approaches are Brown Clustering [3] and [4]. As recently shown, HMMs can also be used to induce a multinomial distribution over possible clusters [5]. 2. Dense representations: These representations are dense, low dimensional and real-valued. Each dimension of these representations captures latent information about a combination of syntactic and semantic word properties. They can either be induced using neural networks like C&W; embeddings [6] and Hierarchical log-linear (HLBL) embeddings [7] or by eigen-decomposition of the word co-occurrence matrix, e.g. Latent Semantic Analysis/Latent Semantic Indexing (LSA/LSI) [8]. Unfortunately, most of these representations are 1). slow to train, 2). sensitive to the scaling of the embeddings (especially 2 based approaches like LSA/PCA), 3). can get stuck in local optima (like EM trained HMM) and 4). learn a single embedding for a given word type; i.e. all the occurrences 1 of the word “bank” will have the same embedding, irrespective of whether the context of the word suggests it means “a financial institution” or “a river bank”. In this paper, we propose a novel context-specific word embedding method called Low Rank MultiView Learning, LR-MVL, which is fast to train and is guaranteed to converge to the optimal solution. As presented here, our LR-MVL embeddings are context-specific, but context oblivious embeddings (like the ones used by [6, 7]) can be trivially gotten from our model. Furthermore, building on recent advances in spectral learning for sequence models like HMMs [9, 10, 11] we show that LR-MVL has strong theoretical grounding. Particularly, we show that LR-MVL estimates low dimensional context-specific word embeddings which preserve all the information in the data if the data were generated by an HMM. Moreover, LR-MVL being linear does not face the danger of getting stuck in local optima as is the case for an EM trained HMM. LR-MVL falls into category (2) mentioned above; it learns real-valued context-specific word embeddings by performing Canonical Correlation Analysis (CCA) [12] between the past and future views of low rank approximations of the data. However, LR-MVL is more general than those methods, which work on bigram or trigram co-occurrence matrices, in that it uses longer word sequence information to estimate context-specific embeddings and also for the reasons mentioned in the last paragraph. The remainder of the paper is organized as follows. In the next section we give a brief overview of CCA, which forms the core of our method. Section 3 describes our proposed LR-MVL algorithm in detail and gives theory supporting its performance. Section 4 demonstrates the effectiveness of LR-MVL on the NLP tasks of Named Entity Recognition and Chunking. We conclude with a brief summary in Section 5. 2 Brief Review: Canonical Correlation Analysis (CCA) CCA [12] is the analog to Principal Component Analysis (PCA) for pairs of matrices. PCA computes the directions of maximum covariance between elements in a single matrix, whereas CCA computes the directions of maximal correlation between a pair of matrices. Unlike PCA, CCA does not depend on how the observations are scaled. This invariance of CCA to linear data transformations allows proofs that keeping the dominant singular vectors (those with largest singular values) will faithfully capture any state information. More specifically, given a set of n paired observation vectors {(l1 , r1 ), ..., (ln , rn )}–in our case the two matrices are the left (L) and right (R) context matrices of a word–we would like to simultaneously find the directions Φl and Φr that maximize the correlation of the projections of L onto Φl with the projections of R onto Φr . This is expressed as max Φl ,Φr E[ L, Φl R, Φr ] E[ L, Φl 2 ]E[ R, Φr 2 ] (1) where E denotes the empirical expectation. We use the notation Clr (Cll ) to denote the cross (auto) covariance matrices between L and R (i.e. L’R and L’L respectively.). The left and right canonical correlates are the solutions Φl , Φr of the following equations: Cll −1 Clr Crr −1 Crl Φl = λΦl Crr −1 Crl Cll −1 Clr Φr = λΦr 3 (2) Low Rank Multi-View Learning (LR-MVL) In LR-MVL, we compute the CCA between the past and future views of the data on a large unlabeled corpus to find the common latent structure, i.e., the hidden state associated with each token. These induced representations of the tokens can then be used as features in a supervised classifier (typically discriminative). The context around a word, consisting of the h words to the right and left of it, sits in a high dimensional space, since for a vocabulary of size v, each of the h words in the context requires an indicator function of dimension v. The key move in LR-MVL is to project the v-dimensional word 2 space down to a k dimensional state space. Thus, all eigenvector computations are done in a space that is v/k times smaller than the original space. Since a typical vocabulary contains at least 50, 000 words, and we use state spaces of order k ≈ 50 dimensions, this gives a 1,000-fold reduction in the size of calculations that are needed. The core of our LR-MVL algorithm is a fast spectral method for learning a v × k matrix A which maps each of the v words in the vocabulary to a k-dimensional state vector. We call this matrix the “eigenfeature dictionary”. We now describe the LR-MVL method, give a theorem that provides intuition into how it works, and formally present the LR-MVL algorithm. The Experiments section then shows that this low rank approximation allows us to achieve state-of-the-art performance on NLP tasks. 3.1 The LR-MVL method Given an unlabeled token sequence w={w0 , w1 , . . ., wn } we want to learn a low (k)- dimensional state vector {z0 , z1 , . . . , zn } for each observed token. The key is to find a v ×k matrix A (Algorithm 1) that maps each of the v words in the vocabulary to a reduced rank k-dimensional state vector, which is later used to induce context specific embeddings for the tokens (Algorithm 2). For supervised learning, these context specific embeddings are supplemented with other information about each token wt , such as its identity, orthographic features such as prefixes and suffixes or membership in domain-specific lexicons, and used as features in a classifier. Section 3.4 gives the algorithm more formally, but the key steps in the algorithm are, in general terms: • Take the h words to the left and to the right of each target word wt (the “Left” and “Right” contexts), and project them each down to k dimensions using A. • Take the CCA between the reduced rank left and right contexts, and use the resulting model to estimate a k dimensional state vector (the “hidden state”) for each token. • Take the CCA between the hidden states and the tokens wt . The singular vectors associated with wt form a new estimate of the eigenfeature dictionary. LR-MVL can be viewed as a type of co-training [13]: The state of each token wt is similar to that of the tokens both before and after it, and it is also similar to the states of the other occurrences of the same word elsewhere in the document (used in the outer iteration). LR-MVL takes advantage of these two different types of similarity by alternately estimating word state using CCA on the smooths of the states of the words before and after each target token and using the average over the states associated with all other occurrences of that word. 3.2 Theoretical Properties of LR-MVL We now present the theory behind the LR-MVL algorithm; particularly we show that the reduced rank matrix A allows a significant data reduction while preserving the information in our data and the estimated state does the best possible job of capturing any label information that can be inferred by a linear model. Let L be an n × hv matrix giving the words in the left context of each of the n tokens, where the context is of length h, R be the corresponding n × hv matrix for the right context, and W be an n × v matrix of indicator functions for the words themselves. We will use the following assumptions at various points in our proof: Assumption 1. L, W, and R come from a rank k HMM i.e. it has a rank k observation matrix and rank k transition matrix both of which have the same domain. For example, if the dimension of the hidden state is k and the vocabulary size is v then the observation matrix, which is k × v, has rank k. This rank condition is similar to the one used by [10]. Assumption 1A. For the three views, L, W and R assume that there exists a “hidden state H” of dimension n × k, where each row Hi has the same non-singular variance-covariance matrix and 3 such that E(Li |Hi ) = Hi β T and E(Ri |Hi ) = Hi β T and E(Wi |Hi ) = Hi β T where all β’s are of L R W rank k, where Li , Ri and Wi are the rows of L, R and W respectively. Assumption 1A follows from Assumption 1. Assumption 2. ρ(L, W), ρ(L, R) and ρ(W, R) all have rank k, where ρ(X1 , X2 ) is the expected correlation between X1 and X2 . Assumption 2 is a rank condition similar to that in [9]. Assumption 3. ρ([L, R], W) has k distinct singular values. Assumption 3 just makes the proof a little cleaner, since if there are repeated singular values, then the singular vectors are not unique. Without it, we would have to phrase results in terms of subspaces with identical singular values. We also need to define the CCA function that computes the left and right singular vectors for a pair of matrices: Definition 1 (CCA). Compute the CCA between two matrices X1 and X2 . Let ΦX1 be a matrix containing the d largest singular vectors for X1 (sorted from the largest on down). Likewise for ΦX2 . Define the function CCAd (X1 , X2 ) = [ΦX1 , ΦX2 ]. When we want just one of these Φ’s, we will use CCAd (X1 , X2 )left = ΦX1 for the left singular vectors and CCAd (X1 , X2 )right = ΦX2 for the right singular vectors. Note that the resulting singular vectors, [ΦX1 , ΦX2 ] can be used to give two redundant estimates, X1 ΦX1 and X2 ΦX2 of the “hidden” state relating X1 and X2 , if such a hidden state exists. Definition 2. Define the symbol “≈” to mean X1 ≈ X2 ⇐⇒ lim X1 = lim X2 n→∞ n→∞ where n is the sample size. Lemma 1. Define A by the following limit of the right singular vectors: CCAk ([L, R], W)right ≈ A. Under assumptions 2, 3 and 1A, such that if CCAk (L, R) ≡ [ΦL , ΦR ] then CCAk ([LΦL , RΦR ], W)right ≈ A. Lemma 1 shows that instead of finding the CCA between the full context and the words, we can take the CCA between the Left and Right contexts, estimate a k dimensional state from them, and take the CCA of that state with the words and get the same result. See the supplementary material for the Proof. ˜ Let Ah denote a matrix formed by stacking h copies of A on top of each other. Right multiplying ˜ L or R by Ah projects each of the words in that context into the k-dimensional reduced rank space. The following theorem addresses the core of the LR-MVL algorithm, showing that there is an A which gives the desired dimensionality reduction. Specifically, it shows that the previous lemma also holds in the reduced rank space. Theorem 1. Under assumptions 1, 2 and 3 there exists a unique matrix A such that if ˜ ˜ ˜ ˜ CCAk (LAh , RAh ) ≡ [ΦL , ΦR ] then ˜ ˜ ˜ ˜ CCAk ([LAh ΦL , RAh ΦR ], W)right ≈ A ˜ where Ah is the stacked form of A. See the supplementary material for the Proof 1 . ˆ It is worth noting that our matrix A corresponds to the matrix U used by [9, 10]. They showed that U is sufficient to compute the probability of a sequence of words generated by an HMM; although we do not show ˆ it here (due to limited space), our A provides a more statistically efficient estimate of U than their U , and hence can also be used to estimate the sequence probabilities. 1 4 Under the above assumptions, there is asymptotically (in the limit of infinite data) no benefit to first estimating state by finding the CCA between the left and right contexts and then finding the CCA between the estimated state and the words. One could instead just directly find the CCA between the combined left and rights contexts and the words. However, because of the Zipfian distribution of words, many words are rare or even unique, and hence one is not in the asymptotic limit. In this case, CCA between the rare words and context will not be informative, whereas finding the CCA between the left and right contexts gives a good state vector estimate even for unique words. One can then fruitfully find the CCA between the contexts and the estimated state vector for their associated words. 3.3 Using Exponential Smooths In practice, we replace the projected left and right contexts with exponential smooths (weighted average of the previous (or next) token’s state i.e. Zt−1 (or Zt+1 ) and previous (or next) token’s smoothed state i.e. St−1 (or St+1 ).), of them at a few different time scales, thus giving a further dimension reduction by a factor of context length h (say 100 words) divided by the number of smooths (often 5-7). We use a mixture of both very short and very long contexts which capture short and long range dependencies as required by NLP problems as NER, Chunking, WSD etc. Since exponential smooths are linear, we preserve the linearity of our method. 3.4 The LR-MVL Algorithm The LR-MVL algorithm (using exponential smooths) is given in Algorithm 1; it computes the pair of CCAs described above in Theorem 1. Algorithm 1 LR-MVL Algorithm - Learning from Large amounts of Unlabeled Data 1: Input: Token sequence Wn×v , state space size k, smoothing rates αj 2: Initialize the eigenfeature dictionary A to random values N (0, 1). 3: repeat 4: Set the state Zt (1 < t ≤ n) of each token wt to the eigenfeature vector of the corresponding word. Zt = (Aw : w = wt ) 5: Smooth the state estimates before and after each token to get a pair of views for each smoothing rate αj . (l,j) (l,j) = (1 − αj )St−1 + αj Zt−1 // left view L St (r,j) (r,j) j St = (1 − α )St+1 + αj Zt+1 // right view R. (l,j) (r,j) th where the t rows of L and R are, respectively, concatenations of the smooths St and St for (j) each of the α s. 6: Find the left and right canonical correlates, which are the eigenvectors Φl and Φr of (L L)−1 L R(R R)−1 R LΦl = λΦl . (R R)−1 R L(L L)−1 L RΦr = λΦr . 7: Project the left and right views on to the space spanned by the top k/2 left and right CCAs respectively (k/2) (k/2) Xl = LΦl and Xr = RΦr (k/2) (k/2) where Φl , Φr are matrices composed of the singular vectors of Φl , Φr with the k/2 largest magnitude singular values. Estimate the state for each word wt as the union of the left and right estimates: Z = [Xl , Xr ] 8: Estimate the eigenfeatures of each word type, w, as the average of the states estimated for that word. Aw = avg(Zt : wt = w) 9: Compute the change in A from the previous iteration 10: until |∆A| < 11: Output: Φk , Φk , A . r l A few iterations (∼ 5) of the above algorithm are sufficient to converge to the solution. (Since the problem is convex, there is a single solution, so there is no issue of local minima.) As [14] show for PCA, one can start with a random matrix that is only slightly larger than the true rank k of the correlation matrix, and with extremely high likelihood converge in a few iterations to within a small distance of the true principal components. In our case, if the assumptions detailed above (1, 1A, 2 and 3) are satisfied, our method converges equally rapidly to the true canonical variates. As mentioned earlier, we get further dimensionality reduction in Step 5, by replacing the Left and Right context matrices with a set of exponentially smoothed values of the reduced rank projections of the context words. Step 6 finds the CCA between the Left and Right contexts. Step 7 estimates 5 the state by combining the estimates from the left and right contexts, since we don’t know which will best estimate the state. Step 8 takes the CCA between the estimated state Z and the matrix of words W. Because W is a vector of indicator functions, this CCA takes the trivial form of a set of averages. Once we have estimated the CCA model, it is used to generate context specific embeddings for the tokens from training, development and test sets (as described in Algorithm 2). These embeddings are further supplemented with other baseline features and used in a supervised learner to predict the label of the token. Algorithm 2 LR-MVL Algorithm -Inducing Context Specific Embeddings for Train/Dev/Test Data 1: Input: Model (Φk , Φk , A) output from above algorithm and Token sequences Wtrain , (Wdev , Wtest ) r l 2: Project the left and right views L and R after smoothing onto the space spanned by the top k left and right CCAs respectively Xl = LΦk and Xr = RΦk r l and the words onto the eigenfeature dictionary Xw = W train A 3: Form the final embedding matrix Xtrain:embed by concatenating these three estimates of state Xtrain:embed = [Xl , Xw , Xr ] 4: Output: The embedding matrices Xtrain:embed , (Xdev:embed , Xtest:embed ) with context-specific representations for the tokens. These embeddings are augmented with baseline set of features mentioned in Sections 4.1.1 and 4.1.2 before learning the final classifier. Note that we can get context “oblivious” embeddings i.e. one embedding per word type, just by using the eigenfeature dictionary (Av×k ) output by Algorithm 1. 4 Experimental Results In this section we present the experimental results of LR-MVL on Named Entity Recognition (NER) and Syntactic Chunking tasks. We compare LR-MVL to state-of-the-art semi-supervised approaches like [1] (Alternating Structures Optimization (ASO)) and [2] (Semi-supervised extension of CRFs) as well as embeddings like C&W;, HLBL and Brown Clustering. 4.1 Datasets and Experimental Setup For the NER experiments we used the data from CoNLL 2003 shared task and for Chunking experiments we used the CoNLL 2000 shared task data2 with standard training, development and testing set splits. The CoNLL ’03 and the CoNLL ’00 datasets had ∼ 204K/51K/46K and ∼ 212K/ − /47K tokens respectively for Train/Dev./Test sets. 4.1.1 Named Entity Recognition (NER) We use the same set of baseline features as used by [15, 16] in their experiments. The detailed list of features is as below: • Current Word wi ; Its type information: all-capitalized, is-capitalized, all-digits and so on; Prefixes and suffixes of wi • Word tokens in window of 2 around the current word i.e. (wi−2 , wi−1 , wi , wi+1 , wi+2 ); and capitalization pattern in the window. d = • Previous two predictions yi−1 and yi−2 and conjunction of d and yi−1 • Embedding features (LR-MVL, C&W;, HLBL, Brown etc.) in a window of 2 around the current word (if applicable). Following [17] we use regularized averaged perceptron model with above set of baseline features for the NER task. We also used their BILOU text chunk representation and fast greedy inference as it was shown to give superior performance. 2 More details about the data and competition are available at http://www.cnts.ua.ac.be/ conll2003/ner/ and http://www.cnts.ua.ac.be/conll2000/chunking/ 6 We also augment the above set of baseline features with gazetteers, as is standard practice in NER experiments. We tuned our free parameter namely the size of LR-MVL embedding on the development and scaled our embedding features to have a 2 norm of 1 for each token and further multiplied them by a normalization constant (also chosen by cross validation), so that when they are used in conjunction with other categorical features in a linear classifier, they do not exert extra influence. The size of LR-MVL embeddings (state-space) that gave the best performance on the development set was k = 50 (50 each for Xl , Xw , Xr in Algorithm 2) i.e. the total size of embeddings was 50×3, and the best normalization constant was 0.5. We omit validation plots due to paucity of space. 4.1.2 Chunking For our chunking experiments we use a similar base set of features as above: • Current Word wi and word tokens in window of 2 around the current word i.e. d = (wi−2 , wi−1 , wi , wi+1 , wi+2 ); • POS tags ti in a window of 2 around the current word. • Word conjunction features wi ∩ wi+1 , i ∈ {−1, 0} and Tag conjunction features ti ∩ ti+1 , i ∈ {−2, −1, 0, 1} and ti ∩ ti+1 ∩ ti+2 , i ∈ {−2, −1, 0}. • Embedding features in a window of 2 around the current word (when applicable). Since CoNLL 00 chunking data does not have a development set, we randomly sampled 1000 sentences from the training data (8936 sentences) for development. So, we trained our chunking models on 7936 training sentences and evaluated their F1 score on the 1000 development sentences and used a CRF 3 as the supervised classifier. We tuned the size of embedding and the magnitude of 2 regularization penalty in CRF on the development set and took log (or -log of the magnitude) of the value of the features4 . The regularization penalty that gave best performance on development set was 2 and here again the best size of LR-MVL embeddings (state-space) was k = 50. Finally, we trained the CRF on the entire (“original”) training data i.e. 8936 sentences. 4.1.3 Unlabeled Data and Induction of embeddings For inducing the embeddings we used the RCV1 corpus containing Reuters newswire from Aug ’96 to Aug ’97 and containing about 63 million tokens in 3.3 million sentences5 . Case was left intact and we did not do the “cleaning” as done by [18, 16] i.e. remove all sentences which are less than 90% lowercase a-z, as our multi-view learning approach is robust to such noisy data, like news byline text (mostly all caps) which does not correlate strongly with the text of the article. We induced our LR-MVL embeddings over a period of 3 days (70 core hours on 3.0 GHz CPU) on the entire RCV1 data by performing 4 iterations, a vocabulary size of 300k and using a variety of smoothing rates (α in Algorithm 1) to capture correlations between shorter and longer contexts α = [0.005, 0.01, 0.05, 0.1, 0.5, 0.9]; theoretically we could tune the smoothing parameters on the development set but we found this mixture of long and short term dependencies to work well in practice. As far as the other embeddings are concerned i.e. C&W;, HLBL and Brown Clusters, we downloaded them from http://metaoptimize.com/projects/wordreprs. The details about their induction and parameter tuning can be found in [16]; we report their best numbers here. It is also worth noting that the unsupervised training of LR-MVL was (> 1.5 times)6 faster than other embeddings. 4.2 Results The results for NER and Chunking are shown in Tables 1 and 2, respectively, which show that LR-MVL performs significantly better than state-of-the-art competing methods on both NER and Chunking tasks. 3 http://www.chokkan.org/software/crfsuite/ Our embeddings are learnt using a linear model whereas CRF is a log-linear model, so to keep things on same scale we did this normalization. 5 We chose this particular dataset to make a fair comparison with [1, 16], who report results using RCV1 as unlabeled data. 6 As some of these embeddings were trained on GPGPU which makes our method even faster comparatively. 4 7 Embedding/Model Baseline C&W;, 200-dim HLBL, 100-dim Brown 1000 clusters Ando & Zhang ’05 Suzuki & Isozaki ’08 LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim HLBL, 100-dim C&W;, 200-dim Brown, 1000 clusters LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim No Gazetteers With Gazetteers F1-Score Dev. Set Test Set 90.03 84.39 92.46 87.46 92.00 88.13 92.32 88.52 93.15 89.31 93.66 89.36 93.11 89.55 93.61 89.91 92.91 89.35 92.98 88.88 93.25 89.41 93.91 89.89 94.41 90.06 Table 1: NER Results. Note: 1). LR-MVL (CO) are Context Oblivious embeddings which are gotten from (A) in Algorithm 1. 2). F1-score= Harmonic Mean of Precision and Recall. 3). The current state-of-the-art for this NER task is 90.90 (Test Set) but using 700 billion tokens of unlabeled data [19]. Embedding/Model Baseline HLBL, 50-dim C&W;, 50-dim Brown 3200 Clusters Ando & Zhang ’05 Suzuki & Isozaki ’08 LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim Test Set F1-Score 93.79 94.00 94.10 94.11 94.39 94.67 95.02 95.44 Table 2: Chunking Results. It is important to note that in problems like NER, the final accuracy depends on performance on rare-words and since LR-MVL is robustly able to correlate past with future views, it is able to learn better representations for rare words resulting in overall better accuracy. On rare-words (occurring < 10 times in corpus), we got 11.7%, 10.7% and 9.6% relative reduction in error over C&W;, HLBL and Brown respectively for NER; on chunking the corresponding numbers were 6.7%, 7.1% and 8.7%. Also, it is worth mentioning that modeling the context in embeddings gives decent improvements in accuracies on both NER and Chunking problems. For the case of NER, the polysemous words were mostly like Chicago, Wales, Oakland etc., which could either be a location or organization (Sports teams, Banks etc.), so when we don’t use the gazetteer features, (which are known lists of cities, persons, organizations etc.) we got higher increase in F-score by modeling context, compared to the case when we already had gazetteer features which captured most of the information about polysemous words for NER dataset and modeling the context didn’t help as much. The polysemous words for Chunking dataset were like spot (VP/NP), never (VP/ADVP), more (NP/VP/ADVP/ADJP) etc. and in this case embeddings with context helped significantly, giving 3.1 − 6.5% relative improvement in accuracy over context oblivious embeddings. 5 Summary and Conclusion In this paper, we presented a novel CCA-based multi-view learning method, LR-MVL, for large scale sequence learning problems such as arise in NLP. LR-MVL is a spectral method that works in low dimensional state-space so it is computationally efficient, and can be used to train using large amounts of unlabeled data; moreover it does not get stuck in local optima like an EM trained HMM. The embeddings learnt using LR-MVL can be used as features with any supervised learner. LR-MVL has strong theoretical grounding; is much simpler and faster than competing methods and achieves state-of-the-art accuracies on NER and Chunking problems. Acknowledgements: The authors would like to thank Alexander Yates, Ted Sandler and the three anonymous reviews for providing valuable feedback. We would also like to thank Lev Ratinov and Joseph Turian for answering our questions regarding their paper [16]. 8 References [1] Ando, R., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6 (2005) 1817–1853 [2] Suzuki, J., Isozaki, H.: Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In: In ACL. (2008) [3] Brown, P., deSouza, P., Mercer, R., Pietra, V.D., Lai, J.: Class-based n-gram models of natural language. Comput. Linguist. 18 (December 1992) 467–479 [4] Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: 31st Annual Meeting of the ACL. (1993) 183–190 [5] Huang, F., Yates, A.: Distributional representations for handling sparsity in supervised sequence-labeling. ACL ’09, Stroudsburg, PA, USA, Association for Computational Linguistics (2009) 495–503 [6] Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. ICML ’08, New York, NY, USA, ACM (2008) 160–167 [7] Mnih, A., Hinton, G.: Three new graphical models for statistical language modelling. ICML ’07, New York, NY, USA, ACM (2007) 641–648 [8] Dumais, S., Furnas, G., Landauer, T., Deerwester, S., Harshman, R.: Using latent semantic analysis to improve access to textual information. In: SIGCHI Conference on human factors in computing systems, ACM (1988) 281–285 [9] Hsu, D., Kakade, S., Zhang, T.: A spectral algorithm for learning hidden markov models. In: COLT. (2009) [10] Siddiqi, S., Boots, B., Gordon, G.J.: Reduced-rank hidden Markov models. In: AISTATS2010. (2010) [11] Song, L., Boots, B., Siddiqi, S.M., Gordon, G.J., Smola, A.J.: Hilbert space embeddings of hidden Markov models. In: ICML. (2010) [12] Hotelling, H.: Canonical correlation analysis (cca). Journal of Educational Psychology (1935) [13] Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT’ 98. (1998) 92–100 [14] Halko, N., Martinsson, P.G., Tropp, J.: Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. (Dec 2010) [15] Zhang, T., Johnson, D.: A robust risk minimization based named entity recognition system. CONLL ’03 (2003) 204–207 [16] Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. ACL ’10, Stroudsburg, PA, USA, Association for Computational Linguistics (2010) 384–394 [17] Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: CONLL. (2009) 147–155 [18] Liang, P.: Semi-supervised learning for natural language. Master’s thesis, Massachusetts Institute of Technology (2005) [19] Lin, D., Wu, X.: Phrase clustering for discriminative learning. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2. ACL ’09, Stroudsburg, PA, USA, Association for Computational Linguistics (2009) 1030–1038 9

5 0.15320548 119 nips-2011-Higher-Order Correlation Clustering for Image Segmentation

Author: Sungwoong Kim, Sebastian Nowozin, Pushmeet Kohli, Chang D. Yoo

Abstract: For many of the state-of-the-art computer vision algorithms, image segmentation is an important preprocessing step. As such, several image segmentation algorithms have been proposed, however, with certain reservation due to high computational load and many hand-tuning parameters. Correlation clustering, a graphpartitioning algorithm often used in natural language processing and document clustering, has the potential to perform better than previously proposed image segmentation algorithms. We improve the basic correlation clustering formulation by taking into account higher-order cluster relationships. This improves clustering in the presence of local boundary ambiguities. We first apply the pairwise correlation clustering to image segmentation over a pairwise superpixel graph and then develop higher-order correlation clustering over a hypergraph that considers higher-order relations among superpixels. Fast inference is possible by linear programming relaxation, and also effective parameter learning framework by structured support vector machine is possible. Experimental results on various datasets show that the proposed higher-order correlation clustering outperforms other state-of-the-art image segmentation algorithms.

6 0.12384599 263 nips-2011-Sparse Manifold Clustering and Embedding

7 0.12306175 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

8 0.10718893 155 nips-2011-Learning to Agglomerate Superpixel Hierarchies

9 0.10304049 66 nips-2011-Crowdclustering

10 0.090798326 47 nips-2011-Beyond Spectral Clustering - Tight Relaxations of Balanced Graph Cuts

11 0.090280674 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators

12 0.081992447 157 nips-2011-Learning to Search Efficiently in High Dimensions

13 0.080208145 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models

14 0.075298384 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels

15 0.074410558 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

16 0.073685631 167 nips-2011-Maximum Covariance Unfolding : Manifold Learning for Bimodal Data

17 0.073008232 267 nips-2011-Spectral Methods for Learning Multivariate Latent Tree Structure

18 0.070785984 213 nips-2011-Phase transition in the family of p-resistances

19 0.070761457 171 nips-2011-Metric Learning with Multiple Kernels

20 0.068888307 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.197), (1, 0.058), (2, -0.097), (3, -0.059), (4, -0.065), (5, -0.028), (6, -0.067), (7, -0.012), (8, 0.161), (9, -0.083), (10, -0.022), (11, 0.377), (12, 0.163), (13, -0.241), (14, -0.0), (15, 0.033), (16, -0.12), (17, 0.072), (18, 0.078), (19, -0.063), (20, -0.03), (21, 0.028), (22, -0.151), (23, 0.063), (24, 0.045), (25, -0.038), (26, 0.015), (27, 0.136), (28, 0.098), (29, -0.027), (30, -0.012), (31, -0.027), (32, 0.074), (33, -0.047), (34, -0.04), (35, 0.037), (36, 0.009), (37, -0.061), (38, -0.039), (39, -0.049), (40, 0.009), (41, 0.044), (42, -0.019), (43, -0.071), (44, -0.07), (45, 0.093), (46, 0.014), (47, -0.143), (48, 0.021), (49, 0.085)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98505121 54 nips-2011-Co-regularized Multi-view Spectral Clustering

Author: Abhishek Kumar, Piyush Rai, Hal Daume

Abstract: In many clustering problems, we have access to multiple views of the data each of which could be individually used for clustering. Exploiting information from multiple views, one can hope to find a clustering that is more accurate than the ones obtained using the individual views. Often these different views admit same underlying clustering of the data, so we can approach this problem by looking for clusterings that are consistent across the views, i.e., corresponding data points in each view should have same cluster membership. We propose a spectral clustering framework that achieves this goal by co-regularizing the clustering hypotheses, and propose two co-regularization schemes to accomplish this. Experimental comparisons with a number of baselines on two synthetic and three real-world datasets establish the efficacy of our proposed approaches.

2 0.87472445 186 nips-2011-Noise Thresholds for Spectral Clustering

Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh

Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1

3 0.67933249 198 nips-2011-On U-processes and clustering performance

Author: Stéphan J. Clémençcon

Abstract: Many clustering techniques aim at optimizing empirical criteria that are of the form of a U -statistic of degree two. Given a measure of dissimilarity between pairs of observations, the goal is to minimize the within cluster point scatter over a class of partitions of the feature space. It is the purpose of this paper to define a general statistical framework, relying on the theory of U -processes, for studying the performance of such clustering methods. In this setup, under adequate assumptions on the complexity of the subsets forming the partition candidates, the √ excess of clustering risk is proved to be of the order OP (1/ n). Based on recent results related to the tail behavior of degenerate U -processes, it is also shown how to establish tighter rate bounds. Model selection issues, related to the number of clusters forming the data partition in particular, are also considered. 1

4 0.63083673 155 nips-2011-Learning to Agglomerate Superpixel Hierarchies

Author: Viren Jain, Srinivas C. Turaga, K Briggman, Moritz N. Helmstaedter, Winfried Denk, H. S. Seung

Abstract: An agglomerative clustering algorithm merges the most similar pair of clusters at every iteration. The function that evaluates similarity is traditionally handdesigned, but there has been recent interest in supervised or semisupervised settings in which ground-truth clustered data is available for training. Here we show how to train a similarity function by regarding it as the action-value function of a reinforcement learning problem. We apply this general method to segment images by clustering superpixels, an application that we call Learning to Agglomerate Superpixel Hierarchies (LASH). When applied to a challenging dataset of brain images from serial electron microscopy, LASH dramatically improved segmentation accuracy when clustering supervoxels generated by state of the boundary detection algorithms. The naive strategy of directly training only supervoxel similarities and applying single linkage clustering produced less improvement. 1

5 0.60242933 263 nips-2011-Sparse Manifold Clustering and Embedding

Author: Ehsan Elhamifar, René Vidal

Abstract: We propose an algorithm called Sparse Manifold Clustering and Embedding (SMCE) for simultaneous clustering and dimensionality reduction of data lying in multiple nonlinear manifolds. Similar to most dimensionality reduction methods, SMCE finds a small neighborhood around each data point and connects each point to its neighbors with appropriate weights. The key difference is that SMCE finds both the neighbors and the weights automatically. This is done by solving a sparse optimization problem, which encourages selecting nearby points that lie in the same manifold and approximately span a low-dimensional affine subspace. The optimal solution encodes information that can be used for clustering and dimensionality reduction using spectral clustering and embedding. Moreover, the size of the optimal neighborhood of a data point, which can be different for different points, provides an estimate of the dimension of the manifold to which the point belongs. Experiments demonstrate that our method can effectively handle multiple manifolds that are very close to each other, manifolds with non-uniform sampling and holes, as well as estimate the intrinsic dimensions of the manifolds. 1 1.1

6 0.55655503 119 nips-2011-Higher-Order Correlation Clustering for Image Segmentation

7 0.52837479 176 nips-2011-Multi-View Learning of Word Embeddings via CCA

8 0.51933795 52 nips-2011-Clustering via Dirichlet Process Mixture Models for Portable Skill Discovery

9 0.46319959 47 nips-2011-Beyond Spectral Clustering - Tight Relaxations of Balanced Graph Cuts

10 0.46112466 167 nips-2011-Maximum Covariance Unfolding : Manifold Learning for Bimodal Data

11 0.44845662 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation

12 0.4482272 51 nips-2011-Clustered Multi-Task Learning Via Alternating Structure Optimization

13 0.44509065 120 nips-2011-History distribution matching method for predicting effectiveness of HIV combination therapies

14 0.43542165 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices

15 0.43466097 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators

16 0.41431451 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data

17 0.40823951 164 nips-2011-Manifold Precis: An Annealing Technique for Diverse Sampling of Manifolds

18 0.40301406 66 nips-2011-Crowdclustering

19 0.40202093 254 nips-2011-Similarity-based Learning via Data Driven Embeddings

20 0.39344972 95 nips-2011-Fast and Accurate k-means For Large Datasets


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.025), (4, 0.04), (20, 0.033), (26, 0.022), (31, 0.05), (33, 0.023), (43, 0.052), (45, 0.097), (57, 0.032), (65, 0.02), (74, 0.045), (83, 0.022), (99, 0.451)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96631902 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization

Author: Binbin Lin, Chiyuan Zhang, Xiaofei He

Abstract: This paper studies the problem of semi-supervised learning from the vector field perspective. Many of the existing work use the graph Laplacian to ensure the smoothness of the prediction function on the data manifold. However, beyond smoothness, it is suggested by recent theoretical work that we should ensure second order smoothness for achieving faster rates of convergence for semisupervised regression problems. To achieve this goal, we show that the second order smoothness measures the linearity of the function, and the gradient field of a linear function has to be a parallel vector field. Consequently, we propose to find a function which minimizes the empirical error, and simultaneously requires its gradient field to be as parallel as possible. We give a continuous objective function on the manifold and discuss how to discretize it by using random points. The discretized optimization problem turns out to be a sparse linear system which can be solved very efficiently. The experimental results have demonstrated the effectiveness of our proposed approach. 1

2 0.94620866 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments

Author: Javad Azimi, Alan Fern, Xiaoli Z. Fern

Abstract: Budgeted optimization involves optimizing an unknown function that is costly to evaluate by requesting a limited number of function evaluations at intelligently selected inputs. Typical problem formulations assume that experiments are selected one at a time with a limited total number of experiments, which fail to capture important aspects of many real-world problems. This paper defines a novel problem formulation with the following important extensions: 1) allowing for concurrent experiments; 2) allowing for stochastic experiment durations; and 3) placing constraints on both the total number of experiments and the total experimental time. We develop both offline and online algorithms for selecting concurrent experiments in this new setting and provide experimental results on a number of optimization benchmarks. The results show that our algorithms produce highly effective schedules compared to natural baselines. 1

same-paper 3 0.94351643 54 nips-2011-Co-regularized Multi-view Spectral Clustering

Author: Abhishek Kumar, Piyush Rai, Hal Daume

Abstract: In many clustering problems, we have access to multiple views of the data each of which could be individually used for clustering. Exploiting information from multiple views, one can hope to find a clustering that is more accurate than the ones obtained using the individual views. Often these different views admit same underlying clustering of the data, so we can approach this problem by looking for clusterings that are consistent across the views, i.e., corresponding data points in each view should have same cluster membership. We propose a spectral clustering framework that achieves this goal by co-regularizing the clustering hypotheses, and propose two co-regularization schemes to accomplish this. Experimental comparisons with a number of baselines on two synthetic and three real-world datasets establish the efficacy of our proposed approaches.

4 0.89485621 8 nips-2011-A Model for Temporal Dependencies in Event Streams

Author: Asela Gunawardana, Christopher Meek, Puyang Xu

Abstract: We introduce the Piecewise-Constant Conditional Intensity Model, a model for learning temporal dependencies in event streams. We describe a closed-form Bayesian approach to learning these models, and describe an importance sampling algorithm for forecasting future events using these models, using a proposal distribution based on Poisson superposition. We then use synthetic data, supercomputer event logs, and web search query logs to illustrate that our learning algorithm can efficiently learn nonlinear temporal dependencies, and that our importance sampling algorithm can effectively forecast future events. 1

5 0.87390363 270 nips-2011-Statistical Performance of Convex Tensor Decomposition

Author: Ryota Tomioka, Taiji Suzuki, Kohei Hayashi, Hisashi Kashima

Abstract: We analyze the statistical performance of a recently proposed convex tensor decomposition algorithm. Conventionally tensor decomposition has been formulated as non-convex optimization problems, which hindered the analysis of their performance. We show under some conditions that the mean squared error of the convex method scales linearly with the quantity we call the normalized rank of the true tensor. The current analysis naturally extends the analysis of convex low-rank matrix estimation to tensors. Furthermore, we show through numerical experiments that our theory can precisely predict the scaling behaviour in practice.

6 0.84259808 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods

7 0.60982436 186 nips-2011-Noise Thresholds for Spectral Clustering

8 0.60382307 263 nips-2011-Sparse Manifold Clustering and Embedding

9 0.59562939 102 nips-2011-Generalised Coupled Tensor Factorisation

10 0.59093076 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators

11 0.57706285 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

12 0.57702714 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

13 0.57618666 29 nips-2011-Algorithms and hardness results for parallel large margin learning

14 0.5566045 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation

15 0.546817 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

16 0.54527384 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

17 0.54518998 198 nips-2011-On U-processes and clustering performance

18 0.54455805 72 nips-2011-Distributed Delayed Stochastic Optimization

19 0.53728896 157 nips-2011-Learning to Search Efficiently in High Dimensions

20 0.53609741 66 nips-2011-Crowdclustering