nips nips2013 nips2013-12 knowledge-graph by maker-knowledge-mining

12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning

Source: pdf

Author: Min Xiao, Yuhong Guo

Abstract: Cross language text classiﬁcation is an important learning task in natural language processing. A critical challenge of cross language learning arises from the fact that words of different languages are in disjoint feature spaces. In this paper, we propose a two-step representation learning method to bridge the feature spaces of different languages by exploiting a set of parallel bilingual documents. Speciﬁcally, we ﬁrst formulate a matrix completion problem to produce a complete parallel document-term matrix for all documents in two languages, and then induce a low dimensional cross-lingual document representation by applying latent semantic indexing on the obtained matrix. We use a projected gradient descent algorithm to solve the formulated matrix completion problem with convergence guarantees. The proposed method is evaluated by conducting a set of experiments with cross language sentiment classiﬁcation tasks on Amazon product reviews. The experimental results demonstrate that the proposed learning method outperforms a number of other cross language representation learning methods, especially when the number of parallel bilingual documents is small. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Cross language text classiﬁcation is an important learning task in natural language processing. [sent-2, score-0.603]

2 A critical challenge of cross language learning arises from the fact that words of different languages are in disjoint feature spaces. [sent-3, score-0.509]

3 In this paper, we propose a two-step representation learning method to bridge the feature spaces of different languages by exploiting a set of parallel bilingual documents. [sent-4, score-0.502]

4 Speciﬁcally, we ﬁrst formulate a matrix completion problem to produce a complete parallel document-term matrix for all documents in two languages, and then induce a low dimensional cross-lingual document representation by applying latent semantic indexing on the obtained matrix. [sent-5, score-0.723]

5 We use a projected gradient descent algorithm to solve the formulated matrix completion problem with convergence guarantees. [sent-6, score-0.213]

6 The proposed method is evaluated by conducting a set of experiments with cross language sentiment classiﬁcation tasks on Amazon product reviews. [sent-7, score-0.603]

7 The experimental results demonstrate that the proposed learning method outperforms a number of other cross language representation learning methods, especially when the number of parallel bilingual documents is small. [sent-8, score-0.943]

8 1 Introduction Cross language text classiﬁcation is an important natural language processing task that exploits a large amount of labeled documents in an auxiliary source language to train a classiﬁcation model for classifying documents in a target language where labeled data is scarce. [sent-9, score-1.626]

9 An effective cross language learning system can greatly reduce the manual annotation effort in the target language for learning good classiﬁcation models. [sent-10, score-0.715]

10 The challenge of cross language text classiﬁcation lies in the language barrier. [sent-12, score-0.737]

11 That is documents in different languages are expressed with different word vocabularies and thus have disjoint feature spaces. [sent-13, score-0.316]

12 In this paper, we propose a two-step learning method to induce cross-lingual feature representations for cross language text classiﬁcation by exploiting a set of unlabeled parallel bilingual documents. [sent-15, score-1.102]

13 First we construct a concatenated bilingual document-term matrix where each document is represented in the concatenated vocabulary of two languages. [sent-16, score-0.412]

14 We then learn the unobserved feature entries of this sparse matrix by formulating a matrix completion problem and solving it using a projected gradient descent optimization algorithm. [sent-18, score-0.278]

15 By doing so, we expect to automatically capture important and robust low-rank information based on the word co-occurrence patterns expressed both within each language and across languages. [sent-19, score-0.314]

16 Next we perform latent semantic indexing over the recovered document-term matrix and induce a low-dimensional dense cross-lingual representation of the documents, on which standard monolingual classiﬁers can be applied. [sent-20, score-0.363]

17 To evaluate the effectiveness of the proposed learning method, we conduct a set of experiments with cross language sentiment classiﬁcation tasks on multilingual Amazon product reviews. [sent-21, score-0.734]

18 The empirical results show that the proposed method signiﬁcantly outperforms a number of cross language learning methods. [sent-22, score-0.418]

19 Moreover, the proposed method produces good performance even with a very small number of unlabeled parallel bilingual documents. [sent-23, score-0.542]

20 [21] proposed an instance and feature bi-weighting method by ﬁrst translating documents from one language domain to the other one and then simultaneously re-weighting instances and features to address the distribution difference across domains. [sent-26, score-0.55]

21 [22] proposed to use the co-training method for cross language sentiment classiﬁcation on parallel corpora. [sent-27, score-0.692]

22 [2] proposed a multi-view majority voting method to categorize documents in multiple views produced from machine translation tools. [sent-28, score-0.217]

23 [1] proposed a multi-view co-classiﬁcation method for multilingual document categorization, which minimizes both the training loss for each view and the prediction disagreement between different language views. [sent-29, score-0.485]

24 Our proposed approach in this paper shares similarity with these approaches in exploiting parallel data produced by machine translation tools. [sent-30, score-0.247]

25 But our approach only requires a small set of unlabeled parallel documents, while these approaches require at least translating all the training documents in one language domain. [sent-31, score-0.792]

26 Another important group of cross language text classiﬁcation methods in the literature construct cross-lingual representations by exploiting bilingual word pairs [16, 7], parallel corpora [10, 20, 15, 19, 8], and other resources [3, 14]. [sent-32, score-0.96]

27 [16] proposed a cross-language structural correspondence learning method to induce language-independent features by using pivot word pairs produced by word translation oracles. [sent-33, score-0.208]

28 [10] proposed a cross-language latent semantic indexing (CL-LSI) method to induce cross-lingual representations by performing LSI over a dual-language document-term matrix, where each dual-language document contains its original words and the corresponding translation text. [sent-34, score-0.334]

29 It ﬁrst learns two projections (one for each language) by conducting kernel canonical correlation analysis over a paired bilingual corpus and then uses them to project documents from language-speciﬁc feature spaces to the shared multilingual semantic feature space. [sent-36, score-0.647]

30 [15] employed cross-lingual oriented principal component analysis (CL-OPCA) over concatenated parallel documents to learn a multilingual projection by simultaneously minimizing the projected distance between parallel documents and maximizing the projected covariance of documents across languages. [sent-37, score-1.083]

31 Some other work uses multilingual topic models such as the coupled probabilistic latent semantic analysis and the bilingual latent Dirichlet allocation to extract latent cross-lingual topics as interlingual representations [19]. [sent-38, score-0.548]

32 [14] proposed to use language-speciﬁc part-of-speech (POS) taggers to tag each word and then map those language-speciﬁc POS tags to twelve universal POS tags as interlingual features for cross language ﬁne-grained genre classiﬁcation. [sent-39, score-0.549]

33 Similar to the multilingual semantic representation learning approaches such as CL-LSI, CL-KCCA and CL-OPCA, our two-step learning method exploits parallel documents. [sent-40, score-0.411]

34 But different from these methods which apply operations such as LSI, KCCA, and OPCA directly on the original concatenated document2 term matrix, our method ﬁrst ﬁlls the missing entries of the document-term matrix using matrix completion, and then performs LSI over the recovered low-rank matrix. [sent-41, score-0.15]

35 3 Approach In this section, we present the proposed two-step learning method for learning cross-lingual document representations. [sent-42, score-0.077]

36 We assume a subset of unlabeled parallel documents from the two languages are given, which can be used to capture the co-occurrence of terms across languages and build connections between the vocabulary sets of the two languages. [sent-43, score-0.72]

37 We ﬁrst construct a uniﬁed documentterm matrix for all documents from the auxiliary source language domain and the target language domain, whose columns correspond to the word features from the uniﬁed vocabulary set of the two languages. [sent-44, score-0.988]

38 In this matrix, each pair of parallel documents is represented as a fully observed row vector, and each non-parallel document is represented as a partially observed row vector where only entries corresponding to words in its own language vocabulary are observed. [sent-45, score-0.698]

39 Instead of learning a low-dimensional cross-lingual document representation from this matrix directly, we perform a twostep learning procedure: First we learn a low-rank document-term matrix by automatically ﬁlling the missing entries via matrix completion. [sent-46, score-0.229]

40 Next we produce cross-lingual representations by applying the latent semantic indexing method over the learned matrix. [sent-47, score-0.192]

41 Let M 0 ∈ Rt×d be the uniﬁed document-term matrix, which is partially ﬁlled with observed nonnegative feature values, where t is the number of documents and d is the size of the uniﬁed vocabulary. [sent-48, score-0.209]

42 0 We use Ω to denote the index set of the observed features in M 0 , such that (i, j) ∈ Ω if only if Mij is observed; and use Ω to denote the index set of the missing features in M 0 , such that (i, j) ∈ Ω 0 if only if Mij is unobserved. [sent-49, score-0.087]

43 For the i-th document in the data set from one language, if the doc0 ument does not have a parallel translation in the other language, then all the features in row Mi: corresponding to the words in the vocabulary of the other language are viewed as missing features. [sent-50, score-0.643]

44 1 Matrix Completion Note that the document-term matrix M 0 has a large fraction of missing features and the only bridge between the vocabulary sets of the two languages is the small set of parallel bilingual documents. [sent-52, score-0.573]

45 Learning from this partially observed matrix directly by treating missing features as zeros certainly will lose a lot of information. [sent-53, score-0.097]

46 On the other hand, a fully observed document-term matrix is naturally low-rank and sparse, as the vocabulary set is typically very large and each document only contains a small fraction of the words in the vocabulary. [sent-54, score-0.149]

47 Thus we propose to automatically ﬁll the missing entries of M 0 based on the feature co-occurrence information expressed in the observed data, by conducting matrix completion to recover a low-rank and sparse matrix. [sent-55, score-0.194]

48 Speciﬁcally, we formulate the matrix completion as the following optimization problem min rank(M ) + µ M M 1 0 subject to Mij = Mij , ∀(i, j) ∈ Ω; Mij ≥ 0, ∀(i, j) ∈ Ω (1) where · 1 denotes a ℓ1 norm and is used to enforce sparsity. [sent-56, score-0.095]

49 Moreover, the matrix M in cross-language learning tasks is typically very large, and thus a scalable optimization algorithm needs to be developed to conduct efﬁcient optimization. [sent-64, score-0.098]

50 In next section, we will present a scalable projected gradient descent algorithm to solve this minimization problem. [sent-65, score-0.118]

51 2 Latent Semantic Indexing After solving (2) for an optimal low-rank solution M ∗ , we can use each row of the sparse matrix M ∗ as a vector representation for each document in the concatenated vocabulary space of the two languages. [sent-75, score-0.22]

52 However exploiting such a matrix representation directly for cross language text classiﬁcation lacks sufﬁcient capacity of handling feature noise and sparseness, as each document is represented using a very small set of words in the vocabulary set. [sent-76, score-0.702]

53 We thus propose to apply a latent semantic indexing (LSI) method on M ∗ to produce a low-dimensional semantic representation of the data. [sent-77, score-0.267]

54 LSI uses singular value decomposition to discover the important associative relationships of word features [10], and create a reduced-dimension feature space. [sent-78, score-0.104]

55 Speciﬁcally, we ﬁrst perform singular value decomposition over M ∗ , M ∗ = U SV ⊤ , and then obtain a low dimensional representation matrix Z via a projection Z = M ∗ Vk , where Vk contains the top k right singular vectors of M ∗ . [sent-79, score-0.064]

56 Cross-language text classiﬁcation can then be conducted over Z using monolingual classiﬁers. [sent-80, score-0.186]

57 1 Optimization Algorithm Projected Gradient Descent Algorithm A number of algorithms have been developed to solve matrix completion problems in the literature [4, 11]. [sent-82, score-0.095]

58 We use a projected gradient descent algorithm to solve the non-smooth convex optimization problem in (2). [sent-83, score-0.118]

59 (3) (i,j)∈Ω It ﬁrst initializes M as the nonnegative projection of the rank-1 approximation of M 0 , and then iteratively updates M using a projected gradient descent procedure. [sent-85, score-0.141]

60 Next we perform a shrinkage operation M = Sν (M ) over the resulting matrix from the ﬁrst step to minimize its rank. [sent-88, score-0.057]

61 Finally we project the resulting matrix into the nonnegative feasible set by M = max(M, 0). [sent-90, score-0.059]

62 2 Convergence Analysis Let h(·) = I(·) − τ ∇g(·) be the gradient descent operator used in the gradient descent step, and let PC (·) = max(·, 0) be the projection operator, while Sν (·) is the shrinkage operator. [sent-94, score-0.18]

63 Below we prove the convergence of the projected gradient descent algorithm. [sent-95, score-0.118]

64 Then following the gradient deﬁnition in (4), we have h(M ) − h(M ′ ) = (M − M ′ ) ◦ Q F F 1 ′ (Mij − Mij )2 Q2 ) 2 ij =( ≤ M − M′ F ij The inequalities become equalities if only if h(M ) − h(M ′ ) = M − M ′ . [sent-104, score-0.07]

65 The sequence {M k } generated by the projected gradient descent iterations in Algo2 rithm 1 with 0 < τ < min(2, ρ ) converges to M ∗ , which is an optimal solution of (2). [sent-126, score-0.118]

66 5 Experiments In this section, we evaluate the proposed two-step learning method by conducting extensive cross language sentiment classiﬁcation experiments on multilingual Amazon product reviews. [sent-130, score-0.707]

67 1 Experimental Setting Dataset We used the multilingual Amazon product reviews dataset [16], which contains three categories (Books (B), DVD (D), Music (M)) of product reviews in four different languages (English (E), French (F), German (G), Japanese (J)). [sent-132, score-0.338]

68 For each category of the product reviews, there are 2000 positive and 2000 negative English reviews, and 1000 positive and 1000 negative reviews for each of the other three languages. [sent-133, score-0.057]

69 In addition, there are another 2000 unlabeled parallel reviews between English and each of the other three languages. [sent-134, score-0.403]

70 For example, the task EFB uses English Books reviews as the source language data and uses French Books reviews as the target language data. [sent-137, score-0.73]

71 5 Table 1: Average classiﬁcation accuracies (%) and standard deviations (%) over 10 runs for the 18 cross language sentiment classiﬁcation tasks. [sent-138, score-0.565]

72 The Target Bag-Of-Word (TBOW) baseline method trains a supervised monolingual classiﬁer in the original bag-of-word feature space with the labeled training data from the target language domain. [sent-320, score-0.495]

73 The Cross-Lingual Kernel Canonical Component Analysis (CL-KCCA) method [20] ﬁrst induces two language projections by using unlabeled parallel data and then trains a monolingual classiﬁer on labeled data from both language domains in the projected low-dimensional space. [sent-322, score-1.076]

74 For all experiments, we used linear support vector machine (SVM) as the monolingual classiﬁcation model. [sent-323, score-0.113]

75 2 Classiﬁcation Accuracy For each of the 18 cross language sentiment classiﬁcation tasks, we used all documents from the two languages and the additional 2000 unlabeled parallel documents for representation learning. [sent-326, score-1.279]

76 Then we used all documents in the auxiliary source language and randomly chose 100 documents from the target language as labeled data for classiﬁcation model training, and used the remaining data in the target language as test data. [sent-327, score-1.326]

77 1, 1, 10}, chose ρ value from {10−5 , 10−4 , 10−3 , 10−2 , 10−1 , 1}, and chose the dimension k value from {20, 50, 100, 200, 500}. [sent-330, score-0.048]

78 We used the ﬁrst task EFB to perform model parameter selection by running the algorithm 3 times based on random selections of 100 labeled target training data. [sent-331, score-0.113]

79 We used the same procedure to select the dimensionality of the learned semantic representations for the other three approaches, CL-LSI, CL-OPCA and CL-KCCA, which produced k = 50 for CL-LSI and CL-OPCA, and k = 100 for CL-KCCA. [sent-334, score-0.107]

80 We then used the selected model parameters for all the 18 tasks and ran each experiment for 10 times based on random selections of 100 labeled target documents. [sent-335, score-0.152]

81 The average classiﬁcation accuracies and standard deviations are reported in Table 1. [sent-336, score-0.055]

82 All these results demonstrate the efﬁcacy and robustness of the proposed two-step representation learning method for cross language text classiﬁcation. [sent-346, score-0.519]

83 3 Impact of the Size of Unlabeled Parallel Data All the four cross-lingual adaptation learning methods, CL-LSI, CL-KCCA, CL-OPCA and TSL, exploit unlabeled parallel reviews for learning cross-lingual representations. [sent-348, score-0.426]

84 Next we investigated the performance of these methods with respect to different numbers of unlabeled parallel reviews. [sent-349, score-0.346]

85 For each number np in the set, we randomly chose np parallel documents from all the 2000 unlabeled parallel reviews to conduct experiments using the same setting from the previous experiments. [sent-351, score-0.818]

86 Each experiment was repeated 10 times based on random selections of labeled target training data. [sent-352, score-0.113]

87 The average test classiﬁcation accuracies and standard deviations are plotted in Figure 1 and Figure 2. [sent-353, score-0.055]

88 From these results, we can see that the performance of all four methods in general improves with the increase of the unlabeled parallel data. [sent-356, score-0.346]

89 The proposed method, TSL, nevertheless outperforms the other three cross-lingual adaptation learning methods across the range of different np values for 16 out of the 18 cross language sentiment classiﬁcation tasks. [sent-357, score-0.576]

90 6 Conclusion In this paper, we developed a novel two-step method to learn cross-lingual semantic data representations for cross language text classiﬁcation by exploiting unlabeled parallel bilingual documents. [sent-361, score-1.126]

91 We ﬁrst formulated a matrix completion problem to infer unobserved feature values of the concatenated document-term matrix in the space of uniﬁed vocabulary set from the source and target languages. [sent-362, score-0.344]

92 Then we performed latent semantic indexing over the completed low-rank document-term matrix to produce a low-dimensional cross-lingual representation of the documents. [sent-363, score-0.226]

93 Monolingual classiﬁers were then used to conduct cross language text classiﬁcation based on the learned document representation. [sent-364, score-0.553]

94 To investigate the effectiveness of the proposed learning method, we conducted extensive experiments with tasks of cross language sentiment classiﬁcation on Amazon product reviews. [sent-365, score-0.568]

95 Moreover, the proposed approach needs much less parallel documents to produce a good cross language text classiﬁcation system. [sent-367, score-0.811]

96 Learning from multiple partially observed views - an application to multilingual text categorization. [sent-377, score-0.216]

97 Cross-lingual sentiment analysis for indian languages using linked wordnets. [sent-385, score-0.192]

98 Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization. [sent-408, score-0.277]

99 An EM based training algorithm for cross-language text categorization. [sent-482, score-0.073]

100 Inferring a semantic representation of text via cross-language correlation analysis. [sent-506, score-0.178]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cl', 0.401), ('tsl', 0.351), ('mij', 0.318), ('lsi', 0.301), ('language', 0.265), ('opca', 0.251), ('unlabeled', 0.183), ('bilingual', 0.177), ('kcca', 0.173), ('parallel', 0.163), ('pc', 0.16), ('documents', 0.157), ('multilingual', 0.143), ('cross', 0.134), ('monolingual', 0.113), ('sentiment', 0.111), ('classi', 0.095), ('languages', 0.081), ('semantic', 0.077), ('efb', 0.075), ('text', 0.073), ('english', 0.067), ('tbow', 0.063), ('completion', 0.059), ('document', 0.058), ('reviews', 0.057), ('vocabulary', 0.055), ('indexing', 0.053), ('french', 0.052), ('target', 0.051), ('projected', 0.05), ('efm', 0.05), ('egm', 0.05), ('ejb', 0.05), ('ejm', 0.05), ('word', 0.049), ('concatenated', 0.043), ('translation', 0.041), ('cation', 0.039), ('tasks', 0.039), ('descent', 0.038), ('efd', 0.038), ('egb', 0.038), ('ejd', 0.038), ('fem', 0.038), ('geb', 0.038), ('ged', 0.038), ('jeb', 0.038), ('jed', 0.038), ('labeled', 0.037), ('japanese', 0.036), ('matrix', 0.036), ('missing', 0.035), ('conducting', 0.035), ('source', 0.035), ('german', 0.034), ('egd', 0.033), ('jem', 0.033), ('accuracy', 0.033), ('amazon', 0.032), ('latent', 0.032), ('genre', 0.031), ('gradient', 0.03), ('representations', 0.03), ('domain', 0.03), ('feature', 0.029), ('accuracies', 0.029), ('gem', 0.029), ('representation', 0.028), ('pos', 0.027), ('xue', 0.027), ('corpora', 0.027), ('features', 0.026), ('deviations', 0.026), ('feb', 0.025), ('selections', 0.025), ('interlingual', 0.025), ('uni', 0.025), ('books', 0.024), ('fed', 0.024), ('translating', 0.024), ('induce', 0.024), ('np', 0.024), ('chose', 0.024), ('exploiting', 0.024), ('adaptation', 0.023), ('nonnegative', 0.023), ('operator', 0.023), ('conduct', 0.023), ('lexicons', 0.022), ('pakdd', 0.022), ('yuhong', 0.022), ('lled', 0.022), ('shrinkage', 0.021), ('categorization', 0.021), ('ij', 0.02), ('auxiliary', 0.019), ('proposed', 0.019), ('resources', 0.018), ('amini', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning

Author: Min Xiao, Yuhong Guo

2 0.10621507 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation

Author: Andriy Mnih, Koray Kavukcuoglu

Abstract: Continuous-valued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information about words very well, setting performance records on several word similarity tasks. The best results are obtained by learning high-dimensional embeddings from very large quantities of data, which makes scalability of the training method a critical factor. We propose a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation. Our approach is simpler, faster, and produces better results than the current state-of-theart method. We achieve results comparable to the best ones reported, which were obtained on a cluster, using four times less data and more than an order of magnitude less computing time. We also investigate several model types and ﬁnd that the embeddings learned by the simpler models perform at least as well as those learned by the more complex ones. 1

3 0.083739057 65 nips-2013-Compressive Feature Learning

Author: Hristo S. Paskov, Robert West, John C. Mitchell, Trevor Hastie

Abstract: This paper addresses the problem of unsupervised feature learning for text data. Our method is grounded in the principle of minimum description length and uses a dictionary-based compression scheme to extract a succinct feature set. Specifically, our method ﬁnds a set of word k-grams that minimizes the cost of reconstructing the text losslessly. We formulate document compression as a binary optimization task and show how to solve it approximately via a sequence of reweighted linear programs that are efﬁcient to solve and parallelizable. As our method is unsupervised, features may be extracted once and subsequently used in a variety of tasks. We demonstrate the performance of these features over a range of scenarios including unsupervised exploratory analysis and supervised text categorization. Our compressed feature space is two orders of magnitude smaller than the full k-gram space and matches the text categorization accuracy achieved in the full feature space. This dimensionality reduction not only results in faster training times, but it can also help elucidate structure in unsupervised learning tasks and reduce the amount of training data necessary for supervised learning. 1

4 0.077989444 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model

Author: Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc'Aurelio Ranzato, Tomas Mikolov

Abstract: Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. This limitation is in part due to the increasing difﬁculty of acquiring sufﬁcient training data in the form of labeled images as the number of object categories grows. One remedy is to leverage data from other sources – such as text data – both to train visual models and to constrain their predictions. In this paper we present a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. We demonstrate that this model matches state-of-the-art performance on the 1000-class ImageNet object recognition challenge while making more semantically reasonable errors, and also show that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Semantic knowledge improves such zero-shot predictions achieving hit rates of up to 18% across thousands of novel labels never seen by the visual model. 1

5 0.076950088 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer

Author: Richard Socher, Milind Ganjoo, Christopher D. Manning, Andrew Ng

Abstract: This work introduces a model that can recognize objects in images even if no training data is available for the object class. The only necessary knowledge about unseen visual categories comes from unsupervised text corpora. Unlike previous zero-shot learning models, which can only differentiate between unseen classes, our model can operate on a mixture of seen and unseen classes, simultaneously obtaining state of the art performance on classes with thousands of training images and reasonable performance on unseen classes. This is achieved by seeing the distributions of words in texts as a semantic space for understanding what objects look like. Our deep learning model does not require any manually deﬁned semantic or visual features for either words or images. Images are mapped to be close to semantic word vectors corresponding to their classes, and the resulting image embeddings can be used to distinguish whether an image is of a seen or unseen class. We then use novelty detection methods to differentiate unseen classes from seen classes. We demonstrate two novelty detection strategies; the ﬁrst gives high accuracy on unseen classes, while the second is conservative in its prediction of novelty and keeps the seen classes’ accuracy high. 1

6 0.073360533 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality

7 0.068058021 211 nips-2013-Non-Linear Domain Adaptation with Boosting

8 0.061175145 335 nips-2013-Transfer Learning in a Transductive Setting

9 0.058571614 99 nips-2013-Dropout Training as Adaptive Regularization

10 0.058249082 5 nips-2013-A Deep Architecture for Matching Short Texts

11 0.057679396 149 nips-2013-Latent Structured Active Learning

12 0.056465842 75 nips-2013-Convex Two-Layer Modeling

13 0.055377476 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation

14 0.054821119 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies

15 0.054746065 174 nips-2013-Lexical and Hierarchical Topic Regression

16 0.04983224 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

17 0.049183469 164 nips-2013-Learning and using language via recursive pragmatic reasoning about other agents

18 0.048309665 307 nips-2013-Speedup Matrix Completion with Side Information: Application to Multi-Label Learning

19 0.04764777 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints

20 0.046560138 108 nips-2013-Error-Minimizing Estimates and Universal Entry-Wise Error Bounds for Low-Rank Matrix Completion

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.11), (1, 0.057), (2, -0.046), (3, -0.002), (4, 0.084), (5, -0.063), (6, -0.024), (7, 0.029), (8, -0.003), (9, 0.031), (10, -0.048), (11, 0.028), (12, -0.029), (13, -0.016), (14, 0.027), (15, -0.074), (16, 0.072), (17, 0.095), (18, 0.047), (19, 0.046), (20, -0.078), (21, -0.089), (22, 0.013), (23, -0.019), (24, 0.033), (25, 0.008), (26, 0.094), (27, -0.056), (28, -0.01), (29, -0.024), (30, -0.005), (31, -0.029), (32, -0.003), (33, 0.001), (34, 0.03), (35, -0.027), (36, 0.018), (37, -0.05), (38, -0.004), (39, 0.102), (40, -0.023), (41, 0.001), (42, 0.006), (43, 0.026), (44, 0.006), (45, -0.067), (46, -0.05), (47, -0.024), (48, -0.085), (49, -0.064)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94853842 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning

Author: Min Xiao, Yuhong Guo

2 0.6744749 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation

Author: Andriy Mnih, Koray Kavukcuoglu

3 0.63096368 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality

Author: Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, Jeff Dean

Abstract: The recently introduced continuous Skip-gram model is an efﬁcient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain signiﬁcant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example, we present a simple method for ﬁnding phrases in text, and show that learning good vector representations for millions of phrases is possible.

4 0.59438413 65 nips-2013-Compressive Feature Learning

Author: Hristo S. Paskov, Robert West, John C. Mitchell, Trevor Hastie

5 0.57304353 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer

Author: Richard Socher, Milind Ganjoo, Christopher D. Manning, Andrew Ng

6 0.5608266 211 nips-2013-Non-Linear Domain Adaptation with Boosting

7 0.55154258 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification

8 0.54900569 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model

9 0.53318465 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation

10 0.52869779 223 nips-2013-On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation

11 0.52828372 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting

12 0.52174985 335 nips-2013-Transfer Learning in a Transductive Setting

13 0.50082505 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies

14 0.49620736 336 nips-2013-Translating Embeddings for Modeling Multi-relational Data

15 0.49016565 263 nips-2013-Reasoning With Neural Tensor Networks for Knowledge Base Completion

16 0.48585433 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning

17 0.47819194 164 nips-2013-Learning and using language via recursive pragmatic reasoning about other agents

18 0.47465152 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition

19 0.47008017 98 nips-2013-Documents as multiple overlapping windows into grids of counts

20 0.45874578 88 nips-2013-Designed Measurements for Vector Count Data

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(16, 0.04), (33, 0.106), (34, 0.059), (39, 0.335), (41, 0.023), (49, 0.02), (55, 0.02), (56, 0.083), (70, 0.011), (85, 0.029), (89, 0.018), (93, 0.132), (95, 0.014), (99, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75374252 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning

Author: Min Xiao, Yuhong Guo

2 0.65064979 76 nips-2013-Correlated random features for fast semi-supervised learning

Author: Brian McWilliams, David Balduzzi, Joachim Buhmann

Abstract: This paper presents Correlated Nystr¨ m Views (XNV), a fast semi-supervised alo gorithm for regression and classiﬁcation. The algorithm draws on two main ideas. First, it generates two views consisting of computationally inexpensive random features. Second, multiview regression, using Canonical Correlation Analysis (CCA) on unlabeled data, biases the regression towards useful features. It has been shown that CCA regression can substantially reduce variance with a minimal increase in bias if the views contains accurate estimators. Recent theoretical and empirical work shows that regression with random features closely approximates kernel regression, implying that the accuracy requirement holds for random views. We show that XNV consistently outperforms a state-of-the-art algorithm for semi-supervised learning: substantially improving predictive performance and reducing the variability of performance on a wide variety of real-world datasets, whilst also reducing runtime by orders of magnitude. 1

3 0.58422405 118 nips-2013-Fast Determinantal Point Process Sampling with Application to Clustering

Author: Byungkon Kang

Abstract: Determinantal Point Process (DPP) has gained much popularity for modeling sets of diverse items. The gist of DPP is that the probability of choosing a particular set of items is proportional to the determinant of a positive deﬁnite matrix that deﬁnes the similarity of those items. However, computing the determinant requires time cubic in the number of items, and is hence impractical for large sets. In this paper, we address this problem by constructing a rapidly mixing Markov chain, from which we can acquire a sample from the given DPP in sub-cubic time. In addition, we show that this framework can be extended to sampling from cardinalityconstrained DPPs. As an application, we show how our sampling algorithm can be used to provide a fast heuristic for determining the number of clusters, resulting in better clustering.

4 0.54644942 341 nips-2013-Universal models for binary spike patterns using centered Dirichlet processes

Author: Il M. Park, Evan W. Archer, Kenneth Latimer, Jonathan W. Pillow

Abstract: Probabilistic models for binary spike patterns provide a powerful tool for understanding the statistical dependencies in large-scale neural recordings. Maximum entropy (or “maxent”) models, which seek to explain dependencies in terms of low-order interactions between neurons, have enjoyed remarkable success in modeling such patterns, particularly for small groups of neurons. However, these models are computationally intractable for large populations, and low-order maxent models have been shown to be inadequate for some datasets. To overcome these limitations, we propose a family of “universal” models for binary spike patterns, where universality refers to the ability to model arbitrary distributions over all 2m binary patterns. We construct universal models using a Dirichlet process centered on a well-behaved parametric base measure, which naturally combines the ﬂexibility of a histogram and the parsimony of a parametric model. We derive computationally efﬁcient inference methods using Bernoulli and cascaded logistic base measures, which scale tractably to large populations. We also establish a condition for equivalence between the cascaded logistic and the 2nd-order maxent or “Ising” model, making cascaded logistic a reasonable choice for base measure in a universal model. We illustrate the performance of these models using neural data. 1

5 0.52001452 146 nips-2013-Large Scale Distributed Sparse Precision Estimation

Author: Huahua Wang, Arindam Banerjee, Cho-Jui Hsieh, Pradeep Ravikumar, Inderjit Dhillon

Abstract: We consider the problem of sparse precision matrix estimation in high dimensions using the CLIME estimator, which has several desirable theoretical properties. We present an inexact alternating direction method of multiplier (ADMM) algorithm for CLIME, and establish rates of convergence for both the objective and optimality conditions. Further, we develop a large scale distributed framework for the computations, which scales to millions of dimensions and trillions of parameters, using hundreds of cores. The proposed framework solves CLIME in columnblocks and only involves elementwise operations and parallel matrix multiplications. We evaluate our algorithm on both shared-memory and distributed-memory architectures, which can use block cyclic distribution of data and parameters to achieve load balance and improve the efﬁciency in the use of memory hierarchies. Experimental results show that our algorithm is substantially more scalable than state-of-the-art methods and scales almost linearly with the number of cores. 1

6 0.51580083 65 nips-2013-Compressive Feature Learning

7 0.51100874 211 nips-2013-Non-Linear Domain Adaptation with Boosting

8 0.50450546 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation

9 0.50384361 339 nips-2013-Understanding Dropout

10 0.50312853 215 nips-2013-On Decomposing the Proximal Map

11 0.48576602 99 nips-2013-Dropout Training as Adaptive Regularization

12 0.47656888 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification

13 0.47493359 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies

14 0.47022778 30 nips-2013-Adaptive dropout for training deep neural networks

15 0.46683282 251 nips-2013-Predicting Parameters in Deep Learning

16 0.46216962 5 nips-2013-A Deep Architecture for Matching Short Texts

17 0.46140134 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation

18 0.45933181 69 nips-2013-Context-sensitive active sensing in humans

19 0.45830297 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

20 0.45697364 340 nips-2013-Understanding variable importances in forests of randomized trees