nips nips2011 nips2011-74 knowledge-graph by maker-knowledge-mining

74 nips-2011-Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection

Source: pdf

Author: Richard Socher, Eric H. Huang, Jeffrey Pennin, Christopher D. Manning, Andrew Y. Ng

Abstract: Paraphrase detection is the task of examining two sentences and determining whether they have the same meaning. In order to obtain high accuracy on this task, thorough syntactic and semantic analysis of the two statements is needed. We introduce a method for paraphrase detection based on recursive autoencoders (RAE). Our unsupervised RAEs are based on a novel unfolding objective and learn feature vectors for phrases in syntactic trees. These features are used to measure the word- and phrase-wise similarity between two sentences. Since sentences may be of arbitrary length, the resulting matrix of similarity measures is of variable size. We introduce a novel dynamic pooling layer which computes a ﬁxed-sized representation from the variable-sized matrices. The pooled representation is then used as input to a classiﬁer. Our method outperforms other state-of-the-art approaches on the challenging MSRP paraphrase corpus. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We introduce a method for paraphrase detection based on recursive autoencoders (RAE). [sent-8, score-0.682]

2 Our unsupervised RAEs are based on a novel unfolding objective and learn feature vectors for phrases in syntactic trees. [sent-9, score-0.68]

3 Since sentences may be of arbitrary length, the resulting matrix of similarity measures is of variable size. [sent-11, score-0.31]

4 We introduce a novel dynamic pooling layer which computes a ﬁxed-sized representation from the variable-sized matrices. [sent-12, score-0.355]

5 Our method outperforms other state-of-the-art approaches on the challenging MSRP paraphrase corpus. [sent-14, score-0.408]

6 We present a joint model that incorporates the similarities between both single word features as well as multi-word phrases extracted from the nodes of parse trees. [sent-20, score-0.538]

7 The ﬁrst component is an unfolding recursive autoencoder (RAE) for unsupervised feature learning from unlabeled parse trees. [sent-23, score-0.816]

8 It learns feature representations for each node in the tree such that the word vectors underneath each node can be recursively reconstructed. [sent-25, score-0.478]

9 These feature representations are used to compute a similarity matrix that compares both the single words as well as all nonterminal node features in both sentences. [sent-26, score-0.375]

10 In order to keep as much of the resulting global information of this comparison as possible and deal with the arbitrary length of the two sentences, we then introduce our second component: a new dynamic pooling layer which outputs a ﬁxed-size representation. [sent-27, score-0.355]

11 Any classiﬁer such as a softmax classiﬁer can then be used to classify whether the two sentences are paraphrases or not. [sent-28, score-0.315]

12 We ﬁrst describe the unsupervised feature learning with RAEs followed by a description of pooling and classiﬁcation. [sent-29, score-0.254]

13 The recursive autoencoder learns phrase features for each node in a parse tree. [sent-34, score-0.73]

14 Using a novel dynamic pooling layer we can compare the variable-sized sentences and classify pairs as being paraphrases or not. [sent-36, score-0.67]

15 2 Recursive Autoencoders In this section we describe two variants of unsupervised recursive autoencoders which can be used to learn features from parse trees. [sent-37, score-0.485]

16 The RAE aims to ﬁnd vector representations for variable-sized phrases spanned by each node of a parse tree. [sent-38, score-0.431]

17 The word vectors inside the embedding matrix capture distributional syntactic and semantic information via the word’s co-occurrence statistics. [sent-46, score-0.404]

18 2 (left) shows an instance of a recursive autoencoder (RAE) applied to a given parse tree as introduced by [12]. [sent-56, score-0.531]

19 Initial experiments showed that having a syntactically plausible tree structure is important for paraphrase detection. [sent-58, score-0.439]

20 Each child can be either an input word vector xi or a nonterminal node in the tree. [sent-65, score-0.292]

21 For simplicity we left out the reconstruction layer at the ﬁrst node y1 which is the same standard autoencoder for both models. [sent-71, score-0.404]

22 Left: A standard autoencoder that tries to reconstruct only its direct children. [sent-72, score-0.255]

23 Right: The unfolding autoencoder which tries to reconstruct all leaf nodes underneath each node. [sent-73, score-0.631]

24 During training, the goal is to minimize the reconstruction error of all input pairs at nonterminal nodes p in a given parse tree T : Erec (T ) = Erec (p) (3) p∈T For the example in Fig. [sent-81, score-0.305]

25 3 Unfolding Recursive Autoencoder The unfolding RAE has the same encoding scheme as the standard RAE. [sent-88, score-0.3]

26 For instance, at node y2 , the reconstruction error is the difference between the leaf nodes underneath that node [x1 ; x2 ; x3 ] and their reconstructed counterparts [x1 ; x2 ; x3 ]. [sent-91, score-0.328]

27 The unfolding produces the reconstructed leaves by starting at y2 and computing [x1 ; y1 ] = f (Wd y2 + bd ). [sent-92, score-0.259]

28 (6) The unfolding autoencoder essentially tries to encode each hidden layer such that it best reconstructs its entire subtree to the leaf nodes. [sent-103, score-0.633]

29 Another potential problem of the standard RAE is that it gives equal weight to the last merged phrases even if one is only a single word (in Fig. [sent-105, score-0.313]

30 In contrast, the unfolding RAE captures the increased importance of a child when the child represents a larger subtree. [sent-107, score-0.354]

31 While the top layer at each node has to have the same dimensionality as each child (in order for the same network to be recursively compatible), the hidden layer may have arbitrary dimensionality. [sent-111, score-0.359]

32 After the unsupervised training of the RAE, we demonstrate that the learned feature representations capture syntactic and semantic similarities and can be used for paraphrase detection. [sent-119, score-0.763]

33 3 An Architecture for Variable-Sized Similarity Matrices Now that we have described the unsupervised feature learning, we explain how to use these features to classify sentence pairs as being in a paraphrase relationship or not. [sent-120, score-0.687]

34 1 Computing Sentence Similarity Matrices Our method incorporates both single word and phrase similarities in one framework. [sent-122, score-0.298]

35 First, the RAE computes phrase vectors for the nodes in a given parse tree. [sent-123, score-0.316]

36 We then compute Euclidean distances between all word and phrase vectors of the two sentences. [sent-124, score-0.316]

37 For computing the similarity matrix, the rows and columns are ﬁrst ﬁlled by the words in their original sentence order. [sent-127, score-0.334]

38 However, since the matrix dimensions vary based on the sentence lengths one cannot simply feed the similarity matrix into a standard neural network or classiﬁer. [sent-132, score-0.349]

39 2 Figure 3: Example of the dynamic min-pooling layer ﬁnding the smallest number in a pooling window region of the original similarity matrix S. [sent-138, score-0.479]

40 2 Dynamic Pooling Consider a similarity matrix S generated by sentences of lengths n and m. [sent-140, score-0.31]

41 Our ﬁrst step in constructing such a map is to partition the rows and columns of S into np roughly equal parts, producing an np × np grid. [sent-143, score-0.255]

42 This procedure will have a slightly ﬁner granularity for the single word similarities which is desired for our task since word overlap is a good indicator for paraphrases. [sent-154, score-0.303]

43 In the rare cases when np > R, the pooling layer needs to ﬁrst up-sample. [sent-155, score-0.375]

44 The unfolding RAE captures most closely both syntactic and semantic similarities. [sent-167, score-0.51]

45 there are similar words or phrases in both sentences, we keep this information by applying a min function to the pooling regions. [sent-168, score-0.423]

46 This dynamic pooling layer could make use of overlapping pooling regions, but for simplicity, we consider only non-overlapping pooling regions. [sent-170, score-0.749]

47 For all paraphrase experiments we used the Microsoft Research paraphrase corpus (MSRP) introduced by Dolan et al. [sent-176, score-0.843]

48 The average sentence length is 21, the shortest sentence has 7 words and the longest 36. [sent-179, score-0.435]

49 3,900 are labeled as being in the paraphrase relationship (technically deﬁned as “mostly bidirectional entailment”). [sent-180, score-0.408]

50 8) was set to have 200 units for both standard and unfolding RAEs. [sent-192, score-0.259]

51 1 Qualitative Evaluation of Nearest Neighbors In order to show that the learned feature representations capture important semantic and syntactic information even for higher nodes in the tree, we visualize nearest neighbor phrases of varying length. [sent-194, score-0.51]

52 After embedding sentences from the Gigaword corpus, we compute nearest neighbors for all nodes in all trees. [sent-195, score-0.285]

53 In Table 1 the ﬁrst phrase is a randomly chosen phrase and the remaining phrases are the closest phrases in the dataset that are not in the same sentence. [sent-196, score-0.616]

54 We compare the two autoencoder models above: RAE and unfolding RAE without hidden layers, as well as a recursive averaging baseline (R. [sent-199, score-0.699]

55 The unfolding RAE can reconstruct perfectly almost all phrases of 2 and 3 words and many with up to 5 words. [sent-212, score-0.527]

56 Recursive averaging is almost entirely focused on an exact string match of the last merged words of the current phrase in the tree. [sent-217, score-0.272]

57 This leads the nearest neighbors to incorrectly add various extra information which would break the paraphrase relationship if we only considered the top node vectors and ignores syntactic similarity. [sent-218, score-0.734]

58 Finally, the unfolding RAE captures most closely the underlying syntactic and semantic structure. [sent-220, score-0.51]

59 2 Reconstructing Phrases via Recursive Decoding In this section we analyze the information captured by the unfolding RAE’s 100-dimensional phrase vectors. [sent-222, score-0.386]

60 In order to show how much of the information can be recovered we recursively reconstruct sentences after encoding them. [sent-224, score-0.303]

61 It starts from a phrase vector of a nonterminal node in the parse tree. [sent-226, score-0.386]

62 We then unfold the tree as given during encoding and ﬁnd the nearest neighbor word to each of the reconstructed leaf node vectors. [sent-227, score-0.351]

63 Table 2 shows that the unfolding RAE can very well reconstruct phrases of up to length ﬁve. [sent-228, score-0.482]

64 Longer phrases retain some correct words and usually the correct part of speech but the semantics of the words get merged. [sent-230, score-0.271]

65 The results are from the unfolding RAE that directly computes the parent representation as in Eq. [sent-231, score-0.292]

66 3 Evaluation on Full-Sentence Paraphrasing We now turn to evaluating the unsupervised features and our dynamic pooling architecture in our main task of paraphrase detection. [sent-234, score-0.787]

67 For instance, numbers often have very similar representations, but even small differences are crucial to reject the paraphrase relation in the MSRP dataset. [sent-236, score-0.408]

68 The ﬁrst is 1 if two sentences contain exactly the same numbers or no number and 0 otherwise, the second is 1 if both sentences contain the same numbers and the third is 1 if the set of numbers in one sentence is a strict subset of the numbers in the other sentence. [sent-238, score-0.567]

69 Since our pooling-layer cannot capture sentence length or the number of exact string matches, we also add the difference in sentence length and the percentage of words and phrases in one sentence that are in the other sentence and vice-versa. [sent-239, score-1.074]

70 For all of our models and training setups, we perform 10-fold cross-validation on the training set to choose the best regularization parameters and np , the size of the pooling matrix S ∈ Rnp ×np . [sent-241, score-0.312]

71 The best pooling size was consistently np = 15, slightly less than the average sentence length. [sent-244, score-0.477]

72 6 Table 3: Test results on the MSRP paraphrase corpus. [sent-273, score-0.408]

73 We observe that the dynamic pooling layer is very powerful because it captures the global structure of the similarity matrix which in turn captures the syntactic and semantic similarities of the two sentences. [sent-279, score-0.808]

74 With the help of this powerful dynamic pooling layer and good initial word vectors even the standard RAE and recursive averaging perform well on this dataset with an accuracy of 75. [sent-280, score-0.746]

75 Next, we compare the dynamic pooling to simpler feature extraction methods. [sent-287, score-0.262]

76 Our comparison shows that the dynamic pooling architecture is important for achieving high accuracy. [sent-288, score-0.295]

77 The low performance shows that our dynamic pooling layer better captures the global similarity information than aggregate statistics. [sent-293, score-0.488]

78 In order to better recover exact string matches it may be necessary to explore overlapping pooling regions. [sent-302, score-0.292]

79 The performance shows that while the unfolding RAE is by itself very powerful, the dynamic pooling layer is needed to extract all information from its trees. [sent-306, score-0.614]

80 Our unfolding RAE and dynamic similarity pooling architecture achieves state-of-the-art performance without handdesigned semantic taxonomies and features such as WordNet. [sent-308, score-0.734]

81 In Table 4 we show several examples of correctly classiﬁed paraphrase candidate pairs together with their similarity matrix after dynamic min-pooling. [sent-310, score-0.597]

82 The ﬁrst and last pair are simple cases of paraphrase and not paraphrase. [sent-311, score-0.408]

83 Even though there is a clear diagonal with good string matches, the gap in the center shows that the ﬁrst sentence contains much extra information. [sent-316, score-0.263]

84 5 Related Work The ﬁeld of paraphrase detection has progressed immensely in recent years. [sent-318, score-0.408]

85 In their approach they choose for each open-class word the single most similar word in the other sentence. [sent-322, score-0.264]

86 Simple paraphrase pairs have clear diagonal structure due to perfect word matches with Euclidean distance 0 (dark blue). [sent-336, score-0.567]

87 There are two shortcomings of such methods: They ignore (i) the syntactic structure of the sentences (by comparing only single words) and (ii) the global structure of such a similarity matrix (by computing only the mean). [sent-341, score-0.463]

88 Most recently, Das and Smith [15] adopted the idea that paraphrases have related syntactic structure. [sent-343, score-0.282]

89 We merge these word-based models and syntactic models in one joint framework: Our matrix consists of phrase similarities and instead of just taking the mean of the similarities we can capture the global layout of the matrix via our min-pooling layer. [sent-346, score-0.418]

90 The idea of applying an autoencoder in a recursive setting was introduced by Pollack [9] and extended recently by [10]. [sent-347, score-0.373]

91 One of the major shortcomings of previous applications of recursive autoencoders to natural language sentences was their binary word representation as discussed in Sec. [sent-349, score-0.652]

92 Recently, Bottou discussed related ideas of recursive autoencoders [25] and recursive image and text understanding but without experimental results. [sent-352, score-0.471]

93 Supervised recursive neural networks have been used for parsing images and natural language sentences by Socher et al. [sent-354, score-0.443]

94 Lastly, [12] introduced the standard recursive autoencoder as mentioned in Sect. [sent-356, score-0.373]

95 The RAE captures syntactic and semantic information as shown qualitatively with nearest neighbor embeddings and quantitatively on a paraphrase detection task. [sent-360, score-0.697]

96 Our RAE phrase features allow us to compare both single word vectors as well as phrases and complete syntactic trees. [sent-361, score-0.65]

97 In order to make use of the global comparison of variable length sentences in a similarity matrix we introduce a new dynamic pooling architecture that produces a ﬁxed-sized representation. [sent-362, score-0.605]

98 We show that this pooled representation captures enough information about the sentence pair to determine the paraphrase relationship on the MSRP dataset with a higher accuracy than any previously published results. [sent-363, score-0.642]

99 Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. [sent-386, score-0.408]

100 Learning continuous phrase representations and syntactic parsing with recursive neural networks. [sent-534, score-0.524]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('rae', 0.487), ('paraphrase', 0.408), ('unfolding', 0.259), ('pooling', 0.197), ('recursive', 0.197), ('sentence', 0.195), ('sentences', 0.186), ('phrases', 0.181), ('autoencoder', 0.176), ('syntactic', 0.153), ('word', 0.132), ('paraphrases', 0.129), ('phrase', 0.127), ('parse', 0.127), ('similarity', 0.094), ('layer', 0.093), ('erec', 0.09), ('np', 0.085), ('msrp', 0.077), ('spooled', 0.077), ('autoencoders', 0.077), ('node', 0.076), ('string', 0.068), ('dynamic', 0.065), ('language', 0.06), ('reconstruction', 0.059), ('semantic', 0.059), ('unsupervised', 0.057), ('nonterminal', 0.056), ('wd', 0.054), ('pollack', 0.052), ('raes', 0.052), ('underneath', 0.052), ('killed', 0.049), ('representations', 0.047), ('words', 0.045), ('children', 0.044), ('reconstruct', 0.042), ('suffering', 0.042), ('encoding', 0.041), ('captures', 0.039), ('similarities', 0.039), ('postpone', 0.039), ('qualifying', 0.039), ('socher', 0.039), ('nearest', 0.038), ('watch', 0.037), ('tries', 0.037), ('deep', 0.036), ('hidden', 0.035), ('recursively', 0.034), ('collobert', 0.034), ('dolan', 0.034), ('political', 0.034), ('leaf', 0.033), ('architecture', 0.033), ('parent', 0.033), ('averaging', 0.032), ('nodes', 0.032), ('das', 0.032), ('tree', 0.031), ('matrix', 0.03), ('vectors', 0.03), ('neighbors', 0.029), ('decoding', 0.028), ('child', 0.028), ('emnlp', 0.028), ('features', 0.027), ('corpus', 0.027), ('matches', 0.027), ('distances', 0.027), ('layers', 0.026), ('stanford', 0.026), ('australasian', 0.026), ('baldwin', 0.026), ('bloom', 0.026), ('bremer', 0.026), ('defence', 0.026), ('embargo', 0.026), ('hewitt', 0.026), ('islam', 0.026), ('kissinger', 0.026), ('lleyton', 0.026), ('malaysia', 0.026), ('merrill', 0.026), ('mihalcea', 0.026), ('modesto', 0.026), ('peace', 0.026), ('pennington', 0.026), ('plaintiffs', 0.026), ('racquet', 0.026), ('rus', 0.026), ('saddam', 0.026), ('sally', 0.026), ('secretaries', 0.026), ('seventeen', 0.026), ('sporting', 0.026), ('summit', 0.026), ('tennis', 0.026), ('tiburtina', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 74 nips-2011-Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection

Author: Richard Socher, Eric H. Huang, Jeffrey Pennin, Christopher D. Manning, Andrew Y. Ng

2 0.10473345 176 nips-2011-Multi-View Learning of Word Embeddings via CCA

Author: Paramveer Dhillon, Dean P. Foster, Lyle H. Ungar

Abstract: Recently, there has been substantial interest in using large amounts of unlabeled data to learn word representations which can then be used as features in supervised classiﬁers for NLP tasks. However, most current approaches are slow to train, do not model the context of the word, and lack theoretical grounding. In this paper, we present a new learning method, Low Rank Multi-View Learning (LR-MVL) which uses a fast spectral method to estimate low dimensional context-speciﬁc word representations from unlabeled data. These representation features can then be used with any supervised learner. LR-MVL is extremely fast, gives guaranteed convergence to a global optimum, is theoretically elegant, and achieves state-ofthe-art performance on named entity recognition (NER) and chunking problems. 1 Introduction and Related Work Over the past decade there has been increased interest in using unlabeled data to supplement the labeled data in semi-supervised learning settings to overcome the inherent data sparsity and get improved generalization accuracies in high dimensional domains like NLP. Approaches like [1, 2] have been empirically very successful and have achieved excellent accuracies on a variety of NLP tasks. However, it is often difﬁcult to adapt these approaches to use in conjunction with an existing supervised NLP system as these approaches enforce a particular choice of model. An increasingly popular alternative is to learn representational embeddings for words from a large collection of unlabeled data (typically using a generative model), and to use these embeddings to augment the feature set of a supervised learner. Embedding methods produce features in low dimensional spaces or over a small vocabulary size, unlike the traditional approach of working in the original high dimensional vocabulary space with only one dimension “on” at a given time. Broadly, these embedding methods fall into two categories: 1. Clustering based word representations: Clustering methods, often hierarchical, are used to group distributionally similar words based on their contexts. The two dominant approaches are Brown Clustering [3] and [4]. As recently shown, HMMs can also be used to induce a multinomial distribution over possible clusters [5]. 2. Dense representations: These representations are dense, low dimensional and real-valued. Each dimension of these representations captures latent information about a combination of syntactic and semantic word properties. They can either be induced using neural networks like C&W; embeddings [6] and Hierarchical log-linear (HLBL) embeddings [7] or by eigen-decomposition of the word co-occurrence matrix, e.g. Latent Semantic Analysis/Latent Semantic Indexing (LSA/LSI) [8]. Unfortunately, most of these representations are 1). slow to train, 2). sensitive to the scaling of the embeddings (especially 2 based approaches like LSA/PCA), 3). can get stuck in local optima (like EM trained HMM) and 4). learn a single embedding for a given word type; i.e. all the occurrences 1 of the word “bank” will have the same embedding, irrespective of whether the context of the word suggests it means “a ﬁnancial institution” or “a river bank”. In this paper, we propose a novel context-speciﬁc word embedding method called Low Rank MultiView Learning, LR-MVL, which is fast to train and is guaranteed to converge to the optimal solution. As presented here, our LR-MVL embeddings are context-speciﬁc, but context oblivious embeddings (like the ones used by [6, 7]) can be trivially gotten from our model. Furthermore, building on recent advances in spectral learning for sequence models like HMMs [9, 10, 11] we show that LR-MVL has strong theoretical grounding. Particularly, we show that LR-MVL estimates low dimensional context-speciﬁc word embeddings which preserve all the information in the data if the data were generated by an HMM. Moreover, LR-MVL being linear does not face the danger of getting stuck in local optima as is the case for an EM trained HMM. LR-MVL falls into category (2) mentioned above; it learns real-valued context-speciﬁc word embeddings by performing Canonical Correlation Analysis (CCA) [12] between the past and future views of low rank approximations of the data. However, LR-MVL is more general than those methods, which work on bigram or trigram co-occurrence matrices, in that it uses longer word sequence information to estimate context-speciﬁc embeddings and also for the reasons mentioned in the last paragraph. The remainder of the paper is organized as follows. In the next section we give a brief overview of CCA, which forms the core of our method. Section 3 describes our proposed LR-MVL algorithm in detail and gives theory supporting its performance. Section 4 demonstrates the effectiveness of LR-MVL on the NLP tasks of Named Entity Recognition and Chunking. We conclude with a brief summary in Section 5. 2 Brief Review: Canonical Correlation Analysis (CCA) CCA [12] is the analog to Principal Component Analysis (PCA) for pairs of matrices. PCA computes the directions of maximum covariance between elements in a single matrix, whereas CCA computes the directions of maximal correlation between a pair of matrices. Unlike PCA, CCA does not depend on how the observations are scaled. This invariance of CCA to linear data transformations allows proofs that keeping the dominant singular vectors (those with largest singular values) will faithfully capture any state information. More speciﬁcally, given a set of n paired observation vectors {(l1 , r1 ), ..., (ln , rn )}–in our case the two matrices are the left (L) and right (R) context matrices of a word–we would like to simultaneously ﬁnd the directions Φl and Φr that maximize the correlation of the projections of L onto Φl with the projections of R onto Φr . This is expressed as max Φl ,Φr E[ L, Φl R, Φr ] E[ L, Φl 2 ]E[ R, Φr 2 ] (1) where E denotes the empirical expectation. We use the notation Clr (Cll ) to denote the cross (auto) covariance matrices between L and R (i.e. L’R and L’L respectively.). The left and right canonical correlates are the solutions Φl , Φr of the following equations: Cll −1 Clr Crr −1 Crl Φl = λΦl Crr −1 Crl Cll −1 Clr Φr = λΦr 3 (2) Low Rank Multi-View Learning (LR-MVL) In LR-MVL, we compute the CCA between the past and future views of the data on a large unlabeled corpus to ﬁnd the common latent structure, i.e., the hidden state associated with each token. These induced representations of the tokens can then be used as features in a supervised classiﬁer (typically discriminative). The context around a word, consisting of the h words to the right and left of it, sits in a high dimensional space, since for a vocabulary of size v, each of the h words in the context requires an indicator function of dimension v. The key move in LR-MVL is to project the v-dimensional word 2 space down to a k dimensional state space. Thus, all eigenvector computations are done in a space that is v/k times smaller than the original space. Since a typical vocabulary contains at least 50, 000 words, and we use state spaces of order k ≈ 50 dimensions, this gives a 1,000-fold reduction in the size of calculations that are needed. The core of our LR-MVL algorithm is a fast spectral method for learning a v × k matrix A which maps each of the v words in the vocabulary to a k-dimensional state vector. We call this matrix the “eigenfeature dictionary”. We now describe the LR-MVL method, give a theorem that provides intuition into how it works, and formally present the LR-MVL algorithm. The Experiments section then shows that this low rank approximation allows us to achieve state-of-the-art performance on NLP tasks. 3.1 The LR-MVL method Given an unlabeled token sequence w={w0 , w1 , . . ., wn } we want to learn a low (k)- dimensional state vector {z0 , z1 , . . . , zn } for each observed token. The key is to ﬁnd a v ×k matrix A (Algorithm 1) that maps each of the v words in the vocabulary to a reduced rank k-dimensional state vector, which is later used to induce context speciﬁc embeddings for the tokens (Algorithm 2). For supervised learning, these context speciﬁc embeddings are supplemented with other information about each token wt , such as its identity, orthographic features such as preﬁxes and sufﬁxes or membership in domain-speciﬁc lexicons, and used as features in a classiﬁer. Section 3.4 gives the algorithm more formally, but the key steps in the algorithm are, in general terms: • Take the h words to the left and to the right of each target word wt (the “Left” and “Right” contexts), and project them each down to k dimensions using A. • Take the CCA between the reduced rank left and right contexts, and use the resulting model to estimate a k dimensional state vector (the “hidden state”) for each token. • Take the CCA between the hidden states and the tokens wt . The singular vectors associated with wt form a new estimate of the eigenfeature dictionary. LR-MVL can be viewed as a type of co-training [13]: The state of each token wt is similar to that of the tokens both before and after it, and it is also similar to the states of the other occurrences of the same word elsewhere in the document (used in the outer iteration). LR-MVL takes advantage of these two different types of similarity by alternately estimating word state using CCA on the smooths of the states of the words before and after each target token and using the average over the states associated with all other occurrences of that word. 3.2 Theoretical Properties of LR-MVL We now present the theory behind the LR-MVL algorithm; particularly we show that the reduced rank matrix A allows a signiﬁcant data reduction while preserving the information in our data and the estimated state does the best possible job of capturing any label information that can be inferred by a linear model. Let L be an n × hv matrix giving the words in the left context of each of the n tokens, where the context is of length h, R be the corresponding n × hv matrix for the right context, and W be an n × v matrix of indicator functions for the words themselves. We will use the following assumptions at various points in our proof: Assumption 1. L, W, and R come from a rank k HMM i.e. it has a rank k observation matrix and rank k transition matrix both of which have the same domain. For example, if the dimension of the hidden state is k and the vocabulary size is v then the observation matrix, which is k × v, has rank k. This rank condition is similar to the one used by [10]. Assumption 1A. For the three views, L, W and R assume that there exists a “hidden state H” of dimension n × k, where each row Hi has the same non-singular variance-covariance matrix and 3 such that E(Li |Hi ) = Hi β T and E(Ri |Hi ) = Hi β T and E(Wi |Hi ) = Hi β T where all β’s are of L R W rank k, where Li , Ri and Wi are the rows of L, R and W respectively. Assumption 1A follows from Assumption 1. Assumption 2. ρ(L, W), ρ(L, R) and ρ(W, R) all have rank k, where ρ(X1 , X2 ) is the expected correlation between X1 and X2 . Assumption 2 is a rank condition similar to that in [9]. Assumption 3. ρ([L, R], W) has k distinct singular values. Assumption 3 just makes the proof a little cleaner, since if there are repeated singular values, then the singular vectors are not unique. Without it, we would have to phrase results in terms of subspaces with identical singular values. We also need to deﬁne the CCA function that computes the left and right singular vectors for a pair of matrices: Deﬁnition 1 (CCA). Compute the CCA between two matrices X1 and X2 . Let ΦX1 be a matrix containing the d largest singular vectors for X1 (sorted from the largest on down). Likewise for ΦX2 . Deﬁne the function CCAd (X1 , X2 ) = [ΦX1 , ΦX2 ]. When we want just one of these Φ’s, we will use CCAd (X1 , X2 )left = ΦX1 for the left singular vectors and CCAd (X1 , X2 )right = ΦX2 for the right singular vectors. Note that the resulting singular vectors, [ΦX1 , ΦX2 ] can be used to give two redundant estimates, X1 ΦX1 and X2 ΦX2 of the “hidden” state relating X1 and X2 , if such a hidden state exists. Deﬁnition 2. Deﬁne the symbol “≈” to mean X1 ≈ X2 ⇐⇒ lim X1 = lim X2 n→∞ n→∞ where n is the sample size. Lemma 1. Deﬁne A by the following limit of the right singular vectors: CCAk ([L, R], W)right ≈ A. Under assumptions 2, 3 and 1A, such that if CCAk (L, R) ≡ [ΦL , ΦR ] then CCAk ([LΦL , RΦR ], W)right ≈ A. Lemma 1 shows that instead of ﬁnding the CCA between the full context and the words, we can take the CCA between the Left and Right contexts, estimate a k dimensional state from them, and take the CCA of that state with the words and get the same result. See the supplementary material for the Proof. ˜ Let Ah denote a matrix formed by stacking h copies of A on top of each other. Right multiplying ˜ L or R by Ah projects each of the words in that context into the k-dimensional reduced rank space. The following theorem addresses the core of the LR-MVL algorithm, showing that there is an A which gives the desired dimensionality reduction. Speciﬁcally, it shows that the previous lemma also holds in the reduced rank space. Theorem 1. Under assumptions 1, 2 and 3 there exists a unique matrix A such that if ˜ ˜ ˜ ˜ CCAk (LAh , RAh ) ≡ [ΦL , ΦR ] then ˜ ˜ ˜ ˜ CCAk ([LAh ΦL , RAh ΦR ], W)right ≈ A ˜ where Ah is the stacked form of A. See the supplementary material for the Proof 1 . ˆ It is worth noting that our matrix A corresponds to the matrix U used by [9, 10]. They showed that U is sufﬁcient to compute the probability of a sequence of words generated by an HMM; although we do not show ˆ it here (due to limited space), our A provides a more statistically efﬁcient estimate of U than their U , and hence can also be used to estimate the sequence probabilities. 1 4 Under the above assumptions, there is asymptotically (in the limit of inﬁnite data) no beneﬁt to ﬁrst estimating state by ﬁnding the CCA between the left and right contexts and then ﬁnding the CCA between the estimated state and the words. One could instead just directly ﬁnd the CCA between the combined left and rights contexts and the words. However, because of the Zipﬁan distribution of words, many words are rare or even unique, and hence one is not in the asymptotic limit. In this case, CCA between the rare words and context will not be informative, whereas ﬁnding the CCA between the left and right contexts gives a good state vector estimate even for unique words. One can then fruitfully ﬁnd the CCA between the contexts and the estimated state vector for their associated words. 3.3 Using Exponential Smooths In practice, we replace the projected left and right contexts with exponential smooths (weighted average of the previous (or next) token’s state i.e. Zt−1 (or Zt+1 ) and previous (or next) token’s smoothed state i.e. St−1 (or St+1 ).), of them at a few different time scales, thus giving a further dimension reduction by a factor of context length h (say 100 words) divided by the number of smooths (often 5-7). We use a mixture of both very short and very long contexts which capture short and long range dependencies as required by NLP problems as NER, Chunking, WSD etc. Since exponential smooths are linear, we preserve the linearity of our method. 3.4 The LR-MVL Algorithm The LR-MVL algorithm (using exponential smooths) is given in Algorithm 1; it computes the pair of CCAs described above in Theorem 1. Algorithm 1 LR-MVL Algorithm - Learning from Large amounts of Unlabeled Data 1: Input: Token sequence Wn×v , state space size k, smoothing rates αj 2: Initialize the eigenfeature dictionary A to random values N (0, 1). 3: repeat 4: Set the state Zt (1 < t ≤ n) of each token wt to the eigenfeature vector of the corresponding word. Zt = (Aw : w = wt ) 5: Smooth the state estimates before and after each token to get a pair of views for each smoothing rate αj . (l,j) (l,j) = (1 − αj )St−1 + αj Zt−1 // left view L St (r,j) (r,j) j St = (1 − α )St+1 + αj Zt+1 // right view R. (l,j) (r,j) th where the t rows of L and R are, respectively, concatenations of the smooths St and St for (j) each of the α s. 6: Find the left and right canonical correlates, which are the eigenvectors Φl and Φr of (L L)−1 L R(R R)−1 R LΦl = λΦl . (R R)−1 R L(L L)−1 L RΦr = λΦr . 7: Project the left and right views on to the space spanned by the top k/2 left and right CCAs respectively (k/2) (k/2) Xl = LΦl and Xr = RΦr (k/2) (k/2) where Φl , Φr are matrices composed of the singular vectors of Φl , Φr with the k/2 largest magnitude singular values. Estimate the state for each word wt as the union of the left and right estimates: Z = [Xl , Xr ] 8: Estimate the eigenfeatures of each word type, w, as the average of the states estimated for that word. Aw = avg(Zt : wt = w) 9: Compute the change in A from the previous iteration 10: until |∆A| < 11: Output: Φk , Φk , A . r l A few iterations (∼ 5) of the above algorithm are sufﬁcient to converge to the solution. (Since the problem is convex, there is a single solution, so there is no issue of local minima.) As [14] show for PCA, one can start with a random matrix that is only slightly larger than the true rank k of the correlation matrix, and with extremely high likelihood converge in a few iterations to within a small distance of the true principal components. In our case, if the assumptions detailed above (1, 1A, 2 and 3) are satisﬁed, our method converges equally rapidly to the true canonical variates. As mentioned earlier, we get further dimensionality reduction in Step 5, by replacing the Left and Right context matrices with a set of exponentially smoothed values of the reduced rank projections of the context words. Step 6 ﬁnds the CCA between the Left and Right contexts. Step 7 estimates 5 the state by combining the estimates from the left and right contexts, since we don’t know which will best estimate the state. Step 8 takes the CCA between the estimated state Z and the matrix of words W. Because W is a vector of indicator functions, this CCA takes the trivial form of a set of averages. Once we have estimated the CCA model, it is used to generate context speciﬁc embeddings for the tokens from training, development and test sets (as described in Algorithm 2). These embeddings are further supplemented with other baseline features and used in a supervised learner to predict the label of the token. Algorithm 2 LR-MVL Algorithm -Inducing Context Speciﬁc Embeddings for Train/Dev/Test Data 1: Input: Model (Φk , Φk , A) output from above algorithm and Token sequences Wtrain , (Wdev , Wtest ) r l 2: Project the left and right views L and R after smoothing onto the space spanned by the top k left and right CCAs respectively Xl = LΦk and Xr = RΦk r l and the words onto the eigenfeature dictionary Xw = W train A 3: Form the ﬁnal embedding matrix Xtrain:embed by concatenating these three estimates of state Xtrain:embed = [Xl , Xw , Xr ] 4: Output: The embedding matrices Xtrain:embed , (Xdev:embed , Xtest:embed ) with context-speciﬁc representations for the tokens. These embeddings are augmented with baseline set of features mentioned in Sections 4.1.1 and 4.1.2 before learning the ﬁnal classiﬁer. Note that we can get context “oblivious” embeddings i.e. one embedding per word type, just by using the eigenfeature dictionary (Av×k ) output by Algorithm 1. 4 Experimental Results In this section we present the experimental results of LR-MVL on Named Entity Recognition (NER) and Syntactic Chunking tasks. We compare LR-MVL to state-of-the-art semi-supervised approaches like [1] (Alternating Structures Optimization (ASO)) and [2] (Semi-supervised extension of CRFs) as well as embeddings like C&W;, HLBL and Brown Clustering. 4.1 Datasets and Experimental Setup For the NER experiments we used the data from CoNLL 2003 shared task and for Chunking experiments we used the CoNLL 2000 shared task data2 with standard training, development and testing set splits. The CoNLL ’03 and the CoNLL ’00 datasets had ∼ 204K/51K/46K and ∼ 212K/ − /47K tokens respectively for Train/Dev./Test sets. 4.1.1 Named Entity Recognition (NER) We use the same set of baseline features as used by [15, 16] in their experiments. The detailed list of features is as below: • Current Word wi ; Its type information: all-capitalized, is-capitalized, all-digits and so on; Preﬁxes and sufﬁxes of wi • Word tokens in window of 2 around the current word i.e. (wi−2 , wi−1 , wi , wi+1 , wi+2 ); and capitalization pattern in the window. d = • Previous two predictions yi−1 and yi−2 and conjunction of d and yi−1 • Embedding features (LR-MVL, C&W;, HLBL, Brown etc.) in a window of 2 around the current word (if applicable). Following [17] we use regularized averaged perceptron model with above set of baseline features for the NER task. We also used their BILOU text chunk representation and fast greedy inference as it was shown to give superior performance. 2 More details about the data and competition are available at http://www.cnts.ua.ac.be/ conll2003/ner/ and http://www.cnts.ua.ac.be/conll2000/chunking/ 6 We also augment the above set of baseline features with gazetteers, as is standard practice in NER experiments. We tuned our free parameter namely the size of LR-MVL embedding on the development and scaled our embedding features to have a 2 norm of 1 for each token and further multiplied them by a normalization constant (also chosen by cross validation), so that when they are used in conjunction with other categorical features in a linear classiﬁer, they do not exert extra inﬂuence. The size of LR-MVL embeddings (state-space) that gave the best performance on the development set was k = 50 (50 each for Xl , Xw , Xr in Algorithm 2) i.e. the total size of embeddings was 50×3, and the best normalization constant was 0.5. We omit validation plots due to paucity of space. 4.1.2 Chunking For our chunking experiments we use a similar base set of features as above: • Current Word wi and word tokens in window of 2 around the current word i.e. d = (wi−2 , wi−1 , wi , wi+1 , wi+2 ); • POS tags ti in a window of 2 around the current word. • Word conjunction features wi ∩ wi+1 , i ∈ {−1, 0} and Tag conjunction features ti ∩ ti+1 , i ∈ {−2, −1, 0, 1} and ti ∩ ti+1 ∩ ti+2 , i ∈ {−2, −1, 0}. • Embedding features in a window of 2 around the current word (when applicable). Since CoNLL 00 chunking data does not have a development set, we randomly sampled 1000 sentences from the training data (8936 sentences) for development. So, we trained our chunking models on 7936 training sentences and evaluated their F1 score on the 1000 development sentences and used a CRF 3 as the supervised classiﬁer. We tuned the size of embedding and the magnitude of 2 regularization penalty in CRF on the development set and took log (or -log of the magnitude) of the value of the features4 . The regularization penalty that gave best performance on development set was 2 and here again the best size of LR-MVL embeddings (state-space) was k = 50. Finally, we trained the CRF on the entire (“original”) training data i.e. 8936 sentences. 4.1.3 Unlabeled Data and Induction of embeddings For inducing the embeddings we used the RCV1 corpus containing Reuters newswire from Aug ’96 to Aug ’97 and containing about 63 million tokens in 3.3 million sentences5 . Case was left intact and we did not do the “cleaning” as done by [18, 16] i.e. remove all sentences which are less than 90% lowercase a-z, as our multi-view learning approach is robust to such noisy data, like news byline text (mostly all caps) which does not correlate strongly with the text of the article. We induced our LR-MVL embeddings over a period of 3 days (70 core hours on 3.0 GHz CPU) on the entire RCV1 data by performing 4 iterations, a vocabulary size of 300k and using a variety of smoothing rates (α in Algorithm 1) to capture correlations between shorter and longer contexts α = [0.005, 0.01, 0.05, 0.1, 0.5, 0.9]; theoretically we could tune the smoothing parameters on the development set but we found this mixture of long and short term dependencies to work well in practice. As far as the other embeddings are concerned i.e. C&W;, HLBL and Brown Clusters, we downloaded them from http://metaoptimize.com/projects/wordreprs. The details about their induction and parameter tuning can be found in [16]; we report their best numbers here. It is also worth noting that the unsupervised training of LR-MVL was (> 1.5 times)6 faster than other embeddings. 4.2 Results The results for NER and Chunking are shown in Tables 1 and 2, respectively, which show that LR-MVL performs signiﬁcantly better than state-of-the-art competing methods on both NER and Chunking tasks. 3 http://www.chokkan.org/software/crfsuite/ Our embeddings are learnt using a linear model whereas CRF is a log-linear model, so to keep things on same scale we did this normalization. 5 We chose this particular dataset to make a fair comparison with [1, 16], who report results using RCV1 as unlabeled data. 6 As some of these embeddings were trained on GPGPU which makes our method even faster comparatively. 4 7 Embedding/Model Baseline C&W;, 200-dim HLBL, 100-dim Brown 1000 clusters Ando & Zhang ’05 Suzuki & Isozaki ’08 LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim HLBL, 100-dim C&W;, 200-dim Brown, 1000 clusters LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim No Gazetteers With Gazetteers F1-Score Dev. Set Test Set 90.03 84.39 92.46 87.46 92.00 88.13 92.32 88.52 93.15 89.31 93.66 89.36 93.11 89.55 93.61 89.91 92.91 89.35 92.98 88.88 93.25 89.41 93.91 89.89 94.41 90.06 Table 1: NER Results. Note: 1). LR-MVL (CO) are Context Oblivious embeddings which are gotten from (A) in Algorithm 1. 2). F1-score= Harmonic Mean of Precision and Recall. 3). The current state-of-the-art for this NER task is 90.90 (Test Set) but using 700 billion tokens of unlabeled data [19]. Embedding/Model Baseline HLBL, 50-dim C&W;, 50-dim Brown 3200 Clusters Ando & Zhang ’05 Suzuki & Isozaki ’08 LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim Test Set F1-Score 93.79 94.00 94.10 94.11 94.39 94.67 95.02 95.44 Table 2: Chunking Results. It is important to note that in problems like NER, the ﬁnal accuracy depends on performance on rare-words and since LR-MVL is robustly able to correlate past with future views, it is able to learn better representations for rare words resulting in overall better accuracy. On rare-words (occurring < 10 times in corpus), we got 11.7%, 10.7% and 9.6% relative reduction in error over C&W;, HLBL and Brown respectively for NER; on chunking the corresponding numbers were 6.7%, 7.1% and 8.7%. Also, it is worth mentioning that modeling the context in embeddings gives decent improvements in accuracies on both NER and Chunking problems. For the case of NER, the polysemous words were mostly like Chicago, Wales, Oakland etc., which could either be a location or organization (Sports teams, Banks etc.), so when we don’t use the gazetteer features, (which are known lists of cities, persons, organizations etc.) we got higher increase in F-score by modeling context, compared to the case when we already had gazetteer features which captured most of the information about polysemous words for NER dataset and modeling the context didn’t help as much. The polysemous words for Chunking dataset were like spot (VP/NP), never (VP/ADVP), more (NP/VP/ADVP/ADJP) etc. and in this case embeddings with context helped signiﬁcantly, giving 3.1 − 6.5% relative improvement in accuracy over context oblivious embeddings. 5 Summary and Conclusion In this paper, we presented a novel CCA-based multi-view learning method, LR-MVL, for large scale sequence learning problems such as arise in NLP. LR-MVL is a spectral method that works in low dimensional state-space so it is computationally efﬁcient, and can be used to train using large amounts of unlabeled data; moreover it does not get stuck in local optima like an EM trained HMM. The embeddings learnt using LR-MVL can be used as features with any supervised learner. LR-MVL has strong theoretical grounding; is much simpler and faster than competing methods and achieves state-of-the-art accuracies on NER and Chunking problems. Acknowledgements: The authors would like to thank Alexander Yates, Ted Sandler and the three anonymous reviews for providing valuable feedback. We would also like to thank Lev Ratinov and Joseph Turian for answering our questions regarding their paper [16]. 8 References [1] Ando, R., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6 (2005) 1817–1853 [2] Suzuki, J., Isozaki, H.: Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In: In ACL. (2008) [3] Brown, P., deSouza, P., Mercer, R., Pietra, V.D., Lai, J.: Class-based n-gram models of natural language. Comput. Linguist. 18 (December 1992) 467–479 [4] Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: 31st Annual Meeting of the ACL. (1993) 183–190 [5] Huang, F., Yates, A.: Distributional representations for handling sparsity in supervised sequence-labeling. ACL ’09, Stroudsburg, PA, USA, Association for Computational Linguistics (2009) 495–503 [6] Collobert, R., Weston, J.: A uniﬁed architecture for natural language processing: deep neural networks with multitask learning. ICML ’08, New York, NY, USA, ACM (2008) 160–167 [7] Mnih, A., Hinton, G.: Three new graphical models for statistical language modelling. ICML ’07, New York, NY, USA, ACM (2007) 641–648 [8] Dumais, S., Furnas, G., Landauer, T., Deerwester, S., Harshman, R.: Using latent semantic analysis to improve access to textual information. In: SIGCHI Conference on human factors in computing systems, ACM (1988) 281–285 [9] Hsu, D., Kakade, S., Zhang, T.: A spectral algorithm for learning hidden markov models. In: COLT. (2009) [10] Siddiqi, S., Boots, B., Gordon, G.J.: Reduced-rank hidden Markov models. In: AISTATS2010. (2010) [11] Song, L., Boots, B., Siddiqi, S.M., Gordon, G.J., Smola, A.J.: Hilbert space embeddings of hidden Markov models. In: ICML. (2010) [12] Hotelling, H.: Canonical correlation analysis (cca). Journal of Educational Psychology (1935) [13] Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT’ 98. (1998) 92–100 [14] Halko, N., Martinsson, P.G., Tropp, J.: Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. (Dec 2010) [15] Zhang, T., Johnson, D.: A robust risk minimization based named entity recognition system. CONLL ’03 (2003) 204–207 [16] Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. ACL ’10, Stroudsburg, PA, USA, Association for Computational Linguistics (2010) 384–394 [17] Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: CONLL. (2009) 147–155 [18] Liang, P.: Semi-supervised learning for natural language. Master’s thesis, Massachusetts Institute of Technology (2005) [19] Lin, D., Wu, X.: Phrase clustering for discriminative learning. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2. ACL ’09, Stroudsburg, PA, USA, Association for Computational Linguistics (2009) 1030–1038 9

3 0.10373282 244 nips-2011-Selecting Receptive Fields in Deep Networks

Author: Adam Coates, Andrew Y. Ng

Abstract: Recent deep learning and unsupervised feature learning systems that learn from unlabeled data have achieved high performance in benchmarks by using extremely large architectures with many features (hidden units) at each layer. Unfortunately, for such large architectures the number of parameters can grow quadratically in the width of the network, thus necessitating hand-coded “local receptive ﬁelds” that limit the number of connections from lower level features to higher ones (e.g., based on spatial locality). In this paper we propose a fast method to choose these connections that may be incorporated into a wide variety of unsupervised training methods. Speciﬁcally, we choose local receptive ﬁelds that group together those low-level features that are most similar to each other according to a pairwise similarity metric. This approach allows us to harness the advantages of local receptive ﬁelds (such as improved scalability, and reduced data requirements) when we do not know how to specify such receptive ﬁelds by hand or where our unsupervised training algorithm has no obvious generalization to a topographic setting. We produce results showing how this method allows us to use even simple unsupervised training algorithms to train successful multi-layered networks that achieve state-of-the-art results on CIFAR and STL datasets: 82.0% and 60.1% accuracy, respectively. 1

4 0.07502804 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

Author: Jia Deng, Sanjeev Satheesh, Alexander C. Berg, Fei Li

Abstract: We present a novel approach to efﬁciently learn a label tree for large scale classiﬁcation with many classes. The key contribution of the approach is a technique to simultaneously determine the structure of the tree and learn the classiﬁers for each node in the tree. This approach also allows ﬁne grained control over the efﬁciency vs accuracy trade-off in designing a label tree, leading to more balanced trees. Experiments are performed on large scale image classiﬁcation with 10184 classes and 9 million images. We demonstrate signiﬁcant improvements in test accuracy and efﬁciency with less training time and more balanced trees compared to the previous state of the art by Bengio et al. 1

5 0.071080565 127 nips-2011-Image Parsing with Stochastic Scene Grammar

Author: Yibiao Zhao, Song-chun Zhu

Abstract: This paper proposes a parsing algorithm for scene understanding which includes four aspects: computing 3D scene layout, detecting 3D objects (e.g. furniture), detecting 2D faces (windows, doors etc.), and segmenting background. In contrast to previous scene labeling work that applied discriminative classiﬁers to pixels (or super-pixels), we use a generative Stochastic Scene Grammar (SSG). This grammar represents the compositional structures of visual entities from scene categories, 3D foreground/background, 2D faces, to 1D lines. The grammar includes three types of production rules and two types of contextual relations. Production rules: (i) AND rules represent the decomposition of an entity into sub-parts; (ii) OR rules represent the switching among sub-types of an entity; (iii) SET rules represent an ensemble of visual entities. Contextual relations: (i) Cooperative “+” relations represent positive links between binding entities, such as hinged faces of a object or aligned boxes; (ii) Competitive “-” relations represents negative links between competing entities, such as mutually exclusive boxes. We design an efﬁcient MCMC inference algorithm, namely Hierarchical cluster sampling, to search in the large solution space of scene conﬁgurations. The algorithm has two stages: (i) Clustering: It forms all possible higher-level structures (clusters) from lower-level entities by production rules and contextual relations. (ii) Sampling: It jumps between alternative structures (clusters) in each layer of the hierarchy to ﬁnd the most probable conﬁguration (represented by a parse tree). In our experiment, we demonstrate the superiority of our algorithm over existing methods on public dataset. In addition, our approach achieves richer structures in the parse tree. 1

6 0.069163434 250 nips-2011-Shallow vs. Deep Sum-Product Networks

7 0.065113321 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms

8 0.06133806 261 nips-2011-Sparse Filtering

9 0.054666866 156 nips-2011-Learning to Learn with Compound HD Models

10 0.054079272 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features

11 0.049063999 115 nips-2011-Hierarchical Topic Modeling for Analysis of Time-Evolving Personal Choices

12 0.046949964 126 nips-2011-Im2Text: Describing Images Using 1 Million Captioned Photographs

13 0.046856843 234 nips-2011-Reconstructing Patterns of Information Diffusion from Incomplete Observations

14 0.046754707 58 nips-2011-Complexity of Inference in Latent Dirichlet Allocation

15 0.046032377 267 nips-2011-Spectral Methods for Learning Multivariate Latent Tree Structure

16 0.045609474 124 nips-2011-ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning

17 0.044945545 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models

18 0.044388592 141 nips-2011-Large-Scale Category Structure Aware Image Categorization

19 0.0441566 287 nips-2011-The Manifold Tangent Classifier

20 0.043648068 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.123), (1, 0.06), (2, -0.042), (3, 0.048), (4, 0.007), (5, -0.036), (6, 0.046), (7, 0.038), (8, -0.011), (9, -0.144), (10, 0.011), (11, 0.008), (12, -0.006), (13, 0.001), (14, 0.015), (15, -0.018), (16, -0.097), (17, 0.014), (18, -0.009), (19, -0.051), (20, -0.009), (21, -0.046), (22, -0.054), (23, 0.008), (24, 0.005), (25, 0.051), (26, -0.015), (27, -0.014), (28, 0.021), (29, -0.061), (30, -0.033), (31, -0.002), (32, 0.045), (33, 0.052), (34, 0.072), (35, 0.011), (36, -0.028), (37, 0.062), (38, -0.022), (39, -0.014), (40, 0.016), (41, 0.029), (42, 0.042), (43, 0.001), (44, -0.067), (45, 0.13), (46, 0.078), (47, 0.119), (48, -0.004), (49, -0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.919487 74 nips-2011-Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection

Author: Richard Socher, Eric H. Huang, Jeffrey Pennin, Christopher D. Manning, Andrew Y. Ng

2 0.64388084 93 nips-2011-Extracting Speaker-Specific Information with a Regularized Siamese Deep Network

Author: Ke Chen, Ahmad Salman

Abstract: Speech conveys different yet mixed information ranging from linguistic to speaker-speciﬁc components, and each of them should be exclusively used in a speciﬁc task. However, it is extremely difﬁcult to extract a speciﬁc information component given the fact that nearly all existing acoustic representations carry all types of speech information. Thus, the use of the same representation in both speech and speaker recognition hinders a system from producing better performance due to interference of irrelevant information. In this paper, we present a deep neural architecture to extract speaker-speciﬁc information from MFCCs. As a result, a multi-objective loss function is proposed for learning speaker-speciﬁc characteristics and regularization via normalizing interference of non-speaker related information and avoiding information loss. With LDC benchmark corpora and a Chinese speech corpus, we demonstrate that a resultant speaker-speciﬁc representation is insensitive to text/languages spoken and environmental mismatches and hence outperforms MFCCs and other state-of-the-art techniques in speaker recognition. We discuss relevant issues and relate our approach to previous work. 1

3 0.58298236 176 nips-2011-Multi-View Learning of Word Embeddings via CCA

Author: Paramveer Dhillon, Dean P. Foster, Lyle H. Ungar

4 0.53167278 250 nips-2011-Shallow vs. Deep Sum-Product Networks

Author: Olivier Delalleau, Yoshua Bengio

Abstract: We investigate the representational power of sum-product networks (computation networks analogous to neural networks, but whose individual units compute either products or weighted sums), through a theoretical analysis that compares deep (multiple hidden layers) vs. shallow (one hidden layer) architectures. We prove there exist families of functions that can be represented much more efﬁciently with a deep network than with a shallow one, i.e. with substantially fewer hidden units. Such results were not available until now, and contribute to motivate recent research involving learning of deep sum-product networks, and more generally motivate research in Deep Learning. 1 Introduction and prior work Many learning algorithms are based on searching a family of functions so as to identify one member of said family which minimizes a training criterion. The choice of this family of functions and how members of that family are parameterized can be a crucial one. Although there is no universally optimal choice of parameterization or family of functions (or “architecture”), as demonstrated by the no-free-lunch results [37], it may be the case that some architectures are appropriate (or inappropriate) for a large class of learning tasks and data distributions, such as those related to Artiﬁcial Intelligence (AI) tasks [4]. Different families of functions have different characteristics that can be appropriate or not depending on the learning task of interest. One of the characteristics that has spurred much interest and research in recent years is depth of the architecture. In the case of a multi-layer neural network, depth corresponds to the number of (hidden and output) layers. A ﬁxedkernel Support Vector Machine is considered to have depth 2 [4] and boosted decision trees to have depth 3 [7]. Here we use the word circuit or network to talk about a directed acyclic graph, where each node is associated with some output value which can be computed based on the values associated with its predecessor nodes. The arguments of the learned function are set at the input nodes of the circuit (which have no predecessor) and the outputs of the function are read off the output nodes of the circuit. Different families of functions correspond to different circuits and allowed choices of computations in each node. Learning can be performed by changing the computation associated with a node, or rewiring the circuit (possibly changing the number of nodes). The depth of the circuit is the length of the longest path in the graph from an input node to an output node. Deep Learning algorithms [3] are tailored to learning circuits with variable depth, typically greater than depth 2. They are based on the idea of multiple levels of representation, with the intuition that the raw input can be represented at different levels of abstraction, with more abstract features of the input or more abstract explanatory factors represented by deeper circuits. These algorithms are often based on unsupervised learning, opening the door to semi-supervised learning and efﬁcient 1 use of large quantities of unlabeled data [3]. Analogies with the structure of the cerebral cortex (in particular the visual cortex) [31] and similarities between features learned with some Deep Learning algorithms and those hypothesized in the visual cortex [17] further motivate investigations into deep architectures. It has been suggested that deep architectures are more powerful in the sense of being able to more efﬁciently represent highly-varying functions [4, 3]. In this paper, we measure “efﬁciency” in terms of the number of computational units in the network. An efﬁcient representation is important mainly because: (i) it uses less memory and is faster to compute, and (ii) given a ﬁxed amount of training samples and computational power, better generalization is expected. The ﬁrst successful algorithms for training deep architectures appeared in 2006, with efﬁcient training procedures for Deep Belief Networks [14] and deep auto-encoders [13, 27, 6], both exploiting the general idea of greedy layer-wise pre-training [6]. Since then, these ideas have been investigated further and applied in many settings, demonstrating state-of-the-art learning performance in object recognition [16, 28, 18, 15] and segmentation [20], audio classiﬁcation [19, 10], natural language processing [9, 36, 21, 32], collaborative ﬁltering [30], modeling textures [24], modeling motion [34, 33], information retrieval [29, 26], and semi-supervised learning [36, 22]. Poon and Domingos [25] introduced deep sum-product networks as a method to compute partition functions of tractable graphical models. These networks are analogous to traditional artiﬁcial neural networks but with nodes that compute either products or weighted sums of their inputs. Analogously to neural networks, we deﬁne “hidden” nodes as those nodes that are neither input nodes nor output nodes. If the nodes are organized in layers, we deﬁne the “hidden” layers to be those that are neither the input layer nor the output layer. Poon and Domingos [25] report experiments with networks much deeper (30+ hidden layers) than those typically used until now, e.g. in Deep Belief Networks [14, 3], where the number of hidden layers is usually on the order of three to ﬁve. Whether such deep architectures have theoretical advantages compared to so-called “shallow” architectures (i.e. those with a single hidden layer) remains an open question. After all, in the case of a sum-product network, the output value can always be written as a sum of products of input variables (possibly raised to some power by allowing multiple connections from the same input), and consequently it is easily rewritten as a shallow network with a sum output unit and product hidden units. The argument supported by our theoretical analysis is that a deep architecture is able to compute some functions much more efﬁciently than a shallow one. Until recently, very few theoretical results supported the idea that deep architectures could present an advantage in terms of representing some functions more efﬁciently. Most related results originate from the analysis of boolean circuits (see e.g. [2] for a review). Well-known results include the proof that solving the n-bit parity task with a depth-2 circuit requires an exponential number of gates [1, 38], and more generally that there exist functions computable with a polynomial-size depthk circuit that would require exponential size when restricted to depth k − 1 [11]. Another recent result on boolean circuits by Braverman [8] offers proof of a longstanding conjecture, showing that bounded-depth boolean circuits are unable to distinguish some (non-uniform) input distributions from the uniform distribution (i.e. they are “fooled” by such input distributions). In particular, Braverman’s result suggests that shallow circuits can in general be fooled more easily than deep ones, i.e., that they would have more difﬁculty efﬁciently representing high-order dependencies (those involving many input variables). It is not obvious that circuit complexity results (that typically consider only boolean or at least discrete nodes) are directly applicable in the context of typical machine learning algorithms such as neural networks (that compute continuous representations of their input). Orponen [23] surveys theoretical results in computational complexity that are relevant to learning algorithms. For instance, H˚ stad and Goldmann [12] extended some results to the case of networks of linear threshold units a with positivity constraints on the weights. Bengio et al. [5, 7] investigate, respectively, complexity issues in networks of Gaussian radial basis functions and decision trees, showing intrinsic limitations of these architectures e.g. on tasks similar to the parity problem. Utgoff and Stracuzzi [35] informally discuss the advantages of depth in boolean circuit in the context of learning architectures. Bengio [3] suggests that some polynomials could be represented more efﬁciently by deep sumproduct networks, but without providing any formal statement or proofs. This work partly addresses this void by demonstrating families of circuits for which a deep architecture can be exponentially more efﬁcient than a shallow one in the context of real-valued polynomials. Note that we do not address in this paper the problem of learning these parameters: even if an efﬁcient deep representation exists for the function we seek to approximate, in general there is no 2 guarantee for standard optimization algorithms to easily converge to this representation. This paper focuses on the representational power of deep sum-product circuits compared to shallow ones, and studies it by considering particular families of target functions (to be represented by the learner). We ﬁrst formally deﬁne sum-product networks. We consider two families of functions represented by deep sum-product networks (families F and G). For each family, we establish a lower bound on the minimal number of hidden units a depth-2 sum-product network would require to represent a function of this family, showing it is much less efﬁcient than the deep representation. 2 Sum-product networks Deﬁnition 1. A sum-product network is a network composed of units that either compute the product of their inputs or a weighted sum of their inputs (where weights are strictly positive). Here, we restrict our deﬁnition of the generic term “sum-product network” to networks whose summation units have positive incoming weights1 , while others are called “negative-weight” networks. Deﬁnition 2. A “negative-weight“ sum-product network may contain summation units whose weights are non-positive (i.e. less than or equal to zero). Finally, we formally deﬁne what we mean by deep vs. shallow networks in the rest of the paper. Deﬁnition 3. A “shallow“ sum-product network contains a single hidden layer (i.e. a total of three layers when counting the input and output layers, and a depth equal to two). Deﬁnition 4. A “deep“ sum-product network contains more than one hidden layer (i.e. a total of at least four layers, and a depth at least three). The family F 3 3.1 Deﬁnition The ﬁrst family of functions we study, denoted by F, is made of functions built from deep sumproduct networks that alternate layers of product and sum units with two inputs each (details are provided below). The basic idea we use here is that composing layers (i.e. using a deep architecture) is equivalent to using a factorized representation of the polynomial function computed by the network. Such a factorized representation can be exponentially more compact than its expansion as a sum of products (which can be associated to a shallow network with product units in its hidden layer and a sum unit as output). This is what we formally show in what follows. + ℓ2 = λ11ℓ1 + µ11ℓ1 = x1x2 + x3x4 = f (x1, x2, x3, x4) 2 1 1 λ11 = 1 µ11 = 1 × ℓ1 = x1x2 1 x1 x2 × ℓ1 = x3x4 2 x3 x4 Figure 1: Sum-product network computing the function f ∈ F such that i = λ11 = µ11 = 1. Let n = 4i , with i a positive integer value. Denote by ℓ0 the input layer containing scalar variables {x1 , . . . , xn }, such that ℓ0 = xj for 1 ≤ j ≤ n. Now deﬁne f ∈ F as any function computed by a j sum-product network (deep for i ≥ 2) composed of alternating product and sum layers: • ℓ2k+1 = ℓ2k · ℓ2k for 0 ≤ k ≤ i − 1 and 1 ≤ j ≤ 22(i−k)−1 2j−1 2j j • ℓ2k = λjk ℓ2k−1 + µjk ℓ2k−1 for 1 ≤ k ≤ i and 1 ≤ j ≤ 22(i−k) j 2j 2j−1 where the weights λjk and µjk of the summation units are strictly positive. The output of the network is given by f (x1 , . . . , xn ) = ℓ2i ∈ R, the unique unit in the last layer. 1 The corresponding (shallow) network for i = 1 and additive weights set to one is shown in Figure 1 1 This condition is required by some of the proofs presented here. 3 (this architecture is also the basic building block of bigger networks for i > 1). Note that both the input size n = 4i and the network’s depth 2i increase with parameter i. 3.2 Theoretical results The main result of this section is presented below in Corollary 1, providing a lower bound on the minimum number of hidden units required by a shallow sum-product network to represent a function f ∈ F. The high-level proof sketch consists in the following steps: (1) Count the number of unique products found in the polynomial representation of f (Lemma 1 and Proposition 1). (2) Show that the only possible architecture for a shallow sum-product network to compute f is to have a hidden layer made of product units, with a sum unit as output (Lemmas 2 to 5). (3) Conclude that the number of hidden units must be at least the number of unique products computed in step 3.2 (Lemma 6 and Corollary 1). Lemma 1. Any element ℓk can be written as a (positively) weighted sum of products of input varij ables, such that each input variable xt is used in exactly one unit of ℓk . Moreover, the number mk of products found in the sum computed by ℓk does not depend on j and obeys the following recurrence j rule for k ≥ 0: if k + 1 is odd, then mk+1 = m2 , otherwise mk+1 = 2mk . k Proof. We prove the lemma by induction on k. It is obviously true for k = 0 since ℓ0 = xj . j Assuming this is true for some k ≥ 0, we consider two cases: k+1 k • If k + 1 is odd, then ℓj = ℓk 2j−1 · ℓ2j . By the inductive hypothesis, it is the product of two (positively) weighted sums of products of input variables, and no input variable can k appear in both ℓk 2j−1 and ℓ2j , so the result is also a (positively) weighted sum of products k of input variables. Additionally, if the number of products in ℓk 2j−1 and ℓ2j is mk , then 2 mk+1 = mk , since all products involved in the multiplication of the two units are different (since they use disjoint subsets of input variables), and the sums have positive weights. Finally, by the induction assumption, an input variable appears in exactly one unit of ℓk . This unit is an input to a single unit of ℓk+1 , that will thus be the only unit of ℓk+1 where this input variable appears. k • If k + 1 is even, then ℓk+1 = λjk ℓk 2j−1 + µjk ℓ2j . Again, from the induction assumption, it j must be a (positively) weighted sum of products of input variables, but with mk+1 = 2mk such products. As in the previous case, an input variable will appear in the single unit of ℓk+1 that has as input the single unit of ℓk in which this variable must appear. 2i Proposition 1. The number of products in the sum computed in the output unit l1 of a network √ n−1 . computing a function in F is m2i = 2 Proof. We ﬁrst prove by induction on k ≥ 1 that for odd k, mk = 22 k 22 1+1 2 2 k+1 2 −2 , and for even k, . This is obviously true for k = 1 since 2 = 2 = 1, and all units in ℓ1 are mk = 2 single products of the form xr xs . Assuming this is true for some k ≥ 1, then: −1 0 −2 • if k + 1 is odd, then from Lemma 1 and the induction assumption, we have: mk+1 = m2 = k 2 k 22 2 −1 k +1 = 22 2 • if k + 1 is even, then instead we have: mk+1 = 2mk = 2 · 22 k+1 2 −2 −2 = 22 = 22 (k+1)+1 2 (k+1) 2 −2 −1 which shows the desired result for k + 1, and thus concludes the induction proof. Applying this result with k = 2i (which is even) yields 2i m2i = 22 2 −1 √ =2 4 22i −1 √ =2 n−1 . 2i Lemma 2. The products computed in the output unit l1 can be split in two groups, one with products containing only variables x1 , . . . , x n and one containing only variables x n +1 , . . . , xn . 2 2 Proof. This is obvious since the last unit is a “sum“ unit that adds two terms whose inputs are these two groups of variables (see e.g. Fig. 1). 2i Lemma 3. The products computed in the output unit l1 involve more than one input variable. k Proof. It is straightforward to show by induction on k ≥ 1 that the products computed by lj all involve more than one input variable, thus it is true in particular for the output layer (k = 2i). Lemma 4. Any shallow sum-product network computing f ∈ F must have a “sum” unit as output. Proof. By contradiction, suppose the output unit of such a shallow sum-product network is multiplicative. This unit must have more than one input, because in the case that it has only one input, the output would be either a (weighted) sum of input variables (which would violate Lemma 3), or a single product of input variables (which would violate Proposition 1), depending on the type (sum or product) of the single input hidden unit. Thus the last unit must compute a product of two or more hidden units. It can be re-written as a product of two factors, where each factor corresponds to either one hidden unit, or a product of multiple hidden units (it does not matter here which speciﬁc factorization is chosen among all possible ones). Regardless of the type (sum or product) of the hidden units involved, those two factors can thus be written as weighted sums of products of variables xt (with positive weights, and input variables potentially raised to powers above one). From Lemma 1, both x1 and xn must be present in the ﬁnal output, and thus they must appear in at least one of these two factors. Without loss of generality, assume x1 appears in the ﬁrst factor. Variables x n +1 , . . . , xn then cannot be present in the second factor, since otherwise one product in the output 2 would contain both x1 and one of these variables (this product cannot cancel out since weights must be positive), violating Lemma 2. But with a similar reasoning, since as a result xn must appear in the ﬁrst factor, variables x1 , . . . , x n cannot be present in the second factor either. Consequently, no 2 input variable can be present in the second factor, leading to the desired contradiction. Lemma 5. Any shallow sum-product network computing f ∈ F must have only multiplicative units in its hidden layer. Proof. By contradiction, suppose there exists a “sum“ unit in the hidden layer, written s = t∈S αt xt with S the set of input indices appearing in this sum, and αt > 0 for all t ∈ S. Since according to Lemma 4 the output unit must also be a sum (and have positive weights according to Deﬁnition 1), then the ﬁnal output will also contain terms of the form βt xt for t ∈ S, with βt > 0. This violates Lemma 3, establishing the contradiction. Lemma 6. Any shallow negative-weight sum-product network (see Deﬁnition 2) computing f ∈ F √ must have at least 2 n−1 hidden units, if its output unit is a sum and its hidden units are products. Proof. Such a network computes a weighted sum of its hidden units, where each hidden unit is a γ product of input variables, i.e. its output can be written as Σj wj Πt xt jt with wj ∈ R and γjt ∈ {0, 1}. In order to compute a function in F, this shallow network thus needs a number of hidden units at least equal to the number of unique products in that function. From Proposition 1, this √ number is equal to 2 n−1 . √ Corollary 1. Any shallow sum-product network computing f ∈ F must have at least 2 units. n−1 hidden Proof. This is a direct corollary of Lemmas 4 (showing the output unit is a sum), 5 (showing that hidden units are products), and 6 (showing the desired result for any shallow network with this speciﬁc structure – regardless of the sign of weights). 5 3.3 Discussion Corollary 1 above shows that in order to compute some function in F with n inputs, the number of √ √ units in a shallow network has to be at least 2 n−1 , (i.e. grows exponentially in n). On another hand, the total number of units in the deep (for i > 1) network computing the same function, as described in Section 3.1, is equal to 1 + 2 + 4 + 8 + . . . + 22i−1 (since all units are binary), which is √ also equal to 22i − 1 = n − 1 (i.e. grows only quadratically in n). It shows that some deep sumproduct network with n inputs and depth O(log n) can represent with O(n) units what would √ require O(2 n ) units for a depth-2 network. Lemma 6 also shows a similar result regardless of the sign of the weights in the summation units of the depth-2 network, but assumes a speciﬁc architecture for this network (products in the hidden layer with a sum as output). 4 The family G In this section we present similar results with a different family of functions, denoted by G. Compared to F, one important difference of deep sum-product networks built to deﬁne functions in G is that they can vary their input size independently of their depth. Their analysis thus provides additional insight when comparing the representational efﬁciency of deep vs. shallow sum-product networks in the case of a ﬁxed dataset. 4.1 Deﬁnition Networks in family G also alternate sum and product layers, but their units have as inputs all units from the previous layer except one. More formally, deﬁne the family G = ∪n≥2,i≥0 Gin of functions represented by sum-product networks, where the sub-family Gin is made of all sum-product networks with n input variables and 2i + 2 layers (including the input layer ℓ0 ), such that: 1. ℓ1 contains summation units; further layers alternate multiplicative and summation units. 2. Summation units have positive weights. 3. All layers are of size n, except the last layer ℓ2i+1 that contains a single sum unit that sums all units in the previous layer ℓ2i . k−1 4. In each layer ℓk for 1 ≤ k ≤ 2i, each unit ℓk takes as inputs {ℓm |m = j}. j An example of a network belonging to G1,3 (i.e. with three layers and three input variables) is shown in Figure 2. ℓ3 = x2 + x2 + x2 + 3(x1x2 + x1x3 + x2x3) = g(x1, x2, x3) 3 2 1 1 + ℓ2 = x2 + x1x2 × 1 1 +x1x3 + x2x3 ℓ1 = x2 + x3 1 × ℓ2 = . . . 2 × ℓ2 = x2 + x1x2 3 3 +x1x3 + x2x3 + + ℓ1 = x1 + x3 2 + ℓ1 = x1 + x2 3 x1 x2 x3 Figure 2: Sum-product network computing a function of G1,3 (summation units’ weights are all 1’s). 4.2 Theoretical results The main result is stated in Proposition 3 below, establishing a lower bound on the number of hidden units of a shallow sum-product network computing g ∈ G. The proof sketch is as follows: 1. We show that the polynomial expansion of g must contain a large set of products (Proposition 2 and Corollary 2). 2. We use both the number of products in that set as well as their degree to establish the desired lower bound (Proposition 3). 6 We will also need the following lemma, which states that when n − 1 items each belong to n − 1 sets among a total of n sets, then we can associate to each item one of the sets it belongs to without using the same set for different items. Lemma 7. Let S1 , . . . , Sn be n sets (n ≥ 2) containing elements of {P1 , . . . , Pn−1 }, such that for any q, r, |{r|Pq ∈ Sr }| ≥ n − 1 (i.e. each element Pq belongs to at least n − 1 sets). Then there exist r1 , . . . , rn−1 different indices such that Pq ∈ Srq for 1 ≤ q ≤ n − 1. Proof. Omitted due to lack of space (very easy to prove by construction). Proposition 2. For any 0 ≤ j ≤ i, and any product of variables P = Πn xαt such that αt ∈ N and t=1 t j 2j whose computed value, when expanded as a weighted t αt = (n − 1) , there exists a unit in ℓ sum of products, contains P among these products. Proof. We prove this proposition by induction on j. First, for j = 0, this is obvious since any P of this form must be made of a single input variable xt , that appears in ℓ0 = xt . t Suppose now the proposition is true for some j < i. Consider a product P = Πn xαt such that t=1 t αt ∈ N and t αt = (n − 1)j+1 . P can be factored in n − 1 sub-products of degree (n − 1)j , β i.e. written P = P1 . . . Pn−1 with Pq = Πn xt qt , βqt ∈ N and t βqt = (n − 1)j for all q. By t=1 the induction hypothesis, each Pq can be found in at least one unit ℓ2j . As a result, by property 4 kq (in the deﬁnition of family G), each Pq will also appear in the additive layer ℓ2j+1 , in at least n − 1 different units (the only sum unit that may not contain Pq is the one that does not have ℓ2j as input). kq By Lemma 7, we can thus ﬁnd a set of units ℓ2j+1 such that for any 1 ≤ q ≤ n − 1, the product rq Pq appears in ℓ2j+1 , with indices rq being different from each other. Let 1 ≤ s ≤ n be such that rq 2(j+1) s = rq for all q. Then, from property 4 of family G, the multiplicative unit ℓs computes the n−1 2j+1 product Πq=1 ℓrq , and as a result, when expanded as a sum of products, it contains in particular P1 . . . Pn−1 = P . The proposition is thus true for j + 1, and by induction, is true for all j ≤ i. Corollary 2. The output gin of a sum-product network in Gin , when expanded as a sum of products, contains all products of variables of the form Πn xαt such that αt ∈ N and t αt = (n − 1)i . t=1 t Proof. Applying Proposition 2 with j = i, we obtain that all products of this form can be found in the multiplicative units of ℓ2i . Since the output unit ℓ2i+1 computes a sum of these multiplicative 1 units (weighted with positive weights), those products are also present in the output. Proposition 3. A shallow negative-weight sum-product network computing gin ∈ Gin must have at least (n − 1)i hidden units. Proof. First suppose the output unit of the shallow network is a sum. Then it may be able to compute gin , assuming we allow multiplicative units in the hidden layer in the hidden layer to use powers of their inputs in the product they compute (which we allow here for the proof to be more generic). However, it will require at least as many of these units as the number of unique products that can be found in the expansion of gin . In particular, from Corollary 2, it will require at least the number n of unique tuples of the form (α1 , . . . , αn ) such that αt ∈ N and t=1 αt = (n − 1)i . Denoting ni dni = (n − 1)i , this number is known to be equal to n+dni −1 , and it is easy to verify it is higher d than (or equal to) dni for any n ≥ 2 and i ≥ 0. Now suppose the output unit is multiplicative. Then there can be no multiplicative hidden unit, otherwise it would mean one could factor some input variable xt in the computed function output: this is not possible since by Corollary 2, for any variable xt there exist products in the output function that do not involve xt . So all hidden units must be additive, and since the computed function contains products of degree dni , there must be at least dni such hidden units. 7 4.3 Discussion Proposition 3 shows that in order to compute the same function as gin ∈ Gin , the number of units in the shallow network has to grow exponentially in i, i.e. in the network’s depth (while the deep network’s size grows linearly in i). The shallow network also needs to grow polynomially in the number of input variables n (with a degree equal to i), while the deep network grows only linearly in n. It means that some deep sum-product network with n inputs and depth O(i) can represent with O(ni) units what would require O((n − 1)i ) units for a depth-2 network. Note that in the similar results found for family F, the depth-2 network computing the same function as a function in F had to be constrained to either have a speciﬁc combination of sum and hidden units (in Lemma 6) or to have non-negative weights (in Corollary 1). On the contrary, the result presented here for family G holds without requiring any of these assumptions. 5 Conclusion We compared a deep sum-product network and a shallow sum-product network representing the same function, taken from two families of functions F and G. For both families, we have shown that the number of units in the shallow network has to grow exponentially, compared to a linear growth in the deep network, so as to represent the same functions. The deep version thus offers a much more compact representation of the same functions. This work focuses on two speciﬁc families of functions: ﬁnding more general parameterization of functions leading to similar results would be an interesting topic for future research. Another open question is whether it is possible to represent such functions only approximately (e.g. up to an error bound ǫ) with a much smaller shallow network. Results by Braverman [8] on boolean circuits suggest that similar results as those presented in this paper may still hold, but this topic has yet to be formally investigated in the context of sum-product networks. A related problem is also to look into functions deﬁned only on discrete input variables: our proofs do not trivially extend to this situation because we cannot assume anymore that two polynomials yielding the same output values must have the same expansion coefﬁcients (since the number of input combinations becomes ﬁnite). Acknowledgments The authors would like to thank Razvan Pascanu and David Warde-Farley for their help in improving this manuscript, as well as the anonymous reviewers for their careful reviews. This work was partially funded by NSERC, CIFAR, and the Canada Research Chairs. References [1] Ajtai, M. (1983). P1 1 -formulae on ﬁnite structures. Annals of Pure and Applied Logic, 24(1), 1–48. [2] Allender, E. (1996). Circuit complexity before the dawn of the new millennium. In 16th Annual Conference on Foundations of Software Technology and Theoretical Computer Science, pages 1–18. Lecture Notes in Computer Science 1180, Springer Verlag. [3] Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 1–127. Also published as a book. Now Publishers, 2009. [4] Bengio, Y. and LeCun, Y. (2007). Scaling learning algorithms towards AI. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT Press. [5] Bengio, Y., Delalleau, O., and Le Roux, N. (2006). The curse of highly variable functions for local kernel machines. In NIPS’05, pages 107–114. MIT Press, Cambridge, MA. [6] Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training of deep networks. In NIPS 19, pages 153–160. MIT Press. [7] Bengio, Y., Delalleau, O., and Simard, C. (2010). Decision trees do not generalize to new variations. Computational Intelligence, 26(4), 449–467. [8] Braverman, M. (2011). Poly-logarithmic independence fools bounded-depth boolean circuits. Communications of the ACM, 54(4), 108–115. [9] Collobert, R. and Weston, J. (2008). A uniﬁed architecture for natural language processing: Deep neural networks with multitask learning. In ICML 2008, pages 160–167. [10] Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. (2010). Phone recognition with the meancovariance restricted boltzmann machine. In Advances in Neural Information Processing Systems (NIPS). 8 [11] H˚ stad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings of the 18th a annual ACM Symposium on Theory of Computing, pages 6–20, Berkeley, California. ACM Press. [12] H˚ stad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits. Computational a Complexity, 1, 113–129. [13] Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507. [14] Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554. [15] Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and LeCun, Y. (2010). Learning convolutional feature hierarchies for visual recognition. In NIPS’10. [16] Larochelle, H., Erhan, D., Courville, A., Bergstra, J., and Bengio, Y. (2007). An empirical evaluation of deep architectures on problems with many factors of variation. In ICML’07, pages 473–480. ACM. [17] Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net model for visual area V2. In NIPS’07, pages 873–880. MIT Press, Cambridge, MA. [18] Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009a). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML 2009. Montreal (Qc), Canada. [19] Lee, H., Pham, P., Largman, Y., and Ng, A. (2009b). Unsupervised feature learning for audio classiﬁcation using convolutional deep belief networks. In NIPS’09, pages 1096–1104. [20] Levner, I. (2008). Data Driven Object Segmentation. Ph.D. thesis, Department of Computer Science, University of Alberta. [21] Mnih, A. and Hinton, G. E. (2009). A scalable hierarchical distributed language model. In NIPS’08, pages 1081–1088. [22] Mobahi, H., Collobert, R., and Weston, J. (2009). Deep learning from temporal coherence in video. In ICML’2009, pages 737–744. [23] Orponen, P. (1994). Computational complexity of neural networks: a survey. Nordic Journal of Computing, 1(1), 94–110. [24] Osindero, S. and Hinton, G. E. (2008). Modeling image patches with a directed hierarchy of markov random ﬁeld. In NIPS’07, pages 1121–1128, Cambridge, MA. MIT Press. [25] Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In UAI’2011, Barcelona, Spain. [26] Ranzato, M. and Szummer, M. (2008). Semi-supervised learning of compact document representations with deep networks. In ICML. [27] Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007). Efﬁcient learning of sparse representations with an energy-based model. In NIPS’06, pages 1137–1144. MIT Press. [28] Ranzato, M., Boureau, Y.-L., and LeCun, Y. (2008). Sparse feature learning for deep belief networks. In NIPS’07, pages 1185–1192, Cambridge, MA. MIT Press. [29] Salakhutdinov, R. and Hinton, G. E. (2007). Semantic hashing. In Proceedings of the 2007 Workshop on Information Retrieval and applications of Graphical Models (SIGIR 2007), Amsterdam. Elsevier. [30] Salakhutdinov, R., Mnih, A., and Hinton, G. E. (2007). Restricted Boltzmann machines for collaborative ﬁltering. In ICML 2007, pages 791–798, New York, NY, USA. [31] Serre, T., Kreiman, G., Kouh, M., Cadieu, C., Knoblich, U., and Poggio, T. (2007). A quantitative theory of immediate visual recognition. Progress in Brain Research, Computational Neuroscience: Theoretical Insights into Brain Function, 165, 33–56. [32] Socher, R., Lin, C., Ng, A. Y., and Manning, C. (2011). Learning continuous phrase representations and syntactic parsing with recursive neural networks. In ICML’2011. [33] Taylor, G. and Hinton, G. (2009). Factored conditional restricted Boltzmann machines for modeling motion style. In ICML 2009, pages 1025–1032. [34] Taylor, G., Hinton, G. E., and Roweis, S. (2007). Modeling human motion using binary latent variables. In NIPS’06, pages 1345–1352. MIT Press, Cambridge, MA. [35] Utgoff, P. E. and Stracuzzi, D. J. (2002). Many-layered learning. Neural Computation, 14, 2497–2539. [36] Weston, J., Ratle, F., and Collobert, R. (2008). Deep learning via semi-supervised embedding. In ICML 2008, pages 1168–1175, New York, NY, USA. [37] Wolpert, D. H. (1996). The lack of a priori distinction between learning algorithms. Neural Computation, 8(7), 1341–1390. [38] Yao, A. (1985). Separating the polynomial-time hierarchy by oracles. In Proceedings of the 26th Annual IEEE Symposium on Foundations of Computer Science, pages 1–10. 9

5 0.50439668 167 nips-2011-Maximum Covariance Unfolding : Manifold Learning for Bimodal Data

Author: Vijay Mahadevan, Chi W. Wong, Jose C. Pereira, Tom Liu, Nuno Vasconcelos, Lawrence K. Saul

Abstract: We propose maximum covariance unfolding (MCU), a manifold learning algorithm for simultaneous dimensionality reduction of data from different input modalities. Given high dimensional inputs from two different but naturally aligned sources, MCU computes a common low dimensional embedding that maximizes the cross-modal (inter-source) correlations while preserving the local (intra-source) distances. In this paper, we explore two applications of MCU. First we use MCU to analyze EEG-fMRI data, where an important goal is to visualize the fMRI voxels that are most strongly correlated with changes in EEG traces. To perform this visualization, we augment MCU with an additional step for metric learning in the high dimensional voxel space. Second, we use MCU to perform cross-modal retrieval of matched image and text samples from Wikipedia. To manage large applications of MCU, we develop a fast implementation based on ideas from spectral graph theory. These ideas transform the original problem for MCU, one of semideﬁnite programming, into a simpler problem in semideﬁnite quadratic linear programming. 1

6 0.4999128 244 nips-2011-Selecting Receptive Fields in Deep Networks

7 0.47708592 156 nips-2011-Learning to Learn with Compound HD Models

8 0.45694801 6 nips-2011-A Global Structural EM Algorithm for a Model of Cancer Progression

9 0.43429658 150 nips-2011-Learning a Distance Metric from a Network

10 0.41895819 115 nips-2011-Hierarchical Topic Modeling for Analysis of Time-Evolving Personal Choices

11 0.41238123 181 nips-2011-Multiple Instance Learning on Structured Data

12 0.40866777 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

13 0.40795195 287 nips-2011-The Manifold Tangent Classifier

14 0.40792617 7 nips-2011-A Machine Learning Approach to Predict Chemical Reactions

15 0.40581787 184 nips-2011-Neuronal Adaptation for Sampling-Based Probabilistic Inference in Perceptual Bistability

16 0.40226892 217 nips-2011-Practical Variational Inference for Neural Networks

17 0.39019513 234 nips-2011-Reconstructing Patterns of Information Diffusion from Incomplete Observations

18 0.38104942 242 nips-2011-See the Tree Through the Lines: The Shazoo Algorithm

19 0.37192404 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features

20 0.37067431 232 nips-2011-Ranking annotators for crowdsourced labeling tasks

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.03), (4, 0.058), (20, 0.028), (26, 0.013), (31, 0.067), (33, 0.034), (43, 0.032), (45, 0.073), (57, 0.062), (65, 0.059), (70, 0.348), (74, 0.041), (83, 0.024), (99, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76532501 74 nips-2011-Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection

Author: Richard Socher, Eric H. Huang, Jeffrey Pennin, Christopher D. Manning, Andrew Y. Ng

2 0.73967487 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning

Author: Joni K. Pajarinen, Jaakko Peltonen

Abstract: Applications such as robot control and wireless communication require planning under uncertainty. Partially observable Markov decision processes (POMDPs) plan policies for single agents under uncertainty and their decentralized versions (DEC-POMDPs) ﬁnd a policy for multiple agents. The policy in inﬁnite-horizon POMDP and DEC-POMDP problems has been represented as ﬁnite state controllers (FSCs). We introduce a novel class of periodic FSCs, composed of layers connected only to the previous and next layer. Our periodic FSC method ﬁnds a deterministic ﬁnite-horizon policy and converts it to an initial periodic inﬁnitehorizon policy. This policy is optimized by a new inﬁnite-horizon algorithm to yield deterministic periodic policies, and by a new expectation maximization algorithm to yield stochastic periodic policies. Our method yields better results than earlier planning methods and can compute larger solutions than with regular FSCs.

3 0.54872686 222 nips-2011-Prismatic Algorithm for Discrete D.C. Programming Problem

Author: Yoshinobu Kawahara, Takashi Washio

Abstract: In this paper, we propose the ﬁrst exact algorithm for minimizing the difference of two submodular functions (D.S.), i.e., the discrete version of the D.C. programming problem. The developed algorithm is a branch-and-bound-based algorithm which responds to the structure of this problem through the relationship between submodularity and convexity. The D.S. programming problem covers a broad range of applications in machine learning. In fact, this generalizes any set-function optimization. We empirically investigate the performance of our algorithm, and illustrate the difference between exact and approximate solutions respectively obtained by the proposed and existing algorithms in feature selection and discriminative structure learning.

4 0.37958229 127 nips-2011-Image Parsing with Stochastic Scene Grammar

Author: Yibiao Zhao, Song-chun Zhu

5 0.37806064 219 nips-2011-Predicting response time and error rates in visual search

Author: Bo Chen, Vidhya Navalpakkam, Pietro Perona

Abstract: A model of human visual search is proposed. It predicts both response time (RT) and error rates (RT) as a function of image parameters such as target contrast and clutter. The model is an ideal observer, in that it optimizes the Bayes ratio of target present vs target absent. The ratio is computed on the ﬁring pattern of V1/V2 neurons, modeled by Poisson distributions. The optimal mechanism for integrating information over time is shown to be a ‘soft max’ of diffusions, computed over the visual ﬁeld by ‘hypercolumns’ of neurons that share the same receptive ﬁeld and have different response properties to image features. An approximation of the optimal Bayesian observer, based on integrating local decisions, rather than diffusions, is also derived; it is shown experimentally to produce very similar predictions to the optimal observer in common psychophysics conditions. A psychophyisics experiment is proposed that may discriminate between which mechanism is used in the human brain. A B C Figure 1: Visual search. (A) Clutter and camouﬂage make visual search difﬁcult. (B,C) Psychologists and neuroscientists build synthetic displays to study visual search. In (B) the target ‘pops out’ (∆θ = 450 ), while in (C) the target requires more time to be detected (∆θ = 100 ) [1]. 1

6 0.37567157 93 nips-2011-Extracting Speaker-Specific Information with a Regularized Siamese Deep Network

7 0.37535155 124 nips-2011-ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning

8 0.37430933 244 nips-2011-Selecting Receptive Fields in Deep Networks

9 0.37173125 156 nips-2011-Learning to Learn with Compound HD Models

10 0.37147817 242 nips-2011-See the Tree Through the Lines: The Shazoo Algorithm

11 0.37064859 180 nips-2011-Multiple Instance Filtering

12 0.37041041 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels

13 0.36981121 35 nips-2011-An ideal observer model for identifying the reference frame of objects

14 0.36931461 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms

15 0.36898491 150 nips-2011-Learning a Distance Metric from a Network

16 0.36898208 168 nips-2011-Maximum Margin Multi-Instance Learning

17 0.36869603 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data

18 0.36840236 75 nips-2011-Dynamical segmentation of single trials from population neural data

19 0.3682414 276 nips-2011-Structured sparse coding via lateral inhibition

20 0.36780673 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features