nips nips2011 nips2011-216 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Fahad S. Khan, Joost Weijer, Andrew D. Bagdanov, Maria Vanrell
Abstract: We describe a novel technique for feature combination in the bag-of-words model of image classification. Our approach builds discriminative compound words from primitive cues learned independently from training images. Our main observation is that modeling joint-cue distributions independently is more statistically robust for typical classification problems than attempting to empirically estimate the dependent, joint-cue distribution directly. We use Information theoretic vocabulary compression to find discriminative combinations of cues and the resulting vocabulary of portmanteau1 words is compact, has the cue binding property, and supports individual weighting of cues in the final image representation. State-of-theart results on both the Oxford Flower-102 and Caltech-UCSD Bird-200 datasets demonstrate the effectiveness of our technique compared to other, significantly more complex approaches to multi-cue image representation. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Our approach builds discriminative compound words from primitive cues learned independently from training images. [sent-3, score-0.572]
2 We use Information theoretic vocabulary compression to find discriminative combinations of cues and the resulting vocabulary of portmanteau1 words is compact, has the cue binding property, and supports individual weighting of cues in the final image representation. [sent-5, score-1.896]
3 These cues are then quantized into visual words and the final image representation is a histogram over these visual vocabularies. [sent-12, score-0.617]
4 In this paper we investigate visual vocabularies which are used to represent images whose local features are described by both shape and color. [sent-15, score-0.709]
5 To extend BOW to multiple cues, two properties are especially important: cue binding and cue weighting. [sent-16, score-0.752]
6 A visual vocabulary is said to have the binding property when two independent cues appearing at the same location in an image remain coupled in the final image representation. [sent-17, score-0.961]
7 For example, if every local patch in an image is independently described by a shape word and a color word, in the final image representation using compound words the binding property ensures that shape and color words coming from the same feature location are coupled in the final representation. [sent-18, score-1.666]
8 The term binding is borrowed from the neuroscience field where it is used to describe the way in which humans select and integrate the separate cues of objects in the correct combinations in order to accurately recognize them [17]. [sent-19, score-0.333]
9 The property of cue weighting implies that it is possible 1 A portmanteau is a combination of two or more words to form a neologism that communicates a concept better than any individual word (e. [sent-20, score-1.039]
10 We use the term to describe our vocabularies to emphasize the connotation with combining color and shape words into new, more meaningful representations. [sent-23, score-0.94]
11 1 to adapt the relevance of each cue depending on the dataset. [sent-24, score-0.29]
12 The importance of cue weighting can be seen from the success of Multiple Kernel Learning (MKL) techniques where weights for each cue are automatically learned [3, 13, 21, 14, 1, 20]. [sent-25, score-0.686]
13 When each cue has its own visual vocabulary the result is known as a late fusion image representation in which an image is represented as one histogram over shape-words and another histogram over color-words. [sent-27, score-1.154]
14 Such a representation does not have the cue binding property, meaning that it is impossible to know exactly which color-shape events co-occurred at local features. [sent-28, score-0.48]
15 Another approach, called early fusion, constructs a single visual vocabulary of joint color-shape words. [sent-30, score-0.506]
16 Representations over early fusion vocabularies have the cue binding property, meaning that the spatial co-occurrence of shape and color events is preserved. [sent-31, score-1.447]
17 However, cue weighting in early fusion vocabularies is very cumbersome since must be performed before vocabulary construction making cross-validation very expensive. [sent-32, score-1.377]
18 [10] proposed a method which combines cue binding and weighting. [sent-34, score-0.444]
19 However, their final image representation size is equal to number of vocabulary words times the number of classes, and is therefore not feasible for the large data sets considered in this paper. [sent-35, score-0.589]
20 A straightforward, if combinatorially inconvenient, approach to ensuring the binding property is to create a new vocabulary that contains one word for each combination of original shape and color feature. [sent-36, score-0.988]
21 Considering that each of the original shape and color vocabularies may contain thousands of words, the resulting joint vocabulary may contain millions. [sent-37, score-1.215]
22 Such large vocabularies are impractical as estimating joint color-shape statistics is often infeasible due to the difficulty of sampling from limited training data. [sent-38, score-0.487]
23 Because of this and other problems, this type of joint feature representation has not been further pursued as a way of ensuring that image representations have the binding property. [sent-40, score-0.372]
24 In recent years a number of vocabulary compression techniques have appeared that derive small, discriminative vocabularies from very large ones [16, 7, 5]. [sent-41, score-0.86]
25 Most of these techniques are based on information theoretic clustering algorithms that attempt to combine words that are equivalently discriminative for the set of object categories being considered. [sent-42, score-0.272]
26 Because these techniques are guided by the discriminative power of clusters of visual words, estimates of class-conditional visual word probabilities are essential. [sent-43, score-0.349]
27 These recent developments in vocabulary compression allow us to reconsider the direct, Cartesian product approach to building compound vocabularies. [sent-44, score-0.557]
28 These vocabulary compression techniques have been demonstrated on single-cue vocabularies with a few tens of thousands of words. [sent-45, score-0.81]
29 Starting from even moderately sized shape and color vocabularies results in a compound shape-color vocabulary an order of magnitude larger. [sent-46, score-1.321]
30 We show that for typical datasets a strong independence assumption about the joint color-shape distribution leads to more robust estimates of the class-conditional distributions needed for vocabulary compression. [sent-48, score-0.557]
31 In addition, our estimation technique allows flexible cue-specific weighting that cannot be easily performed with other cue combination techniques that maintain the binding property. [sent-49, score-0.588]
32 2 Portmanteau vocabularies In this section we propose a new multi-cue vocabulary construction method that results in compact vocabularies which possess both the cue binding and the cue weighting properties described above. [sent-50, score-2.086]
33 Our approach is to build portmanteau vocabularies of discriminative, compound shape and color words chosen from independently learned color and shape lexicons. [sent-51, score-1.924]
34 The term portmanteau is used in natural language for words which are a blend of two other words and which combine their meaning. [sent-52, score-0.659]
35 A simple way to ensure the binding property is by considering a product vocabulary that contains a new word for every combination of shape and color terms. [sent-54, score-1.013]
36 , cN } represent the visual shape and color vocabularies, respectively. [sent-61, score-0.471]
37 Because of these drawbacks, compound product vocabularies have, to the best of our knowledge, not been pursued in literature. [sent-75, score-0.635]
38 1 Compact Portmanteau Vocabularies In recent years, several algorithms for feature clustering have been proposed which compress large vocabularies into small ones [16, 7, 5]. [sent-78, score-0.483]
39 In our algorithm, loss in mutual information is measured between original product vocabulary and the resulting clusters. [sent-82, score-0.383]
40 The algorithm joins words which have similar discriminative power over the set of classes in the image categorization problem. [sent-83, score-0.317]
41 More precisely, the drop in mutual information I between the vocabulary W and the class labels R when going from the original set of vocabulary words W to the clustered representation W R = {W1 , W2 , . [sent-87, score-0.847]
42 The clusters show consistency over both color and shape. [sent-99, score-0.268]
43 of the cluster distributions and assignment of compound visual words to their closest cluster. [sent-103, score-0.375]
44 We call the compact vocabulary which is the output of the DITC algorithm the portmanteau vocabulary and its words accordingly portmanteau words. [sent-107, score-1.676]
45 The final image representation p(W R ) is a distribution over the portmanteau words. [sent-108, score-0.559]
46 2 Joint distribution estimation In solving the problem of high-dimensionality of the compound vocabularies we seemingly further complicated the estimation problem. [sent-110, score-0.625]
47 To solve this problem we propose to estimate the class conditional distributions by assuming independence of color and shape, given the class: p(sm , cn |R) ∝ p(sm |R)p(cn |R). [sent-113, score-0.412]
48 (2) Note that we do not assume independence of the cues themselves, but rather the less restrictive independence of the cues given the class. [sent-114, score-0.534]
49 Instead of directly estimating the empirical joint distribution p(S, C|R), we reduce the number of parameters to estimate to (M + N ) × L, which in the vocabulary configurations discussed in this paper represents a reduction in complexity of two orders of magnitude. [sent-115, score-0.399]
50 3 that estimating the joint distribution p(S, C|R) allows us to introduce cue weighting. [sent-117, score-0.331]
51 DITC is used to obtain ten portmanteau means p (R|Wj ) are plotted in different colors. [sent-133, score-0.441]
52 Note that none of the portmanteau means are especially discriminative for one particular class. [sent-136, score-0.49]
53 The plotted lines show the curves for a color cue vocabulary of 100 words and a shape cue vocabulary of 5,000 words, resulting in a product vocabulary of 500,000 words. [sent-142, score-2.143]
54 Increasing the number of training samples, or starting with smaller color and shape vocabularies and hence reducing the number of parameters to estimate, will improve direct empirical estimates of p(S, C). [sent-144, score-0.925]
55 However, figure 1 shows that for typical vocabulary settings on large datasets the independence assumption results in equivalently good or better estimates of the joint distribution. [sent-145, score-0.521]
56 3 Cue weighting Constructing the compact portmanteau vocabularies based on the independence assumption significantly reduces the number of parameters to estimate. [sent-147, score-1.084]
57 Furthermore, as we will see in this section, it allows us to control the relative contribution of color and shape cues in the final representation. [sent-148, score-0.576]
58 We introduce a weighting parameter α ∈ [0, 1] in the estimate of p(C, S): pα (sm , cn |R) ∝ p(sm |R)α p(cn |R)1−α (3) where an α close to zero results in a larger influence of the color words, and a α close to one leads to a vocabulary which focuses predominantly on shape. [sent-149, score-0.746]
59 To illustrate the influence of α on the vocabulary construction, we show samples from portmanteau words obtained on the Oxford Flower-102 dataset (see figure 4) in figure 2. [sent-150, score-0.9]
60 The DITC algorithm is applied to reduce the product vocabulary of 500,000 compound words to 100 portmanteau words. [sent-151, score-1.065]
61 For low α the Portmanteau words exhibit homogeneity of color but lack within-cluster shape consistency. [sent-158, score-0.534]
62 On the other hand for high α the words show strong shape homogeneity such as low and high frequency lines and blobs, while color is more uniformly distributed. [sent-159, score-0.534]
63 5 the clustering is more consistent in both color and shape. [sent-161, score-0.248]
64 we show the p (R|wt ) for a subset of 20 words in grey, and p (R|Wj ) ∝ p(wt )p(R|wt ) for wt ∈Wj the ten portmanteau words in color. [sent-166, score-0.717]
65 4 Image representation with portmanteau vocabularies We summarize our approach to constructing portmanteau vocabularies for image representation. [sent-174, score-1.832]
66 Image representation by portmanteau vocabulary built from color and shape cues follows these steps: 1. [sent-176, score-1.37]
67 Independent color and shape vocabularies are constructed by standard K-means clustering over color and shape descriptors extracted from training images. [sent-177, score-1.301]
68 Empirical class-conditional word distributions p(S|R) and p(C|R) are computed from the training set, the joint cue distribution P (S, C|R) is estimated assuming conditional independence as in equation (4). [sent-179, score-0.522]
69 The portmanteau vocabulary is computed with the DITC algorithm. [sent-181, score-0.758]
70 The output of the DITC is a list of indexes which, for each member of the compound vocabulary maps to one of the J portmanteau words. [sent-182, score-0.939]
71 Using the index list output by DITC, the original image features are revisited and the index corresponding the compound shape-color word at each feature is used to represent each image as a histogram over the portmanteau vocabulary. [sent-184, score-0.903]
72 For color we use the color name descriptor, which is computed by converting sRGB values to color names according to [19] after which each patch is represented as a histogram over the eleven color names. [sent-188, score-0.932]
73 The shape and color vocabularies are constructed using the standard Kmeans algorithm. [sent-189, score-0.822]
74 In all our experiments we use a shape vocabulary of 5000 words and a color vocabulary of 100 words. [sent-190, score-1.185]
75 We performed several experiments to validate our approach to building multi-cue vocabularies by comparing with other methods which are based on exactly the same initial SIFT and CN descriptors: • Shape and Color only: a single vocabulary of 5000 SIFT words and one of 100 CN words. [sent-194, score-0.878]
76 The relative weight of shape and color is optimized by cross-validation. [sent-196, score-0.397]
77 Note that cross-validation on cue weighting parameters for early fusion must be done over the entire BOW pipeline, from vocabulary construction to classification. [sent-197, score-0.931]
78 In all cases the color-shape visual vocabularies are compressed to 500 visual words and spatial pyramids are constructed for the final image representation as in [11]. [sent-203, score-0.85]
79 Since our cue weighting is done after the initial vocabulary and histogram construction, cross-validation is significantly faster than for early fusion. [sent-214, score-0.823]
80 The bottom three rows of table 1(a) give the results of our approach to image representation with portmanteau vocabularies in a variety of configurations. [sent-215, score-0.984]
81 However, weighting the two visual cues using the α parameter described in equation (3) in the independent estimation of p(s, c|class) improves the results significantly. [sent-217, score-0.377]
82 This dataset contains many bird species that closely resemble each other in terms of color and shape cues, making the recognition task extremely difficult. [sent-222, score-0.45]
83 Interestingly, on this dataset color outperforms shape alone and early fusion yields only a small improvement over color. [sent-224, score-0.602]
84 Results based on portmanteau vocabularies outperform early fusion, and estimation based on the independence assumption provide better results than direct empirical estimation. [sent-225, score-1.065]
85 These results are further improved by the introduction of cue weighting with a final score of 22. [sent-226, score-0.396]
86 2 Comparison with the state-of-the-art Recently, an extensive performance evaluation of color descriptors was presented by van de Sande et al. [sent-230, score-0.288]
87 We construct a visual vocabulary of 5000 visual words for both OpponentSIFT and C-SIFT and apply the DITC algorithm to compress it to 500 visual words. [sent-233, score-0.709]
88 As shown in table 1(b), Our approach provides significantly better results compared to both OpponentSIFT and C-SIFT, possibly due to the fact neither supports cue weighting. [sent-234, score-0.29]
89 These approaches combine multiple cues and multiple kernels and apply per-class cue weighting. [sent-271, score-0.505]
90 The technique described in [3] is based on geometric blur, grayscale SIFT, color SIFT and full image color histograms, while the approach in [13] also employs HSV, SIFT int, SIFT bd, and HOG descriptors in the MKL framework of [21]. [sent-273, score-0.585]
91 In contrast to these approaches, we learn a global, class-independent cue weighting parameters to balance color and shape cues. [sent-279, score-0.793]
92 8% using segmented images and a combination of four different visual cues in a multiple kernel learning framework. [sent-283, score-0.35]
93 Our performance, however, is obtained on unsegmented images using only color and shape cues. [sent-284, score-0.434]
94 4 Conclusions In this paper we propose a new method to construct multi-cue, visual portmanteau vocabularies that combine color and shape cues. [sent-286, score-1.319]
95 When constructing a multi-cue vocabulary two properties are especially desirable: cue binding and cue weighting. [sent-287, score-1.069]
96 Starting from multi-cue product vocabularies we compress this representation to form discriminative compound terms, or portmanteaux, used in the final image representation. [sent-288, score-0.851]
97 Experiments demonstrate that assuming independence of visual cues given the categories provides a robust estimation of joint-cue distributions compared to direct empirical estimation. [sent-289, score-0.479]
98 Assuming independence also has the advantage of both reducing the complexity of the representation by two orders of magnitude and allowing flexible cue weighting. [sent-290, score-0.414]
99 Our final image representation is compact, maintains the cue binding property, admits cue weighting and yields state-of-the-art performance on the image categorization problem. [sent-291, score-1.108]
100 Results demonstrate the superiority of our approach over existing ones combining color and shape cues. [sent-293, score-0.397]
wordName wordTfidf (topN-words)
[('vocabularies', 0.425), ('portmanteau', 0.423), ('vocabulary', 0.335), ('cue', 0.29), ('ditc', 0.272), ('color', 0.224), ('cues', 0.179), ('shape', 0.173), ('compound', 0.164), ('binding', 0.154), ('fusion', 0.125), ('words', 0.118), ('weighting', 0.106), ('image', 0.1), ('independence', 0.088), ('mkl', 0.087), ('cn', 0.081), ('visual', 0.074), ('discriminative', 0.067), ('word', 0.063), ('sift', 0.062), ('joost', 0.06), ('opponentsift', 0.06), ('wt', 0.058), ('sm', 0.058), ('bow', 0.057), ('early', 0.056), ('weijer', 0.053), ('nal', 0.046), ('fahad', 0.045), ('shahbaz', 0.045), ('clusters', 0.044), ('compact', 0.042), ('joint', 0.041), ('wj', 0.04), ('hundred', 0.04), ('descriptors', 0.037), ('images', 0.037), ('classi', 0.036), ('histogram', 0.036), ('representation', 0.036), ('compress', 0.034), ('oxford', 0.034), ('object', 0.034), ('khan', 0.033), ('compression', 0.033), ('categorization', 0.032), ('direct', 0.032), ('portmanteaux', 0.03), ('datasets', 0.03), ('bird', 0.029), ('zisserman', 0.029), ('categories', 0.029), ('eccv', 0.029), ('van', 0.027), ('manik', 0.027), ('maria', 0.027), ('sande', 0.027), ('estimates', 0.027), ('gure', 0.026), ('product', 0.025), ('nilsback', 0.024), ('dataset', 0.024), ('clustering', 0.024), ('empirical', 0.023), ('primitive', 0.023), ('cartesian', 0.023), ('pyramids', 0.023), ('mutual', 0.023), ('kernel', 0.022), ('late', 0.022), ('divisive', 0.022), ('training', 0.021), ('spanish', 0.021), ('pursued', 0.021), ('francis', 0.021), ('barcelona', 0.021), ('cumbersome', 0.021), ('representations', 0.02), ('ower', 0.02), ('scene', 0.02), ('combination', 0.02), ('property', 0.019), ('dhillon', 0.019), ('construction', 0.019), ('distributions', 0.019), ('homogeneity', 0.019), ('lazebnik', 0.019), ('varma', 0.019), ('multiple', 0.018), ('plotted', 0.018), ('grey', 0.018), ('estimation', 0.018), ('descriptor', 0.018), ('iccv', 0.017), ('cvpr', 0.017), ('schmid', 0.017), ('robust', 0.017), ('thousands', 0.017), ('list', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999911 216 nips-2011-Portmanteau Vocabularies for Multi-Cue Image Representation
Author: Fahad S. Khan, Joost Weijer, Andrew D. Bagdanov, Maria Vanrell
Abstract: We describe a novel technique for feature combination in the bag-of-words model of image classification. Our approach builds discriminative compound words from primitive cues learned independently from training images. Our main observation is that modeling joint-cue distributions independently is more statistically robust for typical classification problems than attempting to empirically estimate the dependent, joint-cue distribution directly. We use Information theoretic vocabulary compression to find discriminative combinations of cues and the resulting vocabulary of portmanteau1 words is compact, has the cue binding property, and supports individual weighting of cues in the final image representation. State-of-theart results on both the Oxford Flower-102 and Caltech-UCSD Bird-200 datasets demonstrate the effectiveness of our technique compared to other, significantly more complex approaches to multi-cue image representation. 1
2 0.082790777 141 nips-2011-Large-Scale Category Structure Aware Image Categorization
Author: Bin Zhao, Fei Li, Eric P. Xing
Abstract: Most previous research on image categorization has focused on medium-scale data sets, while large-scale image categorization with millions of images from thousands of categories remains a challenge. With the emergence of structured large-scale dataset such as the ImageNet, rich information about the conceptual relationships between images, such as a tree hierarchy among various image categories, become available. As human cognition of complex visual world benefits from underlying semantic relationships between object classes, we believe a machine learning system can and should leverage such information as well for better performance. In this paper, we employ such semantic relatedness among image categories for large-scale image categorization. Specifically, a category hierarchy is utilized to properly define loss function and select common set of features for related categories. An efficient optimization method based on proximal approximation and accelerated parallel gradient method is introduced. Experimental results on a subset of ImageNet containing 1.2 million images from 1000 categories demonstrate the effectiveness and promise of our proposed approach. 1
3 0.079593189 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition
Author: Alessandro Bergamo, Lorenzo Torresani, Andrew W. Fitzgibbon
Abstract: We introduce P I C O D ES: a very compact image descriptor which nevertheless allows high performance on object category recognition. In particular, we address novel-category recognition: the task of defining indexing structures and image representations which enable a large collection of images to be searched for an object category that was not known when the index was built. Instead, the training images defining the category are supplied at query time. We explicitly learn descriptors of a given length (from as small as 16 bytes per image) which have good object-recognition performance. In contrast to previous work in the domain of object recognition, we do not choose an arbitrary intermediate representation, but explicitly learn short codes. In contrast to previous approaches to learn compact codes, we optimize explicitly for (an upper bound on) classification performance. Optimization directly for binary features is difficult and nonconvex, but we present an alternation scheme and convex upper bound which demonstrate excellent performance in practice. P I C O D ES of 256 bytes match the accuracy of the current best known classifier for the Caltech256 benchmark, but they decrease the database storage size by a factor of 100 and speed-up the training and testing of novel classes by orders of magnitude.
4 0.077682093 176 nips-2011-Multi-View Learning of Word Embeddings via CCA
Author: Paramveer Dhillon, Dean P. Foster, Lyle H. Ungar
Abstract: Recently, there has been substantial interest in using large amounts of unlabeled data to learn word representations which can then be used as features in supervised classifiers for NLP tasks. However, most current approaches are slow to train, do not model the context of the word, and lack theoretical grounding. In this paper, we present a new learning method, Low Rank Multi-View Learning (LR-MVL) which uses a fast spectral method to estimate low dimensional context-specific word representations from unlabeled data. These representation features can then be used with any supervised learner. LR-MVL is extremely fast, gives guaranteed convergence to a global optimum, is theoretically elegant, and achieves state-ofthe-art performance on named entity recognition (NER) and chunking problems. 1 Introduction and Related Work Over the past decade there has been increased interest in using unlabeled data to supplement the labeled data in semi-supervised learning settings to overcome the inherent data sparsity and get improved generalization accuracies in high dimensional domains like NLP. Approaches like [1, 2] have been empirically very successful and have achieved excellent accuracies on a variety of NLP tasks. However, it is often difficult to adapt these approaches to use in conjunction with an existing supervised NLP system as these approaches enforce a particular choice of model. An increasingly popular alternative is to learn representational embeddings for words from a large collection of unlabeled data (typically using a generative model), and to use these embeddings to augment the feature set of a supervised learner. Embedding methods produce features in low dimensional spaces or over a small vocabulary size, unlike the traditional approach of working in the original high dimensional vocabulary space with only one dimension “on” at a given time. Broadly, these embedding methods fall into two categories: 1. Clustering based word representations: Clustering methods, often hierarchical, are used to group distributionally similar words based on their contexts. The two dominant approaches are Brown Clustering [3] and [4]. As recently shown, HMMs can also be used to induce a multinomial distribution over possible clusters [5]. 2. Dense representations: These representations are dense, low dimensional and real-valued. Each dimension of these representations captures latent information about a combination of syntactic and semantic word properties. They can either be induced using neural networks like C&W; embeddings [6] and Hierarchical log-linear (HLBL) embeddings [7] or by eigen-decomposition of the word co-occurrence matrix, e.g. Latent Semantic Analysis/Latent Semantic Indexing (LSA/LSI) [8]. Unfortunately, most of these representations are 1). slow to train, 2). sensitive to the scaling of the embeddings (especially 2 based approaches like LSA/PCA), 3). can get stuck in local optima (like EM trained HMM) and 4). learn a single embedding for a given word type; i.e. all the occurrences 1 of the word “bank” will have the same embedding, irrespective of whether the context of the word suggests it means “a financial institution” or “a river bank”. In this paper, we propose a novel context-specific word embedding method called Low Rank MultiView Learning, LR-MVL, which is fast to train and is guaranteed to converge to the optimal solution. As presented here, our LR-MVL embeddings are context-specific, but context oblivious embeddings (like the ones used by [6, 7]) can be trivially gotten from our model. Furthermore, building on recent advances in spectral learning for sequence models like HMMs [9, 10, 11] we show that LR-MVL has strong theoretical grounding. Particularly, we show that LR-MVL estimates low dimensional context-specific word embeddings which preserve all the information in the data if the data were generated by an HMM. Moreover, LR-MVL being linear does not face the danger of getting stuck in local optima as is the case for an EM trained HMM. LR-MVL falls into category (2) mentioned above; it learns real-valued context-specific word embeddings by performing Canonical Correlation Analysis (CCA) [12] between the past and future views of low rank approximations of the data. However, LR-MVL is more general than those methods, which work on bigram or trigram co-occurrence matrices, in that it uses longer word sequence information to estimate context-specific embeddings and also for the reasons mentioned in the last paragraph. The remainder of the paper is organized as follows. In the next section we give a brief overview of CCA, which forms the core of our method. Section 3 describes our proposed LR-MVL algorithm in detail and gives theory supporting its performance. Section 4 demonstrates the effectiveness of LR-MVL on the NLP tasks of Named Entity Recognition and Chunking. We conclude with a brief summary in Section 5. 2 Brief Review: Canonical Correlation Analysis (CCA) CCA [12] is the analog to Principal Component Analysis (PCA) for pairs of matrices. PCA computes the directions of maximum covariance between elements in a single matrix, whereas CCA computes the directions of maximal correlation between a pair of matrices. Unlike PCA, CCA does not depend on how the observations are scaled. This invariance of CCA to linear data transformations allows proofs that keeping the dominant singular vectors (those with largest singular values) will faithfully capture any state information. More specifically, given a set of n paired observation vectors {(l1 , r1 ), ..., (ln , rn )}–in our case the two matrices are the left (L) and right (R) context matrices of a word–we would like to simultaneously find the directions Φl and Φr that maximize the correlation of the projections of L onto Φl with the projections of R onto Φr . This is expressed as max Φl ,Φr E[ L, Φl R, Φr ] E[ L, Φl 2 ]E[ R, Φr 2 ] (1) where E denotes the empirical expectation. We use the notation Clr (Cll ) to denote the cross (auto) covariance matrices between L and R (i.e. L’R and L’L respectively.). The left and right canonical correlates are the solutions Φl , Φr of the following equations: Cll −1 Clr Crr −1 Crl Φl = λΦl Crr −1 Crl Cll −1 Clr Φr = λΦr 3 (2) Low Rank Multi-View Learning (LR-MVL) In LR-MVL, we compute the CCA between the past and future views of the data on a large unlabeled corpus to find the common latent structure, i.e., the hidden state associated with each token. These induced representations of the tokens can then be used as features in a supervised classifier (typically discriminative). The context around a word, consisting of the h words to the right and left of it, sits in a high dimensional space, since for a vocabulary of size v, each of the h words in the context requires an indicator function of dimension v. The key move in LR-MVL is to project the v-dimensional word 2 space down to a k dimensional state space. Thus, all eigenvector computations are done in a space that is v/k times smaller than the original space. Since a typical vocabulary contains at least 50, 000 words, and we use state spaces of order k ≈ 50 dimensions, this gives a 1,000-fold reduction in the size of calculations that are needed. The core of our LR-MVL algorithm is a fast spectral method for learning a v × k matrix A which maps each of the v words in the vocabulary to a k-dimensional state vector. We call this matrix the “eigenfeature dictionary”. We now describe the LR-MVL method, give a theorem that provides intuition into how it works, and formally present the LR-MVL algorithm. The Experiments section then shows that this low rank approximation allows us to achieve state-of-the-art performance on NLP tasks. 3.1 The LR-MVL method Given an unlabeled token sequence w={w0 , w1 , . . ., wn } we want to learn a low (k)- dimensional state vector {z0 , z1 , . . . , zn } for each observed token. The key is to find a v ×k matrix A (Algorithm 1) that maps each of the v words in the vocabulary to a reduced rank k-dimensional state vector, which is later used to induce context specific embeddings for the tokens (Algorithm 2). For supervised learning, these context specific embeddings are supplemented with other information about each token wt , such as its identity, orthographic features such as prefixes and suffixes or membership in domain-specific lexicons, and used as features in a classifier. Section 3.4 gives the algorithm more formally, but the key steps in the algorithm are, in general terms: • Take the h words to the left and to the right of each target word wt (the “Left” and “Right” contexts), and project them each down to k dimensions using A. • Take the CCA between the reduced rank left and right contexts, and use the resulting model to estimate a k dimensional state vector (the “hidden state”) for each token. • Take the CCA between the hidden states and the tokens wt . The singular vectors associated with wt form a new estimate of the eigenfeature dictionary. LR-MVL can be viewed as a type of co-training [13]: The state of each token wt is similar to that of the tokens both before and after it, and it is also similar to the states of the other occurrences of the same word elsewhere in the document (used in the outer iteration). LR-MVL takes advantage of these two different types of similarity by alternately estimating word state using CCA on the smooths of the states of the words before and after each target token and using the average over the states associated with all other occurrences of that word. 3.2 Theoretical Properties of LR-MVL We now present the theory behind the LR-MVL algorithm; particularly we show that the reduced rank matrix A allows a significant data reduction while preserving the information in our data and the estimated state does the best possible job of capturing any label information that can be inferred by a linear model. Let L be an n × hv matrix giving the words in the left context of each of the n tokens, where the context is of length h, R be the corresponding n × hv matrix for the right context, and W be an n × v matrix of indicator functions for the words themselves. We will use the following assumptions at various points in our proof: Assumption 1. L, W, and R come from a rank k HMM i.e. it has a rank k observation matrix and rank k transition matrix both of which have the same domain. For example, if the dimension of the hidden state is k and the vocabulary size is v then the observation matrix, which is k × v, has rank k. This rank condition is similar to the one used by [10]. Assumption 1A. For the three views, L, W and R assume that there exists a “hidden state H” of dimension n × k, where each row Hi has the same non-singular variance-covariance matrix and 3 such that E(Li |Hi ) = Hi β T and E(Ri |Hi ) = Hi β T and E(Wi |Hi ) = Hi β T where all β’s are of L R W rank k, where Li , Ri and Wi are the rows of L, R and W respectively. Assumption 1A follows from Assumption 1. Assumption 2. ρ(L, W), ρ(L, R) and ρ(W, R) all have rank k, where ρ(X1 , X2 ) is the expected correlation between X1 and X2 . Assumption 2 is a rank condition similar to that in [9]. Assumption 3. ρ([L, R], W) has k distinct singular values. Assumption 3 just makes the proof a little cleaner, since if there are repeated singular values, then the singular vectors are not unique. Without it, we would have to phrase results in terms of subspaces with identical singular values. We also need to define the CCA function that computes the left and right singular vectors for a pair of matrices: Definition 1 (CCA). Compute the CCA between two matrices X1 and X2 . Let ΦX1 be a matrix containing the d largest singular vectors for X1 (sorted from the largest on down). Likewise for ΦX2 . Define the function CCAd (X1 , X2 ) = [ΦX1 , ΦX2 ]. When we want just one of these Φ’s, we will use CCAd (X1 , X2 )left = ΦX1 for the left singular vectors and CCAd (X1 , X2 )right = ΦX2 for the right singular vectors. Note that the resulting singular vectors, [ΦX1 , ΦX2 ] can be used to give two redundant estimates, X1 ΦX1 and X2 ΦX2 of the “hidden” state relating X1 and X2 , if such a hidden state exists. Definition 2. Define the symbol “≈” to mean X1 ≈ X2 ⇐⇒ lim X1 = lim X2 n→∞ n→∞ where n is the sample size. Lemma 1. Define A by the following limit of the right singular vectors: CCAk ([L, R], W)right ≈ A. Under assumptions 2, 3 and 1A, such that if CCAk (L, R) ≡ [ΦL , ΦR ] then CCAk ([LΦL , RΦR ], W)right ≈ A. Lemma 1 shows that instead of finding the CCA between the full context and the words, we can take the CCA between the Left and Right contexts, estimate a k dimensional state from them, and take the CCA of that state with the words and get the same result. See the supplementary material for the Proof. ˜ Let Ah denote a matrix formed by stacking h copies of A on top of each other. Right multiplying ˜ L or R by Ah projects each of the words in that context into the k-dimensional reduced rank space. The following theorem addresses the core of the LR-MVL algorithm, showing that there is an A which gives the desired dimensionality reduction. Specifically, it shows that the previous lemma also holds in the reduced rank space. Theorem 1. Under assumptions 1, 2 and 3 there exists a unique matrix A such that if ˜ ˜ ˜ ˜ CCAk (LAh , RAh ) ≡ [ΦL , ΦR ] then ˜ ˜ ˜ ˜ CCAk ([LAh ΦL , RAh ΦR ], W)right ≈ A ˜ where Ah is the stacked form of A. See the supplementary material for the Proof 1 . ˆ It is worth noting that our matrix A corresponds to the matrix U used by [9, 10]. They showed that U is sufficient to compute the probability of a sequence of words generated by an HMM; although we do not show ˆ it here (due to limited space), our A provides a more statistically efficient estimate of U than their U , and hence can also be used to estimate the sequence probabilities. 1 4 Under the above assumptions, there is asymptotically (in the limit of infinite data) no benefit to first estimating state by finding the CCA between the left and right contexts and then finding the CCA between the estimated state and the words. One could instead just directly find the CCA between the combined left and rights contexts and the words. However, because of the Zipfian distribution of words, many words are rare or even unique, and hence one is not in the asymptotic limit. In this case, CCA between the rare words and context will not be informative, whereas finding the CCA between the left and right contexts gives a good state vector estimate even for unique words. One can then fruitfully find the CCA between the contexts and the estimated state vector for their associated words. 3.3 Using Exponential Smooths In practice, we replace the projected left and right contexts with exponential smooths (weighted average of the previous (or next) token’s state i.e. Zt−1 (or Zt+1 ) and previous (or next) token’s smoothed state i.e. St−1 (or St+1 ).), of them at a few different time scales, thus giving a further dimension reduction by a factor of context length h (say 100 words) divided by the number of smooths (often 5-7). We use a mixture of both very short and very long contexts which capture short and long range dependencies as required by NLP problems as NER, Chunking, WSD etc. Since exponential smooths are linear, we preserve the linearity of our method. 3.4 The LR-MVL Algorithm The LR-MVL algorithm (using exponential smooths) is given in Algorithm 1; it computes the pair of CCAs described above in Theorem 1. Algorithm 1 LR-MVL Algorithm - Learning from Large amounts of Unlabeled Data 1: Input: Token sequence Wn×v , state space size k, smoothing rates αj 2: Initialize the eigenfeature dictionary A to random values N (0, 1). 3: repeat 4: Set the state Zt (1 < t ≤ n) of each token wt to the eigenfeature vector of the corresponding word. Zt = (Aw : w = wt ) 5: Smooth the state estimates before and after each token to get a pair of views for each smoothing rate αj . (l,j) (l,j) = (1 − αj )St−1 + αj Zt−1 // left view L St (r,j) (r,j) j St = (1 − α )St+1 + αj Zt+1 // right view R. (l,j) (r,j) th where the t rows of L and R are, respectively, concatenations of the smooths St and St for (j) each of the α s. 6: Find the left and right canonical correlates, which are the eigenvectors Φl and Φr of (L L)−1 L R(R R)−1 R LΦl = λΦl . (R R)−1 R L(L L)−1 L RΦr = λΦr . 7: Project the left and right views on to the space spanned by the top k/2 left and right CCAs respectively (k/2) (k/2) Xl = LΦl and Xr = RΦr (k/2) (k/2) where Φl , Φr are matrices composed of the singular vectors of Φl , Φr with the k/2 largest magnitude singular values. Estimate the state for each word wt as the union of the left and right estimates: Z = [Xl , Xr ] 8: Estimate the eigenfeatures of each word type, w, as the average of the states estimated for that word. Aw = avg(Zt : wt = w) 9: Compute the change in A from the previous iteration 10: until |∆A| < 11: Output: Φk , Φk , A . r l A few iterations (∼ 5) of the above algorithm are sufficient to converge to the solution. (Since the problem is convex, there is a single solution, so there is no issue of local minima.) As [14] show for PCA, one can start with a random matrix that is only slightly larger than the true rank k of the correlation matrix, and with extremely high likelihood converge in a few iterations to within a small distance of the true principal components. In our case, if the assumptions detailed above (1, 1A, 2 and 3) are satisfied, our method converges equally rapidly to the true canonical variates. As mentioned earlier, we get further dimensionality reduction in Step 5, by replacing the Left and Right context matrices with a set of exponentially smoothed values of the reduced rank projections of the context words. Step 6 finds the CCA between the Left and Right contexts. Step 7 estimates 5 the state by combining the estimates from the left and right contexts, since we don’t know which will best estimate the state. Step 8 takes the CCA between the estimated state Z and the matrix of words W. Because W is a vector of indicator functions, this CCA takes the trivial form of a set of averages. Once we have estimated the CCA model, it is used to generate context specific embeddings for the tokens from training, development and test sets (as described in Algorithm 2). These embeddings are further supplemented with other baseline features and used in a supervised learner to predict the label of the token. Algorithm 2 LR-MVL Algorithm -Inducing Context Specific Embeddings for Train/Dev/Test Data 1: Input: Model (Φk , Φk , A) output from above algorithm and Token sequences Wtrain , (Wdev , Wtest ) r l 2: Project the left and right views L and R after smoothing onto the space spanned by the top k left and right CCAs respectively Xl = LΦk and Xr = RΦk r l and the words onto the eigenfeature dictionary Xw = W train A 3: Form the final embedding matrix Xtrain:embed by concatenating these three estimates of state Xtrain:embed = [Xl , Xw , Xr ] 4: Output: The embedding matrices Xtrain:embed , (Xdev:embed , Xtest:embed ) with context-specific representations for the tokens. These embeddings are augmented with baseline set of features mentioned in Sections 4.1.1 and 4.1.2 before learning the final classifier. Note that we can get context “oblivious” embeddings i.e. one embedding per word type, just by using the eigenfeature dictionary (Av×k ) output by Algorithm 1. 4 Experimental Results In this section we present the experimental results of LR-MVL on Named Entity Recognition (NER) and Syntactic Chunking tasks. We compare LR-MVL to state-of-the-art semi-supervised approaches like [1] (Alternating Structures Optimization (ASO)) and [2] (Semi-supervised extension of CRFs) as well as embeddings like C&W;, HLBL and Brown Clustering. 4.1 Datasets and Experimental Setup For the NER experiments we used the data from CoNLL 2003 shared task and for Chunking experiments we used the CoNLL 2000 shared task data2 with standard training, development and testing set splits. The CoNLL ’03 and the CoNLL ’00 datasets had ∼ 204K/51K/46K and ∼ 212K/ − /47K tokens respectively for Train/Dev./Test sets. 4.1.1 Named Entity Recognition (NER) We use the same set of baseline features as used by [15, 16] in their experiments. The detailed list of features is as below: • Current Word wi ; Its type information: all-capitalized, is-capitalized, all-digits and so on; Prefixes and suffixes of wi • Word tokens in window of 2 around the current word i.e. (wi−2 , wi−1 , wi , wi+1 , wi+2 ); and capitalization pattern in the window. d = • Previous two predictions yi−1 and yi−2 and conjunction of d and yi−1 • Embedding features (LR-MVL, C&W;, HLBL, Brown etc.) in a window of 2 around the current word (if applicable). Following [17] we use regularized averaged perceptron model with above set of baseline features for the NER task. We also used their BILOU text chunk representation and fast greedy inference as it was shown to give superior performance. 2 More details about the data and competition are available at http://www.cnts.ua.ac.be/ conll2003/ner/ and http://www.cnts.ua.ac.be/conll2000/chunking/ 6 We also augment the above set of baseline features with gazetteers, as is standard practice in NER experiments. We tuned our free parameter namely the size of LR-MVL embedding on the development and scaled our embedding features to have a 2 norm of 1 for each token and further multiplied them by a normalization constant (also chosen by cross validation), so that when they are used in conjunction with other categorical features in a linear classifier, they do not exert extra influence. The size of LR-MVL embeddings (state-space) that gave the best performance on the development set was k = 50 (50 each for Xl , Xw , Xr in Algorithm 2) i.e. the total size of embeddings was 50×3, and the best normalization constant was 0.5. We omit validation plots due to paucity of space. 4.1.2 Chunking For our chunking experiments we use a similar base set of features as above: • Current Word wi and word tokens in window of 2 around the current word i.e. d = (wi−2 , wi−1 , wi , wi+1 , wi+2 ); • POS tags ti in a window of 2 around the current word. • Word conjunction features wi ∩ wi+1 , i ∈ {−1, 0} and Tag conjunction features ti ∩ ti+1 , i ∈ {−2, −1, 0, 1} and ti ∩ ti+1 ∩ ti+2 , i ∈ {−2, −1, 0}. • Embedding features in a window of 2 around the current word (when applicable). Since CoNLL 00 chunking data does not have a development set, we randomly sampled 1000 sentences from the training data (8936 sentences) for development. So, we trained our chunking models on 7936 training sentences and evaluated their F1 score on the 1000 development sentences and used a CRF 3 as the supervised classifier. We tuned the size of embedding and the magnitude of 2 regularization penalty in CRF on the development set and took log (or -log of the magnitude) of the value of the features4 . The regularization penalty that gave best performance on development set was 2 and here again the best size of LR-MVL embeddings (state-space) was k = 50. Finally, we trained the CRF on the entire (“original”) training data i.e. 8936 sentences. 4.1.3 Unlabeled Data and Induction of embeddings For inducing the embeddings we used the RCV1 corpus containing Reuters newswire from Aug ’96 to Aug ’97 and containing about 63 million tokens in 3.3 million sentences5 . Case was left intact and we did not do the “cleaning” as done by [18, 16] i.e. remove all sentences which are less than 90% lowercase a-z, as our multi-view learning approach is robust to such noisy data, like news byline text (mostly all caps) which does not correlate strongly with the text of the article. We induced our LR-MVL embeddings over a period of 3 days (70 core hours on 3.0 GHz CPU) on the entire RCV1 data by performing 4 iterations, a vocabulary size of 300k and using a variety of smoothing rates (α in Algorithm 1) to capture correlations between shorter and longer contexts α = [0.005, 0.01, 0.05, 0.1, 0.5, 0.9]; theoretically we could tune the smoothing parameters on the development set but we found this mixture of long and short term dependencies to work well in practice. As far as the other embeddings are concerned i.e. C&W;, HLBL and Brown Clusters, we downloaded them from http://metaoptimize.com/projects/wordreprs. The details about their induction and parameter tuning can be found in [16]; we report their best numbers here. It is also worth noting that the unsupervised training of LR-MVL was (> 1.5 times)6 faster than other embeddings. 4.2 Results The results for NER and Chunking are shown in Tables 1 and 2, respectively, which show that LR-MVL performs significantly better than state-of-the-art competing methods on both NER and Chunking tasks. 3 http://www.chokkan.org/software/crfsuite/ Our embeddings are learnt using a linear model whereas CRF is a log-linear model, so to keep things on same scale we did this normalization. 5 We chose this particular dataset to make a fair comparison with [1, 16], who report results using RCV1 as unlabeled data. 6 As some of these embeddings were trained on GPGPU which makes our method even faster comparatively. 4 7 Embedding/Model Baseline C&W;, 200-dim HLBL, 100-dim Brown 1000 clusters Ando & Zhang ’05 Suzuki & Isozaki ’08 LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim HLBL, 100-dim C&W;, 200-dim Brown, 1000 clusters LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim No Gazetteers With Gazetteers F1-Score Dev. Set Test Set 90.03 84.39 92.46 87.46 92.00 88.13 92.32 88.52 93.15 89.31 93.66 89.36 93.11 89.55 93.61 89.91 92.91 89.35 92.98 88.88 93.25 89.41 93.91 89.89 94.41 90.06 Table 1: NER Results. Note: 1). LR-MVL (CO) are Context Oblivious embeddings which are gotten from (A) in Algorithm 1. 2). F1-score= Harmonic Mean of Precision and Recall. 3). The current state-of-the-art for this NER task is 90.90 (Test Set) but using 700 billion tokens of unlabeled data [19]. Embedding/Model Baseline HLBL, 50-dim C&W;, 50-dim Brown 3200 Clusters Ando & Zhang ’05 Suzuki & Isozaki ’08 LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim Test Set F1-Score 93.79 94.00 94.10 94.11 94.39 94.67 95.02 95.44 Table 2: Chunking Results. It is important to note that in problems like NER, the final accuracy depends on performance on rare-words and since LR-MVL is robustly able to correlate past with future views, it is able to learn better representations for rare words resulting in overall better accuracy. On rare-words (occurring < 10 times in corpus), we got 11.7%, 10.7% and 9.6% relative reduction in error over C&W;, HLBL and Brown respectively for NER; on chunking the corresponding numbers were 6.7%, 7.1% and 8.7%. Also, it is worth mentioning that modeling the context in embeddings gives decent improvements in accuracies on both NER and Chunking problems. For the case of NER, the polysemous words were mostly like Chicago, Wales, Oakland etc., which could either be a location or organization (Sports teams, Banks etc.), so when we don’t use the gazetteer features, (which are known lists of cities, persons, organizations etc.) we got higher increase in F-score by modeling context, compared to the case when we already had gazetteer features which captured most of the information about polysemous words for NER dataset and modeling the context didn’t help as much. The polysemous words for Chunking dataset were like spot (VP/NP), never (VP/ADVP), more (NP/VP/ADVP/ADJP) etc. and in this case embeddings with context helped significantly, giving 3.1 − 6.5% relative improvement in accuracy over context oblivious embeddings. 5 Summary and Conclusion In this paper, we presented a novel CCA-based multi-view learning method, LR-MVL, for large scale sequence learning problems such as arise in NLP. LR-MVL is a spectral method that works in low dimensional state-space so it is computationally efficient, and can be used to train using large amounts of unlabeled data; moreover it does not get stuck in local optima like an EM trained HMM. The embeddings learnt using LR-MVL can be used as features with any supervised learner. LR-MVL has strong theoretical grounding; is much simpler and faster than competing methods and achieves state-of-the-art accuracies on NER and Chunking problems. Acknowledgements: The authors would like to thank Alexander Yates, Ted Sandler and the three anonymous reviews for providing valuable feedback. We would also like to thank Lev Ratinov and Joseph Turian for answering our questions regarding their paper [16]. 8 References [1] Ando, R., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6 (2005) 1817–1853 [2] Suzuki, J., Isozaki, H.: Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In: In ACL. (2008) [3] Brown, P., deSouza, P., Mercer, R., Pietra, V.D., Lai, J.: Class-based n-gram models of natural language. Comput. Linguist. 18 (December 1992) 467–479 [4] Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: 31st Annual Meeting of the ACL. (1993) 183–190 [5] Huang, F., Yates, A.: Distributional representations for handling sparsity in supervised sequence-labeling. ACL ’09, Stroudsburg, PA, USA, Association for Computational Linguistics (2009) 495–503 [6] Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. ICML ’08, New York, NY, USA, ACM (2008) 160–167 [7] Mnih, A., Hinton, G.: Three new graphical models for statistical language modelling. ICML ’07, New York, NY, USA, ACM (2007) 641–648 [8] Dumais, S., Furnas, G., Landauer, T., Deerwester, S., Harshman, R.: Using latent semantic analysis to improve access to textual information. In: SIGCHI Conference on human factors in computing systems, ACM (1988) 281–285 [9] Hsu, D., Kakade, S., Zhang, T.: A spectral algorithm for learning hidden markov models. In: COLT. (2009) [10] Siddiqi, S., Boots, B., Gordon, G.J.: Reduced-rank hidden Markov models. In: AISTATS2010. (2010) [11] Song, L., Boots, B., Siddiqi, S.M., Gordon, G.J., Smola, A.J.: Hilbert space embeddings of hidden Markov models. In: ICML. (2010) [12] Hotelling, H.: Canonical correlation analysis (cca). Journal of Educational Psychology (1935) [13] Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT’ 98. (1998) 92–100 [14] Halko, N., Martinsson, P.G., Tropp, J.: Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. (Dec 2010) [15] Zhang, T., Johnson, D.: A robust risk minimization based named entity recognition system. CONLL ’03 (2003) 204–207 [16] Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. ACL ’10, Stroudsburg, PA, USA, Association for Computational Linguistics (2010) 384–394 [17] Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: CONLL. (2009) 147–155 [18] Liang, P.: Semi-supervised learning for natural language. Master’s thesis, Massachusetts Institute of Technology (2005) [19] Lin, D., Wu, X.: Phrase clustering for discriminative learning. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2. ACL ’09, Stroudsburg, PA, USA, Association for Computational Linguistics (2009) 1030–1038 9
5 0.075026385 126 nips-2011-Im2Text: Describing Images Using 1 Million Captioned Photographs
Author: Vicente Ordonez, Girish Kulkarni, Tamara L. Berg
Abstract: We develop and demonstrate automatic image description methods using a large captioned photo collection. One contribution is our technique for the automatic collection of this new dataset – performing a huge number of Flickr queries and then filtering the noisy results down to 1 million images with associated visually relevant captions. Such a collection allows us to approach the extremely challenging problem of description generation using relatively simple non-parametric methods and produces surprisingly effective results. We also develop methods incorporating many state of the art, but fairly noisy, estimates of image content to produce even more pleasing results. Finally we introduce a new objective performance measure for image captioning. 1
6 0.072519332 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
7 0.068018332 112 nips-2011-Heavy-tailed Distances for Gradient Based Image Descriptors
8 0.067304544 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes
9 0.065495513 154 nips-2011-Learning person-object interactions for action recognition in still images
10 0.064674206 165 nips-2011-Matrix Completion for Multi-label Image Classification
11 0.059785958 91 nips-2011-Exploiting spatial overlap to efficiently compute appearance distances between image windows
12 0.057928868 235 nips-2011-Recovering Intrinsic Images with a Global Sparsity Prior on Reflectance
13 0.056699377 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning
14 0.056027889 138 nips-2011-Joint 3D Estimation of Objects and Scene Layout
15 0.055189129 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels
16 0.054264754 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features
17 0.052354209 304 nips-2011-Why The Brain Separates Face Recognition From Object Recognition
18 0.051413283 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms
19 0.051396649 180 nips-2011-Multiple Instance Filtering
20 0.050212506 157 nips-2011-Learning to Search Efficiently in High Dimensions
topicId topicWeight
[(0, 0.134), (1, 0.077), (2, -0.062), (3, 0.089), (4, 0.046), (5, 0.022), (6, 0.056), (7, 0.007), (8, -0.016), (9, 0.049), (10, 0.011), (11, 0.067), (12, 0.068), (13, 0.039), (14, -0.017), (15, 0.028), (16, -0.011), (17, 0.016), (18, -0.01), (19, 0.047), (20, -0.009), (21, 0.002), (22, -0.016), (23, -0.02), (24, 0.034), (25, 0.031), (26, 0.02), (27, -0.024), (28, 0.094), (29, 0.057), (30, -0.106), (31, -0.02), (32, -0.001), (33, -0.008), (34, -0.024), (35, 0.004), (36, 0.053), (37, -0.048), (38, -0.017), (39, -0.046), (40, -0.02), (41, 0.056), (42, -0.055), (43, 0.059), (44, 0.007), (45, 0.0), (46, 0.045), (47, -0.002), (48, 0.026), (49, -0.049)]
simIndex simValue paperId paperTitle
same-paper 1 0.93230343 216 nips-2011-Portmanteau Vocabularies for Multi-Cue Image Representation
Author: Fahad S. Khan, Joost Weijer, Andrew D. Bagdanov, Maria Vanrell
Abstract: We describe a novel technique for feature combination in the bag-of-words model of image classification. Our approach builds discriminative compound words from primitive cues learned independently from training images. Our main observation is that modeling joint-cue distributions independently is more statistically robust for typical classification problems than attempting to empirically estimate the dependent, joint-cue distribution directly. We use Information theoretic vocabulary compression to find discriminative combinations of cues and the resulting vocabulary of portmanteau1 words is compact, has the cue binding property, and supports individual weighting of cues in the final image representation. State-of-theart results on both the Oxford Flower-102 and Caltech-UCSD Bird-200 datasets demonstrate the effectiveness of our technique compared to other, significantly more complex approaches to multi-cue image representation. 1
2 0.68677622 126 nips-2011-Im2Text: Describing Images Using 1 Million Captioned Photographs
Author: Vicente Ordonez, Girish Kulkarni, Tamara L. Berg
Abstract: We develop and demonstrate automatic image description methods using a large captioned photo collection. One contribution is our technique for the automatic collection of this new dataset – performing a huge number of Flickr queries and then filtering the noisy results down to 1 million images with associated visually relevant captions. Such a collection allows us to approach the extremely challenging problem of description generation using relatively simple non-parametric methods and produces surprisingly effective results. We also develop methods incorporating many state of the art, but fairly noisy, estimates of image content to produce even more pleasing results. Finally we introduce a new objective performance measure for image captioning. 1
3 0.68337363 235 nips-2011-Recovering Intrinsic Images with a Global Sparsity Prior on Reflectance
Author: Carsten Rother, Martin Kiefel, Lumin Zhang, Bernhard Schölkopf, Peter V. Gehler
Abstract: We address the challenging task of decoupling material properties from lighting properties given a single image. In the last two decades virtually all works have concentrated on exploiting edge information to address this problem. We take a different route by introducing a new prior on reflectance, that models reflectance values as being drawn from a sparse set of basis colors. This results in a Random Field model with global, latent variables (basis colors) and pixel-accurate output reflectance values. We show that without edge information high-quality results can be achieved, that are on par with methods exploiting this source of information. Finally, we are able to improve on state-of-the-art results by integrating edge information into our model. We believe that our new approach is an excellent starting point for future developments in this field. 1
4 0.67891979 293 nips-2011-Understanding the Intrinsic Memorability of Images
Author: Phillip Isola, Devi Parikh, Antonio Torralba, Aude Oliva
Abstract: Artists, advertisers, and photographers are routinely presented with the task of creating an image that a viewer will remember. While it may seem like image memorability is purely subjective, recent work shows that it is not an inexplicable phenomenon: variation in memorability of images is consistent across subjects, suggesting that some images are intrinsically more memorable than others, independent of a subjects’ contexts and biases. In this paper, we used the publicly available memorability dataset of Isola et al. [13], and augmented the object and scene annotations with interpretable spatial, content, and aesthetic image properties. We used a feature-selection scheme with desirable explaining-away properties to determine a compact set of attributes that characterizes the memorability of any individual image. We find that images of enclosed spaces containing people with visible faces are memorable, while images of vistas and peaceful scenes are not. Contrary to popular belief, unusual or aesthetically pleasing scenes do not tend to be highly memorable. This work represents one of the first attempts at understanding intrinsic image memorability, and opens a new domain of investigation at the interface between human cognition and computer vision. 1
5 0.63470566 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition
Author: Alessandro Bergamo, Lorenzo Torresani, Andrew W. Fitzgibbon
Abstract: We introduce P I C O D ES: a very compact image descriptor which nevertheless allows high performance on object category recognition. In particular, we address novel-category recognition: the task of defining indexing structures and image representations which enable a large collection of images to be searched for an object category that was not known when the index was built. Instead, the training images defining the category are supplied at query time. We explicitly learn descriptors of a given length (from as small as 16 bytes per image) which have good object-recognition performance. In contrast to previous work in the domain of object recognition, we do not choose an arbitrary intermediate representation, but explicitly learn short codes. In contrast to previous approaches to learn compact codes, we optimize explicitly for (an upper bound on) classification performance. Optimization directly for binary features is difficult and nonconvex, but we present an alternation scheme and convex upper bound which demonstrate excellent performance in practice. P I C O D ES of 256 bytes match the accuracy of the current best known classifier for the Caltech256 benchmark, but they decrease the database storage size by a factor of 100 and speed-up the training and testing of novel classes by orders of magnitude.
6 0.59895188 91 nips-2011-Exploiting spatial overlap to efficiently compute appearance distances between image windows
7 0.59549266 141 nips-2011-Large-Scale Category Structure Aware Image Categorization
8 0.5698576 112 nips-2011-Heavy-tailed Distances for Gradient Based Image Descriptors
9 0.54703581 165 nips-2011-Matrix Completion for Multi-label Image Classification
10 0.54189104 138 nips-2011-Joint 3D Estimation of Objects and Scene Layout
11 0.5335685 154 nips-2011-Learning person-object interactions for action recognition in still images
12 0.5201444 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
13 0.51369369 168 nips-2011-Maximum Margin Multi-Instance Learning
14 0.50116903 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes
15 0.49895588 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning
16 0.48249814 290 nips-2011-Transfer Learning by Borrowing Examples for Multiclass Object Detection
17 0.47606799 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms
18 0.47452503 266 nips-2011-Spatial distance dependent Chinese restaurant processes for image segmentation
19 0.47319719 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition
20 0.46286082 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling
topicId topicWeight
[(0, 0.012), (4, 0.035), (20, 0.051), (26, 0.012), (31, 0.052), (33, 0.096), (43, 0.042), (45, 0.111), (57, 0.064), (62, 0.282), (65, 0.028), (74, 0.037), (83, 0.025), (84, 0.02), (99, 0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.76697993 216 nips-2011-Portmanteau Vocabularies for Multi-Cue Image Representation
Author: Fahad S. Khan, Joost Weijer, Andrew D. Bagdanov, Maria Vanrell
Abstract: We describe a novel technique for feature combination in the bag-of-words model of image classification. Our approach builds discriminative compound words from primitive cues learned independently from training images. Our main observation is that modeling joint-cue distributions independently is more statistically robust for typical classification problems than attempting to empirically estimate the dependent, joint-cue distribution directly. We use Information theoretic vocabulary compression to find discriminative combinations of cues and the resulting vocabulary of portmanteau1 words is compact, has the cue binding property, and supports individual weighting of cues in the final image representation. State-of-theart results on both the Oxford Flower-102 and Caltech-UCSD Bird-200 datasets demonstrate the effectiveness of our technique compared to other, significantly more complex approaches to multi-cue image representation. 1
2 0.6861527 161 nips-2011-Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation
Author: Zhouchen Lin, Risheng Liu, Zhixun Su
Abstract: Many machine learning and signal processing problems can be formulated as linearly constrained convex programs, which could be efficiently solved by the alternating direction method (ADM). However, usually the subproblems in ADM are easily solvable only when the linear mappings in the constraints are identities. To address this issue, we propose a linearized ADM (LADM) method by linearizing the quadratic penalty term and adding a proximal term when solving the subproblems. For fast convergence, we also allow the penalty to change adaptively according a novel update rule. We prove the global convergence of LADM with adaptive penalty (LADMAP). As an example, we apply LADMAP to solve lowrank representation (LRR), which is an important subspace clustering technique yet suffers from high computation cost. By combining LADMAP with a skinny SVD representation technique, we are able to reduce the complexity O(n3 ) of the original ADM based method to O(rn2 ), where r and n are the rank and size of the representation matrix, respectively, hence making LRR possible for large scale applications. Numerical experiments verify that for LRR our LADMAP based methods are much faster than state-of-the-art algorithms. 1
3 0.62802029 208 nips-2011-Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness
Author: Rémi Munos
Abstract: We consider a global optimization problem of a deterministic function f in a semimetric space, given a finite budget of n evaluations. The function f is assumed to be locally smooth (around one of its global maxima) with respect to a semi-metric ℓ. We describe two algorithms based on optimistic exploration that use a hierarchical partitioning of the space at all scales. A first contribution is an algorithm, DOO, that requires the knowledge of ℓ. We report a finite-sample performance bound in terms of a measure of the quantity of near-optimal states. We then define a second algorithm, SOO, which does not require the knowledge of the semimetric ℓ under which f is smooth, and whose performance is almost as good as DOO optimally-fitted.
4 0.58566028 106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample
Author: Gilles Blanchard, Gyemin Lee, Clayton Scott
Abstract: We consider the problem of assigning class labels to an unlabeled test data set, given several labeled training data sets drawn from similar distributions. This problem arises in several applications where data distributions fluctuate because of biological, technical, or other sources of variation. We develop a distributionfree, kernel-based approach to the problem. This approach involves identifying an appropriate reproducing kernel Hilbert space and optimizing a regularized empirical risk over the space. We present generalization error analysis, describe universal kernels, and establish universal consistency of the proposed methodology. Experimental results on flow cytometry data are presented. 1
5 0.53790176 141 nips-2011-Large-Scale Category Structure Aware Image Categorization
Author: Bin Zhao, Fei Li, Eric P. Xing
Abstract: Most previous research on image categorization has focused on medium-scale data sets, while large-scale image categorization with millions of images from thousands of categories remains a challenge. With the emergence of structured large-scale dataset such as the ImageNet, rich information about the conceptual relationships between images, such as a tree hierarchy among various image categories, become available. As human cognition of complex visual world benefits from underlying semantic relationships between object classes, we believe a machine learning system can and should leverage such information as well for better performance. In this paper, we employ such semantic relatedness among image categories for large-scale image categorization. Specifically, a category hierarchy is utilized to properly define loss function and select common set of features for related categories. An efficient optimization method based on proximal approximation and accelerated parallel gradient method is introduced. Experimental results on a subset of ImageNet containing 1.2 million images from 1000 categories demonstrate the effectiveness and promise of our proposed approach. 1
6 0.52956349 266 nips-2011-Spatial distance dependent Chinese restaurant processes for image segmentation
7 0.52809221 165 nips-2011-Matrix Completion for Multi-label Image Classification
8 0.52677846 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
9 0.52290803 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling
10 0.52257252 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss
11 0.51807928 227 nips-2011-Pylon Model for Semantic Segmentation
12 0.51798731 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression
13 0.51610816 168 nips-2011-Maximum Margin Multi-Instance Learning
14 0.51201719 154 nips-2011-Learning person-object interactions for action recognition in still images
15 0.51161391 157 nips-2011-Learning to Search Efficiently in High Dimensions
16 0.51055878 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features
17 0.50795436 126 nips-2011-Im2Text: Describing Images Using 1 Million Captioned Photographs
18 0.50562716 99 nips-2011-From Stochastic Nonlinear Integrate-and-Fire to Generalized Linear Models
19 0.50548851 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition
20 0.50348645 112 nips-2011-Heavy-tailed Distances for Gradient Based Image Descriptors