nips nips2013 nips2013-65 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Hristo S. Paskov, Robert West, John C. Mitchell, Trevor Hastie
Abstract: This paper addresses the problem of unsupervised feature learning for text data. Our method is grounded in the principle of minimum description length and uses a dictionary-based compression scheme to extract a succinct feature set. Specifically, our method finds a set of word k-grams that minimizes the cost of reconstructing the text losslessly. We formulate document compression as a binary optimization task and show how to solve it approximately via a sequence of reweighted linear programs that are efficient to solve and parallelizable. As our method is unsupervised, features may be extracted once and subsequently used in a variety of tasks. We demonstrate the performance of these features over a range of scenarios including unsupervised exploratory analysis and supervised text categorization. Our compressed feature space is two orders of magnitude smaller than the full k-gram space and matches the text categorization accuracy achieved in the full feature space. This dimensionality reduction not only results in faster training times, but it can also help elucidate structure in unsupervised learning tasks and reduce the amount of training data necessary for supervised learning. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract This paper addresses the problem of unsupervised feature learning for text data. [sent-11, score-0.331]
2 Our method is grounded in the principle of minimum description length and uses a dictionary-based compression scheme to extract a succinct feature set. [sent-12, score-0.63]
3 Specifically, our method finds a set of word k-grams that minimizes the cost of reconstructing the text losslessly. [sent-13, score-0.253]
4 We formulate document compression as a binary optimization task and show how to solve it approximately via a sequence of reweighted linear programs that are efficient to solve and parallelizable. [sent-14, score-0.6]
5 We demonstrate the performance of these features over a range of scenarios including unsupervised exploratory analysis and supervised text categorization. [sent-16, score-0.425]
6 Our compressed feature space is two orders of magnitude smaller than the full k-gram space and matches the text categorization accuracy achieved in the full feature space. [sent-17, score-0.875]
7 This dimensionality reduction not only results in faster training times, but it can also help elucidate structure in unsupervised learning tasks and reduce the amount of training data necessary for supervised learning. [sent-18, score-0.343]
8 1 Introduction Machine learning algorithms rely critically on the features used to represent data; the feature set provides the primary interface through which an algorithm can reason about the data at hand. [sent-19, score-0.269]
9 Various heuristics have been proposed for feature selection, one class of which work by evaluating each feature separately with respect to its discriminative power. [sent-23, score-0.327]
10 Unsupervised feature selection methods [19, 18, 29, 13] are particularly attractive. [sent-26, score-0.189]
11 In this work we present a novel unsupervised method for feature selection for text data based on ideas from data compression and formulated as an optimization problem. [sent-34, score-0.768]
12 1 The basic intuition is that substrings appearing frequently in a corpus represent a recurring theme in some of the documents, and hence pertain to class representation. [sent-36, score-0.216]
13 Our solution invokes the principle of minimum description length (MDL) [23]: First, we compress the corpus using a dictionary-based lossless compression method. [sent-41, score-0.674]
14 Then, the substrings that are used to reconstruct each document serve as the feature set. [sent-42, score-0.457]
15 We formulate the compression task as a numerical optimization problem. [sent-43, score-0.381]
16 Our method reduces the feature set size by two orders of magnitude without incurring a loss of performance on several text categorization tasks. [sent-47, score-0.45]
17 Moreover, it expedites training times and requires significantly less labeled training data on some text categorization tasks. [sent-48, score-0.346]
18 2 Compression and Machine Learning Our work draws on a deep connection between data compression and machine learning, exemplified early on by the celebrated MDL principle [23]. [sent-49, score-0.381]
19 More recently, researchers have experimented with off-the-shelf compression algorithms as machine learning subroutines. [sent-50, score-0.381]
20 [7] deplore that “it is hard to see how efficient feature selection could be incorporated” into the compression algorithm. [sent-56, score-0.57]
21 But Sculley and Brodley [24] show that many compression-based distance measures can be interpreted as operating in an implicit high-dimensional feature space, spanned by the dictionary elements found during compression. [sent-57, score-0.386]
22 ’s above-cited concern about the impossibility of feature selection for compression-based methods. [sent-59, score-0.189]
23 Instead of using an off-the-shelf compression algorithm as a black-box kernel operating in an implicit high-dimensional feature space, we develop an optimization-based compression scheme whose explicit job it is to perform feature selection. [sent-60, score-1.11]
24 Imagine we want to extract features from an entire corpus. [sent-63, score-0.172]
25 We would proceed by concatenating all documents in the corpus into a single large document D, which we would compress using a Lempel–Ziv algorithm. [sent-64, score-0.484]
26 The problem is that the extracted substrings are dependent on the order in which we concatenate the documents to form the input D. [sent-65, score-0.271]
27 For the sake of concreteness, consider LZ77 [28], a prominent member of the Lempel– Ziv family (but the argument applies equally to most standard compression algorithms). [sent-66, score-0.381]
28 Three different solutions shown for representing the 8-word document D = manamana in terms of dictionary and pointers. [sent-72, score-0.362]
29 Center: balance of dictionary and pointer costs (λ = 1). [sent-78, score-0.615]
30 It then outputs a pointer to that previous instance—we interpret this substring as a feature—and continues with the remaining input string (if no prefix matches, the single next character is output). [sent-80, score-0.509]
31 This approach produces different feature sets depending on the order in which documents are concatenated. [sent-81, score-0.271]
32 However, our formulation is not affected by the concatenation order of corpus documents and does not suffer from LZ77’s instability issues. [sent-85, score-0.285]
33 3 Compressive Feature Learning The MDL principle implies that a good feature representation for a document D = x1 x2 . [sent-86, score-0.28]
34 Our dictionary-based compression scheme accomplishes this by representing D as a dictionary—a subset of D’s substrings—and a sequence of pointers indicating where copies of each dictionary element should be placed in order to fully reconstruct the document. [sent-90, score-0.862]
35 The compressed representation is chosen so as to minimize the cost of storing each dictionary element in plaintext as well as all necessary pointers. [sent-91, score-0.577]
36 This scheme achieves a shorter description length whenever it can reuse dictionary elements at different locations in D. [sent-92, score-0.295]
37 1, which shows three ways of representing a document D in terms of a dictionary and pointers. [sent-94, score-0.362]
38 These representations are obtained by using the same pointer storage cost λ for each pointer and varying λ. [sent-95, score-0.912]
39 The two extreme solutions focus on minimizing either the dictionary cost (λ = 0) or the pointer cost (λ = 8) solely, while the middle solution (λ = 1) trades off between minimizing a combination of the two. [sent-96, score-0.785]
40 We are particularly interested in this tradeoff: when all pointers have the same cost, the dictionary and pointer costs pull the solution in opposite directions. [sent-97, score-0.823]
41 Varying λ allows us to ‘interpolate’ between the two extremes of minimum dictionary cost and minimum pointer cost. [sent-98, score-0.685]
42 To formalize our compression criterion, let S = { xi . [sent-100, score-0.381]
43 , m} to be the set of indices of all pointers which share the same string s. [sent-115, score-0.225]
44 Given a binary vector w ∈ {0, 1}m , w reconstructs word x j if for some wi = 1 the corresponding pointer pi = (s, l) satisfies l ≤ j < l + |s|. [sent-116, score-0.581]
45 This notation uses wi to indicate whether pointer pi should be used to reconstruct (part of) D by pasting a copy of string s into location l. [sent-117, score-0.529]
46 3 Compressing D can be cast as a binary linear minimization problem over w; this bit vector tells us which pointers to use in the compressed representation of D and it implicitly defines the dictionary (a subset of S). [sent-119, score-0.723]
47 Next, we assume the pointer storage cost of setting wi = 1 is given by di ≥ 0 and that the cost of storing any s ∈ S is c(s). [sent-122, score-0.62]
48 Note that s must be stored in the dictionary if wJ(s) ∞ = 1, i. [sent-123, score-0.215]
49 , some pointer using s is used in the compression of D. [sent-125, score-0.781]
50 Putting everything together, our lossless compression criterion is minimize wT d + w c(s) wJ(s) subject to Xw ≥ 1, w ∈ {0, 1}m . [sent-126, score-0.498]
51 ∞ (1) s∈S Finally, multiple documents can be compressed jointly by concatenating them in any order into a large document and disallowing any pointers that span document boundaries. [sent-127, score-0.956]
52 Since this objective is invariant to the document concatenating order, it does not suffer from the same problems as LZ77 (cf. [sent-128, score-0.201]
53 At a high level, reweighting can be motivated by noting that (2) recovers the correct binary solution if is sufficiently small and we use as weights a nearly binary solution to (1). [sent-140, score-0.203]
54 However, if all potential dictionary elements are no longer than k words in length, we can use problem structure to achieve a run time of O(k2 n) per step of ADMM, i. [sent-145, score-0.245]
55 , nN words long are compressed jointly and no k-gram spans two documents, then H is block-diagonal with block i an ni × ni (k − 1)–banded matrix. [sent-179, score-0.322]
56 Since each newsgroup discusses a different topic, some more closely related than others, we investigate our compressed features’ ability to elucidate class structure in supervised and unsupervised learning scenarios. [sent-183, score-0.466]
57 5 Feature Extraction and Training We compute a bag-of-k-grams representation from a compressed document by counting the number of pointers that use each substring in the compressed version of the document. [sent-186, score-0.971]
58 This method retrieves the canonical bag-of-k-grams representation when all pointers are used, i. [sent-187, score-0.178]
59 Our compression criterion therefore leads to a less redundant representation. [sent-190, score-0.416]
60 Note that we extract features for a document corpus by compressing all of its documents jointly and then splitting into testing and training sets. [sent-191, score-0.733]
61 Each substring’s dictionary cost was its word length and the pointer cost was uniformly set to 0 ≤ λ ≤ 5. [sent-196, score-0.841]
62 Since this regularizer is a mix of L1 and L2 penalties, it is useful for feature selection but can also be used as a simple L2 ridge penalty. [sent-203, score-0.189]
63 Before training, we normalize each document by its L1 norm and then normalize features by their standard deviation. [sent-204, score-0.283]
64 We use this scheme so as to prevent overly long documents from dominating the feature normalization. [sent-205, score-0.362]
65 08 Misclassification Error LZ77 Comparison Our first experiment demonstrates LZ77’s sensitivity to document ordering on a simple binary classification task of predicting whether a document is from the alt. [sent-207, score-0.332]
66 Features were computed by concatenating documents in different orders: (1) by class, i. [sent-210, score-0.192]
67 , all documents in A before those in G, or G before A; (2) randomly; (3) by alternating the class every other document. [sent-212, score-0.178]
68 5 shows the testing error compared to features computed from our criterion. [sent-214, score-0.191]
69 As predicted in Section 2, document ordering has a marked impact on performance, with the by-class and random orders performing significantly worse than the alternating ordering. [sent-217, score-0.289]
70 Moreover, order invariance and the ability to tune the pointer cost lets our criterion select a better set of 5-grams. [sent-218, score-0.505]
71 The four leftmost results are on features from running LZ77 on documents ordered by class (AG, GA), randomly (Rand), or by alternating classes (Alt); the rightmost is on our compressed features. [sent-226, score-0.606]
72 PCA Next, we investigate our features in a typical exploratory analysis scenario: a researcher looking for interesting structure by plotting all pairs of the top 10 principal components of the data. [sent-227, score-0.172]
73 3 plots the pair of principal components that best exemplifies class structure using (1) compressed features and (2) all 5-grams. [sent-238, score-0.464]
74 For the sake of fairness, the components were picked by training a logistic regression on every pair of the top 10 principal components and selecting the pair with the lowest training error. [sent-239, score-0.166]
75 In both the binary and multiclass scenarios, PCA is inundated by millions of features when using all 5-grams and cannot display good class structure. [sent-240, score-0.212]
76 In contrast, compression reduces the feature set to tens of thousands (by two orders of magnitude) and clearly shows class structure. [sent-241, score-0.609]
77 We also did not L1 -normalize documents in the binary task because it was found to be counterproductive on the training set. [sent-271, score-0.241]
78 Our classification performance is state of the art in both tasks, with the compressed and all-5-gram features tied in performance. [sent-272, score-0.428]
79 Since both datasets feature copious amounts of labeled data, we expect the 5-gram features to do well because of the power of the Elastic-Net regularizer. [sent-273, score-0.269]
80 What is remarkable is that the compression retains useful features without using any label information. [sent-274, score-0.517]
81 There are tens of millions of 5-grams, but compression reduces them to hundreds of thousands (by two orders of magnitude). [sent-275, score-0.476]
82 Cross-validation takes 1 hour with compressed features and 8–16 hours for all 5-grams on our reference computer depending on the sparsity of the resulting classifier. [sent-277, score-0.428]
83 4 plots testing error as the amount of training data varies, comparing compressed features to full 7 Error on A vs. [sent-286, score-0.548]
84 5-grams; we explore the latter with and without feature selection enabled (i. [sent-308, score-0.189]
85 For the A–G task, the compressed features require substantially less data than the full 5-grams to come close to their best testing error. [sent-315, score-0.483]
86 The B–H task is harder and all three classifiers benefit from more training data, although the gap between compressed features and all 5-grams is widest when less than half of the training data is available. [sent-316, score-0.558]
87 In all cases, the compressed features outperform the full 5-grams, indicating that that latter may benefit from even more training data. [sent-317, score-0.493]
88 In future work it will be interesting to investigate the efficacy of compressed features on more intelligent sampling schemes such as active learning. [sent-318, score-0.428]
89 6 Discussion We develop a feature selection method for text based on lossless data compression. [sent-319, score-0.404]
90 It selects a compact feature set that can require significantly less training data and reveals unsupervised problem structure (e. [sent-323, score-0.263]
91 Our compression scheme is more robust and less arbitrary compared to a setup which uses off-theshelf compression algorithms to extract features from a document corpus. [sent-326, score-1.125]
92 Importantly, the algorithm we present is based on iterative reweighting and ADMM and is fast enough—linear in the input size when k is fixed, and highly parallelizable—to allow for computing a regularization path of features by varying the pointer cost. [sent-328, score-0.634]
93 Thus, we may adapt the compression to the data at hand and select features that better elucidate its structure. [sent-329, score-0.568]
94 Finally, even though we focus on text data in this paper, our method is applicable to any sequential data where the sequence elements are drawn from a finite set (such as the universe of words in the case of text data). [sent-330, score-0.337]
95 We also plan to experiment with approximate text representations obtained by making our criterion lossy. [sent-332, score-0.168]
96 Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4. [sent-389, score-0.272]
97 Proposing a new term weighting scheme for text categorization. [sent-419, score-0.177]
98 Improving multiclass text classification with error-correcting output coding and sub-class partitions. [sent-433, score-0.171]
99 Toward integrating feature selection algorithms for classification and clustering. [sent-438, score-0.189]
100 A comparative study on feature selection in text categorization. [sent-492, score-0.322]
wordName wordTfidf (topN-words)
[('pointer', 0.4), ('compression', 0.381), ('compressed', 0.292), ('dictionary', 0.215), ('pointers', 0.178), ('newsgroups', 0.175), ('dj', 0.156), ('document', 0.147), ('wj', 0.142), ('documents', 0.138), ('features', 0.136), ('substrings', 0.133), ('feature', 0.133), ('text', 0.133), ('imdb', 0.089), ('ziv', 0.089), ('qj', 0.087), ('categorization', 0.083), ('corpus', 0.083), ('lossless', 0.082), ('admm', 0.082), ('misclassification', 0.076), ('sculley', 0.076), ('cost', 0.07), ('reweighting', 0.067), ('mdl', 0.067), ('orders', 0.067), ('training', 0.065), ('xw', 0.065), ('unsupervised', 0.065), ('compress', 0.062), ('lempel', 0.062), ('substring', 0.062), ('xz', 0.062), ('classi', 0.059), ('supervised', 0.058), ('selection', 0.056), ('bigram', 0.055), ('reconstructs', 0.055), ('testing', 0.055), ('concatenating', 0.054), ('hastie', 0.053), ('elucidate', 0.051), ('west', 0.051), ('characters', 0.051), ('brodley', 0.05), ('categorizing', 0.05), ('paskov', 0.05), ('word', 0.05), ('mitchell', 0.049), ('overly', 0.047), ('string', 0.047), ('frank', 0.046), ('unigrams', 0.044), ('reconstruct', 0.044), ('scheme', 0.044), ('pca', 0.044), ('storage', 0.042), ('compressing', 0.041), ('universe', 0.041), ('banded', 0.041), ('en', 0.041), ('alternating', 0.04), ('tasks', 0.039), ('spam', 0.039), ('tit', 0.039), ('multiclass', 0.038), ('operating', 0.038), ('compressive', 0.038), ('wi', 0.038), ('binary', 0.038), ('stanford', 0.037), ('knapsack', 0.037), ('length', 0.036), ('principal', 0.036), ('extract', 0.036), ('criterion', 0.035), ('instability', 0.035), ('impact', 0.035), ('magnitude', 0.034), ('exempli', 0.034), ('reweighted', 0.034), ('ag', 0.034), ('separately', 0.033), ('scenarios', 0.033), ('splitting', 0.032), ('pre', 0.032), ('regularization', 0.031), ('alt', 0.031), ('rand', 0.031), ('cholesky', 0.03), ('solution', 0.03), ('words', 0.03), ('dual', 0.03), ('evenly', 0.029), ('concatenation', 0.029), ('subproblems', 0.028), ('tens', 0.028), ('discriminative', 0.028), ('solvable', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.000001 65 nips-2013-Compressive Feature Learning
Author: Hristo S. Paskov, Robert West, John C. Mitchell, Trevor Hastie
Abstract: This paper addresses the problem of unsupervised feature learning for text data. Our method is grounded in the principle of minimum description length and uses a dictionary-based compression scheme to extract a succinct feature set. Specifically, our method finds a set of word k-grams that minimizes the cost of reconstructing the text losslessly. We formulate document compression as a binary optimization task and show how to solve it approximately via a sequence of reweighted linear programs that are efficient to solve and parallelizable. As our method is unsupervised, features may be extracted once and subsequently used in a variety of tasks. We demonstrate the performance of these features over a range of scenarios including unsupervised exploratory analysis and supervised text categorization. Our compressed feature space is two orders of magnitude smaller than the full k-gram space and matches the text categorization accuracy achieved in the full feature space. This dimensionality reduction not only results in faster training times, but it can also help elucidate structure in unsupervised learning tasks and reduce the amount of training data necessary for supervised learning. 1
2 0.13678145 251 nips-2013-Predicting Parameters in Deep Learning
Author: Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, Nando de Freitas
Abstract: We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. 1
3 0.12319611 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces
Author: Mahdi Milani Fard, Yuri Grinberg, Amir massoud Farahmand, Joelle Pineau, Doina Precup
Abstract: This paper addresses the problem of automatic generation of features for value function approximation in reinforcement learning. Bellman Error Basis Functions (BEBFs) have been shown to improve policy evaluation, with a convergence rate similar to that of value iteration. We propose a simple, fast and robust algorithm based on random projections, which generates BEBFs for sparse feature spaces. We provide a finite sample analysis of the proposed method, and prove that projections logarithmic in the dimension of the original space guarantee a contraction in the error. Empirical results demonstrate the strength of this method in domains in which choosing a good state representation is challenging. 1
4 0.10767839 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
Author: David J. Weiss, Ben Taskar
Abstract: Discriminative methods for learning structured models have enabled wide-spread use of very rich feature representations. However, the computational cost of feature extraction is prohibitive for large-scale or time-sensitive applications, often dominating the cost of inference in the models. Significant efforts have been devoted to sparsity-based model selection to decrease this cost. Such feature selection methods control computation statically and miss the opportunity to finetune feature extraction to each input at run-time. We address the key challenge of learning to control fine-grained feature extraction adaptively, exploiting nonhomogeneity of the data. We propose an architecture that uses a rich feedback loop between extraction and prediction. The run-time control policy is learned using efficient value-function approximation, which adaptively determines the value of information of features at the level of individual variables for each input. We demonstrate significant speedups over state-of-the-art methods on two challenging datasets. For articulated pose estimation in video, we achieve a more accurate state-of-the-art model that is also faster, with similar results on an OCR task. 1
5 0.10229409 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators
Author: Pablo Sprechmann, Roee Litman, Tal Ben Yakar, Alexander M. Bronstein, Guillermo Sapiro
Abstract: In this paper, we propose a new computationally efficient framework for learning sparse models. We formulate a unified approach that contains as particular cases models promoting sparse synthesis and analysis type of priors, and mixtures thereof. The supervised training of the proposed model is formulated as a bilevel optimization problem, in which the operators are optimized to achieve the best possible performance on a specific task, e.g., reconstruction or classification. By restricting the operators to be shift invariant, our approach can be thought as a way of learning sparsity-promoting convolutional operators. Leveraging recent ideas on fast trainable regressors designed to approximate exact sparse codes, we propose a way of constructing feed-forward networks capable of approximating the learned models at a fraction of the computational cost of exact solvers. In the shift-invariant case, this leads to a principled way of constructing a form of taskspecific convolutional networks. We illustrate the proposed models on several experiments in music analysis and image processing applications. 1
6 0.088515438 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning
7 0.083739057 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning
8 0.08356902 88 nips-2013-Designed Measurements for Vector Count Data
9 0.08301913 59 nips-2013-Blind Calibration in Compressed Sensing using Message Passing Algorithms
10 0.075594433 232 nips-2013-Online PCA for Contaminated Data
11 0.074891418 304 nips-2013-Sparse nonnegative deconvolution for compressive calcium imaging: algorithms and phase transitions
12 0.074688844 75 nips-2013-Convex Two-Layer Modeling
13 0.07169535 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
14 0.070754379 301 nips-2013-Sparse Additive Text Models with Low Rank Background
15 0.067566596 248 nips-2013-Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs
16 0.067261763 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs
17 0.066861637 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking
18 0.066707976 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer
19 0.066362597 211 nips-2013-Non-Linear Domain Adaptation with Boosting
20 0.066235997 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies
topicId topicWeight
[(0, 0.195), (1, 0.076), (2, -0.049), (3, -0.026), (4, 0.075), (5, -0.025), (6, -0.073), (7, 0.062), (8, -0.025), (9, -0.016), (10, 0.009), (11, 0.078), (12, -0.027), (13, -0.003), (14, 0.001), (15, -0.017), (16, 0.082), (17, 0.043), (18, 0.028), (19, -0.01), (20, -0.093), (21, -0.059), (22, 0.001), (23, -0.005), (24, 0.036), (25, -0.056), (26, 0.033), (27, -0.014), (28, 0.014), (29, -0.119), (30, 0.021), (31, 0.113), (32, 0.035), (33, 0.086), (34, 0.113), (35, -0.054), (36, 0.011), (37, -0.038), (38, -0.039), (39, -0.035), (40, 0.045), (41, 0.045), (42, 0.042), (43, -0.033), (44, -0.065), (45, -0.046), (46, 0.106), (47, -0.079), (48, -0.05), (49, -0.174)]
simIndex simValue paperId paperTitle
same-paper 1 0.93566519 65 nips-2013-Compressive Feature Learning
Author: Hristo S. Paskov, Robert West, John C. Mitchell, Trevor Hastie
Abstract: This paper addresses the problem of unsupervised feature learning for text data. Our method is grounded in the principle of minimum description length and uses a dictionary-based compression scheme to extract a succinct feature set. Specifically, our method finds a set of word k-grams that minimizes the cost of reconstructing the text losslessly. We formulate document compression as a binary optimization task and show how to solve it approximately via a sequence of reweighted linear programs that are efficient to solve and parallelizable. As our method is unsupervised, features may be extracted once and subsequently used in a variety of tasks. We demonstrate the performance of these features over a range of scenarios including unsupervised exploratory analysis and supervised text categorization. Our compressed feature space is two orders of magnitude smaller than the full k-gram space and matches the text categorization accuracy achieved in the full feature space. This dimensionality reduction not only results in faster training times, but it can also help elucidate structure in unsupervised learning tasks and reduce the amount of training data necessary for supervised learning. 1
2 0.6979751 88 nips-2013-Designed Measurements for Vector Count Data
Author: Liming Wang, David Carlson, Miguel Rodrigues, David Wilcox, Robert Calderbank, Lawrence Carin
Abstract: We consider design of linear projection measurements for a vector Poisson signal model. The projections are performed on the vector Poisson rate, X ∈ Rn , and the + observed data are a vector of counts, Y ∈ Zm . The projection matrix is designed + by maximizing mutual information between Y and X, I(Y ; X). When there is a latent class label C ∈ {1, . . . , L} associated with X, we consider the mutual information with respect to Y and C, I(Y ; C). New analytic expressions for the gradient of I(Y ; X) and I(Y ; C) are presented, with gradient performed with respect to the measurement matrix. Connections are made to the more widely studied Gaussian measurement model. Example results are presented for compressive topic modeling of a document corpora (word counting), and hyperspectral compressive sensing for chemical classification (photon counting). 1
3 0.67331886 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators
Author: Pablo Sprechmann, Roee Litman, Tal Ben Yakar, Alexander M. Bronstein, Guillermo Sapiro
Abstract: In this paper, we propose a new computationally efficient framework for learning sparse models. We formulate a unified approach that contains as particular cases models promoting sparse synthesis and analysis type of priors, and mixtures thereof. The supervised training of the proposed model is formulated as a bilevel optimization problem, in which the operators are optimized to achieve the best possible performance on a specific task, e.g., reconstruction or classification. By restricting the operators to be shift invariant, our approach can be thought as a way of learning sparsity-promoting convolutional operators. Leveraging recent ideas on fast trainable regressors designed to approximate exact sparse codes, we propose a way of constructing feed-forward networks capable of approximating the learned models at a fraction of the computational cost of exact solvers. In the shift-invariant case, this leads to a principled way of constructing a form of taskspecific convolutional networks. We illustrate the proposed models on several experiments in music analysis and image processing applications. 1
4 0.63939184 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning
Author: Min Xiao, Yuhong Guo
Abstract: Cross language text classification is an important learning task in natural language processing. A critical challenge of cross language learning arises from the fact that words of different languages are in disjoint feature spaces. In this paper, we propose a two-step representation learning method to bridge the feature spaces of different languages by exploiting a set of parallel bilingual documents. Specifically, we first formulate a matrix completion problem to produce a complete parallel document-term matrix for all documents in two languages, and then induce a low dimensional cross-lingual document representation by applying latent semantic indexing on the obtained matrix. We use a projected gradient descent algorithm to solve the formulated matrix completion problem with convergence guarantees. The proposed method is evaluated by conducting a set of experiments with cross language sentiment classification tasks on Amazon product reviews. The experimental results demonstrate that the proposed learning method outperforms a number of other cross language representation learning methods, especially when the number of parallel bilingual documents is small. 1
5 0.58556902 59 nips-2013-Blind Calibration in Compressed Sensing using Message Passing Algorithms
Author: Christophe Schulke, Francesco Caltagirone, Florent Krzakala, Lenka Zdeborova
Abstract: Compressed sensing (CS) is a concept that allows to acquire compressible signals with a small number of measurements. As such it is very attractive for hardware implementations. Therefore, correct calibration of the hardware is a central issue. In this paper we study the so-called blind calibration, i.e. when the training signals that are available to perform the calibration are sparse but unknown. We extend the approximate message passing (AMP) algorithm used in CS to the case of blind calibration. In the calibration-AMP, both the gains on the sensors and the elements of the signals are treated as unknowns. Our algorithm is also applicable to settings in which the sensors distort the measurements in other ways than multiplication by a gain, unlike previously suggested blind calibration algorithms based on convex relaxations. We study numerically the phase diagram of the blind calibration problem, and show that even in cases where convex relaxation is possible, our algorithm requires a smaller number of measurements and/or signals in order to perform well. 1
6 0.55064702 304 nips-2013-Sparse nonnegative deconvolution for compressive calcium imaging: algorithms and phase transitions
7 0.54938871 245 nips-2013-Pass-efficient unsupervised feature selection
8 0.5485931 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces
9 0.5212881 247 nips-2013-Phase Retrieval using Alternating Minimization
10 0.50685072 76 nips-2013-Correlated random features for fast semi-supervised learning
11 0.50627196 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms
12 0.50318295 301 nips-2013-Sparse Additive Text Models with Low Rank Background
13 0.50032192 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition
14 0.49514213 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification
15 0.49266815 212 nips-2013-Non-Uniform Camera Shake Removal Using a Spatially-Adaptive Sparse Penalty
16 0.48799643 358 nips-2013-q-OCSVM: A q-Quantile Estimator for High-Dimensional Distributions
17 0.47904637 244 nips-2013-Parametric Task Learning
18 0.47787055 98 nips-2013-Documents as multiple overlapping windows into grids of counts
19 0.47734886 117 nips-2013-Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis
20 0.47614741 85 nips-2013-Deep content-based music recommendation
topicId topicWeight
[(2, 0.011), (16, 0.034), (33, 0.108), (34, 0.074), (41, 0.018), (49, 0.024), (56, 0.085), (70, 0.023), (85, 0.024), (89, 0.03), (93, 0.489), (95, 0.016)]
simIndex simValue paperId paperTitle
1 0.9449712 339 nips-2013-Understanding Dropout
Author: Pierre Baldi, Peter J. Sadowski
Abstract: Dropout is a relatively new algorithm for training neural networks which relies on stochastically “dropping out” neurons during training in order to avoid the co-adaptation of feature detectors. We introduce a general formalism for studying dropout on either units or connections, with arbitrary probability values, and use it to analyze the averaging and regularizing properties of dropout in both linear and non-linear networks. For deep neural networks, the averaging properties of dropout are characterized by three recursive equations, including the approximation of expectations by normalized weighted geometric means. We provide estimates and bounds for these approximations and corroborate the results with simulations. Among other results, we also show how dropout performs stochastic gradient descent on a regularized error function. 1
2 0.89185464 211 nips-2013-Non-Linear Domain Adaptation with Boosting
Author: Carlos J. Becker, Christos M. Christoudias, Pascal Fua
Abstract: A common assumption in machine vision is that the training and test samples are drawn from the same distribution. However, there are many problems when this assumption is grossly violated, as in bio-medical applications where different acquisitions can generate drastic variations in the appearance of the data due to changing experimental conditions. This problem is accentuated with 3D data, for which annotation is very time-consuming, limiting the amount of data that can be labeled in new acquisitions for training. In this paper we present a multitask learning algorithm for domain adaptation based on boosting. Unlike previous approaches that learn task-specific decision boundaries, our method learns a single decision boundary in a shared feature space, common to all tasks. We use the boosting-trick to learn a non-linear mapping of the observations in each task, with no need for specific a-priori knowledge of its global analytical form. This yields a more parameter-free domain adaptation approach that successfully leverages learning on new tasks where labeled data is scarce. We evaluate our approach on two challenging bio-medical datasets and achieve a significant improvement over the state of the art. 1
same-paper 3 0.88952929 65 nips-2013-Compressive Feature Learning
Author: Hristo S. Paskov, Robert West, John C. Mitchell, Trevor Hastie
Abstract: This paper addresses the problem of unsupervised feature learning for text data. Our method is grounded in the principle of minimum description length and uses a dictionary-based compression scheme to extract a succinct feature set. Specifically, our method finds a set of word k-grams that minimizes the cost of reconstructing the text losslessly. We formulate document compression as a binary optimization task and show how to solve it approximately via a sequence of reweighted linear programs that are efficient to solve and parallelizable. As our method is unsupervised, features may be extracted once and subsequently used in a variety of tasks. We demonstrate the performance of these features over a range of scenarios including unsupervised exploratory analysis and supervised text categorization. Our compressed feature space is two orders of magnitude smaller than the full k-gram space and matches the text categorization accuracy achieved in the full feature space. This dimensionality reduction not only results in faster training times, but it can also help elucidate structure in unsupervised learning tasks and reduce the amount of training data necessary for supervised learning. 1
4 0.85989928 146 nips-2013-Large Scale Distributed Sparse Precision Estimation
Author: Huahua Wang, Arindam Banerjee, Cho-Jui Hsieh, Pradeep Ravikumar, Inderjit Dhillon
Abstract: We consider the problem of sparse precision matrix estimation in high dimensions using the CLIME estimator, which has several desirable theoretical properties. We present an inexact alternating direction method of multiplier (ADMM) algorithm for CLIME, and establish rates of convergence for both the objective and optimality conditions. Further, we develop a large scale distributed framework for the computations, which scales to millions of dimensions and trillions of parameters, using hundreds of cores. The proposed framework solves CLIME in columnblocks and only involves elementwise operations and parallel matrix multiplications. We evaluate our algorithm on both shared-memory and distributed-memory architectures, which can use block cyclic distribution of data and parameters to achieve load balance and improve the efficiency in the use of memory hierarchies. Experimental results show that our algorithm is substantially more scalable than state-of-the-art methods and scales almost linearly with the number of cores. 1
5 0.85362732 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation
Author: Andriy Mnih, Koray Kavukcuoglu
Abstract: Continuous-valued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information about words very well, setting performance records on several word similarity tasks. The best results are obtained by learning high-dimensional embeddings from very large quantities of data, which makes scalability of the training method a critical factor. We propose a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation. Our approach is simpler, faster, and produces better results than the current state-of-theart method. We achieve results comparable to the best ones reported, which were obtained on a cluster, using four times less data and more than an order of magnitude less computing time. We also investigate several model types and find that the embeddings learned by the simpler models perform at least as well as those learned by the more complex ones. 1
6 0.7176702 215 nips-2013-On Decomposing the Proximal Map
7 0.67761546 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning
8 0.63210875 99 nips-2013-Dropout Training as Adaptive Regularization
9 0.61670226 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification
10 0.61656177 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies
11 0.59828925 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality
12 0.58855289 30 nips-2013-Adaptive dropout for training deep neural networks
13 0.58354169 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation
14 0.56772196 251 nips-2013-Predicting Parameters in Deep Learning
15 0.55900419 5 nips-2013-A Deep Architecture for Matching Short Texts
16 0.55602646 69 nips-2013-Context-sensitive active sensing in humans
17 0.53645438 263 nips-2013-Reasoning With Neural Tensor Networks for Knowledge Base Completion
18 0.53525186 64 nips-2013-Compete to Compute
19 0.53012365 183 nips-2013-Mapping paradigm ontologies to and from the brain
20 0.52957076 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization