nips nips2010 nips2010-251 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yariv Maron, Elie Bienenstock, Michael James
Abstract: Motivated by an application to unsupervised part-of-speech tagging, we present an algorithm for the Euclidean embedding of large sets of categorical data based on co-occurrence statistics. We use the CODE model of Globerson et al. but constrain the embedding to lie on a hig hdimensional unit sphere. This constraint allows for efficient optimization, even in the case of large datasets and high embedding dimensionality. Using k-means clustering of the embedded data, our approach efficiently produces state-of-the-art results. We analyze the reasons why the sphere constraint is beneficial in this application, and conjecture that these reasons might apply quite generally to other large-scale tasks. 1 In trod u cti on The embedding of objects in a low-dimensional Euclidean space is a form of dimensionality reduction that has been used in the past mostly to create 2D representations of data for the purpose of visualization and exploratory data analysis [10, 13]. Most methods work on objects of a single type, endowed with a measure of similarity. Other methods, such as [ 3], embed objects of heterogeneous types, based on their co-occurrence statistics. In this paper we demonstrate that the latter can be successfully applied to unsupervised part-of-speech (POS) induction, an extensively studied, challenging, problem in natural language processing [1, 4, 5, 6, 7]. The problem we address is distributional POS tagging, in which words are to be tagged based on the statistics of their immediate left and right context in a corpus (ignoring morphology and other features). The induction task is fully unsupervised, i.e., it uses no annotations. This task has been addressed in the past using a variety of methods. Some approaches, such as [1], combine a Markovian assumption with clustering. Many recent works use HMMs, perhaps due to their excellent performance on the supervised version of the task [7, 2, 5]. Using a latent-descriptor clustering approach, [15] obtain the best results to date for distributional-only unsupervised POS tagging of the widely-used WSJ corpus. Using a heterogeneous-data embedding approach for this task, we define separate embedding functions for the objects
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Motivated by an application to unsupervised part-of-speech tagging, we present an algorithm for the Euclidean embedding of large sets of categorical data based on co-occurrence statistics. [sent-5, score-0.362]
2 but constrain the embedding to lie on a hig hdimensional unit sphere. [sent-7, score-0.217]
3 This constraint allows for efficient optimization, even in the case of large datasets and high embedding dimensionality. [sent-8, score-0.257]
4 Using k-means clustering of the embedded data, our approach efficiently produces state-of-the-art results. [sent-9, score-0.096]
5 We analyze the reasons why the sphere constraint is beneficial in this application, and conjecture that these reasons might apply quite generally to other large-scale tasks. [sent-10, score-0.149]
6 1 In trod u cti on The embedding of objects in a low-dimensional Euclidean space is a form of dimensionality reduction that has been used in the past mostly to create 2D representations of data for the purpose of visualization and exploratory data analysis [10, 13]. [sent-11, score-0.318]
7 Other methods, such as [ 3], embed objects of heterogeneous types, based on their co-occurrence statistics. [sent-13, score-0.106]
8 In this paper we demonstrate that the latter can be successfully applied to unsupervised part-of-speech (POS) induction, an extensively studied, challenging, problem in natural language processing [1, 4, 5, 6, 7]. [sent-14, score-0.2]
9 The problem we address is distributional POS tagging, in which words are to be tagged based on the statistics of their immediate left and right context in a corpus (ignoring morphology and other features). [sent-15, score-0.186]
10 Using a latent-descriptor clustering approach, [15] obtain the best results to date for distributional-only unsupervised POS tagging of the widely-used WSJ corpus. [sent-22, score-0.523]
11 Using a heterogeneous-data embedding approach for this task, we define separate embedding functions for the objects "left word" and "right word" based on their co -occurrence statistics, i. [sent-23, score-0.494]
12 We are interested in modeling the statistical interactions between left words and right words, as relevant to POS tagging, rather than their joint distribution. [sent-26, score-0.077]
13 Indeed, modeling the joint distribution directly results in models that do not handle rare words well. [sent-27, score-0.118]
14 This embedding model incorporates the marginal probabilities, or unigram frequencies, in a way that results in appropriate handling of both frequent and rare words. [sent-29, score-0.335]
15 The size of the dataset (number of points to embed) and the embedding dimensionality are several-fold larger than in the applications studied in [3], making the optimization methods used by these authors impractical. [sent-30, score-0.217]
16 Importantly, in order to handle both the large dataset and the relatively high dimensionality of the embedding needed for this application, we constrain the embedding to lie on the unit sphere. [sent-32, score-0.434]
17 The spherical constraint causes the regularization term—the partition function—to be nearly constant and also makes the stochastic gradient ascent smoother ; this allows a several-fold computational improvement, and yields excellent performance. [sent-34, score-0.364]
18 After convergence of the embedding model, we use a k-means algorithm to cluster all the words of the corpus, based on their embeddings. [sent-35, score-0.291]
19 The induced POS labels are evaluated using the standard setting for this task, yielding state-of-the-art tagging performance. [sent-36, score-0.419]
20 , an ordered pair of adjacent words in the corpus, as joint random variables (X,Y), each taking values in W, the set of word types occurring in the corpus. [sent-40, score-0.367]
21 Since X and Y, the first and second words in a bigram, play different roles, we build a heterogeneous model, i. [sent-41, score-0.091]
22 Both map W into S, the unit sphere in the r-dimensional Euclidean space. [sent-44, score-0.139]
23 We use for the word-type frequencies: is the number of word tokens of type x divided by the total number of tokens in the corpus. [sent-45, score-0.578]
24 We refer to as the empirical marginal distribution, or unigram frequency. [sent-46, score-0.113]
25 We use for the empirical joint distribution of X and Y, i. [sent-47, score-0.069]
26 Because our ultimate goal is the clustering of word types for POS tagging, we want the embedding to be insensitive to the marginals: two word types with similar context distributions should be mapped to neighboring points in S even if their unigram frequencies are very different. [sent-50, score-1.003]
27 Instead, it turns out (see Discussion) that, thanks to the sphere constraint, we can approximate this dynamic variable, Z, using a constant, , which arises from a coarse approximation in which all pairs of embedded variables are distributed uniformly and independently on the sphere. [sent-57, score-0.168]
28 To implement the first sum in (5) and (6) − representing an attraction force between the embeddings of the words in a bigram − we sample bigrams from the empirical joint . [sent-71, score-0.559]
29 In order to speed up the convergence process, we use a learning rate that decreases as word types are repeatedly observed. [sent-74, score-0.29]
30 If is the number of times word type w has been previously encountered, we use: . [sent-75, score-0.31]
31 This modified learning rate also reduces the variability of the tagging accuracy, while slightly increasing its mean. [sent-77, score-0.328]
32 The second sum in (5) and in (6) − representing a repulsion force − involves not the empirical joint but the product of the empirical marginals. [sent-78, score-0.257]
33 After each step, the updated vectors are projected back onto the sphere S. [sent-80, score-0.109]
34 After convergence, for any word w, we have two embedded vectors, and . [sent-81, score-0.31]
35 These vectors are concatenated to form a single geometric description of word type w. [sent-82, score-0.31]
36 The collection of all these vectors is then clustered using a weighted k-means clustering algorithm: in each iteration, a cluster’s centroid is updated as the weighted mean of its currently assigned constituent vectors, with the weight of the vector for word w equal to . [sent-83, score-0.288]
37 The number of clusters chosen depends on whether evaluation is to be done against the PTB45 or the PTB17 tagset (see below, Section 2. [sent-84, score-0.101]
38 2 Evaluation and data The resulting assignment of cluster labels to word types is used to label the corpus. [sent-87, score-0.363]
39 The standard practice for evaluating the performance of the induced labels is to either map them to the gold-standard tags, or to use an information-theoretic measure. [sent-88, score-0.121]
40 The first criterion maps each cluster to the POS tag that it best matches according to the hand -annotated labels. [sent-90, score-0.375]
41 The match is determined by finding the tag that is most frequently assigned to any token of any word type in the cluster. [sent-91, score-0.506]
42 Because the criterion is free to assign several clusters to the same POS tag, this evaluation technique is called many-to-one mapping, or MTO. [sent-92, score-0.156]
43 Once the map is constructed, the accuracy score is obtained as the fraction of all tokens whose inferred tag under the map matches the hand-annotated tag. [sent-93, score-0.49]
44 The second criterion, 1-to-1 mapping, is similar to the first, but the mapping is restricted from assigning multiple clusters to a single tag; hence it is called one -to-one mapping, or 1to-1. [sent-94, score-0.128]
45 Most authors construct the 1-to-1 mapping greedily, assigning maximal-score label-totag matches first; some authors, e. [sent-95, score-0.106]
46 We note that we and other authors found the most reliable criterion for comparing unsupervised POS taggers to be MTO. [sent-100, score-0.222]
47 We ignore capitalization, leaving 43,766 word types, to compare performance with other models consistently. [sent-103, score-0.251]
48 Evaluation is done against the full tag set (PTB45), and against a coarse tag set (PTB17) [12]. [sent-104, score-0.392]
49 MTO17 and MTO50 refer to the number of tokens tagged correctly under the many-to-1 mapping for the PTB45 and PTB17 tagsets respectively. [sent-108, score-0.262]
50 The type-accuracy curves use the same mapping 1 Source code is available at the author’s website: faculty. [sent-109, score-0.12]
51 and tagsets, but record the fraction of word types whose inferred tag matches thei r "modal" annotated tag, i. [sent-113, score-0.566]
52 , the annotated tag co-occurring most frequently with this word type. [sent-115, score-0.492]
53 1 0 0 20 40 60 80 bigram updates (times 100,000) 100 120 Figure 1: Scores against number of iterations (bigram updates). [sent-128, score-0.221]
54 MTO17 is the Many-to-1 tagging accuracy score based on 17 induced labels mapped to 17 tags. [sent-130, score-0.533]
55 MTO50 is the Many-to-1 score based on 50 induced labels mapped to 45 tags. [sent-131, score-0.169]
56 Type Accuracy 17 (50) is the average accuracy per word type, where the gold-standard tag of a word type is the modal annotated tag of that type (see text). [sent-132, score-1.138]
57 5 0 10 20 30 40 bigram updates (times 100,000) 50 60 Figure 2: Comparison of models with different dimensionalities: r = 2, 5, 10, 25. [sent-141, score-0.221]
58 MTO17 is the Many-to-1 score based on 17 induced labels mapped to PTB17 tags. [sent-142, score-0.169]
59 Unlike previous applications of CODE [3] (which often emphasize visualization of data and thus require a low dimension), this unsupervised POS -tagging application benefits from high values of r. [sent-145, score-0.151]
60 Larger values of r cause both the tagging accuracy to improve and the variability during convergence to decrease. [sent-146, score-0.364]
61 PTB45-45 maps 45 induced labels to 45 tags, while PTB45-50 maps 50 induced labels to 45 tags. [sent-219, score-0.3]
62 Under the Many-to-1 criterion, which we find to be the most appropriate of the three for the evaluation of unsupervised POS taggers, S-CODE is superior to HMM results, and scores comparably to [15], the highest-performing model to date on this task. [sent-221, score-0.266]
63 This robustness lends promise for the usefulness of this method for other applications in which the partition function is impractical to compute. [sent-225, score-0.082]
64 Such embeddings have been used mostly for the purpose of visualization and exploratory data analysis. [sent-228, score-0.144]
65 In the task at hand, the sets X and Y to be embedded are large (43K), making most conventional embedding approaches, including CODE (as implemented in [3]), impractical. [sent-233, score-0.276]
66 It uses stochastic gradient ascent to maximize the likelihood of the model. [sent-235, score-0.068]
67 The first component embodies an attraction force, pulling toward in proportion to the empirical joint . [sent-240, score-0.178]
68 The second component, the gradient of the regularization term, , embodies a repulsion force; it keeps the solution away from the trivial state where all x's and y's are mapped to the same point, and more generally attempts to keep Z small. [sent-241, score-0.235]
69 The repulsion force pushes away from in proportion to the product of the empirical marginals and , and is scaled by . [sent-242, score-0.188]
70 The computational complexity of Z, the partition function, is . [sent-243, score-0.082]
71 In the application studied here, the use of the spherical constraint of S -CODE has two important consequences. [sent-244, score-0.167]
72 Indeed, when using the spherical constraint, we observed that Z, when actually computed and updated every 10 6 steps, does not deviate much from its initial value. [sent-246, score-0.127]
73 Note that the absolute minimum of Z—obtained for a that maps all of W to a single point on S and a that maps all of W to the opposite point—is ; the absolute maximum of Z, obtained for and that map all of W to the same point, is 1. [sent-250, score-0.148]
74 Note that the only effect of changing in the stochastic gradient algorithm is to change the relative strength of the attraction and repulsion terms. [sent-255, score-0.193]
75 This required us to compute the partition function, which is highly computationally intensive. [sent-259, score-0.082]
76 We therefore computed the partition function only once every q update steps (where one update step is the sampling of one bigram). [sent-260, score-0.14]
77 We found that for q = 10 5 the partition function and likelihood changed smoothly enough and converged, and the embeddings yielded tagging performances that did not differ significantly from those obtained with S -CODE. [sent-261, score-0.573]
78 The second important consequence of imposing the spherical constraint is that it makes the stochastic gradient-ascent procedure markedly smoother. [sent-262, score-0.167]
79 As a result, a relatively large step size can be used, achieving convergence and excellent tagging performance in about 10 minutes of computation time on a desktop machine. [sent-263, score-0.375]
80 CODE requires a smaller step size as well as the recomputation of the partition function, and, as a result, computation time in this application was 6 times longer than with S-CODE. [sent-264, score-0.082]
81 When gauging the applicability of S-CODE to different large-scale embedding problems, one should try to gain some understanding of why the spherical constraint stabilizes the partition function, and whether Z will stabilize around the same value for other problems. [sent-265, score-0.466]
82 The computational complexity of the algorithm is O(Nr), and the memory requirement is O(|W|r) where N is the number of word tokens, and |W| is the number of word types. [sent-271, score-0.502]
83 2 Co mparison to other POS induction models Even though embedding models have been studied extensively, they are not widely used for POS tagging (see however [18]). [sent-274, score-0.613]
84 For the unsupervised POS tagging task, HMMs have until recently dominated the field. [sent-275, score-0.444]
85 Here we show that an embedding model substantially outperforms HMMs, and achieves the same level of performance as the best distributionalonly model to date [15]. [sent-276, score-0.259]
86 In contrast, the approach presented here maps each word type to a single point in . [sent-282, score-0.369]
87 Hence, it assigns a single tag to each word type, like a number of other recent approaches [15, 16, 17]. [sent-283, score-0.447]
88 , of assigning different tags to the same word depending on context, as in "I long to see a long movie. [sent-286, score-0.35]
89 In the past, left and right distributions were extracted by factoring the bigram matrix and using the left and right eigenvectors. [sent-292, score-0.191]
90 Such a linear method does not handle rare words well. [sent-293, score-0.085]
91 This approach allows words with similar contexts but different unigram frequencies to be embedded near each other. [sent-295, score-0.223]
92 In future work, and as a more radical deviation from the CODE model, one may then give up altogether modeling the distribution of X and Y, instead relying on a heuristically motivated objective function of sphere-constrained embeddings and , to be maximized. [sent-299, score-0.071]
93 Although S-CODE and LDC [15] achieve essentially the same level of performance on taggings that induce 17, 45, or 50 labels (Table 1), S-CODE proves superior for the induction of very fine-grained taggings. [sent-301, score-0.111]
94 The appeal of S-CODE lies not only in its strong performance on the unsupervised POS tagging problem, but also in its simplicity, its robustness, and its math ematical grounding. [sent-309, score-0.444]
95 Modeling the joint probability of word type co-occurrence through distances between Euclidean embeddings, without relying on discrete categories or states, is a novel and promising approach for POS tagging. [sent-311, score-0.343]
96 The spherical constraint introduced here permits the approximation of the partition function by a constant, which is the key to the efficiency of the algorithm for large datasets. [sent-312, score-0.29]
97 While the accuracy and computational efficiency of S-CODE is matched by the recent LDC algorithm [15], S-CODE is more robust, showing very little change in performance over a wide range of implementation choices. [sent-314, score-0.077]
98 A comparison of bayesian estimators for unsupervised Hidden Markov Model POS taggers. [sent-322, score-0.116]
99 Building a large annotated corpus of English: The Penn Treebank. [sent-353, score-0.116]
100 A unified architecture for natural language processing: Deep neural networks with multitask learning. [sent-400, score-0.084]
wordName wordTfidf (topN-words)
[('pos', 0.533), ('tagging', 0.328), ('word', 0.251), ('embedding', 0.217), ('tag', 0.196), ('bigram', 0.191), ('ldc', 0.152), ('hmms', 0.15), ('tokens', 0.134), ('spherical', 0.127), ('unsupervised', 0.116), ('sphere', 0.109), ('elie', 0.102), ('repulsion', 0.102), ('language', 0.084), ('partition', 0.082), ('code', 0.081), ('unigram', 0.077), ('bigrams', 0.076), ('lamar', 0.076), ('yariv', 0.076), ('embeddings', 0.071), ('corpus', 0.071), ('induction', 0.068), ('tags', 0.067), ('linguistics', 0.065), ('find', 0.064), ('embedded', 0.059), ('maps', 0.059), ('type', 0.059), ('attraction', 0.058), ('maron', 0.058), ('clusters', 0.057), ('criterion', 0.055), ('aria', 0.051), ('embodies', 0.051), ('haghighi', 0.051), ('taggers', 0.051), ('tagsets', 0.051), ('vem', 0.051), ('force', 0.05), ('mapped', 0.049), ('induced', 0.048), ('excellent', 0.047), ('heterogeneous', 0.047), ('mark', 0.046), ('annotated', 0.045), ('modal', 0.045), ('evaluation', 0.044), ('euclidean', 0.044), ('words', 0.044), ('labels', 0.043), ('frequencies', 0.043), ('fairly', 0.042), ('date', 0.042), ('penn', 0.041), ('goldwater', 0.041), ('efficiency', 0.041), ('cooccurrence', 0.041), ('rare', 0.041), ('constraint', 0.04), ('mapping', 0.039), ('types', 0.039), ('fernando', 0.038), ('sharon', 0.038), ('dimensionalities', 0.038), ('tagged', 0.038), ('exploratory', 0.038), ('brown', 0.038), ('clustering', 0.037), ('empirical', 0.036), ('accuracy', 0.036), ('ascent', 0.035), ('matches', 0.035), ('linguistic', 0.035), ('visualization', 0.035), ('intuitive', 0.034), ('morphological', 0.033), ('significantly', 0.033), ('louis', 0.033), ('distributional', 0.033), ('gradient', 0.033), ('joint', 0.033), ('co', 0.032), ('assigning', 0.032), ('website', 0.031), ('embed', 0.031), ('yielded', 0.03), ('hmm', 0.03), ('updates', 0.03), ('cluster', 0.03), ('map', 0.03), ('meeting', 0.029), ('globerson', 0.029), ('acl', 0.029), ('score', 0.029), ('update', 0.029), ('smoothly', 0.029), ('categorical', 0.029), ('objects', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 251 nips-2010-Sphere Embedding: An Application to Part-of-Speech Induction
Author: Yariv Maron, Elie Bienenstock, Michael James
Abstract: Motivated by an application to unsupervised part-of-speech tagging, we present an algorithm for the Euclidean embedding of large sets of categorical data based on co-occurrence statistics. We use the CODE model of Globerson et al. but constrain the embedding to lie on a hig hdimensional unit sphere. This constraint allows for efficient optimization, even in the case of large datasets and high embedding dimensionality. Using k-means clustering of the embedded data, our approach efficiently produces state-of-the-art results. We analyze the reasons why the sphere constraint is beneficial in this application, and conjecture that these reasons might apply quite generally to other large-scale tasks. 1 In trod u cti on The embedding of objects in a low-dimensional Euclidean space is a form of dimensionality reduction that has been used in the past mostly to create 2D representations of data for the purpose of visualization and exploratory data analysis [10, 13]. Most methods work on objects of a single type, endowed with a measure of similarity. Other methods, such as [ 3], embed objects of heterogeneous types, based on their co-occurrence statistics. In this paper we demonstrate that the latter can be successfully applied to unsupervised part-of-speech (POS) induction, an extensively studied, challenging, problem in natural language processing [1, 4, 5, 6, 7]. The problem we address is distributional POS tagging, in which words are to be tagged based on the statistics of their immediate left and right context in a corpus (ignoring morphology and other features). The induction task is fully unsupervised, i.e., it uses no annotations. This task has been addressed in the past using a variety of methods. Some approaches, such as [1], combine a Markovian assumption with clustering. Many recent works use HMMs, perhaps due to their excellent performance on the supervised version of the task [7, 2, 5]. Using a latent-descriptor clustering approach, [15] obtain the best results to date for distributional-only unsupervised POS tagging of the widely-used WSJ corpus. Using a heterogeneous-data embedding approach for this task, we define separate embedding functions for the objects
2 0.1556434 264 nips-2010-Synergies in learning words and their referents
Author: Mark Johnson, Katherine Demuth, Bevan Jones, Michael J. Black
Abstract: This paper presents Bayesian non-parametric models that simultaneously learn to segment words from phoneme strings and learn the referents of some of those words, and shows that there is a synergistic interaction in the acquisition of these two kinds of linguistic information. The models themselves are novel kinds of Adaptor Grammars that are an extension of an embedding of topic models into PCFGs. These models simultaneously segment phoneme sequences into words and learn the relationship between non-linguistic objects to the words that refer to them. We show (i) that modelling inter-word dependencies not only improves the accuracy of the word segmentation but also of word-object relationships, and (ii) that a model that simultaneously learns word-object relationships and word segmentation segments more accurately than one that just learns word segmentation on its own. We argue that these results support an interactive view of language acquisition that can take advantage of synergies such as these. 1
3 0.15032478 285 nips-2010-Why are some word orders more common than others? A uniform information density account
Author: Luke Maurits, Dan Navarro, Amy Perfors
Abstract: Languages vary widely in many ways, including their canonical word order. A basic aspect of the observed variation is the fact that some word orders are much more common than others. Although this regularity has been recognized for some time, it has not been well-explained. In this paper we offer an informationtheoretic explanation for the observed word-order distribution across languages, based on the concept of Uniform Information Density (UID). We suggest that object-first languages are particularly disfavored because they are highly nonoptimal if the goal is to distribute information content approximately evenly throughout a sentence, and that the rest of the observed word-order distribution is at least partially explainable in terms of UID. We support our theoretical analysis with data from child-directed speech and experimental work. 1
4 0.12178835 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
Author: Samy Bengio, Jason Weston, David Grangier
Abstract: Multi-class classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a treestructure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracy to existing embedding approaches. Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being orders of magnitude faster. 1
5 0.11060884 94 nips-2010-Feature Set Embedding for Incomplete Data
Author: David Grangier, Iain Melvin
Abstract: We present a new learning strategy for classification problems in which train and/or test data suffer from missing features. In previous work, instances are represented as vectors from some feature space and one is forced to impute missing values or to consider an instance-specific subspace. In contrast, our method considers instances as sets of (feature,value) pairs which naturally handle the missing value case. Building onto this framework, we propose a classification strategy for sets. Our proposal maps (feature,value) pairs into an embedding space and then nonlinearly combines the set of embedded vectors. The embedding and the combination parameters are learned jointly on the final classification objective. This simple strategy allows great flexibility in encoding prior knowledge about the features in the embedding step and yields advantageous results compared to alternative solutions over several datasets. 1
6 0.1009852 286 nips-2010-Word Features for Latent Dirichlet Allocation
7 0.085589588 280 nips-2010-Unsupervised Kernel Dimension Reduction
8 0.062820308 29 nips-2010-An Approximate Inference Approach to Temporal Optimization in Optimal Control
9 0.058422115 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression
10 0.057189818 150 nips-2010-Learning concept graphs from text with stick-breaking priors
11 0.055419628 206 nips-2010-Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine
12 0.055278216 97 nips-2010-Functional Geometry Alignment and Localization of Brain Areas
13 0.053197399 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
14 0.051362757 166 nips-2010-Minimum Average Cost Clustering
15 0.049737692 283 nips-2010-Variational Inference over Combinatorial Spaces
16 0.049131893 221 nips-2010-Random Projections for $k$-means Clustering
17 0.048778508 273 nips-2010-Towards Property-Based Classification of Clustering Paradigms
18 0.047814015 230 nips-2010-Robust Clustering as Ensembles of Affinity Relations
19 0.04638527 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
20 0.045394424 112 nips-2010-Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning
topicId topicWeight
[(0, 0.154), (1, 0.047), (2, -0.009), (3, -0.028), (4, -0.104), (5, 0.023), (6, 0.066), (7, -0.042), (8, 0.005), (9, 0.006), (10, 0.063), (11, 0.013), (12, 0.056), (13, -0.041), (14, 0.054), (15, -0.018), (16, 0.052), (17, 0.001), (18, -0.111), (19, -0.134), (20, -0.007), (21, 0.006), (22, 0.12), (23, -0.004), (24, 0.215), (25, -0.115), (26, 0.082), (27, -0.057), (28, -0.001), (29, -0.021), (30, 0.116), (31, -0.056), (32, -0.148), (33, 0.066), (34, -0.039), (35, -0.104), (36, 0.12), (37, -0.066), (38, 0.009), (39, 0.019), (40, 0.07), (41, 0.085), (42, -0.034), (43, -0.073), (44, 0.09), (45, -0.018), (46, -0.001), (47, 0.07), (48, 0.131), (49, 0.006)]
simIndex simValue paperId paperTitle
same-paper 1 0.93640727 251 nips-2010-Sphere Embedding: An Application to Part-of-Speech Induction
Author: Yariv Maron, Elie Bienenstock, Michael James
Abstract: Motivated by an application to unsupervised part-of-speech tagging, we present an algorithm for the Euclidean embedding of large sets of categorical data based on co-occurrence statistics. We use the CODE model of Globerson et al. but constrain the embedding to lie on a hig hdimensional unit sphere. This constraint allows for efficient optimization, even in the case of large datasets and high embedding dimensionality. Using k-means clustering of the embedded data, our approach efficiently produces state-of-the-art results. We analyze the reasons why the sphere constraint is beneficial in this application, and conjecture that these reasons might apply quite generally to other large-scale tasks. 1 In trod u cti on The embedding of objects in a low-dimensional Euclidean space is a form of dimensionality reduction that has been used in the past mostly to create 2D representations of data for the purpose of visualization and exploratory data analysis [10, 13]. Most methods work on objects of a single type, endowed with a measure of similarity. Other methods, such as [ 3], embed objects of heterogeneous types, based on their co-occurrence statistics. In this paper we demonstrate that the latter can be successfully applied to unsupervised part-of-speech (POS) induction, an extensively studied, challenging, problem in natural language processing [1, 4, 5, 6, 7]. The problem we address is distributional POS tagging, in which words are to be tagged based on the statistics of their immediate left and right context in a corpus (ignoring morphology and other features). The induction task is fully unsupervised, i.e., it uses no annotations. This task has been addressed in the past using a variety of methods. Some approaches, such as [1], combine a Markovian assumption with clustering. Many recent works use HMMs, perhaps due to their excellent performance on the supervised version of the task [7, 2, 5]. Using a latent-descriptor clustering approach, [15] obtain the best results to date for distributional-only unsupervised POS tagging of the widely-used WSJ corpus. Using a heterogeneous-data embedding approach for this task, we define separate embedding functions for the objects
2 0.79600644 264 nips-2010-Synergies in learning words and their referents
Author: Mark Johnson, Katherine Demuth, Bevan Jones, Michael J. Black
Abstract: This paper presents Bayesian non-parametric models that simultaneously learn to segment words from phoneme strings and learn the referents of some of those words, and shows that there is a synergistic interaction in the acquisition of these two kinds of linguistic information. The models themselves are novel kinds of Adaptor Grammars that are an extension of an embedding of topic models into PCFGs. These models simultaneously segment phoneme sequences into words and learn the relationship between non-linguistic objects to the words that refer to them. We show (i) that modelling inter-word dependencies not only improves the accuracy of the word segmentation but also of word-object relationships, and (ii) that a model that simultaneously learns word-object relationships and word segmentation segments more accurately than one that just learns word segmentation on its own. We argue that these results support an interactive view of language acquisition that can take advantage of synergies such as these. 1
3 0.76971519 285 nips-2010-Why are some word orders more common than others? A uniform information density account
Author: Luke Maurits, Dan Navarro, Amy Perfors
Abstract: Languages vary widely in many ways, including their canonical word order. A basic aspect of the observed variation is the fact that some word orders are much more common than others. Although this regularity has been recognized for some time, it has not been well-explained. In this paper we offer an informationtheoretic explanation for the observed word-order distribution across languages, based on the concept of Uniform Information Density (UID). We suggest that object-first languages are particularly disfavored because they are highly nonoptimal if the goal is to distribute information content approximately evenly throughout a sentence, and that the rest of the observed word-order distribution is at least partially explainable in terms of UID. We support our theoretical analysis with data from child-directed speech and experimental work. 1
4 0.59596825 125 nips-2010-Inference and communication in the game of Password
Author: Yang Xu, Charles Kemp
Abstract: Communication between a speaker and hearer will be most efficient when both parties make accurate inferences about the other. We study inference and communication in a television game called Password, where speakers must convey secret words to hearers by providing one-word clues. Our working hypothesis is that human communication is relatively efficient, and we use game show data to examine three predictions. First, we predict that speakers and hearers are both considerate, and that both take the other’s perspective into account. Second, we predict that speakers and hearers are calibrated, and that both make accurate assumptions about the strategy used by the other. Finally, we predict that speakers and hearers are collaborative, and that they tend to share the cognitive burden of communication equally. We find evidence in support of all three predictions, and demonstrate in addition that efficient communication tends to break down when speakers and hearers are placed under time pressure.
5 0.54020536 286 nips-2010-Word Features for Latent Dirichlet Allocation
Author: James Petterson, Wray Buntine, Shravan M. Narayanamurthy, Tibério S. Caetano, Alex J. Smola
Abstract: We extend Latent Dirichlet Allocation (LDA) by explicitly allowing for the encoding of side information in the distribution over words. This results in a variety of new capabilities, such as improved estimates for infrequently occurring words, as well as the ability to leverage thesauri and dictionaries in order to boost topic cohesion within and across languages. We present experiments on multi-language topic synchronisation where dictionary information is used to bias corresponding words towards similar topics. Results indicate that our model substantially improves topic cohesion when compared to the standard LDA model. 1
6 0.41846237 207 nips-2010-Phoneme Recognition with Large Hierarchical Reservoirs
7 0.3988398 107 nips-2010-Global seismic monitoring as probabilistic inference
8 0.3981038 155 nips-2010-Learning the context of a category
9 0.38599417 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
10 0.3856723 94 nips-2010-Feature Set Embedding for Incomplete Data
11 0.38506737 61 nips-2010-Direct Loss Minimization for Structured Prediction
12 0.37702185 28 nips-2010-An Alternative to Low-level-Sychrony-Based Methods for Speech Detection
13 0.35590199 283 nips-2010-Variational Inference over Combinatorial Spaces
14 0.3496781 289 nips-2010-b-Bit Minwise Hashing for Estimating Three-Way Similarities
15 0.32723099 255 nips-2010-Static Analysis of Binary Executables Using Structural SVMs
16 0.32135558 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression
17 0.32124975 62 nips-2010-Discriminative Clustering by Regularized Information Maximization
18 0.31947279 206 nips-2010-Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine
19 0.30238265 157 nips-2010-Learning to localise sounds with spiking neural networks
20 0.30133486 280 nips-2010-Unsupervised Kernel Dimension Reduction
topicId topicWeight
[(0, 0.268), (13, 0.06), (17, 0.019), (27, 0.077), (30, 0.129), (35, 0.021), (45, 0.185), (50, 0.044), (52, 0.037), (60, 0.037), (77, 0.03), (90, 0.022)]
simIndex simValue paperId paperTitle
1 0.83371025 14 nips-2010-A Reduction from Apprenticeship Learning to Classification
Author: Umar Syed, Robert E. Schapire
Abstract: We provide new theoretical results for apprenticeship learning, a variant of reinforcement learning in which the true reward function is unknown, and the goal is to perform well relative to an observed expert. We study a common approach to learning from expert demonstrations: using a classification algorithm to learn to imitate the expert’s behavior. Although this straightforward learning strategy is widely-used in practice, it has been subject to very little formal analysis. We prove that, if the learned classifier has error rate ǫ, the difference between the √ value of the apprentice’s policy and the expert’s policy is O( ǫ). Further, we prove that this difference is only O(ǫ) when the expert’s policy is close to optimal. This latter result has an important practical consequence: Not only does imitating a near-optimal expert result in a better policy, but far fewer demonstrations are required to successfully imitate such an expert. This suggests an opportunity for substantial savings whenever the expert is known to be good, but demonstrations are expensive or difficult to obtain. 1
same-paper 2 0.78357518 251 nips-2010-Sphere Embedding: An Application to Part-of-Speech Induction
Author: Yariv Maron, Elie Bienenstock, Michael James
Abstract: Motivated by an application to unsupervised part-of-speech tagging, we present an algorithm for the Euclidean embedding of large sets of categorical data based on co-occurrence statistics. We use the CODE model of Globerson et al. but constrain the embedding to lie on a hig hdimensional unit sphere. This constraint allows for efficient optimization, even in the case of large datasets and high embedding dimensionality. Using k-means clustering of the embedded data, our approach efficiently produces state-of-the-art results. We analyze the reasons why the sphere constraint is beneficial in this application, and conjecture that these reasons might apply quite generally to other large-scale tasks. 1 In trod u cti on The embedding of objects in a low-dimensional Euclidean space is a form of dimensionality reduction that has been used in the past mostly to create 2D representations of data for the purpose of visualization and exploratory data analysis [10, 13]. Most methods work on objects of a single type, endowed with a measure of similarity. Other methods, such as [ 3], embed objects of heterogeneous types, based on their co-occurrence statistics. In this paper we demonstrate that the latter can be successfully applied to unsupervised part-of-speech (POS) induction, an extensively studied, challenging, problem in natural language processing [1, 4, 5, 6, 7]. The problem we address is distributional POS tagging, in which words are to be tagged based on the statistics of their immediate left and right context in a corpus (ignoring morphology and other features). The induction task is fully unsupervised, i.e., it uses no annotations. This task has been addressed in the past using a variety of methods. Some approaches, such as [1], combine a Markovian assumption with clustering. Many recent works use HMMs, perhaps due to their excellent performance on the supervised version of the task [7, 2, 5]. Using a latent-descriptor clustering approach, [15] obtain the best results to date for distributional-only unsupervised POS tagging of the widely-used WSJ corpus. Using a heterogeneous-data embedding approach for this task, we define separate embedding functions for the objects
3 0.75045323 5 nips-2010-A Dirty Model for Multi-task Learning
Author: Ali Jalali, Sujay Sanghavi, Chao Ruan, Pradeep K. Ravikumar
Abstract: We consider multi-task learning in the setting of multiple linear regression, and where some relevant features could be shared across the tasks. Recent research has studied the use of ℓ1 /ℓq norm block-regularizations with q > 1 for such blocksparse structured problems, establishing strong guarantees on recovery even under high-dimensional scaling where the number of features scale with the number of observations. However, these papers also caution that the performance of such block-regularized methods are very dependent on the extent to which the features are shared across tasks. Indeed they show [8] that if the extent of overlap is less than a threshold, or even if parameter values in the shared features are highly uneven, then block ℓ1 /ℓq regularization could actually perform worse than simple separate elementwise ℓ1 regularization. Since these caveats depend on the unknown true parameters, we might not know when and which method to apply. Even otherwise, we are far away from a realistic multi-task setting: not only do the set of relevant features have to be exactly the same across tasks, but their values have to as well. Here, we ask the question: can we leverage parameter overlap when it exists, but not pay a penalty when it does not ? Indeed, this falls under a more general question of whether we can model such dirty data which may not fall into a single neat structural bracket (all block-sparse, or all low-rank and so on). With the explosion of such dirty high-dimensional data in modern settings, it is vital to develop tools – dirty models – to perform biased statistical estimation tailored to such data. Here, we take a first step, focusing on developing a dirty model for the multiple regression problem. Our method uses a very simple idea: we estimate a superposition of two sets of parameters and regularize them differently. We show both theoretically and empirically, our method strictly and noticeably outperforms both ℓ1 or ℓ1 /ℓq methods, under high-dimensional scaling and over the entire range of possible overlaps (except at boundary cases, where we match the best method). 1 Introduction: Motivation and Setup High-dimensional scaling. In fields across science and engineering, we are increasingly faced with problems where the number of variables or features p is larger than the number of observations n. Under such high-dimensional scaling, for any hope of statistically consistent estimation, it becomes vital to leverage any potential structure in the problem such as sparsity (e.g. in compressed sensing [3] and LASSO [14]), low-rank structure [13, 9], or sparse graphical model structure [12]. It is in such high-dimensional contexts in particular that multi-task learning [4] could be most useful. Here, 1 multiple tasks share some common structure such as sparsity, and estimating these tasks jointly by leveraging this common structure could be more statistically efficient. Block-sparse Multiple Regression. A common multiple task learning setting, and which is the focus of this paper, is that of multiple regression, where we have r > 1 response variables, and a common set of p features or covariates. The r tasks could share certain aspects of their underlying distributions, such as common variance, but the setting we focus on in this paper is where the response variables have simultaneously sparse structure: the index set of relevant features for each task is sparse; and there is a large overlap of these relevant features across the different regression problems. Such “simultaneous sparsity” arises in a variety of contexts [15]; indeed, most applications of sparse signal recovery in contexts ranging from graphical model learning, kernel learning, and function estimation have natural extensions to the simultaneous-sparse setting [12, 2, 11]. It is useful to represent the multiple regression parameters via a matrix, where each column corresponds to a task, and each row to a feature. Having simultaneous sparse structure then corresponds to the matrix being largely “block-sparse” – where each row is either all zero or mostly non-zero, and the number of non-zero rows is small. A lot of recent research in this setting has focused on ℓ1 /ℓq norm regularizations, for q > 1, that encourage the parameter matrix to have such blocksparse structure. Particular examples include results using the ℓ1 /ℓ∞ norm [16, 5, 8], and the ℓ1 /ℓ2 norm [7, 10]. Dirty Models. Block-regularization is “heavy-handed” in two ways. By strictly encouraging sharedsparsity, it assumes that all relevant features are shared, and hence suffers under settings, arguably more realistic, where each task depends on features specific to itself in addition to the ones that are common. The second concern with such block-sparse regularizers is that the ℓ1 /ℓq norms can be shown to encourage the entries in the non-sparse rows taking nearly identical values. Thus we are far away from the original goal of multitask learning: not only do the set of relevant features have to be exactly the same, but their values have to as well. Indeed recent research into such regularized methods [8, 10] caution against the use of block-regularization in regimes where the supports and values of the parameters for each task can vary widely. Since the true parameter values are unknown, that would be a worrisome caveat. We thus ask the question: can we learn multiple regression models by leveraging whatever overlap of features there exist, and without requiring the parameter values to be near identical? Indeed this is an instance of a more general question on whether we can estimate statistical models where the data may not fall cleanly into any one structural bracket (sparse, block-sparse and so on). With the explosion of dirty high-dimensional data in modern settings, it is vital to investigate estimation of corresponding dirty models, which might require new approaches to biased high-dimensional estimation. In this paper we take a first step, focusing on such dirty models for a specific problem: simultaneously sparse multiple regression. Our approach uses a simple idea: while any one structure might not capture the data, a superposition of structural classes might. Our method thus searches for a parameter matrix that can be decomposed into a row-sparse matrix (corresponding to the overlapping or shared features) and an elementwise sparse matrix (corresponding to the non-shared features). As we show both theoretically and empirically, with this simple fix we are able to leverage any extent of shared features, while allowing disparities in support and values of the parameters, so that we are always better than both the Lasso or block-sparse regularizers (at times remarkably so). The rest of the paper is organized as follows: In Sec 2. basic definitions and setup of the problem are presented. Main results of the paper is discussed in sec 3. Experimental results and simulations are demonstrated in Sec 4. Notation: For any matrix M , we denote its j th row as Mj , and its k-th column as M (k) . The set of all non-zero rows (i.e. all rows with at least one non-zero element) is denoted by RowSupp(M ) (k) and its support by Supp(M ). Also, for any matrix M , let M 1,1 := j,k |Mj |, i.e. the sums of absolute values of the elements, and M 1,∞ := j 2 Mj ∞ where, Mj ∞ (k) := maxk |Mj |. 2 Problem Set-up and Our Method Multiple regression. We consider the following standard multiple linear regression model: ¯ y (k) = X (k) θ(k) + w(k) , k = 1, . . . , r, where y (k) ∈ Rn is the response for the k-th task, regressed on the design matrix X (k) ∈ Rn×p (possibly different across tasks), while w(k) ∈ Rn is the noise vector. We assume each w(k) is drawn independently from N (0, σ 2 ). The total number of tasks or target variables is r, the number of features is p, while the number of samples we have for each task is n. For notational convenience, ¯ we collate these quantities into matrices Y ∈ Rn×r for the responses, Θ ∈ Rp×r for the regression n×r parameters and W ∈ R for the noise. ¯ Dirty Model. In this paper we are interested in estimating the true parameter Θ from data by lever¯ aging any (unknown) extent of simultaneous-sparsity. In particular, certain rows of Θ would have many non-zero entries, corresponding to features shared by several tasks (“shared” rows), while certain rows would be elementwise sparse, corresponding to those features which are relevant for some tasks but not all (“non-shared rows”), while certain rows would have all zero entries, corresponding to those features that are not relevant to any task. We are interested in estimators Θ that automatically adapt to different levels of sharedness, and yet enjoy the following guarantees: Support recovery: We say an estimator Θ successfully recovers the true signed support if ¯ sign(Supp(Θ)) = sign(Supp(Θ)). We are interested in deriving sufficient conditions under which ¯ the estimator succeeds. We note that this is stronger than merely recovering the row-support of Θ, which is union of its supports for the different tasks. In particular, denoting Uk for the support of the ¯ k-th column of Θ, and U = k Uk . Error bounds: We are also interested in providing bounds on the elementwise ℓ∞ norm error of the estimator Θ, ¯ Θ−Θ 2.1 ∞ = max max j=1,...,p k=1,...,r (k) Θj (k) ¯ − Θj . Our Method Our method explicitly models the dirty block-sparse structure. We estimate a sum of two parameter matrices B and S with different regularizations for each: encouraging block-structured row-sparsity in B and elementwise sparsity in S. The corresponding “clean” models would either just use blocksparse regularizations [8, 10] or just elementwise sparsity regularizations [14, 18], so that either method would perform better in certain suited regimes. Interestingly, as we will see in the main results, by explicitly allowing to have both block-sparse and elementwise sparse component, we are ¯ able to outperform both classes of these “clean models”, for all regimes Θ. Algorithm 1 Dirty Block Sparse Solve the following convex optimization problem: (S, B) ∈ arg min S,B 1 2n r k=1 y (k) − X (k) S (k) + B (k) 2 2 + λs S 1,1 + λb B 1,∞ . (1) Then output Θ = B + S. 3 Main Results and Their Consequences We now provide precise statements of our main results. A number of recent results have shown that the Lasso [14, 18] and ℓ1 /ℓ∞ block-regularization [8] methods succeed in recovering signed supports with controlled error bounds under high-dimensional scaling regimes. Our first two theorems extend these results to our dirty model setting. In Theorem 1, we consider the case of deterministic design matrices X (k) , and provide sufficient conditions guaranteeing signed support recovery, and elementwise ℓ∞ norm error bounds. In Theorem 2, we specialize this theorem to the case where the 3 rows of the design matrices are random from a general zero mean Gaussian distribution: this allows us to provide scaling on the number of observations required in order to guarantee signed support recovery and bounded elementwise ℓ∞ norm error. Our third result is the most interesting in that it explicitly quantifies the performance gains of our method vis-a-vis Lasso and the ℓ1 /ℓ∞ block-regularization method. Since this entailed finding the precise constants underlying earlier theorems, and a correspondingly more delicate analysis, we follow Negahban and Wainwright [8] and focus on the case where there are two-tasks (i.e. r = 2), and where we have standard Gaussian design matrices as in Theorem 2. Further, while each of two tasks depends on s features, only a fraction α of these are common. It is then interesting to see how the behaviors of the different regularization methods vary with the extent of overlap α. Comparisons. Negahban and Wainwright [8] show that there is actually a “phase transition” in the scaling of the probability of successful signed support-recovery with the number of observations. n Denote a particular rescaling of the sample-size θLasso (n, p, α) = s log(p−s) . Then as Wainwright [18] show, when the rescaled number of samples scales as θLasso > 2 + δ for any δ > 0, Lasso succeeds in recovering the signed support of all columns with probability converging to one. But when the sample size scales as θLasso < 2−δ for any δ > 0, Lasso fails with probability converging to one. For the ℓ1 /ℓ∞ -reguralized multiple linear regression, define a similar rescaled sample size n θ1,∞ (n, p, α) = s log(p−(2−α)s) . Then as Negahban and Wainwright [8] show there is again a transition in probability of success from near zero to near one, at the rescaled sample size of θ1,∞ = (4 − 3α). Thus, for α < 2/3 (“less sharing”) Lasso would perform better since its transition is at a smaller sample size, while for α > 2/3 (“more sharing”) the ℓ1 /ℓ∞ regularized method would perform better. As we show in our third theorem, the phase transition for our method occurs at the rescaled sample size of θ1,∞ = (2 − α), which is strictly before either the Lasso or the ℓ1 /ℓ∞ regularized method except for the boundary cases: α = 0, i.e. the case of no sharing, where we match Lasso, and for α = 1, i.e. full sharing, where we match ℓ1 /ℓ∞ . Everywhere else, we strictly outperform both methods. Figure 3 shows the empirical performance of each of the three methods; as can be seen, they agree very well with the theoretical analysis. (Further details in the experiments Section 4). 3.1 Sufficient Conditions for Deterministic Designs We first consider the case where the design matrices X (k) for k = 1, · · ·, r are deterministic, and start by specifying the assumptions we impose on the model. We note that similar sufficient conditions for the deterministic X (k) ’s case were imposed in papers analyzing Lasso [18] and block-regularization methods [8, 10]. (k) A0 Column Normalization Xj 2 ≤ √ 2n for all j = 1, . . . , p, k = 1, . . . , r. ¯ Let Uk denote the support of the k-th column of Θ, and U = supports for each task. Then we require that k r A1 Incoherence Condition γb := 1 − max c j∈U (k) (k) Xj , XUk (k) (k) XUk , XUk Uk denote the union of −1 c We will also find it useful to define γs := 1−max1≤k≤r maxj∈Uk (k) > 0. 1 k=1 (k) Xj , XUk Note that by the incoherence condition A1, we have γs > 0. A2 Eigenvalue Condition Cmin := min λmin 1≤k≤r A3 Boundedness Condition Dmax := max 1≤k≤r 1 (k) (k) XUk , XUk n 1 (k) (k) XUk , XUk n (k) (k) XUk , XUk −1 . 1 > 0. −1 ∞,1 < ∞. Further, we require the regularization penalties be set as λs > 2(2 − γs )σ log(pr) √ γs n and 4 λb > 2(2 − γb )σ log(pr) √ . γb n (2) 1 0.9 0.8 0.8 Dirty Model L1/Linf Reguralizer Probability of Success Probability of Success 1 0.9 0.7 0.6 0.5 0.4 LASSO 0.3 0.2 0 0.5 1 1.5 1.7 2 2.5 Control Parameter θ 3 3.1 3.5 0.6 0.5 0.4 L1/Linf Reguralizer 0.3 LASSO 0.2 p=128 p=256 p=512 0.1 Dirty Model 0.7 p=128 p=256 p=512 0.1 0 0.5 4 1 1.333 (a) α = 0.3 1.5 2 Control Parameter θ (b) α = 2.5 3 2 3 1 0.9 Dirty Model Probability of Success 0.8 0.7 L1/Linf Reguralizer 0.6 0.5 LASSO 0.4 0.3 0.2 p=128 p=256 p=512 0.1 0 0.5 1 1.2 1.5 1.6 2 Control Parameter θ 2.5 (c) α = 0.8 Figure 1: Probability of success in recovering the true signed support using dirty model, Lasso and ℓ1 /ℓ∞ regularizer. For a 2-task problem, the probability of success for different values of feature-overlap fraction α is plotted. As we can see in the regimes that Lasso is better than, as good as and worse than ℓ1 /ℓ∞ regularizer ((a), (b) and (c) respectively), the dirty model outperforms both of the methods, i.e., it requires less number of observations for successful recovery of the true signed support compared to Lasso and ℓ1 /ℓ∞ regularizer. Here p s = ⌊ 10 ⌋ always. Theorem 1. Suppose A0-A3 hold, and that we obtain estimate Θ from our algorithm with regularization parameters chosen according to (2). Then, with probability at least 1 − c1 exp(−c2 n) → 1, we are guaranteed that the convex program (1) has a unique optimum and (a) The estimate Θ has no false inclusions, and has bounded ℓ∞ norm error so that ¯ Supp(Θ) ⊆ Supp(Θ), and ¯ Θ−Θ ∞,∞ 4σ 2 log (pr) + λs Dmax . n Cmin ≤ bmin ¯ (b) sign(Supp(Θ)) = sign Supp(Θ) provided that min ¯ (j,k)∈Supp(Θ) ¯(k) θj > bmin . Here the positive constants c1 , c2 depend only on γs , γb , λs , λb and σ, but are otherwise independent of n, p, r, the problem dimensions of interest. Remark: Condition (a) guarantees that the estimate will have no false inclusions; i.e. all included features will be relevant. If in addition, we require that it have no false exclusions and that recover the support exactly, we need to impose the assumption in (b) that the non-zero elements are large enough to be detectable above the noise. 3.2 General Gaussian Designs Often the design matrices consist of samples from a Gaussian ensemble. Suppose that for each task (k) k = 1, . . . , r the design matrix X (k) ∈ Rn×p is such that each row Xi ∈ Rp is a zero-mean Gaussian random vector with covariance matrix Σ(k) ∈ Rp×p , and is independent of every other (k) row. Let ΣV,U ∈ R|V|×|U | be the submatrix of Σ(k) with rows corresponding to V and columns to U . We require these covariance matrices to satisfy the following conditions: r C1 Incoherence Condition γb := 1 − max c j∈U (k) (k) Σj,Uk , ΣUk ,Uk k=1 5 −1 >0 1 C2 Eigenvalue Condition Cmin := min λmin Σ(k),Uk Uk > 0 so that the minimum eigenvalue 1≤k≤r is bounded away from zero. C3 Boundedness Condition Dmax := (k) ΣUk ,Uk −1 ∞,1 < ∞. These conditions are analogues of the conditions for deterministic designs; they are now imposed on the covariance matrix of the (randomly generated) rows of the design matrix. Further, defining s := maxk |Uk |, we require the regularization penalties be set as 1/2 λs > 1/2 4σ 2 Cmin log(pr) √ γs nCmin − 2s log(pr) and λb > 4σ 2 Cmin r(r log(2) + log(p)) . √ γb nCmin − 2sr(r log(2) + log(p)) (3) Theorem 2. Suppose assumptions C1-C3 hold, and that the number of samples scale as n > max 2s log(pr) 2sr r log(2)+log(p) 2 2 Cmin γs , Cmin γb . Suppose we obtain estimate Θ from algorithm (3). Then, with probability at least 1 − c1 exp (−c2 (r log(2) + log(p))) − c3 exp(−c4 log(rs)) → 1 for some positive numbers c1 − c4 , we are guaranteed that the algorithm estimate Θ is unique and satisfies the following conditions: (a) the estimate Θ has no false inclusions, and has bounded ℓ∞ norm error so that ¯ Supp(Θ) ⊆ Supp(Θ), and ¯ Θ−Θ ∞,∞ ≤ 50σ 2 log(rs) + λs nCmin 4s √ + Dmax . Cmin n gmin ¯ (b) sign(Supp(Θ)) = sign Supp(Θ) provided that 3.3 min ¯ (j,k)∈Supp(Θ) ¯(k) θj > gmin . Sharp Transition for 2-Task Gaussian Designs This is one of the most important results of this paper. Here, we perform a more delicate and finer analysis to establish precise quantitative gains of our method. We focus on the special case where r = 2 and the design matrix has rows generated from the standard Gaussian distribution N (0, In×n ), so that C1 − C3 hold, with Cmin = Dmax = 1. As we will see both analytically and experimentally, our method strictly outperforms both Lasso and ℓ1 /ℓ∞ -block-regularization over for all cases, except at the extreme endpoints of no support sharing (where it matches that of Lasso) and full support sharing (where it matches that of ℓ1 /ℓ∞ ). We now present our analytical results; the empirical comparisons are presented next in Section 4. The results will be in terms of a particular rescaling of the sample size n as θ(n, p, s, α) := n . (2 − α)s log (p − (2 − α)s) We will also require the assumptions that 4σ 2 (1 − F1 λs > F2 λb > s/n)(log(r) + log(p − (2 − α)s)) 1/2 (n)1/2 − (s)1/2 − ((2 − α) s (log(r) + log(p − (2 − α)s)))1/2 4σ 2 (1 − s/n)r(r log(2) + log(p − (2 − α)s)) , 1/2 (n)1/2 − (s)1/2 − ((1 − α/2) sr (r log(2) + log(p − (2 − α)s)))1/2 . Theorem 3. Consider a 2-task regression problem (n, p, s, α), where the design matrix has rows generated from the standard Gaussian distribution N (0, In×n ). 6 Suppose maxj∈B∗ ∗(1) Θj − ∗(2) Θj = o(λs ), where B ∗ is the submatrix of Θ∗ with rows where both entries are non-zero. Then the estimate Θ of the problem (1) satisfies the following: (Success) Suppose the regularization coefficients satisfy F1 − F2. Further, assume that the number of samples scales as θ(n, p, s, α) > 1. Then, with probability at least 1 − c1 exp(−c2 n) for some positive numbers c1 and c2 , we are guaranteed that Θ satisfies the support-recovery and ℓ∞ error bound conditions (a-b) in Theorem 2. ˆ ˆ (Failure) If θ(n, p, s, α) < 1 there is no solution (B, S) for any choices of λs and λb such that ¯ sign Supp(Θ) = sign Supp(Θ) . We note that we require the gap ∗(1) Θj ∗(2) − Θj to be small only on rows where both entries are non-zero. As we show in a more general theorem in the appendix, even in the case where the gap is large, the dependence of the sample scaling on the gap is quite weak. 4 Empirical Results In this section, we investigate the performance of our dirty block sparse estimator on synthetic and real-world data. The synthetic experiments explore the accuracy of Theorem 3, and compare our estimator with LASSO and the ℓ1 /ℓ∞ regularizer. We see that Theorem 3 is very accurate indeed. Next, we apply our method to a real world datasets containing hand-written digits for classification. Again we compare against LASSO and the ℓ1 /ℓ∞ . (a multi-task regression dataset) with r = 2 tasks. In both of this real world dataset, we show that dirty model outperforms both LASSO and ℓ1 /ℓ∞ practically. For each method, the parameters are chosen via cross-validation; see supplemental material for more details. 4.1 Synthetic Data Simulation We consider a r = 2-task regression problem as discussed in Theorem 3, for a range of parameters (n, p, s, α). The design matrices X have each entry being i.i.d. Gaussian with mean 0 and variance 1. For each fixed set of (n, s, p, α), we generate 100 instances of the problem. In each instance, ¯ given p, s, α, the locations of the non-zero entries of the true Θ are chosen at randomly; each nonzero entry is then chosen to be i.i.d. Gaussian with mean 0 and variance 1. n samples are then generated from this. We then attempt to estimate using three methods: our dirty model, ℓ1 /ℓ∞ regularizer and LASSO. In each case, and for each instance, the penalty regularizer coefficients are found by cross validation. After solving the three problems, we compare the signed support of the solution with the true signed support and decide whether or not the program was successful in signed support recovery. We describe these process in more details in this section. Performance Analysis: We ran the algorithm for five different values of the overlap ratio α ∈ 2 {0.3, 3 , 0.8} with three different number of features p ∈ {128, 256, 512}. For any instance of the ˆ ¯ problem (n, p, s, α), if the recovered matrix Θ has the same sign support as the true Θ, then we count it as success, otherwise failure (even if one element has different sign, we count it as failure). As Theorem 3 predicts and Fig 3 shows, the right scaling for the number of oservations is n s log(p−(2−α)s) , where all curves stack on the top of each other at 2 − α. Also, the number of observations required by dirty model for true signed support recovery is always less than both LASSO and ℓ1 /ℓ∞ regularizer. Fig 1(a) shows the probability of success for the case α = 0.3 (when LASSO is better than ℓ1 /ℓ∞ regularizer) and that dirty model outperforms both methods. When α = 2 3 (see Fig 1(b)), LASSO and ℓ1 /ℓ∞ regularizer performs the same; but dirty model require almost 33% less observations for the same performance. As α grows toward 1, e.g. α = 0.8 as shown in Fig 1(c), ℓ1 /ℓ∞ performs better than LASSO. Still, dirty model performs better than both methods in this case as well. 7 4 p=128 p=256 p=512 Phase Transition Threshold 3.5 L1/Linf Regularizer 3 2.5 LASSO 2 Dirty Model 1.5 1 0 0.1 0.2 0.3 0.4 0.5 0.6 Shared Support Parameter α 0.7 0.8 0.9 1 Figure 2: Verification of the result of the Theorem 3 on the behavior of phase transition threshold by changing the parameter α in a 2-task (n, p, s, α) problem for dirty model, LASSO and ℓ1 /ℓ∞ regularizer. The y-axis p n is s log(p−(2−α)s) , where n is the number of samples at which threshold was observed. Here s = ⌊ 10 ⌋. Our dirty model method shows a gain in sample complexity over the entire range of sharing α. The pre-constant in Theorem 3 is also validated. n 10 20 40 Average Classification Error Variance of Error Average Row Support Size Average Support Size Average Classification Error Variance of Error Average Row Support Size Average Support Size Average Classification Error Variance of Error Average Row Support Size Average Support Size Our Model 8.6% 0.53% B:165 B + S:171 S:18 B + S:1651 3.0% 0.56% B:211 B + S:226 S:34 B + S:2118 2.2% 0.57% B:270 B + S:299 S:67 B + S:2761 ℓ1 /ℓ∞ 9.9% 0.64% 170 1700 3.5% 0.62% 217 2165 3.2% 0.68% 368 3669 LASSO 10.8% 0.51% 123 539 4.1% 0.68% 173 821 2.8% 0.85% 354 2053 Table 1: Handwriting Classification Results for our model, ℓ1 /ℓ∞ and LASSO Scaling Verification: To verify that the phase transition threshold changes linearly with α as predicted by Theorem 3, we plot the phase transition threshold versus α. For five different values of 2 α ∈ {0.05, 0.3, 3 , 0.8, 0.95} and three different values of p ∈ {128, 256, 512}, we find the phase transition threshold for dirty model, LASSO and ℓ1 /ℓ∞ regularizer. We consider the point where the probability of success in recovery of signed support exceeds 50% as the phase transition threshold. We find this point by interpolation on the closest two points. Fig 2 shows that phase transition threshold for dirty model is always lower than the phase transition for LASSO and ℓ1 /ℓ∞ regularizer. 4.2 Handwritten Digits Dataset We use the handwritten digit dataset [1], containing features of handwritten numerals (0-9) extracted from a collection of Dutch utility maps. This dataset has been used by a number of papers [17, 6] as a reliable dataset for handwritten recognition algorithms. There are thus r = 10 tasks, and each handwritten sample consists of p = 649 features. Table 1 shows the results of our analysis for different sizes n of the training set . We measure the classification error for each digit to get the 10-vector of errors. Then, we find the average error and the variance of the error vector to show how the error is distributed over all tasks. We compare our method with ℓ1 /ℓ∞ reguralizer method and LASSO. Again, in all methods, parameters are chosen via cross-validation. For our method we separate out the B and S matrices that our method finds, so as to illustrate how many features it identifies as “shared” and how many as “non-shared”. For the other methods we just report the straight row and support numbers, since they do not make such a separation. Acknowledgements We acknowledge support from NSF grant IIS-101842, and NSF CAREER program, Grant 0954059. 8 References [1] A. Asuncion and D.J. Newman. UCI Machine Learning Repository, http://www.ics.uci.edu/ mlearn/MLRepository.html. University of California, School of Information and Computer Science, Irvine, CA, 2007. [2] F. Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9:1179–1225, 2008. [3] R. Baraniuk. Compressive sensing. IEEE Signal Processing Magazine, 24(4):118–121, 2007. [4] R. Caruana. Multitask learning. Machine Learning, 28:41–75, 1997. [5] C.Zhang and J.Huang. Model selection consistency of the lasso selection in high-dimensional linear regression. Annals of Statistics, 36:1567–1594, 2008. [6] X. He and P. Niyogi. Locality preserving projections. In NIPS, 2003. [7] K. Lounici, A. B. Tsybakov, M. Pontil, and S. A. van de Geer. Taking advantage of sparsity in multi-task learning. In 22nd Conference On Learning Theory (COLT), 2009. [8] S. Negahban and M. J. Wainwright. Joint support recovery under high-dimensional scaling: Benefits and perils of ℓ1,∞ -regularization. In Advances in Neural Information Processing Systems (NIPS), 2008. [9] S. Negahban and M. J. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. In ICML, 2010. [10] G. Obozinski, M. J. Wainwright, and M. I. Jordan. Support union recovery in high-dimensional multivariate regression. Annals of Statistics, 2010. [11] P. Ravikumar, H. Liu, J. Lafferty, and L. Wasserman. Sparse additive models. Journal of the Royal Statistical Society, Series B. [12] P. Ravikumar, M. J. Wainwright, and J. Lafferty. High-dimensional ising model selection using ℓ1 -regularized logistic regression. Annals of Statistics, 2009. [13] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. In Allerton Conference, Allerton House, Illinois, 2007. [14] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996. [15] J. A. Tropp, A. C. Gilbert, and M. J. Strauss. Algorithms for simultaneous sparse approximation. Signal Processing, Special issue on “Sparse approximations in signal and image processing”, 86:572–602, 2006. [16] B. Turlach, W.N. Venables, and S.J. Wright. Simultaneous variable selection. Techno- metrics, 27:349–363, 2005. [17] M. van Breukelen, R.P.W. Duin, D.M.J. Tax, and J.E. den Hartog. Handwritten digit recognition by combined classifiers. Kybernetika, 34(4):381–386, 1998. [18] M. J. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using ℓ1 -constrained quadratic programming (lasso). IEEE Transactions on Information Theory, 55: 2183–2202, 2009. 9
4 0.74378961 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models
Author: Iain Murray, Ryan P. Adams
Abstract: The Gaussian process (GP) is a popular way to specify dependencies between random variables in a probabilistic model. In the Bayesian framework the covariance structure can be specified using unknown hyperparameters. Integrating over these hyperparameters considers different possible explanations for the data when making predictions. This integration is often performed using Markov chain Monte Carlo (MCMC) sampling. However, with non-Gaussian observations standard hyperparameter sampling approaches require careful tuning and may converge slowly. In this paper we present a slice sampling approach that requires little tuning while mixing well in both strong- and weak-data regimes. 1
5 0.6678586 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the γ-adapted-dimension, which is a simple function of the spectrum of a distribution’s covariance matrix, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the γ-adapted-dimension of the source distribution. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. The bounds hold for a rich family of sub-Gaussian distributions. 1
6 0.66695696 58 nips-2010-Decomposing Isotonic Regression for Efficiently Solving Large Problems
7 0.66236448 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
8 0.66208673 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs
9 0.66202384 155 nips-2010-Learning the context of a category
10 0.65928316 220 nips-2010-Random Projection Trees Revisited
11 0.65671575 40 nips-2010-Beyond Actions: Discriminative Models for Contextual Group Activities
12 0.65450364 222 nips-2010-Random Walk Approach to Regret Minimization
13 0.6537866 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
14 0.6530543 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
15 0.65289283 117 nips-2010-Identifying graph-structured activation patterns in networks
16 0.65183204 280 nips-2010-Unsupervised Kernel Dimension Reduction
17 0.65059996 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
18 0.64830929 268 nips-2010-The Neural Costs of Optimal Control
19 0.64636207 148 nips-2010-Learning Networks of Stochastic Differential Equations
20 0.64318871 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods