jmlr jmlr2011 jmlr2011-64 knowledge-graph by maker-knowledge-mining

64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning


Source: pdf

Author: Paramveer S. Dhillon, Dean Foster, Lyle H. Ungar

Abstract: We propose a framework MIC (Multiple Inclusion Criterion) for learning sparse models based on the information theoretic Minimum Description Length (MDL) principle. MIC provides an elegant way of incorporating arbitrary sparsity patterns in the feature space by using two-part MDL coding schemes. We present MIC based models for the problems of grouped feature selection (MICG ROUP) and multi-task feature selection (MIC-M ULTI). MIC-G ROUP assumes that the features are divided into groups and induces two level sparsity, selecting a subset of the feature groups, and also selecting features within each selected group. MIC-M ULTI applies when there are multiple related tasks that share the same set of potentially predictive features. It also induces two level sparsity, selecting a subset of the features, and then selecting which of the tasks each feature should be added to. Lastly, we propose a model, T RANS F EAT, that can be used to transfer knowledge from a set of previously learned tasks to a new task that is expected to share similar features. All three methods are designed for selecting a small set of predictive features from a large pool of candidate features. We demonstrate the effectiveness of our approach with experimental results on data from genomics and from word sense disambiguation problems.1 Keywords: feature selection, minimum description length principle, multi-task learning

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 MIC provides an elegant way of incorporating arbitrary sparsity patterns in the feature space by using two-part MDL coding schemes. [sent-11, score-0.331]

2 We present MIC based models for the problems of grouped feature selection (MICG ROUP) and multi-task feature selection (MIC-M ULTI). [sent-12, score-0.366]

3 MIC-G ROUP assumes that the features are divided into groups and induces two level sparsity, selecting a subset of the feature groups, and also selecting features within each selected group. [sent-13, score-0.316]

4 As a running example, we consider the problem of disambiguating word senses based on their context. [sent-32, score-0.418]

5 Also, since the features that are useful for predicting one sense are likely to be useful for predicting the other senses (perhaps with a coefficient of different sign. [sent-41, score-0.375]

6 In particular we propose to solve these two related problems, simultaneous feature selection for a set of multiple related tasks and grouped feature selection for a single task, by using coding schemes inspired by the Minimum Description Length (MDL) principle. [sent-55, score-0.634]

7 In multi-task learning, each feature is added into models of a (possibly empty) subset of the tasks and in group feature selection, a (possibly empty) subset of the features are selected from each group (feature class). [sent-60, score-0.472]

8 As an example, consider the task of predicting whether a word has a given sense when one already has models for predicting senses for synonyms of that word. [sent-62, score-0.447]

9 T RANS F EAT is most beneficial when the word under consideration has considerably less labeled data available than the synonyms of that word (for example) so that building a supervised learning model for that word alone does not yield high predictive accuracy. [sent-65, score-0.481]

10 Related Work The main contribution of this paper is to propose a joint framework for the related tasks of simultaneous feature selection for multiple related tasks and grouped feature selection for a single task. [sent-75, score-0.464]

11 Our approach is different from these methods in that we use ℓ0 penalty-based greedy feature selection methods which minimize a cost function provided by MDL based coding schemes. [sent-101, score-0.37]

12 (2009) have also used coding schemes similar to the MDL for enforcing arbitrary structured sparsity patterns over the feature space. [sent-104, score-0.37]

13 1 C ODING THE ˆ DATA : D (Y|w) The Kraft inequality in information theory (Cover and Thomas, 2006) implies that for any probability distribution {pi } over a finite or countable set, there exists a corresponding code with codeword length ⌈− lg pi ⌉ (The logarithm is base 2). [sent-140, score-0.344]

14 We dropped the ceiling on − lg P(Y|w) since we use “idealized” code lengths (Barron et al. [sent-145, score-0.317]

15 So, the first step in coding w could be to say where the non-zero entries are located; if only a few features enter the model, this can be done relatively efficiently by listing the indices of the features in the set {1, 2, . [sent-157, score-0.338]

16 This requires ⌈lg p⌉ bits or approximately lg p bits. [sent-161, score-0.353]

17 In fact, this is done explicitly in ˆ the Minimum Message Length (MML) principle which is a Bayesian analogue of MDL, which chooses the model w with maximum P(w|Y), that is, it chooses the model that minimizes − lg P(w|Y) = − lg P(w) − lg P(Y|w) + const. [sent-167, score-0.777]

18 is lg∗ i + b, where lg∗ i = lg i + lg lg i + lg lg lg i + . [sent-173, score-1.554]

19 ∗ i=1 ∗ We require the lg instead of a simple lg because the number of bits Sender uses to convey the integer i will vary, and she needs to tell the Receiver how many bits to expect. [sent-179, score-0.706]

20 The reason behind choosing 2 bits over using a more conservative penalty like BIC (Bayesian Information Criterion) (lg n) bits is that using a fewer number of bits allows us to select even those features which provide marginal benefit. [sent-188, score-0.399]

21 MIC provides an elegant way of incorporating arbitrary sparsity patterns in the feature space by using MDL coding schemes customized to the problem at hand. [sent-195, score-0.37]

22 In this section, we describe how MIC can be used to provide statistically efficient models for the problems of simultaneous feature selection for multiple related tasks and grouped feature selection for a single task. [sent-196, score-0.415]

23 For the problem of simultaneous feature selection for a set of related tasks (which is addressed using MIC-M ULTI) we assume a set of h regression or classification tasks which can potentially share a set of p features and a total of n labeled training examples. [sent-199, score-0.343]

24 Similarly, for the related problem of grouped feature selection (which is addressed using MICG ROUP) also, we have a total of p candidate features which are further divided into K groups (equal or unequal). [sent-205, score-0.311]

25 If we expect nearly all the responses to be correlated with the predictive features, we could give all the responses nonzero coefficients (using 2h bits to code each of the h response coefficients) and simply specify the feature that we are talking about by using lg p bits, as in Section 3. [sent-225, score-0.82]

26 From now on we will refer to this approach as F ULL-MIC-M ULTI (fully dependent MIC-M ULTI) coding scheme, as it assumes that a selected feature will be added in the models of all the tasks, in much the same way as BBL ASSO (Obozinski et al. [sent-228, score-0.306]

27 Another limiting case is the one when we do feature selection for all the tasks independently (the baseline “Independent” Coding Scheme); the coding scheme in that case takes the form given in Equation 4. [sent-230, score-0.471]

28 A more flexible coding scheme would allow us to specify only the subset of the responses to which we want to give nonzero coefficients. [sent-232, score-0.329]

29 Then, we can use lg p bits to specify our feature (number 2609) once, and then we can list the particular responses that have nonzero coefficients with feature 2609, thereby avoiding paying the cost of lg(mh) four times to specify each coefficient in isolation. [sent-234, score-0.745]

30 We then need lg h additional bits to specify the k particular subset. [sent-237, score-0.353]

31 We have already described the cost for ℓH above; it is equal to: ℓH = lg∗ k + ch + lg 533 h . [sent-240, score-0.315]

32 For ℓI , which specifies the size of the code for the given feature, we use lg p bits, which is equivalent to a uniform prior over the features,5 that is, each feature is equally likely to be selected. [sent-244, score-0.443]

33 This can be accomplished by simply keeping a linear array of features and coding the indices of the features with nonzero coefficients. [sent-245, score-0.375]

34 Thus, we can represent the total model cost for MIC-M ULTI as: SM = lg∗ k + ch + lg 4. [sent-246, score-0.315]

35 The model likelihood under the Gaussian assumption6 can be written as: P(Yi |wq ) = ˆ 1 (2π)h |Σ| exp 1 T −1 ε Σ εi , 2 i (6) n SE ˆ = − lg ∏ P(Yi |wq ) (7) i=1 = n 1 ˆ ˆ n ln (2π)h |Σ| + ∑ (Yi − Xi · wq )T Σ−1 (Yi − Xi · wq ) 2 ln 2 i=1 with subscript i denoting the ith row. [sent-253, score-0.477]

36 Note that we do not need to pay an extra coding cost for estimating Σ as we are using a prequential coding scheme; Σ is calculated using information that was already paid for. [sent-259, score-0.384]

37 The coding costs for these three methods are compared in Table 1 for p = 2000 features, h = 20 responses, and for various values of k, the number of responses to which we add the current feature under consideration. [sent-272, score-0.385]

38 For example, genes can be divided into gene classes based on what pathway they occur in or features of a word can be grouped based on whether they are based on specific neighbouring words, parts of speech, or more global document properties. [sent-296, score-0.297]

39 The problem of grouped feature selection (MIC-G ROUP) is very closely related to the problem of simultaneous feature selection for a set of related tasks (MIC-M ULTI) as has also been pointed out by Obozinski et al. [sent-299, score-0.415]

40 The multi-task problem we described earlier can also be thought of as a grouped feature selection scenario in which a group is defined by fixing a specific feature and ranging over multiple tasks. [sent-301, score-0.372]

41 D HILLON , F OSTER AND U NGAR feature selection is based on the fact that some feature groups contain highly predictive features than others. [sent-304, score-0.431]

42 1 C ODING S CHEMES FOR MIC-G ROUP Since the problem of grouped feature selection is similar to the problem of simultaneous feature selection for a set of related tasks, we can propose a coding scheme which is analogous to the coding scheme for MIC-M ULTI. [sent-307, score-0.792]

43 In a similar fashion, the number of bits to code the model can be represented as SM = lg∗ k + ch + lg hsingle +log p+2k, k which corresponds to Equation 5. [sent-310, score-0.443]

44 Although this coding scheme works very well in practice, but it turns out that we are not exploiting the full flexibility that MDL based coding offer us. [sent-313, score-0.393]

45 So, we propose a new coding scheme, which is computationally more efficient than MIC-G ROUP (I), as it does not require a stepwise search for subset selection, though the predictive accuracy of both these coding schemes is comparable. [sent-314, score-0.458]

46 Coding the data with MIC-G ROUP-SC (MIC-G ROUP-Switch Coding): In this new coding scheme SE is the message length for a single feature and ∆SE represents the increase in likelihood of the data by adding that feature to the model. [sent-316, score-0.492]

47 The intuition behind this coding scheme is that once we have selected (at least) one feature from a given group then it should 8. [sent-325, score-0.385]

48 For simplicity of analysis and ease of comparison with coding schemes for MIC-M ULTI we are assuming that all groups are of the same size hsingle , though in reality the groups may be of unequal size and the same coding scheme still holds. [sent-326, score-0.496]

49 9 So ℓC is lg K where K is the total number of groups (feature classes) in the data. [sent-331, score-0.291]

50 Therefore ℓC is (Note that we added 1 bit to lg K also to ensure that the group whose index starts with 1 is not confused with the “switch”. [sent-335, score-0.305]

51 ):   1 + lg K i f the f eature class is not in the   model ℓC =  1 + lg Q i f the f eature class is already in   the model. [sent-336, score-0.518]

52 This corresponds to lg mi bits where mi is the total number of features in the feature class of which the ith feature is a part of. [sent-338, score-0.684]

53 (9) As mentioned earlier, this coding scheme is computationally cheaper than MIC-G ROUP (I) as it does not require a subset search every time a feature is added to the model and it provides comparable predictive accuracy to MIC-G ROUP (I). [sent-342, score-0.367]

54 Note that just analogous to MIC-M ULTI it is possible to come up with a new coding scheme called F ULL MIC-G ROUP(I) which just like its MIC-M ULTI counterpart would add all the features from a given group (feature class) into the model. [sent-343, score-0.338]

55 Second, the log h term in PARTIAL MIC-M ULTI’s coding k cost does not increase monotonically with k, so even if adding the feature to an intermediate number of tasks does not look promising, adding it to all of them might still be worthwhile. [sent-396, score-0.379]

56 Because each response had four features, those responses (6 − 20) that did not have all of the first four features had other features randomly distributed among the remaining features (5, 6, . [sent-461, score-0.376]

57 Also, as shorthand we denote − ∑n lg Q(Xi ) as − lg Q(X n ) i=1 throughout this section and P(n) denotes the marginal distribution on the first n outcomes that is induced by the probabilistic source P. [sent-849, score-0.518]

58 (13) The ℓn (Z) term corresponds to the number of bits required to code the model and the − lg Z(X n ) term corresponds to the data likelihood term in the two part MDL coding scheme. [sent-875, score-0.591]

59 However, due to the complexity of the forward greedy feature selection the sparsistency condition in this case depends only on the feature (design) matrix X, unlike Lasso and Group Lasso. [sent-896, score-0.325]

60 For our MIC based methods this penalty is a combination of RIC, AIC (to code the coefficients) and other coding schemes which incorporate the structure of the problem at hand. [sent-899, score-0.315]

61 However, we could modify our coding schemes slightly by using the BIC penalty (lg n bits) to code the coefficients instead of AIC to ensure sparsistency of MIC. [sent-901, score-0.348]

62 However, we prefer that our methods are not sparsistent as in that case we achieve competitive performance with the true underlying model, that is, we get finite risk-inflation of about 2 lg p (Foster and George, 1994) whereas if we chose sparsistency then MIC would have infinite risk-inflation. [sent-902, score-0.292]

63 A Model for “Intra Domain” Adaptation: T RANS F EAT In the previous sections we proposed MIC based methods for the related problems of simultaneous feature selection for a set of multiple related tasks (MIC-M ULTI) and grouped feature selection for single task (MIC-G ROUP). [sent-906, score-0.415]

64 Note that this is a multi-class problem; we have a single task, which is to predict the correct sense of the word and we have h possible choices (the word senses) for that task. [sent-922, score-0.331]

65 The main reason for doing this is that not all senses of all words are similar to all senses of some other word. [sent-927, score-0.534]

66 Thus, transfer learning only makes sense at level of individual word senses rather than at the level of whole words. [sent-928, score-0.497]

67 • Make separate feature matrices for these h prediction problems, because the original feature matrix Xn×p contained features which would be useful for the multiclass problem of “What is the exact sense of the word? [sent-929, score-0.36]

68 We do this is by characterizing each binary problem by those features from the original p features which are positively correlated with that particular word sense. [sent-932, score-0.309]

69 • Next, cluster the different word senses by using “foreground-background” clustering that puts all singleton points into a “background cluster” which we then ignore • Learn separate MIC-G ROUP-SC models for each word sense. [sent-937, score-0.596]

70 ) • For each word sense in a cluster, use T RANS F EAT to learn a feature relevance prior from the remaining word senses in that cluster that have more observations than the target word sense, on the features of that word sense. [sent-939, score-1.132]

71 In general, features with positive coefficients are associated with the given sense and those with negative coefficients with other senses of that word. [sent-941, score-0.375]

72 We learn the feature relevance prior only from distributionally similar word senses; in contrast to Ando (2006) who share knowledge across “all” the senses of “all” the words. [sent-944, score-0.544]

73 For example, one sense of “fire” (as in “fire someone”) should share features with one sense of “dismiss” (as in “dismiss someone”), but other senses of “fire” (as in “fire the gun”) do not. [sent-946, score-0.404]

74 Thus, knowledge can only be fruitfully transferred between the shared senses of different words, even though the models being learned are for disambiguating different senses of a single word. [sent-950, score-0.534]

75 We hold out each word sense in the cluster once and learn a prior from the remaining word senses in that cluster. [sent-952, score-0.625]

76 If at least one sense of the word “arrest”, that we are trying to model is similar to the other word senses (for “kill” and “capture”), some of the same features should be beneficial for all of them. [sent-954, score-0.677]

77 1 T RANS F EAT Formulation We now describe T RANS F EAT in detail and show how it can be used to learn better feature selection models by relaxing the overly simplistic assumption of the model coding schemes of MIC methods of uniform prior by learning a feature relevance prior. [sent-956, score-0.511]

78 We define a binary random variable fi ∈ {1,0} that denotes the event of the ith feature being in or not being in the model for the test word sense, and model it as being from a Bernoulli distribution parameterized by θi : f p( fi |θi ) = θi i (1 − θi )1− fi . [sent-957, score-0.4]

79 (16) Given the data for the ith feature for all the training word senses, we can write: D fi = { fi1 , . [sent-958, score-0.318]

80 As can be seen from Equation 17, the probability that a feature is selected for the held out test word sense is a “smoothed” average of the number of times it was selected in the models for the senses of other words that are similar to it. [sent-971, score-0.573]

81 Using similar reasoning, we can extend the above concept to the groups (feature classes) so that the probability that a group (feature class) is selected is also a “smoothed” average of the number of times it was selected in the models for the senses of other words that are similar to it. [sent-972, score-0.345]

82 17 Note that in the case when we have previously selected features from a given feature class, the most efficient way to code the feature class is to use the minimum of the T RANS F EAT cost and the actual “switch” coding cost as described in Section 4. [sent-974, score-0.617]

83 3: Cluster the different word senses by “foreground-background” clustering. [sent-985, score-0.418]

84 , c} 5: word_sensesk = sk // Number of word senses in kth cluster. [sent-989, score-0.418]

85 // Uniform prior assumption 9: end for 10: end for 11: for i in total_clusters do 12: for t in word_sensesi do 13: Learn T RANS F EAT model on all word senses in the cluster which have more data (observations) than the t th word sense. [sent-991, score-0.596]

86 1 C HOICE OF H YPERPARAMETERS The hyperparameters a and b in Equation 17 control the “smoothing” of our probability estimates, that is, how strongly we want the evidence obtained from similar word senses to affect the model that we learn for the test word sense. [sent-997, score-0.569]

87 In all our experiments we set a = 1 and choose b so that in the limiting case of no transfer, that is, (k = l = 0 in Equation 17) the coding scheme will reduce to the baseline feature selection described in (Equation 4). [sent-998, score-0.422]

88 1 S IMILARITY M ETRIC Finding a good similarity metric between different word senses is perhaps one of the biggest challenges that we faced. [sent-1004, score-0.418]

89 There are many ways in which word senses can be judged as similar, including having similar “meanings” or similar syntactic usages. [sent-1006, score-0.418]

90 We thus need a clustering method that gives tight clusters of word senses, and does not attempt to cluster those word senses which are not similar to any other word sense in the corpus. [sent-1015, score-0.776]

91 This algorithm gives highly cohesive clusters of word senses (the foreground) and puts all the remaining word senses in the background. [sent-1018, score-0.836]

92 The main difference between these two data sets is that S ENSEVAL -2 data contains “fine grained” senses of the words and as a result tends to have more senses per word than the “coarse grained” verb senses in O NTO N OTES. [sent-1027, score-0.952]

93 3 R ESULTS We cluster the word senses based on all the features, that is, semantic+syntactic similarity features. [sent-1035, score-0.445]

94 In order to ensure fairness of comparison we compute the predicted sense for each observation by selecting the word sense model (from among the different senses for that word) with the highest score for that observation sense. [sent-1063, score-0.476]

95 Some examples will help to emphasize the point that we made earlier that transfer helps the most in cases in which the target word sense has much less data than the word senses from which knowledge is being transferred. [sent-1080, score-0.648]

96 “kill” had roughly 6 times more data than all other word senses in its cluster (i. [sent-1081, score-0.445]

97 Similarly, for the case of word “do” which had roughly 10 times more data than the other word senses in its cluster (e. [sent-1088, score-0.596]

98 Transfer makes the biggest difference when the target words have much less data than the word senses they are generalizing from, but even in cases where the words have comparable amounts of data we still get a 1. [sent-1093, score-0.418]

99 For example, “begin” had ∼ 8 times more data (on average per sense) than the other word senses in its cluster (i. [sent-1107, score-0.445]

100 No Hypercompression Inequality: ∀K > 0, PTrue [− lg PTrue (X n ) ≥ − lg PModel (X n ) + K] ≤ 2−K . [sent-1129, score-0.518]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ulti', 0.311), ('roup', 0.311), ('mic', 0.308), ('senses', 0.267), ('lg', 0.259), ('ptrue', 0.239), ('mdl', 0.229), ('rans', 0.227), ('eat', 0.188), ('coding', 0.18), ('word', 0.151), ('feature', 0.126), ('hillon', 0.116), ('ngar', 0.116), ('oster', 0.116), ('enalization', 0.111), ('wq', 0.109), ('bbl', 0.099), ('bits', 0.094), ('sender', 0.087), ('tdl', 0.081), ('wsd', 0.081), ('asso', 0.08), ('features', 0.079), ('responses', 0.079), ('enseval', 0.07), ('nto', 0.07), ('parse', 0.064), ('ndo', 0.064), ('otes', 0.064), ('ando', 0.061), ('se', 0.061), ('response', 0.06), ('foster', 0.059), ('pmic', 0.058), ('code', 0.058), ('verbs', 0.057), ('dhillon', 0.053), ('transfer', 0.05), ('tasks', 0.049), ('gsea', 0.047), ('yeast', 0.047), ('group', 0.046), ('nwald', 0.044), ('baseline', 0.043), ('gn', 0.043), ('sm', 0.042), ('lasso', 0.041), ('fi', 0.041), ('hang', 0.041), ('ric', 0.041), ('selection', 0.04), ('schemes', 0.039), ('coef', 0.038), ('penalty', 0.038), ('multi', 0.038), ('nonzero', 0.037), ('disambiguation', 0.037), ('cv', 0.036), ('transfeat', 0.035), ('grouped', 0.034), ('scheme', 0.033), ('gene', 0.033), ('accuracies', 0.033), ('ull', 0.033), ('sparsistency', 0.033), ('gr', 0.033), ('ch', 0.032), ('breast', 0.032), ('groups', 0.032), ('nets', 0.031), ('stepwise', 0.031), ('eatures', 0.031), ('oding', 0.03), ('synthetic', 0.03), ('description', 0.029), ('arrest', 0.029), ('pmodel', 0.029), ('ungar', 0.029), ('sense', 0.029), ('partial', 0.029), ('mn', 0.029), ('receiver', 0.029), ('predictive', 0.028), ('earning', 0.027), ('length', 0.027), ('cluster', 0.027), ('elastic', 0.027), ('lexical', 0.027), ('obozinski', 0.026), ('residuals', 0.026), ('cients', 0.026), ('codes', 0.026), ('dataset', 0.026), ('sparsity', 0.025), ('cancer', 0.025), ('leukemia', 0.025), ('zhang', 0.024), ('cost', 0.024), ('bblasso', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

Author: Paramveer S. Dhillon, Dean Foster, Lyle H. Ungar

Abstract: We propose a framework MIC (Multiple Inclusion Criterion) for learning sparse models based on the information theoretic Minimum Description Length (MDL) principle. MIC provides an elegant way of incorporating arbitrary sparsity patterns in the feature space by using two-part MDL coding schemes. We present MIC based models for the problems of grouped feature selection (MICG ROUP) and multi-task feature selection (MIC-M ULTI). MIC-G ROUP assumes that the features are divided into groups and induces two level sparsity, selecting a subset of the feature groups, and also selecting features within each selected group. MIC-M ULTI applies when there are multiple related tasks that share the same set of potentially predictive features. It also induces two level sparsity, selecting a subset of the features, and then selecting which of the tasks each feature should be added to. Lastly, we propose a model, T RANS F EAT, that can be used to transfer knowledge from a set of previously learned tasks to a new task that is expected to share similar features. All three methods are designed for selecting a small set of predictive features from a large pool of candidate features. We demonstrate the effectiveness of our approach with experimental results on data from genomics and from word sense disambiguation problems.1 Keywords: feature selection, minimum description length principle, multi-task learning

2 0.10398453 59 jmlr-2011-Learning with Structured Sparsity

Author: Junzhou Huang, Tong Zhang, Dimitris Metaxas

Abstract: This paper investigates a learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea that has become popular in recent years. A general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure. It is shown that if the coding complexity of the target signal is small, then one can achieve improved performance by using coding complexity regularization methods, which generalize the standard sparse regularization. Moreover, a structured greedy algorithm is proposed to efficiently solve the structured sparsity problem. It is shown that the greedy algorithm approximately solves the coding complexity optimization problem under appropriate conditions. Experiments are included to demonstrate the advantage of structured sparsity over standard sparsity on some real applications. Keywords: structured sparsity, standard sparsity, group sparsity, tree sparsity, graph sparsity, sparse learning, feature selection, compressive sensing

3 0.076624565 68 jmlr-2011-Natural Language Processing (Almost) from Scratch

Author: Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa

Abstract: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements. Keywords: natural language processing, neural networks

4 0.073623545 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

Author: Jérémie Bigot, Rolando J. Biscay, Jean-Michel Loubes, Lillian Muñiz-Alvarez

Abstract: In this paper, we consider the Group Lasso estimator of the covariance matrix of a stochastic process corrupted by an additive noise. We propose to estimate the covariance matrix in a highdimensional setting under the assumption that the process has a sparse representation in a large dictionary of basis functions. Using a matrix regression model, we propose a new methodology for high-dimensional covariance matrix estimation based on empirical contrast regularization by a group Lasso penalty. Using such a penalty, the method selects a sparse set of basis functions in the dictionary used to approximate the process, leading to an approximation of the covariance matrix into a low dimensional space. Consistency of the estimator is studied in Frobenius and operator norms and an application to sparse PCA is proposed. Keywords: group Lasso, ℓ1 penalty, high-dimensional covariance estimation, basis expansion, sparsity, oracle inequality, sparse PCA

5 0.073428497 55 jmlr-2011-Learning Multi-modal Similarity

Author: Brian McFee, Gert Lanckriet

Abstract: In many applications involving multi-media data, the definition of similarity between items is integral to several key tasks, including nearest-neighbor retrieval, classification, and recommendation. Data in such regimes typically exhibits multiple modalities, such as acoustic and visual content of video. Integrating such heterogeneous data to form a holistic similarity space is therefore a key challenge to be overcome in many real-world applications. We present a novel multiple kernel learning technique for integrating heterogeneous data into a single, unified similarity space. Our algorithm learns an optimal ensemble of kernel transformations which conform to measurements of human perceptual similarity, as expressed by relative comparisons. To cope with the ubiquitous problems of subjectivity and inconsistency in multimedia similarity, we develop graph-based techniques to filter similarity measurements, resulting in a simplified and robust training procedure. Keywords: multiple kernel learning, metric learning, similarity

6 0.068024896 97 jmlr-2011-Union Support Recovery in Multi-task Learning

7 0.065135859 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models

8 0.06180235 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

9 0.041847609 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

10 0.041089758 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models

11 0.040256854 101 jmlr-2011-Variable Sparsity Kernel Learning

12 0.038465187 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

13 0.037851322 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

14 0.036677845 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels

15 0.034437884 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review

16 0.033634055 66 jmlr-2011-Multiple Kernel Learning Algorithms

17 0.032342013 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

18 0.031755388 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

19 0.030232942 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

20 0.029892534 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.184), (1, -0.073), (2, 0.119), (3, 0.051), (4, -0.077), (5, 0.004), (6, -0.0), (7, 0.047), (8, -0.016), (9, -0.083), (10, 0.083), (11, -0.144), (12, 0.066), (13, -0.149), (14, 0.018), (15, -0.065), (16, -0.01), (17, 0.055), (18, 0.029), (19, 0.05), (20, -0.114), (21, 0.017), (22, -0.185), (23, 0.067), (24, 0.056), (25, -0.067), (26, -0.115), (27, -0.084), (28, -0.315), (29, 0.003), (30, 0.1), (31, -0.08), (32, 0.159), (33, 0.044), (34, 0.1), (35, 0.009), (36, 0.086), (37, -0.039), (38, -0.035), (39, 0.019), (40, 0.062), (41, 0.074), (42, -0.103), (43, 0.354), (44, 0.138), (45, 0.083), (46, -0.005), (47, 0.033), (48, 0.074), (49, -0.199)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94856799 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

Author: Paramveer S. Dhillon, Dean Foster, Lyle H. Ungar

Abstract: We propose a framework MIC (Multiple Inclusion Criterion) for learning sparse models based on the information theoretic Minimum Description Length (MDL) principle. MIC provides an elegant way of incorporating arbitrary sparsity patterns in the feature space by using two-part MDL coding schemes. We present MIC based models for the problems of grouped feature selection (MICG ROUP) and multi-task feature selection (MIC-M ULTI). MIC-G ROUP assumes that the features are divided into groups and induces two level sparsity, selecting a subset of the feature groups, and also selecting features within each selected group. MIC-M ULTI applies when there are multiple related tasks that share the same set of potentially predictive features. It also induces two level sparsity, selecting a subset of the features, and then selecting which of the tasks each feature should be added to. Lastly, we propose a model, T RANS F EAT, that can be used to transfer knowledge from a set of previously learned tasks to a new task that is expected to share similar features. All three methods are designed for selecting a small set of predictive features from a large pool of candidate features. We demonstrate the effectiveness of our approach with experimental results on data from genomics and from word sense disambiguation problems.1 Keywords: feature selection, minimum description length principle, multi-task learning

2 0.50860399 68 jmlr-2011-Natural Language Processing (Almost) from Scratch

Author: Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa

Abstract: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements. Keywords: natural language processing, neural networks

3 0.42208862 59 jmlr-2011-Learning with Structured Sparsity

Author: Junzhou Huang, Tong Zhang, Dimitris Metaxas

Abstract: This paper investigates a learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea that has become popular in recent years. A general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure. It is shown that if the coding complexity of the target signal is small, then one can achieve improved performance by using coding complexity regularization methods, which generalize the standard sparse regularization. Moreover, a structured greedy algorithm is proposed to efficiently solve the structured sparsity problem. It is shown that the greedy algorithm approximately solves the coding complexity optimization problem under appropriate conditions. Experiments are included to demonstrate the advantage of structured sparsity over standard sparsity on some real applications. Keywords: structured sparsity, standard sparsity, group sparsity, tree sparsity, graph sparsity, sparse learning, feature selection, compressive sensing

4 0.33397552 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels

Author: Jianxin Wu, Wei-Chian Tan, James M. Rehg

Abstract: Common visual codebook generation methods used in a bag of visual words model, for example, k-means or Gaussian Mixture Model, use the Euclidean distance to cluster features into visual code words. However, most popular visual descriptors are histograms of image measurements. It has been shown that with histogram features, the Histogram Intersection Kernel (HIK) is more effective than the Euclidean distance in supervised learning tasks. In this paper, we demonstrate that HIK can be used in an unsupervised manner to significantly improve the generation of visual codebooks. We propose a histogram kernel k-means algorithm which is easy to implement and runs almost as fast as the standard k-means. The HIK codebooks have consistently higher recognition accuracy over k-means codebooks by 2–4% in several benchmark object and scene recognition data sets. The algorithm is also generalized to arbitrary additive kernels. Its speed is thousands of times faster than a naive implementation of the kernel k-means algorithm. In addition, we propose a one-class SVM formulation to create more effective visual code words. Finally, we show that the standard kmedian clustering method can be used for visual codebook generation and can act as a compromise between the HIK / additive kernel and the k-means approaches. Keywords: visual codebook, additive kernel, histogram intersection kernel

5 0.32290965 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models

Author: Sharon Goldwater, Thomas L. Griffiths, Mark Johnson

Abstract: Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that can generically produce power laws, breaking generative models into two stages. The first stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework. We discuss two stochastic processes—the Chinese restaurant process and its two-parameter generalization based on the Pitman-Yor process—that can be used as adaptors in our framework to produce power-law distributions over word frequencies. We show that these adaptors justify common estimation procedures based on logarithmic or inverse-power transformations of empirical frequencies. In addition, taking the Pitman-Yor Chinese restaurant process as an adaptor justifies the appearance of type frequencies in formal analyses of natural language and improves the performance of a model for unsupervised learning of morphology. Keywords: nonparametric Bayes, Pitman-Yor process, language model, unsupervised

6 0.29600736 55 jmlr-2011-Learning Multi-modal Similarity

7 0.2895889 97 jmlr-2011-Union Support Recovery in Multi-task Learning

8 0.22829053 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments

9 0.2222217 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

10 0.20363194 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

11 0.19125113 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

12 0.17607099 101 jmlr-2011-Variable Sparsity Kernel Learning

13 0.17586654 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors

14 0.17181528 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

15 0.16730431 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

16 0.16488151 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

17 0.16115759 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review

18 0.15857065 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models

19 0.15834361 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

20 0.1571331 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.051), (6, 0.011), (9, 0.044), (10, 0.032), (24, 0.063), (31, 0.071), (32, 0.028), (41, 0.033), (50, 0.392), (60, 0.03), (65, 0.011), (67, 0.012), (70, 0.012), (71, 0.017), (73, 0.023), (78, 0.057), (90, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.68278646 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

Author: Paramveer S. Dhillon, Dean Foster, Lyle H. Ungar

Abstract: We propose a framework MIC (Multiple Inclusion Criterion) for learning sparse models based on the information theoretic Minimum Description Length (MDL) principle. MIC provides an elegant way of incorporating arbitrary sparsity patterns in the feature space by using two-part MDL coding schemes. We present MIC based models for the problems of grouped feature selection (MICG ROUP) and multi-task feature selection (MIC-M ULTI). MIC-G ROUP assumes that the features are divided into groups and induces two level sparsity, selecting a subset of the feature groups, and also selecting features within each selected group. MIC-M ULTI applies when there are multiple related tasks that share the same set of potentially predictive features. It also induces two level sparsity, selecting a subset of the features, and then selecting which of the tasks each feature should be added to. Lastly, we propose a model, T RANS F EAT, that can be used to transfer knowledge from a set of previously learned tasks to a new task that is expected to share similar features. All three methods are designed for selecting a small set of predictive features from a large pool of candidate features. We demonstrate the effectiveness of our approach with experimental results on data from genomics and from word sense disambiguation problems.1 Keywords: feature selection, minimum description length principle, multi-task learning

2 0.3164396 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

Author: Ricardo Henao, Ole Winther

Abstract: In this paper we consider sparse and identifiable linear latent variable (factor) and linear Bayesian network models for parsimonious analysis of multivariate data. We propose a computationally efficient method for joint parameter and model inference, and model comparison. It consists of a fully Bayesian hierarchy for sparse models using slab and spike priors (two-component δ-function and continuous mixtures), non-Gaussian latent factors and a stochastic search over the ordering of the variables. The framework, which we call SLIM (Sparse Linear Identifiable Multivariate modeling), is validated and bench-marked on artificial and real biological data sets. SLIM is closest in spirit to LiNGAM (Shimizu et al., 2006), but differs substantially in inference, Bayesian network structure learning and model comparison. Experimentally, SLIM performs equally well or better than LiNGAM with comparable computational complexity. We attribute this mainly to the stochastic search strategy used, and to parsimony (sparsity and identifiability), which is an explicit part of the model. We propose two extensions to the basic i.i.d. linear framework: non-linear dependence on observed variables, called SNIM (Sparse Non-linear Identifiable Multivariate modeling) and allowing for correlations between latent variables, called CSLIM (Correlated SLIM), for the temporal and/or spatial data. The source code and scripts are available from http://cogsys.imm.dtu.dk/slim/. Keywords: parsimony, sparsity, identifiability, factor models, linear Bayesian networks

3 0.31112552 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

Author: Jennifer Gillenwater, Kuzman Ganchev, João Graça, Fernando Pereira, Ben Taskar

Abstract: A strong inductive bias is essential in unsupervised grammar induction. In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. We use part-of-speech (POS) tags to group dependencies by parent-child types and investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 different languages, we achieve significant gains in directed attachment accuracy over the standard expectation maximization (EM) baseline, with an average accuracy improvement of 6.5%, outperforming EM by at least 1% for 9 out of 12 languages. Furthermore, the new method outperforms models based on standard Bayesian sparsity-inducing parameter priors with an average improvement of 5% and positive gains of at least 1% for 9 out of 12 languages. On English text in particular, we show that our approach improves performance over other state-of-the-art techniques.

4 0.31109843 91 jmlr-2011-The Sample Complexity of Dictionary Learning

Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein

Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefficient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation

5 0.30812937 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

Author: Bruno Pelletier, Pierre Pudlo

Abstract: Following Hartigan (1975), a cluster is defined as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classification, level sets, connected components

6 0.3069669 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

7 0.30629447 48 jmlr-2011-Kernel Analysis of Deep Networks

8 0.30382234 16 jmlr-2011-Clustering Algorithms for Chains

9 0.30306017 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

10 0.30288166 59 jmlr-2011-Learning with Structured Sparsity

11 0.30106935 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

12 0.30031934 12 jmlr-2011-Bayesian Co-Training

13 0.30012798 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

14 0.30004519 104 jmlr-2011-X-Armed Bandits

15 0.29998967 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models

16 0.29978868 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models

17 0.29867956 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

18 0.29817519 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

19 0.29739919 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments

20 0.29727939 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs