jmlr jmlr2010 jmlr2010-42 knowledge-graph by maker-knowledge-mining

42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

Source: pdf

Author: Gideon S. Mann, Andrew McCallum

Abstract: In this paper, we present an overview of generalized expectation criteria (GE), a simple, robust, scalable method for semi-supervised training using weakly-labeled data. GE ﬁts model parameters by favoring models that match certain expectation constraints, such as marginal label distributions, on the unlabeled data. This paper shows how to apply generalized expectation criteria to two classes of parametric models: maximum entropy models and conditional random ﬁelds. Experimental results demonstrate accuracy improvements over supervised training and a number of other stateof-the-art semi-supervised learning methods for these models. Keywords: generalized expectation criteria, semi-supervised learning, logistic regression, conditional random ﬁelds

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 GE ﬁts model parameters by favoring models that match certain expectation constraints, such as marginal label distributions, on the unlabeled data. [sent-8, score-0.646]

2 This paper shows how to apply generalized expectation criteria to two classes of parametric models: maximum entropy models and conditional random ﬁelds. [sent-9, score-0.415]

3 To use this data, we present generalized expectation critera (GE), a method initially described as expectation regularization in Mann and McCallum (2007). [sent-16, score-0.373]

4 To calculate the divergence from the input distribution there is no need for additional training data, as the expected distribution on the unlabeled data can be used. [sent-18, score-0.389]

5 We demonstrate that for both of these models, label regularization is able to provide performance gains over other supervised and semi-supervised learning methods. [sent-27, score-0.545]

6 Related Work Traditional supervised learning takes as input fully labeled data, a set of tuples D = {(x, y)}, where x is the input and y the desired output. [sent-35, score-0.379]

7 In semisupervised learning approaches, a small amount of labeled data is augmented by unlabeled data, a set of elements U = {x}, which it exploits to chose a similar function: J(D ∪ U ) = f . [sent-39, score-0.476]

8 This section presents the main methods for semi-supervised learning from labeled and unlabeled data: 1) bootstrapping, 2) expectation maximization, 3) feature discovery, 4) decision boundaries in sparse region methods, and 5) graph-based methods. [sent-40, score-0.552]

9 Instead, in addition to unlabeled data there is side information that has been provided to the learner, for example, expectation constraints like marginal label distributions gy = p(y). [sent-43, score-0.677]

10 1 B OOTSTRAPPING In bootstrapping, or self-training approaches, a classiﬁer is ﬁrst trained on the fully labeled instances and then is applied to unlabeled instances. [sent-50, score-0.591]

11 Some subset of those newly labeled instances are then used (in conjunction with the original labeled instances) to retrain the model. [sent-51, score-0.401]

12 To apply EM to semisupervised data, the log-likelihood function, log L(θ; x, y), is set to be a composite of labeled and unlabeled data (possibly with a weighting factor to down-weight the contribution to the likelihood from the unlabeled data). [sent-64, score-0.78]

13 Another twist on this technique is to blend generative and discriminative models, by combining ML estimates over the labeled data with EM parameter estimates over the unlabeled data, for a joint model which combines a CRF and a HMM (Suzuki et al. [sent-77, score-0.582]

14 3 F EATURE D ISCOVERY As alternative to estimating a classiﬁer directly with unlabeled data, a number of groups have explored using the unlabeled data for feature induction or feature discovery, and those features are then embedded into a traditional supervised learning problem. [sent-87, score-0.778]

15 A latent clustering is applied over the unlabeled data, to learn a function fl , which is used to provide additional features fl (x) = z, ¯ these features are then used to augment the original labeled data: D = ∪(xd ,yd )∈D (xd ∪ zd , yd ), and then supervised learning proceeds as usual. [sent-88, score-0.796]

16 One illustrative example is entropy regularization (Grandvalet and Bengio, 2004), where a traditional conditional label log-likelihood objective function is augmented with an additional term that minimizes the 2. [sent-115, score-0.64]

17 958 G ENERALIZED E XPECTATION C RITERIA entropy of the label distribution on the unlabeled data: O(θ; D , U ) = L(θ; D ) + H(θ; U ) = ∑ log p(yd |xd ) − λ ∑ ∑ p(y|xu ) log p(y|xu ). [sent-117, score-0.711]

18 In entropy regularization, the hyper-parameter λ has a dramatic effect on the performance of the learner, since it must be tuned with regards to the amount of labeled and unlabeled data. [sent-119, score-0.651]

19 Entropy regularization is particularly difﬁcult to apply in cases of very small amounts of labeled data, since in one degenerate case, the model could select one output label for all possible inputs. [sent-120, score-0.594]

20 Benchmark tests have shown that entropy regularization performs as well as TSVMs (when the SVM is given a linear kernel) (Chapelle et al. [sent-129, score-0.365]

21 In these methods, a graph, typically with weighted edges, is constructed spanning the labeled and unlabeled instances. [sent-135, score-0.476]

22 Zhu and Ghahramani (2002) propose label propagation, where labels propagate from labeled instances to unlabeled instances (see Algorithm 2). [sent-137, score-0.822]

23 By sub-sampling unlabeled data, one can reduce run-time from O(n3 ) to O(m2 n), where m is the subsampled number of unlabeled data points (Delalleau et al. [sent-142, score-0.608]

24 At the very least, labeled and unlabeled data must be stored in order to classify new examples. [sent-154, score-0.476]

25 Graph-based methods have used class proportions for post-processing to set thresholds on label propagation (Zhu et al. [sent-184, score-0.363]

26 Schuurmans (1997) uses predicted label distributions on unlabeled data for model structure selection (as opposed to parameter estimation). [sent-186, score-0.572]

27 Generalized expectation criteria are unique in that it uses the expected distribution as the sole criterion for optimizing the model parameters on one set of unlabeled data (though it may also use labeled data). [sent-213, score-0.642]

28 (2002) model, since if the model exactly matched the label expectation on a per-instance basis, in application it would assign all instances to the majority class. [sent-220, score-0.365]

29 , label marginals in label regularization), constraints over individual instances, and expectation constraints which are more expressive than the base model can model directly (e. [sent-229, score-0.598]

30 Alternatively, an entropy regularization term (Grandvalet and Bengio, 2004)4 can be combined into the above objective function in the same manner: O(θ; U , D ) = L(θ; D ) + G(θ; U ) + H(θ; U ). [sent-244, score-0.365]

31 ˜ ˜ The effect of adding this term is to ensure that the model applied to the unlabeled data matches the label proportions. [sent-260, score-0.536]

32 (2008) compares traditional labeled data and labeled features (which can be used to build feature marginal distributions). [sent-269, score-0.391]

33 That study ﬁnds that given the same amount of time, human annotation in the form of labeled features and classiﬁers trained using GE criteria outperform human annotation of traditional labeled instances and maximum likelihood training. [sent-270, score-0.688]

34 For the structured output case, Mann and McCallum (2008) has shown that expectations over features for CRF learning, similar to the prototypes proposed in Haghighi and Klein (2006b), are more effective when used with GE than similar numbers of labeled tokens used to train a CRF. [sent-271, score-0.392]

35 (2009) presents an alternative objective function to learn using expectation constraints over unlabeled data. [sent-278, score-0.409]

36 In this setting, for an arbitrary label yi you can ﬁnd parameter settings where ∀x, p(yi |x) = 1, by having a parameter θ = {θk = ∞, ∀ j = k, θ j = 0} where ψk (x, y) = 1{y = yi }. [sent-303, score-0.46]

37 Label regularization can occasionally ﬁnd a degenerate solution where, rather than the expectation of all instances matching the input distribution, instead, the distribution over labels for each instance will match the given distribution on every example. [sent-305, score-0.399]

38 yi−1 m=1 For each label s, it requires one pass to create a lattice with ∑i−1 p(yi−1 , yi , ym = s|x; θ) for all pairs m=1 (yi , yi+1 ). [sent-352, score-0.443]

39 Experimental Results for Classiﬁers We present two sets of experiments: experiments on maximum entropy models and conditional random ﬁelds for the special case of generalized expectation criteria, label regularization. [sent-356, score-0.557]

40 We present learning curves, where the amount of labeled training data is gradually increased from one instance per class up to thousands of instances and demonstrate that generalized expectation criteria are able to show improvements for both types of scenarios. [sent-358, score-0.469]

41 while gaussian regularization can have an effect for label regularization, for more complicated GE variants it doesn’t dramatically affect performance. [sent-373, score-0.422]

42 (2006), compare with the published results and show that label regularization is able to outperform previous methods. [sent-377, score-0.422]

43 Across all of the experiments, for supervised comparisons, we compare with na¨ve Bayes and ı maximum entropy models, for semi-supervised comparisons we compare with na¨ve Bayes trained ı with EM and maximum entropy models trained with entropy regularization. [sent-383, score-0.764]

44 Presumably, training a graph-based method on the entire unlabeled training set would have been technically infeasible. [sent-389, score-0.39]

45 While ﬁnite-state methods could also be applied in these cases, the cost of training label regularization would be prohibitive, and we found that these methods work well. [sent-394, score-0.465]

46 MaxEnt + GE Table 2: Label regularization outperforms other semi-supervised learning methods at 100 labeled data points. [sent-437, score-0.362]

47 10 For the maximum entropy model trained with entropy regularization, after some experimentation, we weighted its contribution to the objective function with λ = # labeled data points / # unlabeled data points. [sent-448, score-0.884]

48 3 presents experiments showing robustness to noisy label proportions both when smoothed towards a uniform distribution and when sampled from a limited number of training examples. [sent-452, score-0.372]

49 Across the experiments, we observed that label regularization trains in time linear in the amount of unlabeled data, and since the resulting model is parametric, it is linear in the number of features for evaluation. [sent-453, score-0.773]

50 2 Learning Curves For the ﬁrst set of experiments, we experimented with varying the number of supervised training example, while keeping the unlabeled data set size the same (with sizes of the unlabeled data shown 10. [sent-455, score-0.774]

51 As Table 2 shows, for SecStr, label regularization outperforms the other methods at 100 labeled points, and approaches the cluster kernel method on 1000 points. [sent-464, score-0.631]

52 While performance results were not available for the other methods for two instances per class, we ran label regularization for this case and found that it outperforms the supervised SVM and maximum entropy model when they are trained with 100 labeled points. [sent-465, score-1.007]

53 In these experiments QC is not run over the complete data (presumably because of scalability problems), but operates on a subset, either selected randomly (randsub) or in a smarter fashion (smartonly and smartsub), while the label regularization method uses the complete data. [sent-466, score-0.422]

54 Figures 1, 2, 3, and 4 show classiﬁer performance using a ﬁxed amount of unlabeled data as greater amounts of labeled data is added. [sent-467, score-0.476]

55 Label regularization on SRAA shows a beneﬁt over the fully supervised maximum entropy model but its accuracy is not as high as that obtained by the EM-trained na¨ve Bayes learner. [sent-469, score-0.488]

56 This may be partly explained ı by the fact that the baseline performance of the discriminative maximum entropy model is much lower than the generative na¨ve Bayes model, so that label regularization starts off at a considerable ı deﬁcit. [sent-470, score-0.703]

57 While alternative methods often result in degredations of performance over their supervised counter parts (EM, entropy regularization, cluster kernels), in these experiments label regularization consistently yielded improved accuracy. [sent-471, score-0.757]

58 Additionally, the beneﬁt of label regularization is more apparent as the feature sets and numbers of unlabeled instances increase, with the least improvements on one of the simplest tasks, the SRAA text classiﬁcation task. [sent-472, score-0.783]

59 These experiments demonstrate that label regularization can at least match, and in many cases beat, alternative methods of semi-supervised learning, given minimal additional information and access to large samples of unlabeled data. [sent-473, score-0.726]

60 Figure 3: POS: Label regularization (GE) outperforms all other methods, though performance improvements over supervised maximum entropy methods appear to level off at 1300 labeled instances. [sent-476, score-0.66]

61 tional, alternative modalities of supervision which will generate data that can be added to supervised classiﬁers and combined with unlabeled data in order to improve performance. [sent-477, score-0.427]

62 971 M ANN AND M C C ALLUM Figure 4: SRAA: Label regularization (GE) outperforms its supervised maximum entropy counterpart and entropy regularization and is the winner at one labeled instance per lass. [sent-483, score-1.025]

63 Even when the human has no domain knowledge to contribute, label distribution estimates of sufﬁcient accuracy should be obtainable from a reasonably small number of labeled examples. [sent-490, score-0.45]

64 6 Mechanism of Effect One question that needs to be addressed is whether label regularization is improving performance solely by adjusting the label proportions or operating in some other fashion. [sent-536, score-0.751]

65 (2008) reports results of experiments using GE criteria where the input distributions were only allowed to affect the features they were conditioned on (corresponding to only being able to adjust the label proportions). [sent-542, score-0.447]

66 Here we examine combining label regularization with entropy regularization where the objective function is augmented with more than one regularization criterion. [sent-551, score-0.977]

67 For many of the experiments, combining label regularization and entropy regularization does not lead to improvements. [sent-552, score-0.787]

68 Notably, on SecStr, combined entropy regularization and label regularization yields a performance of 66. [sent-554, score-0.787]

69 Figure 11: CoNLL03: On this data set, combining label regularization and entropy regularization does not lead to any beneﬁt. [sent-557, score-0.787]

70 can sometimes be a beneﬁt over the use of label regularization or entropy regularization alone. [sent-559, score-0.787]

71 In comparison, Figure 11 shows that there is little or no difference in performance when entropy regularization is combined with label regularization in the case of CoNLL03. [sent-560, score-0.787]

72 Experimental Results for Conditional Random Fields In this section, we examine the performance of label regularization for conditional random ﬁelds. [sent-562, score-0.465]

73 We compare against two previous methods for semi-supervised learning of conditional random ﬁelds, entropy regularization and the clustering method proposed by Miller et al. [sent-564, score-0.448]

74 We demonstrate that label regularization can achieve higher accuracy than purely supervised training, and can beat or match 976 G ENERALIZED E XPECTATION C RITERIA MRF (prototypes) MRF (prototypes) + cluster HMM supervised only (100) CRF supervised only (100) CRF supervised (100) + Ent. [sent-567, score-1.017]

75 3) Table 3: APT: A CRF trained semi-supervised with 100 labeled examples and 2k unlabeled exampled has the highest performance, beating the strictly supervised CRF and the CRF trained semi-supervised with entropy regularization. [sent-578, score-0.89]

76 We set label regularization λ as before, but set the entropy regularization λ to 0. [sent-596, score-0.787]

77 01 times the number of labeled examples divided by the number of unlabeled examples. [sent-597, score-0.476]

78 Table 3 compares the performance of our system relative to previous supervised and semisupervised systems,11 where at the maximum amount of training data, label regularization achieves a 1. [sent-604, score-0.588]

79 With accurate input distributions, the CRF trained with label regularization achieves the highest performance, for all but with one example, where entropy regularization is the highest performer. [sent-608, score-0.887]

80 3) Table 4: APT: Use of input distributions estimated from limited labeled data yields smaller improvements over the true distributions, but still improves over the strictly supervised accuracy, and at large levels of training data, nearly matches the accuracy achieved by the true distributions. [sent-654, score-0.457]

81 6) Table 5: CITE: Label regularization (GE) is able to signiﬁcantly improve on the purely supervised cases at very low levels of training data. [sent-696, score-0.397]

82 regularization, label regularization improves over supervised learning across all amounts of training data. [sent-699, score-0.588]

83 When these noisy input distributions were used, the performance was less than with entropy regularization at lower levels of training data, but it still provided consistent gains in performance across all training settings. [sent-701, score-0.57]

84 For this experiment we used 1,000 unlabeled examples, with λ = 1 times the number of unlabeled data points, and the entropy regularizer set as before. [sent-710, score-0.783]

85 On this set of data, as shown in Table 5, label regularization gives a slight win at low levels of training data, but at higher levels, it does not provide any improvement. [sent-711, score-0.506]

86 The difference between the two data sets suggest that more work is needed to understand in what situations label regularization can be expected to work well. [sent-714, score-0.422]

87 2) Table 6: APT: Using clustering features gives an additional gain to label regularization for low levels of training data, but it doesn’t provide any additional beneﬁt at higher levels of training data. [sent-755, score-0.677]

88 The strengths of label regularization over entropy regularization are two-fold. [sent-762, score-0.787]

89 GE rarely performs worse than purely supervised training (when the input distributions are accurate), whereas the performance of entropy regularization is erratic, sometimes yielding a gain in performance, in other cases leading to a severe decrease in performance. [sent-765, score-0.609]

90 In his method, the unlabeled data undergoes word clustering, and then features corresponding to clusters are added during supervised training. [sent-769, score-0.512]

91 Table 6 shows that for one labeled example, 2,000 unlabeled examples, and unsupervised clustering, the system achieves almost a 5% improvement by using these unsupervised cluster features (45. [sent-772, score-0.63]

92 At higher levels of training data, label regularization and clustering features interact poorly, and using them together lead to no improvement. [sent-775, score-0.593]

93 While at lower levels of training data, label regularization is able to achieve more gains than the clustering method, at higher levels of performance, it only matches performance. [sent-776, score-0.587]

94 This method penalizes models by divergence between the model’s expectations over the unlabeled data and input conditional probabilities, which can be estimated from labeled data or given as a priori knowledge by a human annotator. [sent-782, score-0.656]

95 An important special case, label regularization is empirically explored for the case of maximum entropy models where we ﬁnd it to provide accuracy improvements over entropy regularization, na¨ve Bayes EM, Quadratic Cost Criı terion (a representative graph-based method) and a cluster kernel SVM. [sent-783, score-0.809]

96 We show that the method is robust to noise in the estimates of the input conditional probabilities, the meta-parameters need little or no tuning, and that it runs in linear time with increasing numbers of unlabeled examples. [sent-784, score-0.389]

97 Conditional random ﬁelds trained with label regularization outperform alternative methods for semi-supervised training such as entropy regularization, and can achieve 1% to 8% improvement over supervised only approaches. [sent-787, score-0.821]

98 Learning from labeled and unlabeled data using graph mincut. [sent-845, score-0.476]

99 Learning to classify text from labeled and unlabeled documents. [sent-1131, score-0.476]

100 Learning from labeled and unlabeled data with label propagation. [sent-1244, score-0.708]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ge', 0.46), ('unlabeled', 0.304), ('label', 0.232), ('regularization', 0.19), ('allum', 0.188), ('riteria', 0.176), ('entropy', 0.175), ('labeled', 0.172), ('mann', 0.15), ('crf', 0.15), ('xpectation', 0.133), ('druck', 0.129), ('supervised', 0.123), ('eneralized', 0.116), ('yi', 0.114), ('secstr', 0.113), ('sraa', 0.113), ('proportions', 0.097), ('ym', 0.097), ('em', 0.096), ('ann', 0.094), ('criteria', 0.09), ('mccallum', 0.09), ('haghighi', 0.087), ('xd', 0.084), ('expectation', 0.076), ('chapelle', 0.075), ('bioii', 0.075), ('tsvms', 0.075), ('na', 0.075), ('klein', 0.071), ('discriminative', 0.068), ('apartment', 0.063), ('yd', 0.063), ('pos', 0.063), ('acl', 0.061), ('trained', 0.058), ('prototypes', 0.058), ('instances', 0.057), ('zien', 0.055), ('apt', 0.05), ('expectations', 0.049), ('ganchev', 0.048), ('citation', 0.048), ('features', 0.047), ('human', 0.046), ('training', 0.043), ('conditional', 0.043), ('bootstrapping', 0.042), ('transductive', 0.042), ('input', 0.042), ('weakly', 0.041), ('levels', 0.041), ('clustering', 0.04), ('classi', 0.039), ('grenager', 0.039), ('nigam', 0.039), ('word', 0.038), ('schapire', 0.038), ('generative', 0.038), ('biocreativeii', 0.038), ('cmn', 0.038), ('cozman', 0.038), ('distantly', 0.038), ('jiao', 0.038), ('ve', 0.037), ('miller', 0.037), ('qc', 0.037), ('tagging', 0.037), ('cluster', 0.037), ('labeling', 0.037), ('elds', 0.037), ('xu', 0.036), ('icml', 0.036), ('harmonic', 0.036), ('distributions', 0.036), ('scholkopf', 0.036), ('sliding', 0.036), ('unsupervised', 0.035), ('bayes', 0.034), ('match', 0.034), ('priors', 0.034), ('hmm', 0.034), ('zhu', 0.034), ('propagation', 0.034), ('structured', 0.034), ('gideon', 0.033), ('suzuki', 0.033), ('tokens', 0.032), ('eu', 0.032), ('beat', 0.032), ('bellare', 0.032), ('er', 0.031), ('contrastive', 0.031), ('generalized', 0.031), ('tuning', 0.03), ('grammar', 0.03), ('blum', 0.03), ('constraints', 0.029), ('disambiguation', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999774 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

Author: Gideon S. Mann, Andrew McCallum

2 0.23407276 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

Author: Kuzman Ganchev, João Graça, Jennifer Gillenwater, Ben Taskar

Abstract: We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efﬁciently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efﬁciency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efﬁcient algorithm for learning with posterior regularization and illustrate its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multi-view learning, cross-lingual dependency grammar induction, unsupervised part-of-speech induction, and bitext word alignment.1 Keywords: posterior regularization framework, unsupervised learning, latent variables models, prior knowledge, natural language processing

3 0.21402752 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

Author: Shiliang Sun, John Shawe-Taylor

Abstract: In this paper, we propose a general framework for sparse semi-supervised learning, which concerns using a small portion of unlabeled data and a few labeled data to represent target functions and thus has the merit of accelerating function evaluations when predicting the output of a new example. This framework makes use of Fenchel-Legendre conjugates to rewrite a convex insensitive loss involving a regularization with unlabeled data, and is applicable to a family of semi-supervised learning methods such as multi-view co-regularized least squares and single-view Laplacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse multi-view SVMs which use a squared ε-insensitive loss. The resultant optimization is an inf-sup problem and the optimal solutions have arguably saddle-point properties. We present a globally optimal iterative algorithm to optimize the problem. We give the margin bound on the generalization error of the sparse multi-view SVMs, and derive the empirical Rademacher complexity for the induced function class. Experiments on artiﬁcial and real-world data show their effectiveness. We further give a sequential training approach to show their possibility and potential for uses in large-scale problems and provide encouraging experimental results indicating the efﬁcacy of the margin bound and empirical Rademacher complexity on characterizing the roles of unlabeled data for semi-supervised learning. Keywords: semi-supervised learning, Fenchel-Legendre conjugate, representer theorem, multiview regularization, support vector machine, statistical learning theory

4 0.11066916 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

Author: Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian

Abstract: Estimating the error rates of classiﬁers or regression models is a fundamental task in machine learning which has thus far been studied exclusively using supervised learning techniques. We propose a novel unsupervised framework for estimating these error rates using only unlabeled data and mild assumptions. We prove consistency results for the framework and demonstrate its practical applicability on both synthetic and real world data. Keywords: classiﬁcation and regression, maximum likelihood, latent variable models

5 0.096311659 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars

Author: Shay B. Cohen, Noah A. Smith

Abstract: Probabilistic grammars offer great ﬂexibility in modeling discrete sequential data like natural language text. Their symbolic component is amenable to inspection by humans, while their probabilistic component helps resolve ambiguity. They also permit the use of well-understood, generalpurpose learning algorithms. There has been an increased interest in using probabilistic grammars in the Bayesian setting. To date, most of the literature has focused on using a Dirichlet prior. The Dirichlet prior has several limitations, including that it cannot directly model covariance between the probabilistic grammar’s parameters. Yet, various grammar parameters are expected to be correlated because the elements in language they represent share linguistic properties. In this paper, we suggest an alternative to the Dirichlet prior, a family of logistic normal distributions. We derive an inference algorithm for this family of distributions and experiment with the task of dependency grammar induction, demonstrating performance improvements with our priors on a set of six treebanks in different natural languages. Our covariance framework permits soft parameter tying within grammars and across grammars for text in different languages, and we show empirical gains in a novel learning setting using bilingual, non-parallel data. Keywords: dependency grammar induction, variational inference, logistic normal distribution, Bayesian inference

6 0.086619705 102 jmlr-2010-Semi-Supervised Novelty Detection

7 0.08162722 117 jmlr-2010-Why Does Unsupervised Pre-training Help Deep Learning?

8 0.075460985 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

9 0.073896699 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

10 0.073413134 22 jmlr-2010-Classification Using Geometric Level Sets

11 0.072286889 61 jmlr-2010-Learning From Crowds

12 0.070322558 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

13 0.07010527 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers

14 0.068039872 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

15 0.067871973 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models

16 0.066485502 24 jmlr-2010-Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials

17 0.061853331 109 jmlr-2010-Stochastic Composite Likelihood

18 0.059990231 90 jmlr-2010-Permutation Tests for Studying Classifier Performance

19 0.057687324 44 jmlr-2010-Graph Kernels

20 0.053659152 53 jmlr-2010-Inducing Tree-Substitution Grammars

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.278), (1, 0.02), (2, -0.183), (3, 0.177), (4, 0.067), (5, -0.038), (6, 0.134), (7, 0.119), (8, 0.045), (9, -0.072), (10, -0.128), (11, -0.166), (12, 0.346), (13, 0.166), (14, 0.165), (15, 0.076), (16, 0.003), (17, 0.043), (18, 0.115), (19, 0.131), (20, -0.03), (21, 0.098), (22, 0.051), (23, -0.1), (24, -0.004), (25, 0.018), (26, -0.072), (27, -0.043), (28, -0.041), (29, -0.04), (30, 0.06), (31, -0.038), (32, 0.003), (33, -0.102), (34, -0.034), (35, 0.107), (36, 0.028), (37, 0.042), (38, -0.03), (39, 0.023), (40, 0.011), (41, 0.042), (42, -0.097), (43, -0.085), (44, -0.015), (45, 0.013), (46, -0.05), (47, -0.111), (48, -0.043), (49, -0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9549008 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

Author: Gideon S. Mann, Andrew McCallum

2 0.75214869 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

Author: Kuzman Ganchev, João Graça, Jennifer Gillenwater, Ben Taskar

3 0.72138405 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

Author: Shiliang Sun, John Shawe-Taylor

4 0.44035581 102 jmlr-2010-Semi-Supervised Novelty Detection

Author: Gilles Blanchard, Gyemin Lee, Clayton Scott

Abstract: A common setting for novelty detection assumes that labeled examples from the nominal class are available, but that labeled examples of novelties are unavailable. The standard (inductive) approach is to declare novelties where the nominal density is low, which reduces the problem to density level set estimation. In this paper, we consider the setting where an unlabeled and possibly contaminated sample is also available at learning time. We argue that novelty detection in this semi-supervised setting is naturally solved by a general reduction to a binary classiﬁcation problem. In particular, a detector with a desired false positive rate can be achieved through a reduction to Neyman-Pearson classiﬁcation. Unlike the inductive approach, semi-supervised novelty detection (SSND) yields detectors that are optimal (e.g., statistically consistent) regardless of the distribution on novelties. Therefore, in novelty detection, unlabeled data have a substantial impact on the theoretical properties of the decision rule. We validate the practical utility of SSND with an extensive experimental study. We also show that SSND provides distribution-free, learning-theoretic solutions to two well known problems in hypothesis testing. First, our results provide a general solution to the general two-sample problem, that is, the problem of determining whether two random samples arise from the same distribution. Second, a specialization of SSND coincides with the standard p-value approach to multiple testing under the so-called random effects model. Unlike standard rejection regions based on thresholded p-values, the general SSND framework allows for adaptation to arbitrary alternative distributions in multiple dimensions. Keywords: semi-supervised learning, novelty detection, Neyman-Pearson classiﬁcation, learning reduction, two-sample problem, multiple testing

5 0.42171797 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

Author: Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian

6 0.40733689 61 jmlr-2010-Learning From Crowds

7 0.37876707 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models

8 0.36118996 24 jmlr-2010-Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials

9 0.33276582 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

10 0.32425615 117 jmlr-2010-Why Does Unsupervised Pre-training Help Deep Learning?

11 0.32177526 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers

12 0.31414929 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

13 0.30979079 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression

14 0.30717593 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars

15 0.27113801 22 jmlr-2010-Classification Using Geometric Level Sets

16 0.26621827 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

17 0.25691554 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

18 0.2568737 109 jmlr-2010-Stochastic Composite Likelihood

19 0.24847084 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

20 0.23391612 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.012), (8, 0.02), (21, 0.023), (22, 0.027), (32, 0.055), (33, 0.013), (36, 0.036), (37, 0.444), (75, 0.119), (81, 0.012), (85, 0.071), (96, 0.052), (97, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9855032 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox

Author: Carl Edward Rasmussen, Hannes Nickisch

Abstract: The GPML toolbox provides a wide range of functionality for Gaussian process (GP) inference and prediction. GPs are speciﬁed by mean and covariance functions; we offer a library of simple mean and covariance functions and mechanisms to compose more complex ones. Several likelihood functions are supported including Gaussian and heavy-tailed for regression as well as others suitable for classiﬁcation. Finally, a range of inference methods is provided, including exact and variational inference, Expectation Propagation, and Laplace’s method dealing with non-Gaussian likelihoods and FITC for dealing with large regression tasks. Keywords: Gaussian processes, nonparametric Bayes, probabilistic regression and classiﬁcation Gaussian processes (GPs) (Rasmussen and Williams, 2006) have convenient properties for many modelling tasks in machine learning and statistics. They can be used to specify distributions over functions without having to commit to a speciﬁc functional form. Applications range from regression over classiﬁcation to reinforcement learning, spatial models, survival and other time series1 models. Predictions of GP models come with a natural conﬁdence measure: predictive error-bars. Although the implementation of the basic principles in the simplest case is straight forward, various complicating features are often desired in practice. For example, a GP is determined by a mean function and a covariance function, but these functions are mostly difﬁcult to specify fully a priori, and typically they are given in terms of hyperparameters, that is, parameters which have to be inferred. Another source of difﬁculty is the likelihood function. For Gaussian likelihoods, inference is analytically tractable; however, in many tasks, Gaussian likelihoods are not appropriate, and approximate inference methods such as Expectation Propagation (EP) (Minka, 2001), Laplace’s approximation (LA) (Williams and Barber, 1998) and variational bounds (VB) (Gibbs and MacKay, 2000) become necessary (Nickisch and Rasmussen, 2008). In case of large training data, approximations (Candela and Rasmussen, 2005) like FITC (Snelson and Ghahramani, 2006) are needed. The GPML toolbox is designed to overcome these hurdles with its variety of mean, covariance and likelihood functions as well as inference methods, while being simple to use and easy to extend. ∗. Also at Max Planck Institute for Biological Cybernetics, Spemannstraße 38, 72076 T¨ bingen, Germany. u 1. Note, that here we typically think of GPs with a more general index set than time. ©2010 Carl Edward Rasmussen and Hannes Nickisch. R ASMUSSEN AND N ICKISCH 1. Implementation The GPML toolbox can be obtained from http://gaussianprocess.org/gpml/code/matlab/ and also http://mloss.org/software/view/263/ under the FreeBSD license. Based on simple interfaces for covariance, mean, likelihood functions as well as inference methods, we offer full compatibility to both Matlab 7.x2 and GNU Octave 3.2.x.3 Special attention has been given to properly disentangle covariance, likelihood and mean hyperparameters. Also, care has been taken to avoid numerical inaccuracies, for example, safe likelihood evaluations for extreme inputs and stable matrix operations. For example, the covariance matrix K can become numerically close to singular making its naive inversion numerically unsafe. We handle these situations in a principled way4 such that Cholesky decompositions are computed of well-conditioned matrices only. As a result, our code shows a high level of robustness along the full spectrum of possible hyperparameters. The focus of the toolbox is on approximate inference using dense matrix algebra. We currently do not support covariance matrix approximation techniques to deal with large numbers of training examples n. Looking at the (growing) body of literature on sparse approximations, this knowledge is still somewhat in ﬂux, and consensus on the best approaches has not yet been reached. We provide stable and modular code checked by an exhaustive suite of test cases. A single function gp.m serves as main interface to the user—it can make inference and predictions and allows the mean, covariance and likelihood function as well as the inference methods to be speciﬁed freely. Furthermore, gp.m enables convenient learning of the hyperparameters by maximising the log marginal likelihood ln Z. One of the particularly appealing properties of GP models is that principled and practical approaches exist for learning the parameters of mean, covariance and likelihood functions. Good adaptation of such parameters can be essential to obtain both high quality predictions and insights into the properties of the data. The GPML toolbox is particularly ﬂexible, including a large library of different covariance and mean functions, and ﬂexible ways to combine these into more expressive, specialised functions. The user can choose between two gradient-based optimisers: one uses conjugate gradients (CG)5 and the other one relies on a quasi-Newton scheme.6 ∂ Computing the derivatives w.r.t. hyperparameters ∂θi ln Z with gp.m does not need any extra programming effort; every inference method automatically collects the respective derivatives from the mean, covariance and likelihood functions and passes them to gp.m. Our documentation comes in two pieces: a hypertext user documentation7 doc/index.html with examples and code browsing and a technical documentation8 doc/manual.pdf focusing on the interfaces and more technical issues. A casual user will use the hypertext document to quickly get his data analysed, however a power user will consult the pdf document once he wants to include his own mean, covariance, likelihood and inference routines or learn about implementation details. 2. 3. 4. 5. 6. 7. 8. Matlab is available from MathWorks, http://www.mathworks.com/. Octave is available from the Free Software Foundation, http://www.gnu.org/software/octave/. We do not consider the “blind” addition of a “small ridge” to K a principled way. Carl Rasmussen’s code is available at http://www.kyb.tuebingen.mpg.de/bs/people/carl/code/minimize/. Peter Carbonetto’s wrapper can be found at http://www.cs.ubc.ca/˜pcarbo/lbfgsb-for-matlab.html. Documentation can be found at http://www.gaussianprocess.org/gpml/code/matlab/doc/index.html. Technical docs are available at http://www.gaussianprocess.org/gpml/code/matlab/doc/manual.pdf. 3012 G AUSSIAN P ROCESSES FOR M ACHINE L EARNING T OOLBOX 2. The GPML Toolbox We illustrate the modular structure of the GPML toolbox by means of a simple code example. GPs are used to formalise and update knowledge about distributions over functions. A GP prior distribution on an unknown latent function f ∼ GP (mφ (x), kψ (x, x′ )), consists of a mean function m(x) = E[ f (x)], and a covariance function k(x, x) = E[( f (x) − m(x))( f (x′ ) − m(x′ ))], both of which typically contain hyperparameters φ and ψ, which we want to ﬁt in the light of data. We generally assume independent observations, that is, input/output pairs (xi , yi ) of f with joint likelihood Pρ (y|f) = ∏n Pρ (yi | f (xi )) factorising over cases. Finally, after speciﬁcation of the prior and i=1 ﬁtting of the hyperparameters θ = {φ, ψ, ρ}, we wish to compute predictive distributions for test cases. 1 2 3 4 5 6 7 % 1) SET UP THE GP : COVARIANCE ; MEAN , LIKELIHOOD , INFERENCE METHOD mf = { ’ meanSum ’ ,{ ’ meanLinear ’, @meanConst }}; a = 2; b = 1; % m(x) = a*x+b cf = { ’ covSEiso ’}; sf = 1; ell = 0.7; % squared exponential covariance funct lf = ’ likLaplace ’; sn = 0.2; % assume Laplace noise with variance sn ˆ2 hyp0 . mean = [a;b ]; hyp0 . cov = log([ ell ; sf ]); hyp0 . lik = log( sn ); % hypers inf = ’ infEP ’; % specify expectation propagation as inference method % 2) MINIMISE NEGATIVE LOG MARGINAL LIKELIHOOD nlZ wrt . hyp ; do 50 CG steps Ncg = 50; [hyp , nlZ ] = minimize ( hyp0 , ’gp ’, -Ncg , inf , mf , cf , lf , X , y ); % 3) PREDICT AT UNKNOWN TEST INPUTS [ymu , ys2 ] = gp (hyp , inf , mf , cf , lf , X , y , Xs ); % test input Xs In line 1, we specify the mean mφ (x) = a⊤ x + b of the GP with hyperparameters φ = {a, b}. First, the functional form of the mean function is given and its parameters are initialised. The desired mean function, happens not to exist in the library of mean functions; instead we have to make a composite mean function from simple constituents. This is done using a nested cell array containing the algebraic expression for m(x): As the sum of a linear (mean/meanLinear.m) and a constant mean function (mean/meanConst.m) it is an afﬁne function. In addition to linear and constant mean functions, the toolbox offers m(x) = 0 and m(x) = 1. These simple mean functions can be combined by composite mean functions to obtain sums (mean/meanSum.m) m(x) = ∑ j m j (x), products m(x) = ∏ j m j (x), scaled versions m(x) = αm0 (x) and powers m(x) = m0 (x)d . This ﬂexible mechanism is used for convenient speciﬁcation of an extensible algebra of mean functions. Note that functions are referred to either as name strings ’meanConst’ or alternatively function handles @meanConst. The order of components of the hyperparameters φ is the same as in the speciﬁcation of the cell array. Every mean function implements its evaluation m = mφ (X) and ﬁrst derivative ∂ computation mi = ∂φi mφ (X) on a data set X. In the same spirit, the squared exponential covariance kψ (x, x′ ) = σ f ² exp(− x − x′ 2 /2ℓ2 ) (cov/covSEiso.m) with hyperparameters ψ = {ln ℓ, ln σ f } is set up in line 2. Note, that the hyperparameters are represented by the logarithms, as these parameters are naturally positive. Many other simple covariance functions are contained in the toolbox. Among others, we offer linear, constant, Mat´ rn, rational quadratic, polynomial, periodic, neural network and ﬁnite support coe variance functions. Composite covariance functions allow for sums k(x, x′ ) = ∑ j k j (x, x′ ), products k(x, x′ ) = ∏ j k j (x, x′ ), positive scaling k(x, x′ ) = σ2 k0 (x, x′ ) and masking of components f k(x, x′ ) = k0 (xI , x′ ) with I ⊆ [1, 2, .., D], x ∈ RD . Again, the interface is simple since only the I ∂ evaluation of the covariance matrix K = kψ (X) and its derivatives ∂i K = ∂ψi kψ (X) on a data set X are required. Furthermore, we need cross terms k∗ = kψ (X, x∗ ) and k∗∗ = kψ (x∗ , x∗ ) for prediction. There are no restrictions on the composition of both mean and covariance functions—any combination is allowed including nested composition. 3013 R ASMUSSEN AND N ICKISCH √ √ The Laplace (lik/likLaplace.m) likelihood Pρ (y| f ) = exp(− 2/σn |y − f |)/ 2σn with hyperparameters ρ = {ln σn } is speciﬁed in line 3. There are only simple likelihood functions: Gaussian, Sech-squared, Laplacian and Student’s t for ordinary and sparse regression as well as the error and the logistic function for classiﬁcation. Again, the same inference code is used for any likelihood function. Although the speciﬁcation of likelihood functions is simple for the user, writing new likelihood functions is slightly more involved as different inference methods require access to different properties; for example, LA requires second derivatives and EP requires derivatives of moments. All hyperparameters θ = {φ, ψ, ρ} are stored in a struct hyp.{mean,cov,lik}, which is initialised in line 4; we select the approximate inference algorithm EP (inf/infEP.m) in line 5. We optimise the hyperparameters θ ≡ hyp by calling the CG optimiser (util/minimize.m) with initial value θ0 ≡ hyp0 in line 6 allowing at most N = 50 evaluations of the EP approximation to the marginal likelihood ZEP (θ) as done by gp.m. Here, D = (X, y) ≡ (X,y) is the training data where X = {x1 , .., xn } and y ∈ Rn . Under the hood, gp.m computes in every step a Gaussian ∂ posterior approximation and the derivatives ∂θ ln ZEP (θ) of the marginal likelihood by calling EP. Predictions with optimised hyperparameters are done in line 7, where we call gp.m with the unseen test inputs X∗ ≡ Xs as additional argument. As a result, we obtain the approximate marginal predictive mean E[P(y∗ |D , X∗ )] ≡ ymu and the predictive variance V[P(y∗ |D , X∗ )] ≡ ys2. Likelihood \ Inference Gaussian Sech-squared Laplacian Student’s t Error function Logistic function Exact FITC EP Laplace VB Type, Output Domain regression, R regression, R regression, R regression, R classiﬁcation, {±1} classiﬁcation, {±1} Alternate Name logistic distribution double exponential probit regression logit regression Table 1: Likelihood ↔ inference compatibility in the GPML toolbox Table 1 gives the legal likelihood/inference combinations. Exact inference and the FITC approximation support the Gaussian likelihood only. Variational Bayesian (VB) inference is applicable to all likelihoods. Expectation propagation (EP) for the Student’s t likelihood is inherently unstable due to its non-log-concavity. The Laplace approximation (LA) for Laplace likelihoods is not sensible due to the non-differentiable peak of the Laplace likelihood. Special care has been taken for the non-convex optimisation problem imposed by the combination Student’s t likelihood and LA. If the number of training examples is larger than a few thousand, dense matrix computations become too slow. We provide the FITC approximation for regression with Gaussian likelihood where ˜ instead of the exact covariance matrix K, a low-rank plus diagonal matrix K = Q + diag(K − Q) ⊤ K−1 K is used. The matrices K and K contain covariances and cross-covariances where Q = Ku uu u uu u of and between inducing inputs ui and data points x j . Using inf/infFITC.m together with any covariance function wrapped into cov/covFITC.m makes the computations feasible for large n. Acknowledgments Thanks to Ed Snelson for assisting with the FITC approximation. 3014 G AUSSIAN P ROCESSES FOR M ACHINE L EARNING T OOLBOX References Joaquin Qui˜ onero Candela and Carl E. Rasmussen. A unifying view of sparse approximate Gausn sian process regression. Journal of Machine Learning Research, 6(6):1935–1959, 2005. Mark N. Gibbs and David J. C. MacKay. Variational Gaussian process classiﬁers. IEEE Transactions on Neural Networks, 11(6):1458–1464, 2000. Thomas P. Minka. Expectation propagation for approximate Bayesian inference. In UAI, pages 362–369. Morgan Kaufmann, 2001. Hannes Nickisch and Carl E. Rasmussen. Approximations for binary Gaussian process classiﬁcation. Journal of Machine Learning Research, 9:2035–2078, 10 2008. Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, Cambridge, MA, 2006. Ed Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems 18, 2006. Christopher K. I. Williams and D. Barber. Bayesian classiﬁcation with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(20):1342–1351, 1998. 3015

2 0.93583614 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs

Author: Garvesh Raskutti, Martin J. Wainwright, Bin Yu

Abstract: Methods based on ℓ1 -relaxation, such as basis pursuit and the Lasso, are very popular for sparse regression in high dimensions. The conditions for success of these methods are now well-understood: (1) exact recovery in the noiseless setting is possible if and only if the design matrix X satisﬁes the restricted nullspace property, and (2) the squared ℓ2 -error of a Lasso estimate decays at the minimax optimal rate k log p , where k is the sparsity of the p-dimensional regression problem with additive n Gaussian noise, whenever the design satisﬁes a restricted eigenvalue condition. The key issue is thus to determine when the design matrix X satisﬁes these desirable properties. Thus far, there have been numerous results showing that the restricted isometry property, which implies both the restricted nullspace and eigenvalue conditions, is satisﬁed when all entries of X are independent and identically distributed (i.i.d.), or the rows are unitary. This paper proves directly that the restricted nullspace and eigenvalue conditions hold with high probability for quite general classes of Gaussian matrices for which the predictors may be highly dependent, and hence restricted isometry conditions can be violated with high probability. In this way, our results extend the attractive theoretical guarantees on ℓ1 -relaxations to a much broader class of problems than the case of completely independent or unitary designs. Keywords: Lasso, basis pursuit, random matrix theory, Gaussian comparison inequality, concentration of measure

3 0.88841981 110 jmlr-2010-The SHOGUN Machine Learning Toolbox

Author: Sören Sonnenburg, Gunnar Rätsch, Sebastian Henschel, Christian Widmer, Jonas Behr, Alexander Zien, Fabio de Bona, Alexander Binder, Christian Gehl, Vojtěch Franc

Abstract: We have developed a machine learning toolbox, called SHOGUN, which is designed for uniﬁed large-scale learning for a broad range of feature types and learning settings. It offers a considerable number of machine learning models such as support vector machines, hidden Markov models, multiple kernel learning, linear discriminant analysis, and more. Most of the speciﬁc algorithms are able to deal with several different data classes. We have used this toolbox in several applications from computational biology, some of them coming with no less than 50 million training examples and others with 7 billion test examples. With more than a thousand installations worldwide, SHOGUN is already widely adopted in the machine learning community and beyond. SHOGUN is , implemented in C++ and interfaces to MATLABTM R, Octave, Python, and has a stand-alone command line interface. The source code is freely available under the GNU General Public License, Version 3 at http://www.shogun-toolbox.org. Keywords: support vector machines, kernels, large-scale learning, Python, Octave, R

same-paper 4 0.88326246 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

Author: Gideon S. Mann, Andrew McCallum

5 0.85723436 57 jmlr-2010-Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models

Author: Fang-Lan Huang, Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin

Abstract: Maximum entropy (Maxent) is useful in natural language processing and many other areas. Iterative scaling (IS) methods are one of the most popular approaches to solve Maxent. With many variants of IS methods, it is difﬁcult to understand them and see the differences. In this paper, we create a general and uniﬁed framework for iterative scaling methods. This framework also connects iterative scaling and coordinate descent methods. We prove general convergence results for IS methods and analyze their computational complexity. Based on the proposed framework, we extend a coordinate descent method for linear SVM to Maxent. Results show that it is faster than existing iterative scaling methods. Keywords: maximum entropy, iterative scaling, coordinate descent, natural language processing, optimization

6 0.70711529 104 jmlr-2010-Sparse Spectrum Gaussian Process Regression

7 0.67919016 118 jmlr-2010-libDAI: A Free and Open Source C++ Library for Discrete Approximate Inference in Graphical Models

8 0.62469357 116 jmlr-2010-WEKA−Experiences with a Java Open-Source Project

9 0.58217198 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

10 0.57924229 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

11 0.56340599 24 jmlr-2010-Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials

12 0.55575311 96 jmlr-2010-Rate Minimaxity of the Lasso and Dantzig Selector for thelqLoss inlrBalls

13 0.54416519 109 jmlr-2010-Stochastic Composite Likelihood

14 0.53432763 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

15 0.52783406 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

16 0.52309322 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

17 0.52118665 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

18 0.52083391 15 jmlr-2010-Approximate Tree Kernels

19 0.51794577 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars

20 0.51705372 54 jmlr-2010-Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization