nips nips2009 nips2009-192 knowledge-graph by maker-knowledge-mining

192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models


Source: pdf

Author: Kuzman Ganchev, Ben Taskar, Fernando Pereira, João Gama

Abstract: We address the problem of learning structured unsupervised models with moment sparsity typical in many natural language induction tasks. For example, in unsupervised part-of-speech (POS) induction using hidden Markov models, we introduce a bias for words to be labeled by a small number of tags. In order to express this bias of posterior sparsity as opposed to parametric sparsity, we extend the posterior regularization framework [7]. We evaluate our methods on three languages — English, Bulgarian and Portuguese — showing consistent and significant accuracy improvement over EM-trained HMMs, and HMMs with sparsity-inducing Dirichlet priors trained by variational EM. We increase accuracy with respect to EM by 2.3%-6.5% in a purely unsupervised setting as well as in a weaklysupervised setting where the closed-class words are provided. Finally, we show improvements using our method when using the induced clusters as features of a discriminative model in a semi-supervised setting. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 For example, in unsupervised part-of-speech (POS) induction using hidden Markov models, we introduce a bias for words to be labeled by a small number of tags. [sent-4, score-0.286]

2 In order to express this bias of posterior sparsity as opposed to parametric sparsity, we extend the posterior regularization framework [7]. [sent-5, score-0.456]

3 We evaluate our methods on three languages — English, Bulgarian and Portuguese — showing consistent and significant accuracy improvement over EM-trained HMMs, and HMMs with sparsity-inducing Dirichlet priors trained by variational EM. [sent-6, score-0.158]

4 5% in a purely unsupervised setting as well as in a weaklysupervised setting where the closed-class words are provided. [sent-9, score-0.158]

5 In this paper we explore the problem of biasing such unsupervised models to favor a novel kind of sparsity that expresses our expectations about the role of the latent variables. [sent-13, score-0.314]

6 We extend the posterior regularization framework [7] to achieve that kind of posterior sparsity on the unlabeled training data. [sent-15, score-0.519]

7 In unsupervised part-of-speech (POS) tagging, a well studied yet challenging problem, the new method consistently and significantly improves performance over a non-sparse baseline and over a variational Bayes baseline with a Dirichlet prior used to encourage sparsity [9, 4]. [sent-16, score-0.399]

8 A common approach to unsupervised POS tagging is to train a hidden Markov model where the hidden states are the possible tags and the observations are word sequences. [sent-17, score-0.898]

9 Unfortunately, while supervised training of HMMs achieves relatively high accuracy, the unsupervised models tend to perform poorly. [sent-19, score-0.199]

10 One well-known reason for this is that EM tends to allow each word to be generated by most hidden states some of the time. [sent-20, score-0.278]

11 To solve this problem, several studies [14, 17, 6] investigated weakly-supervised approaches where the model is given the list of possible tags for each word. [sent-22, score-0.292]

12 The task is then to disambiguate among the possible tags for each word type. [sent-23, score-0.437]

13 Recent work has made use of smaller dictionaries, trying to model the set of possible tags for each word [18, 5], or use a small number of “prototypes” for each tag [8]. [sent-24, score-0.769]

14 All these approaches initialize the model in a way that encourages sparsity by zeroing out impossible tags. [sent-25, score-0.201]

15 This has been explored in a Bayesian setting, where a prior is used to encourage sparsity in the model parameters [4, 9, 6]. [sent-27, score-0.258]

16 This sparse prior, which prefers each tag to have few word types associated with it, indirectly achieves sparsity over the posteriors, meaning each word type should have few possible tags. [sent-28, score-0.892]

17 Our method differs in that it encourages sparsity in the model posteriors, more directly encoding the desiderata. [sent-29, score-0.201]

18 Additionally our method can be applied to log-linear models where sparsity in the parameters leads to dense posteriors. [sent-30, score-0.172]

19 We use a first-order HMM as our model to compare the different training conditions: classical expectation-maximization (EM) training without modifications to encourage sparsity, the sparse prior used by [9] with variational Bayes EM (VEM), and our sparse posterior regularization (Sparse). [sent-32, score-0.547]

20 We find that our method consistently improves performance with respect to both baselines in a completely unsupervised scenario, as well as in a weakly-supervised scenario where the tags of closed-class words are supplied. [sent-34, score-0.45]

21 Interestingly, while VEM achieves a state size distribution (number of words assigned to hidden states) that is closer to the empirical tag distribution than EM and Sparse its state-token distribution is a worse match to the empirical tag-token distribution than the competing methods. [sent-35, score-0.49]

22 Finally, we show that states assigned by the model are useful as features for a supervised POS tagger. [sent-36, score-0.138]

23 2 Posterior Regularization In order to express the desired preference for posterior sparsity, we use the posterior regularization (PR) framework [7], which incorporates side information into parameter estimation in the form of linear constraints on posterior expectations. [sent-37, score-0.49]

24 This allows tractable learning and inference even when the constraints would be intractable to encode directly in the model, for instance to enforce that each hidden state in an HMM is used only once in expectation. [sent-38, score-0.165]

25 PR can be seen as a penalty on the standard marginal likelihood objective, which we define first: Marginal Likelihood: L(θ) = E[− log pθ (x)] = E[− log pθ (z, x)] z over the parameters θ, where E is the empirical expectation over the unlabeled sample x, and z are the hidden states. [sent-40, score-0.155]

26 Posterior information in PR is specified with sets Qx of distributions over the hidden variables z defined by linear constraints on feature expectations: Qx = {q(z | x) : Eq [f (x, z)] ≤ b}. [sent-42, score-0.136]

27 However, in PR, q is a projection of the posteriors onto the constraint set Qx for each example x: arg min KL(q(z|x) pθ (z|x)) s. [sent-46, score-0.146]

28 Left panel: initial tag distributions (columns) for 15 instances of a word. [sent-63, score-0.332]

29 Right panel: q concentrates the posteriors for all instances on the NN tag, reducing the 1 / ∞ norm from just under 4 to a little over 1. [sent-65, score-0.135]

30 The new posteriors q(z|x) are used to compute sufficient statistics for this instance and hence to update the model’s parameters in the M step. [sent-66, score-0.108]

31 There is one dual variable per expectation constraint, which can be optimized by projected gradient descent where gradient for λ is b − Eq [f (x, z)]. [sent-69, score-0.119]

32 3 Relaxing Posterior Regularization In this work, we modify PR so that instead of hard constraints on q(z | x), it allows the constraints to be relaxed at a cost specified by a penalty. [sent-71, score-0.112]

33 Using soft constraints, the non-differentiable portion of the dual objective turns into simplex constraints on the dual variables, allowing us to use an efficient projected gradient method. [sent-75, score-0.219]

34 For POS tagging, we will design R(b) to encourage each word type to be observed with a small number of POS tags in the projected posteriors q. [sent-79, score-0.637]

35 1 1/ ∞ regularization We now choose the posterior constraint regularizer R(b) to encourage each word to be associated with only a few parts of speech. [sent-84, score-0.439]

36 Let feature fwti have value 1 whenever the ith occurrence of word w has part of speech tag t. [sent-85, score-0.57]

37 For every word w, we would like there to be only a few POS tags t such that there are occurrences i where t has nonzero probability. [sent-86, score-0.485]

38 This can be achieved if it “costs” a lot to allow an occurrence of a word to take a tag, but once that happens, it should be “free” for other occurrences of the word to receive that same tag. [sent-87, score-0.375]

39 More precisely, we would like the sum ( 1 norm) over tags t and word types w of the maxima ( ∞ norm) of the expectation of taking tag t 3 over all occurrences of w to be small. [sent-88, score-0.86]

40 Table 1 shows the value of the 1 / ∞ sparsity measure for three different corpora, comparing fully supervised HMM and fully unsupervised HMM learned with standard EM, with standard EM having 3-4 times larger value of 1 / ∞ than the supervised. [sent-89, score-0.34]

41 Note that setting σ = 0 we are back to normal EM where q is the model posterior distribution. [sent-94, score-0.113]

42 As σ → ∞, the constraints force each occurrence of a word type to have the same posterior distribution, effectively reducing the mode to a 0th-order Markov chain in the E step. [sent-95, score-0.351]

43 z (9) i where z ranges over assignments to the hidden tag variables for all of the occurrences in the training data, f (z) is the vector of fwti feature values for assignment z, λ is the vector of dual parameters λwti , and the primal parameters are q(z) ∝ pθ (z) exp (−λ · f (z)). [sent-98, score-0.59]

44 For simplicity suppose we are only regularizing one word and our model pθ is just a product distribution over 15 instances of the word. [sent-101, score-0.145]

45 The left panel in Figure 1 shows the posteriors under pθ . [sent-102, score-0.152]

46 The center panel of the figure shows the λ values determined by Equation 9, and the right panel shows the projected distribution q, which concentrates most of the posterior on the bottom row. [sent-104, score-0.261]

47 Note that we are not requiring the posteriors to be sparse, which would be equivalent to preferring that the distribution is peaked; rather, we want a word to concentrate its tag posterior on a few tags across all instances of the word. [sent-105, score-0.99]

48 Indeed, most of the instances (columns) become less peaked than in the original posterior to allow posterior mass to be redistributed away from the outlier tags. [sent-106, score-0.226]

49 4 Bayesian Estimators Recent advances in inference methods for sparsifying Bayesian estimation have been applied to unsupervised POS tagging [4, 9, 6]. [sent-109, score-0.248]

50 In the Bayesian setting, preference for sparsity is expressed as a prior distribution over model structures and parameters, rather than as constraints on feature posteriors. [sent-110, score-0.292]

51 To compare these two approaches, in Section 5 we compare our method to a Bayesian approach proposed by Johnson [9], which relies on a Dirichlet prior to encourage sparsity in a firstorder HMM for POS tagging. [sent-111, score-0.258]

52 The complete description of the model is: θi P (ti |tt−1 = tag) ∼ ∼ Dir (αi ) Multi(θi ) φi P (wi |ti = tag) ∼ ∼ Dir (λi ) Multi(φi ) Here, αi controls sparsity over the state transition matrix and λi controls the sparsity of state emission probabilities. [sent-112, score-0.465]

53 In contrast, as λi approaches zero, it encourages the model to have highly skewed P (wi |ti = tag) distributions, that is, each tag is encouraged to generate a few words with high probability, and the rest with very low probability. [sent-114, score-0.41]

54 This is not exactly the constraint we would like to enforce: there are some POS tags that generate many different words with relatively high probability (for example, nouns and verbs), while each word is associated with a small number of tags. [sent-115, score-0.524]

55 5 Experiments We now compare first-order HMMs trained using the three methods described earlier: the classical EM algorithm (EM), our 1 / ∞ posterior regularization based method (Sparse), and the model presented in Section 4 (VEM). [sent-126, score-0.227]

56 We also report results on the full Penn treebank tag set in the supplementary materials. [sent-128, score-0.392]

57 Table 1 gives statistics for each corpus as well as the sparsity for a first-order HMM trained using the labeled data and using standard EM with unlabeled data. [sent-131, score-0.288]

58 1 / ∞ is the value of the sparsity measure for a fully supervised HMM trained on all available data and EM 1 / ∞ is the value of the sparsity measure for a fully unsupervised HMM trained using standard EM on all available data. [sent-144, score-0.624]

59 For VEM we tested 4 different prior combinations, (all combinations of 10−1 and 10−3 for emission prior and transition prior), based on Johnson’s results [9]. [sent-151, score-0.117]

60 All models are first order HMMs: EM trained using expectation maximization, VEM trained using variational EM observation priors shown in parentheses, Sparse trained using PR with the constraint strength (σ) in parentheses. [sent-226, score-0.281]

61 (a) 1-many accuracy vs 1 / ∞ , (b) 1-1 accuracy vs 1 / ∞ , (c) tens of thousands of tokens assigned to hidden state vs rank, (d) mutual information in bits between gold tag distribution and hidden state distribution. [sent-238, score-0.83]

62 Predictions were obtained using posterior decoding since this consistently showed small improvements over Viterbi decoding. [sent-242, score-0.113]

63 We evaluate the accuracy of the models using two established mappings between hidden states and POS tags: 1-Many maps each hidden state to the tag with which it co-occurs the most; 1-1 [8] greedily picks a tag for each state under the constraint of never using the same tag twice. [sent-243, score-1.342]

64 If the numbers of hidden states and tags are not the same, some hidden states will be unassigned (and hence always wrong) or some tags not used. [sent-245, score-0.85]

65 In all our experiments the number of hidden states is the same as the number of POS tags. [sent-246, score-0.133]

66 The left two plots show scatter graphs of accuracy with respect to 1 / ∞ value, where accuracy is measured with either the 1-many mapping (left) or 1-1 mapping (center). [sent-252, score-0.126]

67 The third plot shows the number of tokens assigned to each hidden state at decoding time, in frequency rank order. [sent-254, score-0.152]

68 ) We assume that we are given the POS tags of closed classes along with the words in each closed class. [sent-263, score-0.415]

69 In the models, we set the emission probability from a closed-class tag to any word not in its class to zero. [sent-264, score-0.54]

70 Also, any word appearing in a closed class is assumed to have zero probability of being generated by an open-class tag. [sent-265, score-0.182]

71 This improves performance significantly for all languages, but our sparse training procedure is still able to outperform EM training significantly as shown in Table 3. [sent-266, score-0.16]

72 Note, for these experiments we do not use an unknown word, since doing so for closed-class words would allow closed class tags to generate unknown words. [sent-267, score-0.378]

73 0) Table 3: Results with given closed-class tags, using posterior decoding, and projection at test time. [sent-292, score-0.113]

74 2 Supervised POS tagging As a further comparison of the models trained using the different methods, we use them to generate features for a supervised POS tagger. [sent-296, score-0.28]

75 The basic supervised model has features for the identity of the current token as well as suffixes of length 2 and 3. [sent-297, score-0.118]

76 For each unsupervised training procedure (EM, Sparse, VEM) we train 10 models using different random initializations and got 10 state identities per training method for each token. [sent-300, score-0.2]

77 Figure 3 shows the average accuracy of the supervised model as we vary the type of unsupervised features. [sent-302, score-0.205]

78 6 Related Work Our learning method is very closely related to the work of Mann and McCallum [11, 12], who concurrently developed the idea of using penalties based on posterior expectations of features to guide learning. [sent-307, score-0.139]

79 They call their method generalized expectation (GE) constraints or alternatively expectation regularization. [sent-308, score-0.142]

80 They compare the two frameworks on several datasets and find that performance is similar, and we suspect that this would be true for the sparsity constraints also. [sent-314, score-0.228]

81 [15] show promising results in weakly-supervised POS tagging, where a tag dictionary is provided. [sent-318, score-0.332]

82 This sparse grammar and the dictionary are provided as input for training an unsupervised HMM. [sent-320, score-0.299]

83 Results show that using a sparse grammar, hence enforcing sparsity over possible sparsity transitions leads to better results. [sent-321, score-0.442]

84 This method is different from ours in the sense that our method focuses on learning the sparsity pattern they their method uses as input. [sent-322, score-0.172]

85 7 Conclusion We presented a new regularization method for unsupervised training of probabilistic models that favors a kind of sparsity that is pervasive in natural language processing. [sent-323, score-0.397]

86 In the case of part-ofspeech induction, the preference can be summarized as “each word occurs as only a few different parts-of-speech,” but the approach is more general and could be applied to other tasks. [sent-324, score-0.182]

87 Our method uses the posterior regularization framework to specify preferences about model posteriors directly, without having to say how these should be encoded in model parameters. [sent-326, score-0.279]

88 This means that the sparse regularization penalty could be used for a log-linear model, where sparse parameters do not correspond to posterior sparsity. [sent-327, score-0.367]

89 We evaluated the new regularization method on the task of unsupervised POS tagging, encoding the prior knowledge that each word should have a small set of tags as a mixed-norm penalty. [sent-328, score-0.631]

90 We compared our method to a previously proposed Bayesian method (VEM) for encouraging sparsity of model parameters [9] and found that ours performs better in practice. [sent-329, score-0.172]

91 We explain this advantage by noting that VEM encodes a preference that each POS tag should generate a few words, which goes in the wrong direction. [sent-330, score-0.369]

92 In reality, in POS tagging (as in several other language processing task), a few event types (tags) (such the NN for POS tagging) generate the bulk of the word occurrences, but each word is only associated with a few tags. [sent-331, score-0.456]

93 An analysis of sparsity shows that both VEM and Sparse achieve a similar posterior sparsity as measured by the 1 / ∞ metric. [sent-333, score-0.457]

94 While VEM models better the empirical sizes of states (tags), the states it assigns have lower mutual information to the true tags, suggesting that parameter sparsity is not as good at generating good tag assignments. [sent-334, score-0.654]

95 In contrast, Sparse’s sparsity seems to help build a model that contains more information about the correct tag assignments. [sent-335, score-0.504]

96 Finally, we evaluated the worth of states assigned by unsupervised learning as features for supervised tagger training with small training sets. [sent-336, score-0.309]

97 In future work, we would like to evaluate the usefulness of these sparser annotations for downstream tasks, for example determining whether Sparse POS tags are better for unsupervised parsing. [sent-339, score-0.401]

98 Finally, we would like to apply the 1 / ∞ posterior regularizer to other applications such as unsupervised grammar induction where we would like sparsity in production rules. [sent-340, score-0.557]

99 Similarly, it would be interesting to use this to regularize a log-linear model, where parameter sparsity does not achieve the same goal. [sent-341, score-0.172]

100 A comparison of Bayesian estimators for unsupervised Hidden Markov Model POS taggers. [sent-374, score-0.109]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('vem', 0.51), ('tag', 0.332), ('pos', 0.332), ('tags', 0.292), ('em', 0.239), ('sparsity', 0.172), ('word', 0.145), ('tagging', 0.139), ('posterior', 0.113), ('unsupervised', 0.109), ('posteriors', 0.108), ('qx', 0.105), ('hmm', 0.103), ('sparse', 0.098), ('vb', 0.082), ('hidden', 0.08), ('pr', 0.078), ('eq', 0.077), ('bulgarian', 0.075), ('bultree', 0.075), ('johnson', 0.072), ('emission', 0.063), ('grammar', 0.061), ('treebank', 0.06), ('supervised', 0.059), ('encourage', 0.059), ('regularization', 0.058), ('acl', 0.057), ('hmms', 0.057), ('fwti', 0.056), ('simov', 0.056), ('unk', 0.056), ('constraints', 0.056), ('trained', 0.056), ('states', 0.053), ('words', 0.049), ('penn', 0.049), ('gra', 0.049), ('induction', 0.048), ('occurrences', 0.048), ('nn', 0.048), ('kl', 0.048), ('gao', 0.045), ('mutual', 0.044), ('panel', 0.044), ('objective', 0.044), ('dual', 0.043), ('expectation', 0.043), ('tokens', 0.043), ('ganchev', 0.042), ('gold', 0.041), ('ge', 0.04), ('jj', 0.04), ('english', 0.038), ('constraint', 0.038), ('mann', 0.038), ('occurrence', 0.037), ('bellare', 0.037), ('cwt', 0.037), ('floresta', 0.037), ('jianfeng', 0.037), ('rse', 0.037), ('sinta', 0.037), ('tica', 0.037), ('wti', 0.037), ('closed', 0.037), ('preference', 0.037), ('accuracy', 0.037), ('projected', 0.033), ('ti', 0.033), ('languages', 0.033), ('token', 0.033), ('lrec', 0.033), ('portuguese', 0.033), ('ravi', 0.033), ('latent', 0.033), ('unlabeled', 0.032), ('variational', 0.032), ('sp', 0.032), ('training', 0.031), ('digamma', 0.03), ('corpora', 0.029), ('encourages', 0.029), ('state', 0.029), ('corpus', 0.028), ('production', 0.028), ('multi', 0.028), ('none', 0.028), ('dt', 0.028), ('language', 0.027), ('bayesian', 0.027), ('prior', 0.027), ('concentrates', 0.027), ('bertsekas', 0.027), ('verbs', 0.027), ('vs', 0.026), ('mapping', 0.026), ('regularizer', 0.026), ('features', 0.026), ('gs', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models

Author: Kuzman Ganchev, Ben Taskar, Fernando Pereira, João Gama

Abstract: We address the problem of learning structured unsupervised models with moment sparsity typical in many natural language induction tasks. For example, in unsupervised part-of-speech (POS) induction using hidden Markov models, we introduce a bias for words to be labeled by a small number of tags. In order to express this bias of posterior sparsity as opposed to parametric sparsity, we extend the posterior regularization framework [7]. We evaluate our methods on three languages — English, Bulgarian and Portuguese — showing consistent and significant accuracy improvement over EM-trained HMMs, and HMMs with sparsity-inducing Dirichlet priors trained by variational EM. We increase accuracy with respect to EM by 2.3%-6.5% in a purely unsupervised setting as well as in a weaklysupervised setting where the closed-class words are provided. Finally, we show improvements using our method when using the induced clusters as features of a discriminative model in a semi-supervised setting. 1

2 0.097099543 29 nips-2009-An Infinite Factor Model Hierarchy Via a Noisy-Or Mechanism

Author: Douglas Eck, Yoshua Bengio, Aaron C. Courville

Abstract: The Indian Buffet Process is a Bayesian nonparametric approach that models objects as arising from an infinite number of latent factors. Here we extend the latent factor model framework to two or more unbounded layers of latent factors. From a generative perspective, each layer defines a conditional factorial prior distribution over the binary latent variables of the layer below via a noisy-or mechanism. We explore the properties of the model with two empirical studies, one digit recognition task and one music tag data experiment. 1

3 0.096666351 100 nips-2009-Gaussian process regression with Student-t likelihood

Author: Jarno Vanhatalo, Pasi Jylänki, Aki Vehtari

Abstract: In the Gaussian process regression the observation model is commonly assumed to be Gaussian, which is convenient in computational perspective. However, the drawback is that the predictive accuracy of the model can be significantly compromised if the observations are contaminated by outliers. A robust observation model, such as the Student-t distribution, reduces the influence of outlying observations and improves the predictions. The problem, however, is the analytically intractable inference. In this work, we discuss the properties of a Gaussian process regression model with the Student-t likelihood and utilize the Laplace approximation for approximate inference. We compare our approach to a variational approximation and a Markov chain Monte Carlo scheme, which utilize the commonly used scale mixture representation of the Student-t distribution. 1

4 0.09247046 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

Author: Novi Quadrianto, John Lim, Dale Schuurmans, Tibério S. Caetano

Abstract: We develop a convex relaxation of maximum a posteriori estimation of a mixture of regression models. Although our relaxation involves a semidefinite matrix variable, we reformulate the problem to eliminate the need for general semidefinite programming. In particular, we provide two reformulations that admit fast algorithms. The first is a max-min spectral reformulation exploiting quasi-Newton descent. The second is a min-min reformulation consisting of fast alternating steps of closed-form updates. We evaluate the methods against Expectation-Maximization in a real problem of motion segmentation from video data. 1

5 0.083263546 157 nips-2009-Multi-Label Prediction via Compressed Sensing

Author: John Langford, Tong Zhang, Daniel J. Hsu, Sham M. Kakade

Abstract: We consider multi-label prediction problems with large output spaces under the assumption of output sparsity – that the target (label) vectors have small support. We develop a general theory for a variant of the popular error correcting output code scheme, using ideas from compressed sensing for exploiting this sparsity. The method can be regarded as a simple reduction from multi-label regression problems to binary regression problems. We show that the number of subproblems need only be logarithmic in the total number of possible labels, making this approach radically more efficient than others. We also state and prove robustness guarantees for this method in the form of regret transform bounds (in general), and also provide a more detailed analysis for the linear prediction setting. 1

6 0.078447595 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs

7 0.075816043 72 nips-2009-Distribution Matching for Transduction

8 0.074502915 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall

9 0.074139886 129 nips-2009-Learning a Small Mixture of Trees

10 0.069354184 96 nips-2009-Filtering Abstract Senses From Image Search Results

11 0.06539844 260 nips-2009-Zero-shot Learning with Semantic Output Codes

12 0.064966425 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors

13 0.064214028 97 nips-2009-Free energy score space

14 0.062729269 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields

15 0.062651061 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

16 0.059617311 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

17 0.057531722 175 nips-2009-Occlusive Components Analysis

18 0.056139816 190 nips-2009-Polynomial Semantic Indexing

19 0.05482284 104 nips-2009-Group Sparse Coding

20 0.054610625 171 nips-2009-Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.188), (1, -0.033), (2, -0.048), (3, -0.06), (4, 0.035), (5, -0.093), (6, 0.054), (7, -0.037), (8, 0.008), (9, 0.106), (10, -0.005), (11, 0.009), (12, -0.057), (13, -0.015), (14, -0.006), (15, 0.03), (16, 0.042), (17, -0.01), (18, -0.056), (19, -0.021), (20, -0.004), (21, -0.085), (22, -0.032), (23, 0.114), (24, 0.082), (25, -0.052), (26, 0.014), (27, 0.095), (28, 0.098), (29, 0.039), (30, -0.128), (31, -0.093), (32, -0.066), (33, 0.009), (34, -0.011), (35, 0.017), (36, -0.017), (37, -0.041), (38, -0.097), (39, 0.021), (40, 0.026), (41, -0.091), (42, 0.04), (43, 0.117), (44, 0.12), (45, 0.07), (46, -0.068), (47, 0.031), (48, -0.022), (49, 0.064)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93931699 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models

Author: Kuzman Ganchev, Ben Taskar, Fernando Pereira, João Gama

Abstract: We address the problem of learning structured unsupervised models with moment sparsity typical in many natural language induction tasks. For example, in unsupervised part-of-speech (POS) induction using hidden Markov models, we introduce a bias for words to be labeled by a small number of tags. In order to express this bias of posterior sparsity as opposed to parametric sparsity, we extend the posterior regularization framework [7]. We evaluate our methods on three languages — English, Bulgarian and Portuguese — showing consistent and significant accuracy improvement over EM-trained HMMs, and HMMs with sparsity-inducing Dirichlet priors trained by variational EM. We increase accuracy with respect to EM by 2.3%-6.5% in a purely unsupervised setting as well as in a weaklysupervised setting where the closed-class words are provided. Finally, we show improvements using our method when using the induced clusters as features of a discriminative model in a semi-supervised setting. 1

2 0.56538951 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors

Author: David P. Wipf, Srikantan S. Nagarajan

Abstract: Finding maximally sparse representations from overcomplete feature dictionaries frequently involves minimizing a cost function composed of a likelihood (or data fit) term and a prior (or penalty function) that favors sparsity. While typically the prior is factorial, here we examine non-factorial alternatives that have a number of desirable properties relevant to sparse estimation and are easily implemented using an efficient and globally-convergent, reweighted 1 -norm minimization procedure. The first method under consideration arises from the sparse Bayesian learning (SBL) framework. Although based on a highly non-convex underlying cost function, in the context of canonical sparse estimation problems, we prove uniform superiority of this method over the Lasso in that, (i) it can never do worse, and (ii) for any dictionary and sparsity profile, there will always exist cases where it does better. These results challenge the prevailing reliance on strictly convex penalty functions for finding sparse solutions. We then derive a new non-factorial variant with similar properties that exhibits further performance improvements in some empirical tests. For both of these methods, as well as traditional factorial analogs, we demonstrate the effectiveness of reweighted 1 -norm algorithms in handling more general sparse estimation problems involving classification, group feature selection, and non-negativity constraints. As a byproduct of this development, a rigorous reformulation of sparse Bayesian classification (e.g., the relevance vector machine) is derived that, unlike the original, involves no approximation steps and descends a well-defined objective function. 1

3 0.53532183 189 nips-2009-Periodic Step Size Adaptation for Single Pass On-line Learning

Author: Chun-nan Hsu, Yu-ming Chang, Hanshen Huang, Yuh-jye Lee

Abstract: It has been established that the second-order stochastic gradient descent (2SGD) method can potentially achieve generalization performance as well as empirical optimum in a single pass (i.e., epoch) through the training examples. However, 2SGD requires computing the inverse of the Hessian matrix of the loss function, which is prohibitively expensive. This paper presents Periodic Step-size Adaptation (PSA), which approximates the Jacobian matrix of the mapping function and explores a linear relation between the Jacobian and Hessian to approximate the Hessian periodically and achieve near-optimal results in experiments on a wide variety of models and tasks. 1

4 0.52536881 25 nips-2009-Adaptive Design Optimization in Experiments with People

Author: Daniel Cavagnaro, Jay Myung, Mark A. Pitt

Abstract: In cognitive science, empirical data collected from participants are the arbiters in model selection. Model discrimination thus depends on designing maximally informative experiments. It has been shown that adaptive design optimization (ADO) allows one to discriminate models as efficiently as possible in simulation experiments. In this paper we use ADO in a series of experiments with people to discriminate the Power, Exponential, and Hyperbolic models of memory retention, which has been a long-standing problem in cognitive science, providing an ideal setting in which to test the application of ADO for addressing questions about human cognition. Using an optimality criterion based on mutual information, ADO is able to find designs that are maximally likely to increase our certainty about the true model upon observation of the experiment outcomes. Results demonstrate the usefulness of ADO and also reveal some challenges in its implementation. 1

5 0.51770878 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs

Author: Alexandre Bouchard-côté, Slav Petrov, Dan Klein

Abstract: Pruning can massively accelerate the computation of feature expectations in large models. However, any single pruning mask will introduce bias. We present a novel approach which employs a randomized sequence of pruning masks. Formally, we apply auxiliary variable MCMC sampling to generate this sequence of masks, thereby gaining theoretical guarantees about convergence. Because each mask is generally able to skip large portions of an underlying dynamic program, our approach is particularly compelling for high-degree algorithms. Empirically, we demonstrate our method on bilingual parsing, showing decreasing bias as more masks are incorporated, and outperforming fixed tic-tac-toe pruning. 1

6 0.51760566 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models

7 0.5077095 100 nips-2009-Gaussian process regression with Student-t likelihood

8 0.5035122 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields

9 0.49643186 29 nips-2009-An Infinite Factor Model Hierarchy Via a Noisy-Or Mechanism

10 0.49395978 114 nips-2009-Indian Buffet Processes with Power-law Behavior

11 0.48639429 167 nips-2009-Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations

12 0.48444504 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

13 0.48133516 72 nips-2009-Distribution Matching for Transduction

14 0.47088203 76 nips-2009-Efficient Learning using Forward-Backward Splitting

15 0.46387133 68 nips-2009-Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora

16 0.45674354 49 nips-2009-Breaking Boundaries Between Induction Time and Diagnosis Time Active Information Acquisition

17 0.45395178 129 nips-2009-Learning a Small Mixture of Trees

18 0.45105541 56 nips-2009-Conditional Neural Fields

19 0.44766229 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall

20 0.44536135 233 nips-2009-Streaming Pointwise Mutual Information


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.013), (9, 0.242), (24, 0.041), (25, 0.054), (35, 0.046), (36, 0.143), (39, 0.06), (58, 0.074), (61, 0.03), (71, 0.095), (80, 0.01), (81, 0.015), (86, 0.072), (91, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82854497 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models

Author: Kuzman Ganchev, Ben Taskar, Fernando Pereira, João Gama

Abstract: We address the problem of learning structured unsupervised models with moment sparsity typical in many natural language induction tasks. For example, in unsupervised part-of-speech (POS) induction using hidden Markov models, we introduce a bias for words to be labeled by a small number of tags. In order to express this bias of posterior sparsity as opposed to parametric sparsity, we extend the posterior regularization framework [7]. We evaluate our methods on three languages — English, Bulgarian and Portuguese — showing consistent and significant accuracy improvement over EM-trained HMMs, and HMMs with sparsity-inducing Dirichlet priors trained by variational EM. We increase accuracy with respect to EM by 2.3%-6.5% in a purely unsupervised setting as well as in a weaklysupervised setting where the closed-class words are provided. Finally, we show improvements using our method when using the induced clusters as features of a discriminative model in a semi-supervised setting. 1

2 0.66123617 260 nips-2009-Zero-shot Learning with Semantic Output Codes

Author: Mark Palatucci, Dean Pomerleau, Geoffrey E. Hinton, Tom M. Mitchell

Abstract: We consider the problem of zero-shot learning, where the goal is to learn a classifier f : X → Y that must predict novel values of Y that were omitted from the training set. To achieve this, we define the notion of a semantic output code classifier (SOC) which utilizes a knowledge base of semantic properties of Y to extrapolate to novel classes. We provide a formalism for this type of classifier and study its theoretical properties in a PAC framework, showing conditions under which the classifier can accurately predict novel classes. As a case study, we build a SOC classifier for a neural decoding task and show that it can often predict words that people are thinking about from functional magnetic resonance images (fMRI) of their neural activity, even without training examples for those words. 1

3 0.66047031 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

Author: Alan S. Willsky, Erik B. Sudderth, Michael I. Jordan, Emily B. Fox

Abstract: We propose a Bayesian nonparametric approach to the problem of modeling related time series. Using a beta process prior, our approach is based on the discovery of a set of latent dynamical behaviors that are shared among multiple time series. The size of the set and the sharing pattern are both inferred from data. We develop an efficient Markov chain Monte Carlo inference method that is based on the Indian buffet process representation of the predictive distribution of the beta process. In particular, our approach uses the sum-product algorithm to efficiently compute Metropolis-Hastings acceptance probabilities, and explores new dynamical behaviors via birth/death proposals. We validate our sampling algorithm using several synthetic datasets, and also demonstrate promising results on unsupervised segmentation of visual motion capture data.

4 0.6596756 129 nips-2009-Learning a Small Mixture of Trees

Author: M. P. Kumar, Daphne Koller

Abstract: The problem of approximating a given probability distribution using a simpler distribution plays an important role in several areas of machine learning, for example variational inference and classification. Within this context, we consider the task of learning a mixture of tree distributions. Although mixtures of trees can be learned by minimizing the KL-divergence using an EM algorithm, its success depends heavily on the initialization. We propose an efficient strategy for obtaining a good initial set of trees that attempts to cover the entire observed distribution by minimizing the α-divergence with α = ∞. We formulate the problem using the fractional covering framework and present a convergent sequential algorithm that only relies on solving a convex program at each iteration. Compared to previous methods, our approach results in a significantly smaller mixture of trees that provides similar or better accuracies. We demonstrate the usefulness of our approach by learning pictorial structures for face recognition.

5 0.65778875 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

Author: Ruslan Salakhutdinov

Abstract: Markov random fields (MRF’s), or undirected graphical models, provide a powerful framework for modeling complex dependencies among random variables. Maximum likelihood learning in MRF’s is hard due to the presence of the global normalizing constant. In this paper we consider a class of stochastic approximation algorithms of the Robbins-Monro type that use Markov chain Monte Carlo to do approximate maximum likelihood learning. We show that using MCMC operators based on tempered transitions enables the stochastic approximation algorithm to better explore highly multimodal distributions, which considerably improves parameter estimates in large, densely-connected MRF’s. Our results on MNIST and NORB datasets demonstrate that we can successfully learn good generative models of high-dimensional, richly structured data that perform well on digit and object recognition tasks.

6 0.65772623 150 nips-2009-Maximum likelihood trajectories for continuous-time Markov chains

7 0.65738577 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

8 0.65699583 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

9 0.65271336 72 nips-2009-Distribution Matching for Transduction

10 0.65213293 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

11 0.65118814 97 nips-2009-Free energy score space

12 0.65046817 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior

13 0.65024734 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

14 0.6466589 27 nips-2009-Adaptive Regularization of Weight Vectors

15 0.64661872 226 nips-2009-Spatial Normalized Gamma Processes

16 0.6434831 71 nips-2009-Distribution-Calibrated Hierarchical Classification

17 0.64300442 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization

18 0.64122069 204 nips-2009-Replicated Softmax: an Undirected Topic Model

19 0.63936961 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity

20 0.63926786 169 nips-2009-Nonlinear Learning using Local Coordinate Coding