jmlr jmlr2010 jmlr2010-38 knowledge-graph by maker-knowledge-mining

38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models

Source: pdf

Author: Jörg Lücke, Julian Eggert

Abstract: We show how a preselection of hidden variables can be used to efﬁciently train generative models with binary hidden variables. The approach is based on Expectation Maximization (EM) and uses an efﬁciently computable approximation to the sufﬁcient statistics of a given model. The computational cost to compute the sufﬁcient statistics is strongly reduced by selecting, for each data point, the relevant hidden causes. The approximation is applicable to a wide range of generative models and provides an interpretation of the beneﬁts of preselection in terms of a variational EM approximation. To empirically show that the method maximizes the data likelihood, it is applied to different types of generative models including: a version of non-negative matrix factorization (NMF), a model for non-linear component extraction (MCA), and a linear generative model similar to sparse coding. The derived algorithms are applied to both artiﬁcial and realistic data, and are compared to other models in the literature. We ﬁnd that the training scheme can reduce computational costs by orders of magnitude and allows for a reliable extraction of hidden causes. Keywords: maximum likelihood, deterministic approximations, variational EM, generative models, component extraction, multiple-cause models

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 30 63073 Offenbach am Main Germany Editor: Aapo Hyv¨ rinen a Abstract We show how a preselection of hidden variables can be used to efﬁciently train generative models with binary hidden variables. [sent-7, score-0.84]

2 The approximation is applicable to a wide range of generative models and provides an interpretation of the beneﬁts of preselection in terms of a variational EM approximation. [sent-10, score-0.719]

3 To empirically show that the method maximizes the data likelihood, it is applied to different types of generative models including: a version of non-negative matrix factorization (NMF), a model for non-linear component extraction (MCA), and a linear generative model similar to sparse coding. [sent-11, score-0.637]

4 Keywords: maximum likelihood, deterministic approximations, variational EM, generative models, component extraction, multiple-cause models 1. [sent-14, score-0.415]

5 The general idea of candidate preselection followed by recurrent recognition will in this paper be formulated in terms of a variational approximation that allows for an efﬁcient training of probabilistic generative models. [sent-27, score-0.697]

6 H ′ determines how many candidates are selected and γ ﬁxes the maximal number of non-zero hidden units sh . [sent-110, score-0.653]

7 Note that while the approximation scheme presumably results in good approximations for many data points it can be expected to give poor results for data points generated by more than γ hidden causes (i. [sent-115, score-0.556]

8 Given selection functions and update rules, Algorithm 1 describes an approximation scheme applicable to generative models with binary hidden variables. [sent-136, score-0.57]

9 Data generated by the truncated generative model with c = 1 contains up to two bars (we set K = {s | ∑ j s j ≤ γ} with γ = 2). [sent-161, score-0.519]

10 If we train the truncated generative model with c = 1 on data from which data points with ∑ j s j > γ were removed, we can expect to approximately recover the true generating basis functions W of the original model. [sent-163, score-0.503]

11 1 First Variational Approximation: Data Classiﬁcation As in Section 2, let us consider a generative model with a set of hidden variables denoted by s, a set of observed variables denoted by y, and a set of parameters denoted by Θ. [sent-165, score-0.408]

12 To distinguish this generative model from models introduced later, it will from now on be referred to as the original generative model. [sent-167, score-0.588]

13 We will formalize the classiﬁcation of data points by introducing two new generative models deﬁned based on the original model. [sent-168, score-0.4]

14 We will refer to the new generative models as truncated models because their prior distributions are truncated to be zero outside of speciﬁc subsets. [sent-173, score-0.447]

15 Figure 1 shows the truncated generative models and data they generate for a concrete example (a model that linearly combines bar-like generative ﬁelds; compare Section 4. [sent-176, score-0.675]

16 The ﬁrst summand is the likelihood L1 (Θ) of the truncated generative model with c = 1, and the second is the likelihood L0 (Θ) of the model with c = 0. [sent-214, score-0.463]

17 If the truncated generative model with c = 1 is optimized on the truncated data class c = 1, the displayed generative ﬁelds W are learned. [sent-225, score-0.713]

18 These parameter values can be expected to also result in close to maximum likelihood values for the generative model with c = 0 on data class c = 0, and also correspond to approximately optimal likelihood values of the original generative model on the original data. [sent-226, score-0.713]

19 2 Second Variational Approximation: Preselection Starting point for the second variational approximation will be the likelihood L1 (Θ) of the truncated generative model c = 1. [sent-232, score-0.556]

20 The computational gain of preselection compared to an approximation without preselection is reﬂected by the reduced size of K n compared to K (see Figure 2 for an example and Appendix C for a detailed complexity analysis). [sent-273, score-0.565]

21 Second, a variational step that approximates the posterior (18) by an approximate posterior (20) deﬁned through preselection (Section 3. [sent-278, score-0.483]

22 On the other hand, the ET approximations can get more severe if the approximate classiﬁcation of data points becomes imprecise (ﬁrst variational step), if the preselection step does not include the relevant candidates (second variational step), or if too few data points are used. [sent-297, score-0.734]

23 We will thus 2865 ¨ L UCKE AND E GGERT monitor the quality of the data point classiﬁcation (ﬁrst variational approximation) and the quality of the approximation by preselection (second variational approximation). [sent-306, score-0.605]

24 ∑s ′ p(s ′ , y (n) | Θ) (21) The preselection approximation has a high quality if for the data points in M the values DKL (q(n) , p) ˜ are close to zero. [sent-310, score-0.426]

25 Training Generative Models The approximation scheme deﬁned and discussed in the previous sections is so far independent of the speciﬁc choice of a generative model except for the assumption of binary hidden variables. [sent-316, score-0.478]

26 1 Non-negative Matrix Factorization (NMF) The ﬁrst generative model considered uses prior (22) and combines the generative ﬁelds Wh = (W1h , . [sent-333, score-0.566]

27 We use a Gaussian noise model such 2866 E XPECTATION T RUNCATION that the probability of a data vector y given s is deﬁned by: p(y | s, Θ) = D ∏ p(yd |W d (s,W ), σ), d=1 p(yd | w, σ) = N (yd ; w, σ2 ) (24) with W d (s,W ) = ∑ Wdh sh , (25) h Ê with W ∈ D×H . [sent-337, score-0.526]

28 Generative models corresponding to the class of non-negative Matrix Factorization (NMF) methods are based on a linear combination of generative ﬁelds but rely on non-negative data points and generative ﬁelds. [sent-349, score-0.683]

29 As indicated, the sufﬁcient statistics for ET-NMF are given by the ﬁrst and second order moments, sh q(n) and sh′ sh q(n) , of the exact posterior; to ﬁnd approximations to these intractable expectation values we ﬁrst have to ﬁnd appropriate selection functions Sh . [sent-363, score-1.052]

30 If we knew that the joint probability was small for a given h, we would know that the sum over all s in (6) which contains s with sh = 1 is small as well. [sent-365, score-0.463]

31 Importantly, however, we know that if Sh is sufﬁciently small, then the contribution of all joint probabilities p(s, y (n) | Θ) with sh = 1 can be neglected. [sent-382, score-0.463]

32 That is, we consider hidden causes in the form of 2 horizontal and vertical bars (ﬁve pixels each) on a 5 × 5 grid. [sent-393, score-0.465]

33 B Time course of the generative ﬁelds W of the NMF-like generative model if Expectation Truncation is used for learning. [sent-426, score-0.566]

34 , when we used T init = 15 and stretched the cooling schedule in Figure 4B to 200 iterations, the algorithm found all bars in all of 50 trials. [sent-435, score-0.397]

35 Likewise, increasing the number of data points increased reliability (> 94% reliability for N > 1000 data points with T = 13 and 100 iterations cooling). [sent-436, score-0.425]

36 Note, however, that the standard bars benchmark test (F¨ ldi´ k, 1990) uses non-linearly overlapping bars o a (we will come to the standard version of the bars test in the next section). [sent-442, score-0.528]

37 In our generative setting it is in principle also possible to learn the model parameters σ and π (compare L¨ cke and u Sahani, 2008). [sent-444, score-0.455]

38 The data points were randomly selected but it was made sure that twenty of them were generated by less or equal γ causes (bright green lines) and the other twenty by more than γ causes (dark red lines). [sent-448, score-0.427]

39 As can be observed, the approximation quality of the twenty data points generated by ≤ γ causes quickly increases whereas the approximation for the other twenty data points approaches zero. [sent-449, score-0.501]

40 5 Data point classiﬁcation data points generated by ≤ γ causes -40 data points generated by > γ causes -60 N cut -80 0. [sent-451, score-0.505]

41 0 1 20 40 60 -100 iterations iterations ˜ Z zo -20 data points generated by ≤ γ causes 50 data points generated by > γ causes 1 100 200 data points Figure 4: A Data likelihood during ten trials using the same set of N = 500 data points but different random initializations. [sent-452, score-0.902]

42 Twenty of the data points were generated by ≤ γ causes (bright green lines) and the other twenty by more ˜ than γ causes (dark red lines). [sent-456, score-0.395]

43 Bright green bars were used to mark the data points generated by ≤ γ causes, dark red bars to mark data points generated by more than γ causes. [sent-458, score-0.566]

44 Bright green bars display the values of all data points generated by less or equal γ causes, dark red bars display the values of all other data points. [sent-465, score-0.498]

45 Despite the constraint to binary hidden variables in our generative version of NMF, the resulting reconstructions are very similar to those of standard 1 NMF as shown in Figure 6C. [sent-493, score-0.408]

46 For these data, the overall average reconstruction error, ND ∑n ||y(n) − ∑h Wh sh q(n) ||2 , of the generative version is less than 5% larger than the reconstruction error of standard NMF. [sent-494, score-0.816]

47 2 Maximal Causes Analysis (MCA) The second generative model we consider was suggested to extract the hidden causes from data whose components combine non-linearly. [sent-496, score-0.569]

48 For the activities of the binary hidden u variables sh we use the same prior as for the NMF model in Section 4. [sent-502, score-0.588]

49 To generate data according to the bars test we use the same parameter settings as for the artiﬁcial data in Figure 3A, that is, D = 5 × 5, bars pixel value 10 and other pixels zero, Gaussian generating noise with standard deviation 2. [sent-519, score-0.51]

50 2874 E XPECTATION T RUNCATION A Data points B Wd1 Wd2 Wd3 Wd4 Wd5 Wd6 Wd7 Wd8 Wd9 Wd10 iterations 0 10 40 100 C D Likelihood values L Approximation quality Q(n) 104 ≤ γ causes -4 > γ causes 0. [sent-535, score-0.392]

51 Values are plotted for twenty randomly selected data points generated by less or equal γ causes (bright green lines) and twenty randomly selected data points generated by greater than γ causes (dark red lines). [sent-541, score-0.522]

52 To probe the reliability of MCAET , we ran 50 trials of the bars test with the bars test parameters as given above. [sent-549, score-0.552]

53 In 46 of the 50 trials MCAET extracted all bars (92% reliability), and in four of the trials 9 of 10 bars were extracted. [sent-551, score-0.604]

54 When we ran 100 trials using the same parameters but N = 2000 data points instead of N = 500, the algorithm extracted all bars in all trials. [sent-556, score-0.414]

55 In 41 of the 50 trials u MCAET extracted all bars (82% reliability) and found 9 of 10 bars in the other nine trials. [sent-561, score-0.495]

56 With N = 10 000 and slower cooling (same cooling schedule as in Figure 4B but stretched to 200 iterations), the algorithm found all bars in all of the 50 trials. [sent-567, score-0.429]

57 1 In earlier work, generative modelling approaches to the bars test merely achieved relatively low reliability values. [sent-569, score-0.55]

58 The unrestricted version of MCA3 extracts all bars in 90% of the trials with noisy bars (with Poisson noise) and in 81% of the cases for the noiseless bars. [sent-580, score-0.461]

59 For the test we adopted the same parameter setting for such input as used by Spratling (2006) and L¨ cke and Sahani (2008), that is, N = 400 example patterns, 16 bars, D = 9 × 9, bars u 2 appear with probability 16 , and number of hidden units is H = 32. [sent-588, score-0.512]

60 We applied MCAET to the data using the same parameters as for the standard bars test except of a higher initial temperature (T = 23) and longer cooling time (the cooling schedule in Figure 4B was stretched by a factor four to 400 iterations). [sent-591, score-0.501]

61 When we repeated the experiment with the same generating and model parameters but with N = 800 instead of N = 400 data points, the algorithm extracted all bars in all of 50 trials (mean number of bars extracted equal to 16. [sent-606, score-0.594]

62 E Ten examples of patches generated according to the MCA generative model using the generative ﬁelds in C. [sent-645, score-0.597]

63 Instead of deﬁning explicit prior and noise distributions, SC-like models are often, more informally, written in the form: H y= ∑ sh Wh + σ η = W s + σ η , (32) h=1 where the entries of η are independently and identically drawn from a Gaussian distribution with zero mean and unit variance. [sent-660, score-0.521]

64 The model (32) with binary factors sh distributed according to the Bernoulli prior (22) can be trained with Expectation Truncation. [sent-661, score-0.463]

65 For continuous factors sh with non-Gaussian priors this special case represents the standard version of ICA (see, e. [sent-666, score-0.463]

66 For bars without noise we extract all bars in 49 trials (98%) and nine of ten bars in one trial. [sent-720, score-0.7]

67 Alternatively, reliability increased when we increased the sample size: all bars were found in 100 of 100 trials if N = 1000 data points were used. [sent-727, score-0.519]

68 Values are plotted for twenty randomly selected data points generated by less or equal γ causes (bright green lines) and twenty randomly selected data points generated by greater than γ causes (dark red lines). [sent-743, score-0.522]

69 Discussion We have studied an approximation to EM to train multiple-cause generative models with binary hidden variables. [sent-797, score-0.473]

70 In contrast to exact EM, there is no guarantee that the data likelihood under a given generative model is always increased or remains unchanged. [sent-801, score-0.417]

71 By applying approximation methods to these integrals over relatively small state spaces, learning algorithms for generative models with latents of continuous values could be derived. [sent-826, score-0.405]

72 But also the applicability of ET to generative models with binary hidden variables is not without limits. [sent-829, score-0.43]

73 For the generative models discussed in Section 4, the tractable selection functions (28) and (33) perform well in this respect, and successfully maximize the likelihood and recover close to optimal parameter values. [sent-836, score-0.422]

74 Examples are, for instance, the generative models discussed in Section 4 but with priors that ﬁx the number of active causes (e. [sent-845, score-0.439]

75 Importantly, only the preselection part depends hereby on the total number of hidden variables H. [sent-859, score-0.413]

76 Instead ET uses two variational steps: one that takes the form of a data point selection, and a second that is based on a preselection of hidden variables (see Section 3 for more details). [sent-906, score-0.523]

77 At the same time, the preselection of relevant hidden variables per data point allowed a massive reduction of the computational cost. [sent-909, score-0.413]

78 , 2009) can be seen as ET without preselection (instead u of implicit data point selection more terms of the truncated sums were used here). [sent-923, score-0.407]

79 We applied the algorithm to the standard bars test with 10 bars and to a more recent benchmark with 16 bars and larger overlap. [sent-962, score-0.528]

80 In both cases the algorithm extracted all causes with close to 100% reliability (50 successful trials out of 50) provided that we used sufﬁciently many data points. [sent-963, score-0.395]

81 The initial temperature for annealing can be chosen by observing that for a given model and application a critical temperature exists above which all generative ﬁelds converge to the same average ﬁeld (no differentiation to different components). [sent-969, score-0.403]

82 If the data is well-explicable by combinations of binary hidden causes, a generative model should be chosen that appropriately reﬂects the data generation process. [sent-984, score-0.462]

83 This scheme is formulated as a deterministic variational EM approximation to maximize the data likelihood under a given generative model. [sent-996, score-0.55]

84 In standard benchmarks on artiﬁcial data we found that the derived algorithms increased the likelihood to values very close to the optimum, extracted hidden causes with high reliability, and reduced the computational cost potentially by orders of magnitude. [sent-999, score-0.404]

85 1 we discuss the consistency of maximum likelihood parameters of original and truncated generative models. [sent-1009, score-0.403]

86 We can approximately recover the distribution p(y | Θ∗ ) by (globally) maximizing the data likelihood under the mixed generative model on {y (n) }n=1,. [sent-1029, score-0.391]

87 Furthermore, we can recover the distributions p(y | c = 1, Θ∗ ) and p(y | c = 0, Θ∗ ) by (globally) maximizing the data likelihoods of the truncated generative models on {y (n) }n∈M opt and {y (n) }n∈M opt , respectively. [sent-1033, score-0.55]

88 ,N maximum likelihood recovery {y (n) }n∈M opt p(y (n) | Θ† ) generated data points {y (n) }n∈M opt p(y (n) | c = 1, Θ†1 ) p(y (n) | c = 0, Θ†0 ) recovered distributions Figure 12: Recovery of the generating distributions through the original and the truncated generative models. [sent-1037, score-0.718]

89 While, for example, the assumption on invariance under transformations T can exactly be fulﬁlled (depending on the generative model), the assumptions that the true data distribution can exactly be matched or that the variational approximation in Section 3. [sent-1063, score-0.463]

90 , D} | yd < Wdh } and note that if yd < Wdh and sh = 1 then (n) (n) p(yd |Wdh , σ) < p(yd |W d (s,W ), σ). [sent-1098, score-0.693]

91 Each term involves the computation of W d (s,W ) which is equal to ∑h Wdh sh in the cases of NMF and LinCA, and equal to maxh {Wdh sh } in the case of MCA. [sent-1110, score-0.926]

92 , in vector form y (n) ≈ W s (n) (39) with non-negative entries for Y , W and S, meaning that the generative ﬁelds Wh are combined (n) linearly using the source activations sh to approximate the data vectors y (n) . [sent-1189, score-0.817]

93 Introducing 1 N (n) ∑ f (y (n) , sh ) N n=1 f (y, sh ) := for a more compact notation and using the vectorial form of (40), the task is to minimize E(W, S) = ||y −W s||2 , (41) that is, the average Euclidean reconstruction error over all data points y (n) . [sent-1192, score-1.056]

94 As for the classical NMF, we combine the generative ﬁelds Wh linearly with the source activities sh using W (s,W ) := W s, with the difference that the activities are now binary and constitute the hidden variables of the system (remark that now we write sh in(n) stead of sh ). [sent-1200, score-1.819]

95 That is, the averaging runs over the subset of data vectors n ∈ M and the set of source vectors s ∈ K n gained by the source preselection and sparseness assumptions: f (y, sh ) ET := ∑ ∑ p(s|y (n) , Θ) f (y (n) , sh ) = n∈M s∈K n ∑ n∈M f (y (n) , sh ) q(n) . [sent-1220, score-1.721]

96 The classical NMF uses data points in ﬁxed association with their corresponding continuous activation vectors and minimizes the Euclidean cost function (41) for both the source activations and the generative ﬁelds, the latter according to the multiplicative update rule (42). [sent-1235, score-0.483]

97 ETNMF explores a subset of allowed binary source activations for each data point and minimizes in its M-step the Euclidean cost function (45) for the generative ﬁelds only, according to the multiplicative update rule (47). [sent-1236, score-0.415]

98 The full Expectation Truncation formalism then comprises the calculation of the expectation values sh q(n) and sh′ sh q(n) (needed for (48)), and afterwards the adjustment of the generative ﬁelds according to (47). [sent-1238, score-1.242]

99 Let Let yh′ further sh denote a hidden state with just one non-zero entry at position h. [sent-1244, score-0.588]

100 To determine if cause h′ cause is represented, we compute the approximate posterior probabilities p(sh | yh′ , Θ) for each vector ˜ sh using the approximation provided by ET. [sent-1245, score-0.618]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sh', 0.463), ('nmf', 0.315), ('generative', 0.283), ('preselection', 0.261), ('wh', 0.191), ('old', 0.181), ('bars', 0.176), ('cke', 0.172), ('ggert', 0.146), ('mca', 0.146), ('ucke', 0.146), ('runcation', 0.14), ('wdh', 0.14), ('causes', 0.134), ('linca', 0.134), ('sahani', 0.127), ('hidden', 0.125), ('yd', 0.115), ('variational', 0.11), ('trials', 0.109), ('mcaet', 0.102), ('xpectation', 0.099), ('reliability', 0.091), ('cooling', 0.083), ('opt', 0.079), ('mae', 0.072), ('truncation', 0.072), ('em', 0.071), ('points', 0.068), ('truncated', 0.06), ('likelihood', 0.06), ('latents', 0.057), ('posterior', 0.056), ('schedule', 0.054), ('init', 0.051), ('spratling', 0.051), ('cut', 0.047), ('elds', 0.046), ('temperature', 0.045), ('bright', 0.044), ('approximation', 0.043), ('units', 0.039), ('ful', 0.038), ('ica', 0.038), ('generating', 0.038), ('approximations', 0.037), ('update', 0.037), ('ub', 0.036), ('noise', 0.036), ('reconstruction', 0.035), ('extracted', 0.034), ('expectation', 0.033), ('selection', 0.033), ('frankfurt', 0.033), ('stretched', 0.033), ('xperiments', 0.033), ('twenty', 0.032), ('wht', 0.032), ('patches', 0.031), ('annealing', 0.03), ('pixels', 0.03), ('equation', 0.03), ('olshausen', 0.03), ('seung', 0.03), ('iterations', 0.029), ('maxima', 0.029), ('summands', 0.029), ('lled', 0.028), ('cause', 0.028), ('nally', 0.028), ('hyv', 0.028), ('hereby', 0.027), ('quality', 0.027), ('scheme', 0.027), ('ten', 0.027), ('data', 0.027), ('sums', 0.026), ('candidates', 0.026), ('multiplicative', 0.024), ('coding', 0.024), ('dark', 0.024), ('tractable', 0.024), ('increased', 0.024), ('recovered', 0.024), ('rinen', 0.024), ('exact', 0.023), ('extraction', 0.022), ('global', 0.022), ('source', 0.022), ('models', 0.022), ('activations', 0.022), ('hochreiter', 0.022), ('rner', 0.022), ('yh', 0.022), ('proportional', 0.022), ('lee', 0.021), ('dayan', 0.021), ('mixed', 0.021), ('bar', 0.021), ('sorting', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999917 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models

Author: Jörg Lücke, Julian Eggert

2 0.082523368 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers

Author: Franz Pernkopf, Jeff A. Bilmes

Abstract: We introduce a simple order-based greedy heuristic for learning discriminative structure within generative Bayesian network classiﬁers. We propose two methods for establishing an order of N features. They are based on the conditional mutual information and classiﬁcation rate (i.e., risk), respectively. Given an ordering, we can ﬁnd a discriminative structure with O N k+1 score evaluations (where constant k is the tree-width of the sub-graph over the attributes). We present results on 25 data sets from the UCI repository, for phonetic classiﬁcation using the TIMIT database, for a visual surface inspection task, and for two handwritten digit recognition tasks. We provide classiﬁcation performance for both discriminative and generative parameter learning on both discriminatively and generatively structured networks. The discriminative structure found by our new procedures signiﬁcantly outperforms generatively produced structures, and achieves a classiﬁcation accuracy on par with the best discriminative (greedy) Bayesian network learning approach, but does so with a factor of ∼10-40 speedup. We also show that the advantages of generative discriminatively structured Bayesian network classiﬁers still hold in the case of missing features, a case where generative classiﬁers have an advantage over discriminative classiﬁers. Keywords: Bayesian networks, classiﬁcation, discriminative learning, structure learning, graphical model, missing feature

3 0.071560867 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

Author: Kuzman Ganchev, João Graça, Jennifer Gillenwater, Ben Taskar

Abstract: We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efﬁciently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efﬁciency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efﬁcient algorithm for learning with posterior regularization and illustrate its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multi-view learning, cross-lingual dependency grammar induction, unsupervised part-of-speech induction, and bitext word alignment.1 Keywords: posterior regularization framework, unsupervised learning, latent variables models, prior knowledge, natural language processing

4 0.069689885 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

Author: Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro

Abstract: Sparse coding—that is, modelling data vectors as sparse linear combinations of basis elements—is widely used in machine learning, neuroscience, signal processing, and statistics. This paper focuses on the large-scale matrix factorization problem that consists of learning the basis set in order to adapt it to speciﬁc data. Variations of this problem include dictionary learning in signal processing, non-negative matrix factorization and sparse principal component analysis. In this paper, we propose to address these tasks with a new online optimization algorithm, based on stochastic approximations, which scales up gracefully to large data sets with millions of training samples, and extends naturally to various matrix factorization formulations, making it suitable for a wide range of learning problems. A proof of convergence is presented, along with experiments with natural images and genomic data demonstrating that it leads to state-of-the-art performance in terms of speed and optimization for both small and large data sets. Keywords: basis pursuit, dictionary learning, matrix factorization, online learning, sparse coding, sparse principal component analysis, stochastic approximations, stochastic optimization, nonnegative matrix factorization

5 0.067871973 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

Author: Gideon S. Mann, Andrew McCallum

Abstract: In this paper, we present an overview of generalized expectation criteria (GE), a simple, robust, scalable method for semi-supervised training using weakly-labeled data. GE ﬁts model parameters by favoring models that match certain expectation constraints, such as marginal label distributions, on the unlabeled data. This paper shows how to apply generalized expectation criteria to two classes of parametric models: maximum entropy models and conditional random ﬁelds. Experimental results demonstrate accuracy improvements over supervised training and a number of other stateof-the-art semi-supervised learning methods for these models. Keywords: generalized expectation criteria, semi-supervised learning, logistic regression, conditional random ﬁelds

6 0.06631308 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing

7 0.058114819 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors

8 0.049823202 117 jmlr-2010-Why Does Unsupervised Pre-training Help Deep Learning?

9 0.04788259 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks

10 0.045010861 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

11 0.044501629 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation

12 0.044330068 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

13 0.043627556 14 jmlr-2010-Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes

14 0.041582592 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars

15 0.040162586 107 jmlr-2010-Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion

16 0.039791029 109 jmlr-2010-Stochastic Composite Likelihood

17 0.038731363 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks

18 0.037992667 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning

19 0.036509942 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox

20 0.035636611 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.164), (1, 0.043), (2, -0.087), (3, -0.002), (4, 0.003), (5, 0.011), (6, 0.156), (7, -0.058), (8, 0.038), (9, -0.02), (10, -0.063), (11, 0.094), (12, 0.007), (13, -0.013), (14, 0.022), (15, 0.061), (16, 0.05), (17, 0.04), (18, 0.006), (19, 0.03), (20, 0.059), (21, 0.018), (22, -0.138), (23, -0.098), (24, -0.074), (25, -0.088), (26, -0.11), (27, -0.018), (28, 0.034), (29, -0.092), (30, 0.286), (31, -0.022), (32, 0.128), (33, -0.06), (34, -0.009), (35, 0.111), (36, -0.172), (37, 0.125), (38, 0.217), (39, -0.101), (40, -0.106), (41, 0.153), (42, -0.121), (43, 0.184), (44, -0.098), (45, 0.073), (46, -0.131), (47, -0.185), (48, 0.116), (49, -0.109)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93066764 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models

Author: Jörg Lücke, Julian Eggert

2 0.50032449 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers

Author: Franz Pernkopf, Jeff A. Bilmes

3 0.39487809 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

Author: Kuzman Ganchev, João Graça, Jennifer Gillenwater, Ben Taskar

4 0.34888369 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing

Author: Ryo Yoshida, Mike West

Abstract: We describe a class of sparse latent factor models, called graphical factor models (GFMs), and relevant sparse learning algorithms for posterior mode estimation. Linear, Gaussian GFMs have sparse, orthogonal factor loadings matrices, that, in addition to sparsity of the implied covariance matrices, also induce conditional independence structures via zeros in the implied precision matrices. We describe the models and their use for robust estimation of sparse latent factor structure and data/signal reconstruction. We develop computational algorithms for model exploration and posterior mode search, addressing the hard combinatorial optimization involved in the search over a huge space of potential sparse conﬁgurations. A mean-ﬁeld variational technique coupled with annealing is developed to successively generate “artiﬁcial” posterior distributions that, at the limiting temperature in the annealing schedule, deﬁne required posterior modes in the GFM parameter space. Several detailed empirical studies and comparisons to related approaches are discussed, including analyses of handwritten digit image and cancer gene expression data. Keywords: annealing, graphical factor models, variational mean-ﬁeld method, MAP estimation, sparse factor analysis, gene expression proﬁling

5 0.3315191 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

Author: Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro

6 0.31583193 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

7 0.29540709 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

8 0.29069245 14 jmlr-2010-Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes

9 0.27742931 109 jmlr-2010-Stochastic Composite Likelihood

10 0.27580673 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks

11 0.2663511 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors

12 0.23902677 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation

13 0.2328874 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

14 0.2269754 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks

15 0.22057758 117 jmlr-2010-Why Does Unsupervised Pre-training Help Deep Learning?

16 0.21196803 22 jmlr-2010-Classification Using Geometric Level Sets

17 0.20859888 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

18 0.20708412 77 jmlr-2010-Model-based Boosting 2.0

19 0.19964923 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars

20 0.19828612 63 jmlr-2010-Learning Instance-Specific Predictive Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.011), (21, 0.011), (32, 0.577), (33, 0.018), (36, 0.032), (37, 0.047), (75, 0.107), (81, 0.012), (85, 0.052), (96, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92735744 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds

Author: Jacek P. Dmochowski, Paul Sajda, Lucas C. Parra

Abstract: The presence of asymmetry in the misclassiﬁcation costs or class prevalences is a common occurrence in the pattern classiﬁcation domain. While much interest has been devoted to the study of cost-sensitive learning techniques, the relationship between cost-sensitive learning and the speciﬁcation of the model set in a parametric estimation framework remains somewhat unclear. To that end, we differentiate between the case of the model including the true posterior, and that in which the model is misspeciﬁed. In the former case, it is shown that thresholding the maximum likelihood (ML) estimate is an asymptotically optimal solution to the risk minimization problem. On the other hand, under model misspeciﬁcation, it is demonstrated that thresholded ML is suboptimal and that the risk-minimizing solution varies with the misclassiﬁcation cost ratio. Moreover, we analytically show that the negative weighted log likelihood (Elkan, 2001) is a tight, convex upper bound of the empirical loss. Coupled with empirical results on several real-world data sets, we argue that weighted ML is the preferred cost-sensitive technique. Keywords: empirical risk minimization, loss function, cost-sensitive learning, imbalanced data sets

2 0.89857948 85 jmlr-2010-On the Foundations of Noise-free Selective Classification

Author: Ran El-Yaniv, Yair Wiener

Abstract: We consider selective classiﬁcation, a term we adopt here to refer to ‘classiﬁcation with a reject option.’ The essence in selective classiﬁcation is to trade-off classiﬁer coverage for higher accuracy. We term this trade-off the risk-coverage (RC) trade-off. Our main objective is to characterize this trade-off and to construct algorithms that can optimally or near optimally achieve the best possible trade-offs in a controlled manner. For noise-free models we present in this paper a thorough analysis of selective classiﬁcation including characterizations of RC trade-offs in various interesting settings. Keywords: classiﬁcation with a reject option, selective classiﬁcation, perfect learning, high performance classiﬁcation, risk-coverage trade-off

same-paper 3 0.87563437 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models

Author: Jörg Lücke, Julian Eggert

4 0.86538887 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond

Author: Zhihua Zhang, Guang Dai, Congfu Xu, Michael I. Jordan

Abstract: Fisher linear discriminant analysis (FDA) and its kernel extension—kernel discriminant analysis (KDA)—are well known methods that consider dimensionality reduction and classiﬁcation jointly. While widely deployed in practical problems, there are still unresolved issues surrounding their efﬁcient implementation and their relationship with least mean squares procedures. In this paper we address these issues within the framework of regularized estimation. Our approach leads to a ﬂexible and efﬁcient implementation of FDA as well as KDA. We also uncover a general relationship between regularized discriminant analysis and ridge regression. This relationship yields variations on conventional FDA based on the pseudoinverse and a direct equivalence to an ordinary least squares estimator. Keywords: Fisher discriminant analysis, reproducing kernel, generalized eigenproblems, ridge regression, singular value decomposition, eigenvalue decomposition

5 0.55394423 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization

Author: Ming Yuan, Marten Wegkamp

Abstract: In this paper, we investigate the problem of binary classiﬁcation with a reject option in which one can withhold the decision of classifying an observation at a cost lower than that of misclassiﬁcation. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. A necessary and sufﬁcient condition for inﬁnite sample consistency (both risks share the same minimizer) is provided. Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. These bounds can be tightened by a generalized margin condition. The impact of the results is illustrated on several commonly used surrogate loss functions. Keywords: classiﬁcation, convex surrogate loss, empirical risk minimization, generalized margin condition, reject option

6 0.54008615 60 jmlr-2010-Learnability, Stability and Uniform Convergence

7 0.53442127 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

8 0.51177418 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

9 0.50206602 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

10 0.48621157 102 jmlr-2010-Semi-Supervised Novelty Detection

11 0.4860031 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes

12 0.48543841 24 jmlr-2010-Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials

13 0.48058438 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes

14 0.47301868 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

15 0.47269252 109 jmlr-2010-Stochastic Composite Likelihood

16 0.46740034 22 jmlr-2010-Classification Using Geometric Level Sets

17 0.46705258 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis

18 0.46682584 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

19 0.46622258 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

20 0.46597973 25 jmlr-2010-Composite Binary Losses