jmlr jmlr2010 jmlr2010-109 knowledge-graph by maker-knowledge-mining

109 jmlr-2010-Stochastic Composite Likelihood

Source: pdf

Author: Joshua V. Dillon, Guy Lebanon

Abstract: Maximum likelihood estimators are often of limited practical use due to the intensive computation they require. We propose a family of alternative estimators that maximize a stochastic variation of the composite likelihood function. Each of the estimators resolve the computation-accuracy tradeoff differently, and taken together they span a continuous spectrum of computation-accuracy tradeoff resolutions. We prove the consistency of the estimators, provide formulas for their asymptotic variance, statistical robustness, and computational complexity. We discuss experimental results in the context of Boltzmann machines and conditional random ﬁelds. The theoretical and experimental studies demonstrate the effectiveness of the estimators when the computational resources are insufﬁcient. They also demonstrate that in some cases reduced computational complexity is associated with robustness thereby increasing statistical accuracy. Keywords: Markov random ﬁelds, composite likelihood, maximum likelihood estimation

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU College of Computing Georgia Institute of Technology Atlanta, GA, USA Editor: Fernando Pereira Abstract Maximum likelihood estimators are often of limited practical use due to the intensive computation they require. [sent-5, score-0.275]

2 We propose a family of alternative estimators that maximize a stochastic variation of the composite likelihood function. [sent-6, score-0.429]

3 Keywords: Markov random ﬁelds, composite likelihood, maximum likelihood estimation 1. [sent-12, score-0.311]

4 , X (n) ), X (i) ∈ Rm , (1) and is sampled iid from a parametric distribution pθ0 with θ0 ∈ Θ ⊂ Rr , a maximum likelihood ˆn estimator (MLE) θml is a maximizer of the log-likelihood function n ℓn (θ ; D) = ∑ log pθ (X (i) ), (2) i=1 ˆn θml = arg max ℓn (θ ; D). [sent-17, score-0.267]

5 In contrast to many previously proposed approximate estimators, our estimators are statistically consistent and admit a precise quantiﬁcation of both computational complexity and statistical accuracy through their asymptotic variance. [sent-37, score-0.27]

6 This “win-win” situation conﬂicts with the conventional wisdom stating that moving from the MLE to pseudo likelihood and other related estimators result in a computational win but a statistical loss. [sent-40, score-0.584]

7 In the case of MCMC, a number of asymptotic results exist (Casella and Robert, 2004), but since MCMC plays a role inside each gradient descent or EM iteration it is hard to analyze the asymptotic behavior of the resulting parameter estimates. [sent-55, score-0.264]

8 Our work draws on the composite likelihood method for parameter estimation proposed by Lindsay (1988) which in turn generalized the pseudo likelihood of Besag (1974). [sent-57, score-0.812]

9 A selection of more recent studies on pseudo and composite likelihood are Arnold and Strauss (1991), Liang and Yu (2003), Varin and Vidoni (2005), Sutton and McCallum (2007) and Hjort and Varin (2008). [sent-58, score-0.653]

10 Most of the recent studies in this area examine the behavior of the pseudo or composite likelihood in a particular modeling situation. [sent-59, score-0.62]

11 The work of Liang and Jordan (2008) is also interesting in that the authors employ composite likelihood m-estimators and asymptotic arguments to compare the risk of discriminative and generative models. [sent-63, score-0.465]

12 (4) We start by reviewing the pseudo log-likelihood function (Besag, 1974) associated with the data D (1), def pℓn (θ ; D) = n m ∑ ∑ log pθ (X j (i) i=1 j=1 (i) |X− j ). [sent-78, score-0.44]

13 (5) ˆ ˆ The maximum pseudo likelihood estimator (MPLE) θn is consistent, that is, θn → θ0 with probability 1, but possesses considerably higher asymptotic variance than the MLE’s (nI(θ0 ))−1 . [sent-79, score-0.726]

14 The MLE has low asymptotic variance but high computational complexity while the MPLE has higher asymptotic variance but low computational complexity. [sent-88, score-0.391]

15 To this end we deﬁne the stochastic composite likelihood whose maximization provides a family of consistent estimators with statistical accuracy and computational complexity spanning the entire accuracy-complexity spectrum. [sent-90, score-0.484]

16 Stochastic composite likelihood generalizes the likelihood and pseudo likelihood functions by constructing an objective function that is a stochastic sum of likelihood objects. [sent-91, score-1.231]

17 We start by deﬁning the notion of m-pairs and likelihood objects and then proceed to stochastic composite likelihood. [sent-92, score-0.379]

18 The def likelihood object associated with an m-pair (A, B) and X is Sθ (A, B) = log pθ (XA |XB ) where XS is deﬁned in (4). [sent-97, score-0.323]

19 The composite log-likelihood function (Lindsay, 1988) is a collection of likelihood objects deﬁned by a ﬁnite sequence of m-pairs (A1 , B1 ), . [sent-98, score-0.344]

20 (6) There is a certain lack of ﬂexibility associated with the composite likelihood framework as each likelihood object is either selected or not for the entire sample X (1) , . [sent-102, score-0.532]

21 For example, available computational resources may allow the computation of the log-likelihood for 20% of the samples, and the pseudo likelihood for the remaining 80%. [sent-107, score-0.501]

22 In the case of composite likelihood if we select the full likelihood component (or the pseudo likelihood or any other likelihood object) then this component is applied to all samples indiscriminately. [sent-108, score-1.196]

23 In SCL, different likelihood objects Sθ (A j , B j ) may be selected for different samples with the possibility of some likelihood objects being selected for only a small fraction of the data samples. [sent-109, score-0.508]

24 For example, we may wish to avoid selecting a pseudo likelihood component for a certain sample X (i) if the full likelihood component was already selected for it. [sent-112, score-0.722]

25 + The stochastic composite log-likelihood (SCL) is def scℓn (θ ; D, Z) = def mθ (X, Z) = 1 n ∑ mθ (X (i) , Z (i) ), n i=1 where k ∑ β j Z j log pθ (XA |XB ), j (7) j j=1 where, for brevity, we typically omit D, Z in favor of scℓn (θ). [sent-132, score-0.386]

26 A multinomial model Z ∼ Mult(1, λ) implies that for each sample Z (i) a multivariate Bernoulli experiment is conducted with precisely one likelihood object being selected depending on the selection probabilities λ1 , . [sent-162, score-0.254]

27 In contrast to the log-likelihood and pseudo log-likelihood functions, the SCL function and its maximizer are random variables that depend on the indicator variables Z (1) , . [sent-176, score-0.309]

28 Choosing appropriate values + of (λ, β) retrieves the special cases of MLE, MPLE, maximum composite likelihood with each selection being associated with a distinct statistical accuracy and computational complexity. [sent-186, score-0.368]

29 The k-order PL function offers a practical alternative to FL (1-order PL corresponds to the traditional pseudo likelihood and 2-order PL its analog, p(X{i, j} |X{i, j}c )). [sent-198, score-0.501]

30 The pseudo likelihood function (5) is illustrated in the second box where each row correspond to one of the ﬁve PL components. [sent-218, score-0.501]

31 4 The third SCL policy reﬂects a stochastic combination of ﬁrst and second order pseudo likelihood components. [sent-223, score-0.654]

32 Additional insight may be gained at this point by considering Figure 3 which plots several SCL estimators as points in the plane whose x and y coordinates correspond to normalized asymptotic variance and computational complexity respectively. [sent-229, score-0.294]

33 The policies are full likelihood (FL, top), pseudo likelihood (PL, middle), and a stochastic combination of ﬁrst and second order pseudo likelihood with the ﬁrst order components selected with probability 0. [sent-254, score-1.362]

34 The sample runs for the policies are illustrated by placing a diamond box in table entries corresponding to selected likelihood objects (rows corresponding to likelihood objects and columns to X (1) , . [sent-257, score-0.595]

35 The FLOP counts of each likelihood object determines the shade of the diamond boxes while the total FLOP counts per example and per likelihood objects are displayed as table marginals (bottom row and right column for each policy). [sent-261, score-0.5]

36 It can be shown that a selection equivalent to the pseudo likelihood function, that is, S = {(A1 , B1 ), . [sent-289, score-0.534]

37 Proposition 2 Making the assumptions of Proposition 1 as well as convexity of Θ ⊂ Rr we have the following convergence in distribution √ msl ˆ n(θn − θ0 ) N (0, ϒΣϒ) where ϒ−1 = k ∑ β j λ j Var θ (∇Sθ (A j , B j )), 0 0 j=1 Σ = Var θ0 k ∑ β j λ j ∇Sθ (A j , B j ) 0 . [sent-299, score-0.257]

38 Figures 1,2,3 provide the asymptotic variance for some SCL estimators and describe how it can be used to gain insight into which estimator to use. [sent-305, score-0.308]

39 √ ˆ The fact that n(θn − θ0 ) converges in distribution to a Gaussian with zero mean (for the MLE and similarly for SCL estimators as we show above) implies that the estimator’s asymptotic behavior, up to n−1/2 order, is determined exclusively by the asymptotic variance. [sent-306, score-0.347]

40 We thus follow the statistical convention of conducting ﬁrst order asymptotic analysis and concentrate on the estimator’s asymptotic variance. [sent-309, score-0.264]

41 Robustness of θmsl We observed in our experiments (see Section 8) that the SCL estimator sometimes performed better on a held-out test set than did the maximum likelihood estimator. [sent-316, score-0.266]

42 Approaches based on pseudo likelihood and composite likelihood are naturally well-suited in this case due to the cancellation of the normalization term in the probability ratios deﬁning conditional distributions. [sent-356, score-0.836]

43 (14) C∈C The primary bottlenecks in obtaining the maximum likelihood are the computations log Z(θ) and ∇ log Z(θ). [sent-361, score-0.252]

44 In contrast, the conditional distributions that form the composite likelihood of (14) are given by (note the cancellation of Z(θ)) ′ ∑ exp ∑C∈C θC fC ((xA , xB , x(A∪B)c )C ) pθ (xA |xB ) = ′ x(A∪B)c ∑ ∑ exp ′′ ′ x(A∪B)c xA . [sent-363, score-0.335]

45 In general, selecting small |A j | and B j = (A j )c leads to efﬁcient computation of the composite likelihood and its gradient. [sent-366, score-0.311]

46 Automatic Selection of β As Proposition 2 indicates, the weight vector β and selection probabilities λ play an important role in the statistical accuracy of the estimator through its asymptotic variance. [sent-372, score-0.26]

47 A common solution is to consider the determinant as a one dimensional measure of the size of the variance matrix,5 and minimize J(β) = log det(ϒΣϒ) = log det Σ + 2 log det ϒ. [sent-377, score-0.266]

48 1 Toy Example: Boltzmann Machines We illustrate the improvement in asymptotic variance of the MSCLE associated with adding higher order Boltzmann machine likelihood components with increasingly higher probability. [sent-408, score-0.404]

49 Figure 3 displays the asymptotic accuracy and complexity for different SCL policies for m = 9 binary valued vertices of a Boltzmann machine. [sent-414, score-0.259]

50 We explore three polices in which we denote pseudo likelihood components of size, or order, k. [sent-415, score-0.533]

51 By taking different linear combinations of various sized pseudo likelihood components, we span a continuous spectrum of accuracy-complexity resolutions. [sent-417, score-0.501]

52 The likelihood components were mixtures of full and pseudo (|A j | = 1) likelihood (rows 1,3) and pseudo and 2nd order pseudo (|A j | = 2) likelihood (rows 2,4). [sent-451, score-1.535]

53 The ﬁgure demonstrates how the train log-likelihood increases with increasing the weight and selection probability of full likelihood in rows 1,3 and of 2nd order pseudo likelihood in rows 2,4. [sent-455, score-0.868]

54 This increase in train log-likelihood is also correlated with an increase in computational complexity as higher order likelihood components require more computation. [sent-456, score-0.321]

55 The pure policies PL1 and PL2 are indicated by black circles and the computational complexity of the full likelihood indicated by a dashed line (corresponding normalized asymptotic variance is 1). [sent-472, score-0.475]

56 As the graph size increases, the computational cost increases dramatically, in particular for the full likelihood policy and to a lesser extent for the pseudo likelihood policy. [sent-474, score-0.811]

57 decreasing the inﬂuence of full likelihood in favor of pseudo likelihood. [sent-475, score-0.501]

58 The fact that this happens for (relatively) weak regularization, σ2 = 10, and indicates that lower order pseudo likelihood has a regularization effect which improves prediction accuracy when the model is not regularized enough. [sent-476, score-0.525]

59 Rows 1,3 are stochastic mixtures of full (FL) and pseudo (PL1) log-likelihood components while rows 2,4 are PL1 and 2nd order pseudo (PL2). [sent-592, score-0.71]

60 The points represent different stochastic combinations of full and pseudo likelihood components. [sent-595, score-0.536]

61 We consider SCL components of the form p(X2 ,Y2 |Y1 ,Y3 ), p(X2 , X3 ,Y2 ,Y3 |Y1 ,Y4 ) which we refer to as ﬁrst and second order pseudo likelihood (with higher order components generalizing in a straightforward manner). [sent-603, score-0.565]

62 Figures 9 and 10 depict train and test negative log-likelihood, that is, perplexity, for the SCL ˆ estimator θmsl with a pseudo/full likelihood selection policy (PL1/FL). [sent-616, score-0.483]

63 The lower order pseudo likelihood component is always selected and has weight 1 − β. [sent-618, score-0.556]

64 As expected the test set perplexity dominates the train-set perplexity. [sent-619, score-0.257]

65 We next demonstrate the effect of using a 1st order/2nd order pseudo likelihood selection policy (PL1/PL2). [sent-623, score-0.652]

66 Recall, our notion of pseudo likelihood never entails conditioning on x, although in principle it could. [sent-624, score-0.501]

67 8 1 β Figure 9: Train set (top) and test set (bottom) negative log-likelihood (perplexity) for the Boltzmann chain MRF with pseudo/full likelihood selection policy (PL1/FL). [sent-722, score-0.412]

68 The best achievable test set perplexity is about 190. [sent-728, score-0.257]

69 Unsurprisingly the test set perplexity dominates the train set perplexity at each σ2 (column). [sent-729, score-0.551]

70 pseudo likelihood is more traditional, for example, p(Y2 |Y1 ,Y, 3, X2 ) and p(Y2 ,Y3 |Y1 ,Y, 4, X2 , X3 ) are valid 1st and 2nd order pseudo likelihood components. [sent-732, score-1.002]

71 8 1 β Figure 10: Train set and test set perplexities for the Boltzmann chain MRF with PL1/FL selection policy (see above layout description). [sent-755, score-0.306]

72 Note that as the regularizer is weakened the range in perplexity spanned by λ increases and the lower ˆn bound decreases. [sent-758, score-0.294]

73 We demonstrate learning on two selection policies: pseudo/full likelihood (Figures 15 and 16) and 1st/2nd order pseudo likelihood (Figures 17 and 18). [sent-761, score-0.726]

74 Intuitively, this seems reasonable as the component likelihood range and variance are constrained by the conditional nature of CRFs. [sent-763, score-0.264]

75 Adding a lower order component such as pseudo likelihood acts as a regularizer that prevents overﬁtting. [sent-769, score-0.59]

76 8 1 β Figure 11: Train set (top) and test set (bottom) perplexity for the Boltzmann chain MRF with 1st/2nd order pseudo likelihood selection policy (PL1/PL2). [sent-854, score-0.949]

77 The best achievable test set perplexity is about 189. [sent-859, score-0.257]

78 In comparing these results to PL1/FL, we note that the test set contours exhibit less perplexity for larger areas. [sent-861, score-0.28]

79 8 1 β Figure 12: Train (top) and test (bottom) perplexities for the Boltzmann chain MRF with PL1/PL2 selection policy (x-axis:PL2 weight, y-axis:perplexity; see above and previous). [sent-888, score-0.306]

80 10) test perplexity despite PL1/FL including FL as a special case (i. [sent-890, score-0.257]

81 In Figure 21 we used the β heuristic to evaluate train and test perplexity over a (λ, σ2 ) grid. [sent-898, score-0.356]

82 In particular, different (β, λ) parameterizations of the stochastic composite likelihood en2620 S TOCHASTIC C OMPOSITE L IKELIHOOD 13 13 x 10 x 10 2. [sent-910, score-0.346]

83 Each point represents the negative log-likelihood (perplexity) and the number of ﬂops required to evaluate the composite likelihood and its gradient under a particular instantiation of the selection policy. [sent-923, score-0.344]

84 The heuristic out-performed the optimal on the test set and had slightly higher perplexity on the training set. [sent-972, score-0.29]

85 8 1 β Figure 15: Train set (top) and test set (bottom) perplexity for the CRF with pseudo/full likelihood selection policy (PL1/FL). [sent-1061, score-0.6]

86 The best achievable test set perplexity is about 5. [sent-1066, score-0.257]

87 This is partially attributable to a smaller range of the objective—the CRF is already conditional hence the per-component perplexity range is reduced. [sent-1070, score-0.252]

88 8 1 β Figure 16: Train (top) and test (bottom) perplexities for a CRF with PL1/FL selection policy (xaxis:FL weight, y-axis:perplexity; see above and Fig. [sent-1091, score-0.266]

89 8 1 β Figure 17: Train set (top) and test set (bottom) perplexity for a CRF with 1st/2nd order pseudo likelihood selection policy (PL1/PL2). [sent-1193, score-0.909]

90 8 1 β Figure 18: Train (top) and test (bottom) perplexities for a CRF with PL1/PL2 selection policy (xaxis:PL2 weight, y-axis:perplexity; see above and Fig. [sent-1219, score-0.266]

91 Although increasing λ only brings minor improvement to both the training and testing perplexities, it is worth noting that the test perplexity meets that of the PL1/FL. [sent-1221, score-0.257]

92 Since c1 , c2 were chosen ˆ msl → θ0 with probability 1. [sent-1229, score-0.257]

93 arbitrarily θn Proposition 6 Making the assumptions of Proposition 1 as well as convexity of Θ ⊂ Rr we have the following convergence in distribution √ msl ˆ n(θn − θ0 ) N (0, ϒΣϒ) 2626 S TOCHASTIC C OMPOSITE L IKELIHOOD 12 2. [sent-1230, score-0.257]

94 Each point represents the negative log-likelihood (perplexity) and the number of ﬂops required to evaluate the composite likelihood and its gradient under a particular instance of the selection policy. [sent-1246, score-0.344]

95 Since θn maximizes the SCL, ∇scℓn (θmsl ) = 0 and √ msl √ ˆ n(θn − θ0 ) = − n(∇2 scℓn (θ′ ))−1 ∇scℓn (θ0 ). [sent-1254, score-0.257]

96 Note that the perplexity is lowest when both components are always selected (λ = 1) and that the PL1/FL policy outperforms the PL1/PL2 policy as expected. [sent-1314, score-0.525]

97 ψθ (X, Z) = ∇mθ (X, Z) def ˙ ψθ (X, Z) = ∇2 mθ (X, Z) (matrix of second order derivatives) def Ψn (θ) = def 1 n ∑ ψθ (X (i) , Z (i) ). [sent-1324, score-0.303]

98 ˆ However, since scℓ′ (θ) can be made to achieve values arbitrarily close to µ(θ0 ) as θn → θ0 , we ˆ msl ∈ S for n > N. [sent-1331, score-0.257]

99 1 Case Study • Boltzmann Machines Figure 1 Tabular comparison of policies for computation and accuracy Figure 2 Theoretical analysis of asymptotic variance for trace and determinant Figure 3 Computation/accuracy tradeoff B. [sent-1345, score-0.374]

100 A note on composite likelihood inference and model selection. [sent-1490, score-0.311]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('scl', 0.497), ('pseudo', 0.309), ('msl', 0.257), ('perplexity', 0.228), ('boltzmann', 0.195), ('likelihood', 0.192), ('illon', 0.171), ('xa', 0.15), ('ebanon', 0.147), ('xb', 0.139), ('asymptotic', 0.132), ('omposite', 0.132), ('pl', 0.125), ('flop', 0.124), ('ikelihood', 0.122), ('composite', 0.119), ('mle', 0.119), ('policy', 0.118), ('fl', 0.118), ('tochastic', 0.114), ('crf', 0.114), ('def', 0.101), ('sc', 0.097), ('op', 0.091), ('perplexities', 0.086), ('estimators', 0.083), ('policies', 0.072), ('mrf', 0.071), ('train', 0.066), ('regularizer', 0.066), ('tradeoff', 0.066), ('proposition', 0.058), ('mscle', 0.057), ('var', 0.054), ('approx', 0.049), ('variance', 0.048), ('det', 0.048), ('mple', 0.048), ('unachievable', 0.048), ('ferguson', 0.047), ('estimator', 0.045), ('crfs', 0.044), ('diamond', 0.044), ('chunking', 0.042), ('sentiment', 0.041), ('chain', 0.04), ('boxes', 0.039), ('ml', 0.038), ('dillon', 0.038), ('gorin', 0.038), ('bk', 0.037), ('stochastic', 0.035), ('vp', 0.034), ('heuristic', 0.033), ('objects', 0.033), ('selection', 0.033), ('fourteen', 0.033), ('bottom', 0.032), ('components', 0.032), ('determinant', 0.032), ('complexity', 0.031), ('log', 0.03), ('ak', 0.03), ('np', 0.03), ('pp', 0.03), ('pos', 0.03), ('tokens', 0.03), ('selected', 0.029), ('ger', 0.029), ('test', 0.029), ('discs', 0.029), ('mult', 0.029), ('pseudolikelihood', 0.029), ('varin', 0.029), ('rr', 0.028), ('mrfs', 0.027), ('maximizers', 0.027), ('consistency', 0.026), ('weight', 0.026), ('rows', 0.025), ('weaker', 0.025), ('effectiveness', 0.025), ('liang', 0.025), ('token', 0.025), ('figures', 0.025), ('shading', 0.024), ('aii', 0.024), ('conditional', 0.024), ('mcmc', 0.024), ('accuracy', 0.024), ('elds', 0.023), ('chains', 0.023), ('shaded', 0.023), ('acts', 0.023), ('sentences', 0.023), ('depicts', 0.023), ('heuristically', 0.023), ('contours', 0.023), ('region', 0.022), ('generative', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999887 109 jmlr-2010-Stochastic Composite Likelihood

Author: Joshua V. Dillon, Guy Lebanon

2 0.15392256 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

Author: Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian

Abstract: Estimating the error rates of classiﬁers or regression models is a fundamental task in machine learning which has thus far been studied exclusively using supervised learning techniques. We propose a novel unsupervised framework for estimating these error rates using only unlabeled data and mild assumptions. We prove consistency results for the framework and demonstrate its practical applicability on both synthetic and real world data. Keywords: classiﬁcation and regression, maximum likelihood, latent variable models

3 0.073604099 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

Author: Kuzman Ganchev, João Graça, Jennifer Gillenwater, Ben Taskar

Abstract: We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efﬁciently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efﬁciency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efﬁcient algorithm for learning with posterior regularization and illustrate its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multi-view learning, cross-lingual dependency grammar induction, unsupervised part-of-speech induction, and bitext word alignment.1 Keywords: posterior regularization framework, unsupervised learning, latent variables models, prior knowledge, natural language processing

4 0.072349146 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation

Author: Miki Aoyagi

Abstract: In this paper, we consider the asymptotic form of the generalization error for the restricted Boltzmann machine in Bayesian estimation. It has been shown that obtaining the maximum pole of zeta functions is related to the asymptotic form of the generalization error for hierarchical learning models (Watanabe, 2001a,b). The zeta function is deﬁned by using a Kullback function. We use two methods to obtain the maximum pole: a new eigenvalue analysis method and a recursive blowing up process. We show that these methods are effective for obtaining the asymptotic form of the generalization error of hierarchical learning models. Keywords: Boltzmann machine, non-regular learning machine, resolution of singularities, zeta function

5 0.068909384 25 jmlr-2010-Composite Binary Losses

Author: Mark D. Reid, Robert C. Williamson

Abstract: We study losses for binary classiﬁcation and class probability estimation and extend the understanding of them from margin losses to general composite losses which are the composition of a proper loss with a link function. We characterise when margin losses can be proper composite losses, explicitly show how to determine a symmetric loss in full from half of one of its partial losses, introduce an intrinsic parametrisation of composite binary losses and give a complete characterisation of the relationship between proper losses and “classiﬁcation calibrated” losses. We also consider the question of the “best” surrogate binary loss. We introduce a precise notion of “best” and show there exist situations where two convex surrogate losses are incommensurable. We provide a complete explicit characterisation of the convexity of composite binary losses in terms of the link function and the weight function associated with the proper loss which make up the composite loss. This characterisation suggests new ways of “surrogate tuning” as well as providing an explicit characterisation of when Bregman divergences on the unit interval are convex in their second argument. Finally, in an appendix we present some new algorithm-independent results on the relationship between properness, convexity and robustness to misclassiﬁcation noise for binary losses and show that all convex proper losses are non-robust to misclassiﬁcation noise. Keywords: surrogate loss, convexity, probability estimation, classiﬁcation, Fisher consistency, classiﬁcation-calibrated, regret bound, proper scoring rule, Bregman divergence, robustness, misclassiﬁcation noise

6 0.067257434 115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages

7 0.064461246 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

8 0.061853331 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

9 0.055469289 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox

10 0.049403388 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds

11 0.04806279 53 jmlr-2010-Inducing Tree-Substitution Grammars

12 0.047759227 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers

13 0.047294669 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

14 0.046583001 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

15 0.041774422 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning

16 0.040581994 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency

17 0.040321849 69 jmlr-2010-Lp-Nested Symmetric Distributions

18 0.039791029 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models

19 0.038483713 10 jmlr-2010-An Exponential Model for Infinite Rankings

20 0.036021829 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.192), (1, -0.011), (2, -0.054), (3, 0.015), (4, -0.05), (5, -0.061), (6, 0.144), (7, 0.062), (8, -0.027), (9, 0.039), (10, -0.063), (11, -0.065), (12, 0.02), (13, -0.223), (14, 0.05), (15, 0.017), (16, 0.054), (17, 0.054), (18, -0.049), (19, 0.166), (20, 0.155), (21, 0.117), (22, 0.049), (23, -0.132), (24, -0.095), (25, 0.022), (26, -0.211), (27, -0.047), (28, -0.032), (29, 0.258), (30, -0.16), (31, -0.121), (32, 0.055), (33, -0.017), (34, 0.091), (35, 0.03), (36, -0.232), (37, -0.111), (38, 0.056), (39, -0.17), (40, 0.079), (41, -0.036), (42, 0.099), (43, 0.134), (44, 0.007), (45, 0.079), (46, 0.04), (47, 0.042), (48, 0.02), (49, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94121957 109 jmlr-2010-Stochastic Composite Likelihood

Author: Joshua V. Dillon, Guy Lebanon

2 0.60997558 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

Author: Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian

3 0.34037235 115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages

Author: Alexander Clark, Rémi Eyraud, Amaury Habrard

Abstract: We present a polynomial update time algorithm for the inductive inference of a large class of context-free languages using the paradigm of positive data and a membership oracle. We achieve this result by moving to a novel representation, called Contextual Binary Feature Grammars (CBFGs), which are capable of representing richly structured context-free languages as well as some context sensitive languages. These representations explicitly model the lattice structure of the distribution of a set of substrings and can be inferred using a generalisation of distributional learning. This formalism is an attempt to bridge the gap between simple learnable classes and the sorts of highly expressive representations necessary for linguistic representation: it allows the learnability of a large class of context-free languages, that includes all regular languages and those context-free languages that satisfy two simple constraints. The formalism and the algorithm seem well suited to natural language and in particular to the modeling of ﬁrst language acquisition. Preliminary experimental results conﬁrm the effectiveness of this approach. Keywords: grammatical inference, context-free language, positive data only, membership queries

4 0.31306565 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation

Author: Miki Aoyagi

5 0.26787075 25 jmlr-2010-Composite Binary Losses

Author: Mark D. Reid, Robert C. Williamson

6 0.2589227 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

7 0.25235009 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models

8 0.24930337 16 jmlr-2010-Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory

9 0.22791736 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

10 0.22315745 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems

11 0.22068433 37 jmlr-2010-Evolving Static Representations for Task Transfer

12 0.21807827 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm

13 0.20824549 93 jmlr-2010-PyBrain

14 0.20643121 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox

15 0.20389459 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds

16 0.20340797 54 jmlr-2010-Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization

17 0.19889185 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers

18 0.19075201 13 jmlr-2010-Approximate Inference on Planar Graphs using Loop Calculus and Belief Propagation

19 0.18726006 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

20 0.18671282 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.012), (3, 0.018), (4, 0.015), (8, 0.029), (15, 0.013), (21, 0.02), (32, 0.072), (35, 0.345), (36, 0.038), (37, 0.072), (75, 0.12), (81, 0.026), (85, 0.063), (88, 0.016), (96, 0.034), (97, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7029323 109 jmlr-2010-Stochastic Composite Likelihood

Author: Joshua V. Dillon, Guy Lebanon

2 0.44282719 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

Author: Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian

3 0.43652755 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

Author: Yevgeny Seldin, Naftali Tishby

Abstract: We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering.1 We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for ﬁnding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved stateof-the-art performance in the MovieLens collaborative ﬁltering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization terms, and new algorithms for this form of matrix factorization. The analysis of co-clustering is extended to tree-shaped graphical models, which can be used to analyze high dimensional tensors. According to the bounds, the generalization abilities of treeshaped graphical models depend on a trade-off between their empirical data ﬁt and the mutual information that is propagated up the tree levels. We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The analysis of co-clustering easily

4 0.43595043 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

Author: Ming Yuan

Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity

5 0.43295991 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

Author: Shiliang Sun, John Shawe-Taylor

Abstract: In this paper, we propose a general framework for sparse semi-supervised learning, which concerns using a small portion of unlabeled data and a few labeled data to represent target functions and thus has the merit of accelerating function evaluations when predicting the output of a new example. This framework makes use of Fenchel-Legendre conjugates to rewrite a convex insensitive loss involving a regularization with unlabeled data, and is applicable to a family of semi-supervised learning methods such as multi-view co-regularized least squares and single-view Laplacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse multi-view SVMs which use a squared ε-insensitive loss. The resultant optimization is an inf-sup problem and the optimal solutions have arguably saddle-point properties. We present a globally optimal iterative algorithm to optimize the problem. We give the margin bound on the generalization error of the sparse multi-view SVMs, and derive the empirical Rademacher complexity for the induced function class. Experiments on artiﬁcial and real-world data show their effectiveness. We further give a sequential training approach to show their possibility and potential for uses in large-scale problems and provide encouraging experimental results indicating the efﬁcacy of the margin bound and empirical Rademacher complexity on characterizing the roles of unlabeled data for semi-supervised learning. Keywords: semi-supervised learning, Fenchel-Legendre conjugate, representer theorem, multiview regularization, support vector machine, statistical learning theory

6 0.43015796 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

7 0.42725307 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

8 0.42386401 102 jmlr-2010-Semi-Supervised Novelty Detection

9 0.4228628 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation

10 0.42250186 104 jmlr-2010-Sparse Spectrum Gaussian Process Regression

11 0.422297 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking

12 0.42033419 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

13 0.41972172 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

14 0.4188979 107 jmlr-2010-Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion

15 0.41849178 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

16 0.41776392 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

17 0.4174476 63 jmlr-2010-Learning Instance-Specific Predictive Models

18 0.41512877 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory

19 0.41503066 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

20 0.414386 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models