nips nips2012 nips2012-218 knowledge-graph by maker-knowledge-mining

218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions


Source: pdf

Author: Mathieu Sinn, Bei Chen

Abstract: Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. Large-sample properties of Conditional Markov Chains have been first studied in [1]. The paper extends this work in two directions: first, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identifiability and the uniqueness of maximum likelihood estimates are being given. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Abstract Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. [sent-6, score-0.29]

2 The paper extends this work in two directions: first, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identifiability and the uniqueness of maximum likelihood estimates are being given. [sent-8, score-0.448]

3 1 Introduction Conditional Random Fields (CRF) are a widely popular class of discriminative models for the distribution of a set of hidden states conditional on a set of observable variables. [sent-9, score-0.214]

4 The fundamental assumption is that the hidden states, conditional on the observations, form a Markov random field [2,3]. [sent-10, score-0.19]

5 [1] defines CMCs of infinite length and studies ergodic properties of the joint sequences of observations and hidden states. [sent-15, score-0.257]

6 The analysis relies on fundamental results from the theory of weak ergodicity [6]. [sent-16, score-0.159]

7 The exposition is restricted to CMCs with bounded feature functions which precludes the application, e. [sent-17, score-0.074]

8 [5] considers weak consistency and central limit theorems for models with a more general structure. [sent-20, score-0.158]

9 Ergodicity and mixing of the models is assumed, but no explicit conditions on the feature functions or on the distribution of the observations are given. [sent-21, score-0.285]

10 The present paper studies mixing properties of Conditional Markov Chains with unbounded feature functions. [sent-23, score-0.245]

11 The results are fundamental for analyzing the consistency of Maximum Likelihood estimates and establishing Central Limit Theorems (which are very useful for constructing statistical hypothesis tests, e. [sent-24, score-0.095]

12 3 the ergodicity results from [1] are extended to models with unbounded feature functions. [sent-30, score-0.231]

13 A key result is that, in order to allow for unbounded feature functions, the observations need to follow a distribution such that Hoeffding-type concentration inequalities can be established. [sent-33, score-0.255]

14 Furthermore, the mixing rates depend on the tail behaviour of the distribution. [sent-34, score-0.165]

15 Furthermore, consider a probability space (Ω, F, P) and let X = (Xt )t∈Z , Y = (Yt )t∈Z be sequences of measurable mappings from Ω into X and Y, respectively. [sent-42, score-0.088]

16 Here, • X is an infinite sequence of observations ranging in the domain X , • Y is an aligned sequence of hidden states taking values in the finite set Y. [sent-43, score-0.235]

17 Next we define Conditional Markov Chains, which parameterize the conditional distribution of Y given X. [sent-45, score-0.121]

18 Consider a vector f of real-valued functions f : X × Y × Y → R, called the feature functions. [sent-47, score-0.074]

19 Throughout this paper, we assume that the following condition is satisfied: (A1) All feature functions are finite: |f (x, i, j)| < ∞ for all x ∈ X and i, j ∈ Y. [sent-48, score-0.095]

20 Associated with the feature functions is a vector λ of real-valued model-weights. [sent-49, score-0.074]

21 In terms of statistical physics, m(x, i, j) measures the potential of the transition between the hidden states i and j from time t−1 to t, given the observation x at time t. [sent-51, score-0.131]

22 Intuitively, αs (x, i) measures the potential of the hidden state i at time t given the observations xs , . [sent-65, score-0.137]

23 , xt and assuming t that at time s − 1 all hidden states have potential equal to 1. [sent-68, score-0.429]

24 Similarly, βs (x, j) is the potential of j at time s assuming equal potential of all hidden states at time t. [sent-69, score-0.118]

25 , Yt+k = yt+k | X) m(Xt+i , yt+i−1 , yt+i ) := i=1 × lim t+k+n t αt−n (X, yt ) βt+k (X, yt+k ) n→∞ αt (X)T β t+k+n (X) t t−n . [sent-76, score-0.909]

26 Furthermore, the family of all marginal distributions obtained this way satisfies the consistency conditions of Kolmogorov’s Extension Theorem. [sent-78, score-0.104]

27 Hence we obtain a unique distribution for Y conditional on X parameterized by the feature functions f and the model weights λ. [sent-79, score-0.195]

28 We introduce the following notation: For any matrix P = (pij ) with strictly positive entries let φ(P ) denote the mixing coefficient φ(P ) = min i,j,k,l pik pjl . [sent-85, score-0.18]

29 This coefficient will play a key role in the analysis of mixing properties. [sent-87, score-0.12]

30 The following proposition summarizes fundamental properties of the distribution of Y conditional on X, which directly follow from the above definition (also see Corollary 1 in [1]). [sent-88, score-0.196]

31 Then Y conditional on X forms a timeinhomogeneous Markov chain. [sent-91, score-0.121]

32 Moreover, if X is strictly stationary, then the joint distribution of the aligned sequences (X, Y ) is strictly stationary. [sent-92, score-0.191]

33 The conditional transition probabilities Pt (x, i, j) := P(Yt = j | Yt−1 = i, X = x) of Y given X = x have the following form: Pt (x, i, j) n βt (x, j) . [sent-93, score-0.156]

34 3 Ergodicity In this section we establish conditions under which the aligned sequences (X, Y ) are jointly ergodic. [sent-95, score-0.213]

35 Let us first recall the definition of ergodicity of X (see [8]): By X we denote the space of sequences x = (xt )t∈Z in X , and by A the corresponding product σ-field. [sent-96, score-0.172]

36 Then ergodicity of X is formally defined as follows: (A2) X is ergodic, that is, PX (A) = PX (τ −1 A) for every A ∈ A, and PX (A) ∈ {0, 1} for every set A ∈ A satisfying A = τ −1 A. [sent-99, score-0.14]

37 In our later analysis, we will use the theorem to show that the time average of feature functions f (Xt , Yt−1 , Yt ) converges to the expected value E[f (Xt , Yt−1 , Yt )]. [sent-104, score-0.135]

38 Suppose that conditions (A1) and (A2) hold, and g : X × Y → R is a function which satisfies E[|g(Xt , Yt )|] < ∞. [sent-106, score-0.075]

39 Then 1 lim n→∞ n n g(Xt , Yt ) = E[g(Xt , Yt )] P-almost surely. [sent-107, score-0.145]

40 Note that Zt represents the hidden state at time t together with the entire aligned sequence of observations τ t−1 X. [sent-110, score-0.171]

41 The details of the proof that Z is ergodic can be found in an extended version of this paper, which is included in the supplementary material. [sent-115, score-0.107]

42 4 Mixing properties In this section we are going to study mixing properties of the aligned sequences (X, Y ). [sent-116, score-0.286]

43 To establish the results, we will assume that the distribution of the observations X satisfies conditions under which certain concentration inequalities hold true: n 1 (A3) Let A ⊂ A be a measurable set, with p := P(Xt ∈ A) and Sn (x) := n t=1 1(xt ∈ A) for x ∈ X . [sent-117, score-0.29]

44 In the dependent case, concentration inequalities of this type can be obtained by imposing Martingale or mixing conditions on X (see [12,13]). [sent-120, score-0.261]

45 Furthermore, we will make the following assumption, which relates the feature functions to the tail behaviour of the distribution of X: (A4) Let h : [0, ∞) → [0, ∞) be a differentiable decreasing function with h(z) = O(z −(1+κ) ) for some κ > 0. [sent-121, score-0.139]

46 The following theorem establishes conditions under which the expected conditional covariances of square-integrable functions are summable. [sent-124, score-0.33]

47 The result is obtained by studying ergodic properties of the transition probability matrices. [sent-125, score-0.154]

48 Suppose that conditions (A1) - (A3) hold true, and g : X × Y → R is a function with finite second moment, E[|g(Xt , Yt )|2 ] < ∞. [sent-127, score-0.099]

49 Let γt,k (X) = Cov(g(Xt , Yt ), g(Xt+k , Yt+k ) | X) denote the covariance of g(Xt , Yt ) and g(Xt+k , Yt+k ) conditional on X. [sent-128, score-0.121]

50 According to the assumptions, we have E[|g(Xt )|] = E[|g(Xt+k )|] < ∞, so we only need to bound the expectation of the conditional covariance. [sent-133, score-0.121]

51 Now, using results from the theory of weak ergodicity (see Chapter 3 in [6]), we can establish 1− φ(P t+1 (x) . [sent-147, score-0.173]

52 The next theorem bounds the difference between the distribution of Y conditional on X and finite approximations of it. [sent-165, score-0.182]

53 Introduce the following notation: For t, k ≥ 0 with t + k ≤ n let P(0:n) (Yt = yt , . [sent-166, score-0.764]

54 , Yt+k = yt+k | X = x) k := t n α0 (x, yt ) βt+k (x, yt+k ) . [sent-169, score-0.764]

55 n→∞ αt (x)T β n (x) t 0 m(xt+i , yt+i−1 , yt+i ) lim i=1 (0:n) Accordingly, write E for expectations taken with respect to P(0:n) . [sent-170, score-0.189]

56 As emphasized by the superscrips, these quantities can be regarded as marginal distributions of Y conditional on the observations at times t = 0, 1, . [sent-171, score-0.157]

57 To simplify notation, the following theorem is stated for 1-dimensional conditional marginal distributions, however, the extension to the general case is straight-forward. [sent-175, score-0.182]

58 Then the limit P(0:∞) (Yt = i | X) lim P(0:n) (Yt = i | X) := n→∞ is well-defined P-almost surely. [sent-178, score-0.203]

59 Moreover, there exists a measurable function C(x) of x ∈ X with finite expectation, E[|C(X)|] < ∞, and a function h(z) satisfying the conditions in (A4) , such that P(0:∞) (Yt = i | X) − P(0:n) (Yt = i | X) ≤ C(τ t X) h(n − t). [sent-179, score-0.131]

60 M (xn ) and write gn (x, i, j) for the (i, j)-th component of Gn (x). [sent-184, score-0.157]

61 4 in [6], there exist t numbers rij (x) such that gn (x, i, k) n→∞ gn (xj, k) lim = rij (x) n n for all k ∈ Y. [sent-190, score-0.517]

62 Consequently, the ratio of βt (x, i) to βt (x, j) converges to rij (x), and hence t n α0 (x, i) βt (x, i) t (x)T β n (x) n→∞ α t 0 lim = 1 q i (x)T r i (x) t where we use the notation q i (x) = αt (x)/α0 (x, i) and r i (x) denotes the vector with the kth 0 component rki (x). [sent-191, score-0.261]

63 t n α0 (X, i) βt (X, i) To bound the latter expression, introduce the vectors r n (x) and r n (x) with the kth components i i rn (x) = min ki l∈Y gn (x, k, l) gn (x, i, l) and rn (x) = max ki l∈Y gn (x, k, l) gn (x, i, l) . [sent-194, score-0.672]

64 The first step is to show the existence of a function C1 (x) with E[|C1 (X)|] < ∞ such that |rn (X) − rn (X)| ≤ ki ki C1 (τ t X) (1 − ζ)n−t for some ζ > 0. [sent-199, score-0.091]

65 Suppose that conditions (A1) - (A4) hold, and the function g : X × Y → R satisfies E[|g(Xt , Yt )|] < ∞. [sent-207, score-0.075]

66 Then 1 n→∞ n n E(0:n) [g(Xt , Yt ) | X] lim = E[g(Xt , Yt )] P-almost surely. [sent-208, score-0.145]

67 Using the result from Theorem 3, we obtain n n n E(0:n) [g(Xt , Yt ) | X] − t=1 E(0:∞) [g(Xt , Yt ) | X] t=1 |g(Xt )| |C(τ t X)| h(n − t), ≤ t=1 where h(z) is a function satisfying the conditions in assumption (A4). [sent-211, score-0.097]

68 Using the facts that X is ergodic and the expectations of |g(Xt )| and |C(τ t X)| are finite, we obtain 1 n→∞ n n n E(0:∞) [g(Xt , Yt ) | X] E(0:n) [g(Xt , Yt ) | X] − lim = 0. [sent-213, score-0.254]

69 Thus, 1 lim n→∞ n n n (0:∞) E [g(Xt , Yt ) | X] − E[g(Xt , Yt ) | X] t=1 = 0. [sent-215, score-0.145]

70 t=1 t Now, noting that E[g(Xt , Yt ) | X] = E[g(X0 , Y0 ) | τ X] for every t, the theorem follows by the ergodicity of X. [sent-216, score-0.179]

71 5 Maximum Likelihood learning and model identifiability In this section we apply the previous results to analyze asymptotic properties of the empirical likelihood function. [sent-217, score-0.132]

72 , Yn ) of X and Y , where the distribution of Y conditional on X follows a conditional Markov chain with fixed feature functions f and unknown model weights λ∗ . [sent-224, score-0.316]

73 , Pλ and Eλ , to denote the conditional distributions. [sent-228, score-0.121]

74 In order to do so, introduce the shorthand notation n f (xn , y n ) = t=1 f (xt , yt−1 , yt ) for xn = (x0 , . [sent-230, score-0.81]

75 Consider the normalized conditional likelihood, 1 T λ f (X n , Y n ) − log exp λT f (X n , y n ) . [sent-237, score-0.121]

76 Ln (λ) = n n+1 y n ∈Y Note that, in the context of finite Conditional Markov Chains, Ln (λ) is the exact likelihood of Y n conditional on X n . [sent-238, score-0.173]

77 It is easy to see that Ln (λ) is strictly concave if and only if |Y| > 1, and there exists a sequence y n such that at least one component of f (X n , y n ) is non-zero. [sent-241, score-0.106]

78 1 Asymptotic properties of the likelihood function In addition to the conditions (A1)-(A4) stated earlier, we will make the following assumptions: (A5) The feature functions have finite second moments: Eλ∗ [|f (Xt , Yt−1 , Yt )|2 ] < ∞. [sent-245, score-0.233]

79 The next theorem establishes asymptotic properties of the likelihood function Ln (λ). [sent-247, score-0.236]

80 (iii) If the limit of the Hessian 2 Ln (λ) is finite and non-singular, then the function L(λ) is strictly concave on Θ. [sent-253, score-0.126]

81 As a consequence, the Maximum Likelihood estimates are strongly consistent: ˆ lim λn n→∞ = λ∗ Pλ∗ -almost surely. [sent-254, score-0.168]

82 The statements are obtained analogously to Lemma 4-6 and Theorem 4 in [1], using the auxiliary results for Conditional Markov Chains with unbounded feature functions established in Theorem 1, Theorem 2, and Theorem 4. [sent-256, score-0.168]

83 Next, we study the asymptotic behaviour of the Hessian 2 Ln (λ). [sent-257, score-0.085]

84 ∂λt ∂λT t+k t=1 The following theorem establishes an expression for the limit of expression given in Lemma 7 of [1], which is incorrect. [sent-272, score-0.162]

85 Then ∞ lim 2 n→∞ Ln (λ) = − γλ (0) + 2 γλ (k) Pλ∗ -almost surely k=1 where γλ (k) = E[Covλ (f (Xt , Yt−1 , Yt ), f (Xt+k , Yt+k−1 , Yt+k ) | X)] is the expectation of the conditional covariance (with respect to Pλ ) between f (Xt , Yt−1 , Yt ) and f (Xt+k , Yt+k−1 , Yt+k ) given X. [sent-276, score-0.298]

86 The key step is to show the existence of a positive measurable function Uλ (k, x) such that n−1 n−k lim n→∞ k=1 t=1 ∂2 Ln (λ) ∂λt ∂λT t+k n−1 ≤ E[Uλ (k, X)] lim n→∞ k=1 with the limit on the right hand side being finite. [sent-279, score-0.402]

87 Then the rest of the proof is straight-forward: Theorem 4 shows that, for fixed k ≥ 0, n−k lim n→∞ t=1 ∂2 Ln (λ) ∂λt ∂λT t+k = γλ (k) Pλ∗ -almost surely. [sent-280, score-0.165]

88 Hence, in order to establish the theorem, it suffices to show that n−k n−1 γλ (k) − lim n→∞ t=1 k=1 ∂2 Ln (λ) ∂λt ∂λT t+k ≤ for all > 0. [sent-281, score-0.18]

89 Hence n−1 |γλ (k)| + lim lim n→∞ ∞ k=1 n→∞ k=N E[Uλ (k, X)] ≤ . [sent-284, score-0.29]

90 k=N On the other hand, Theorem 4 shows that N −1 γλ (k) − lim n→∞ n−k k=1 t=1 ∂2 Ln (λ) ∂λt ∂λT t+k For details on how to construct Uλ (k, x), see the extended version of this paper. [sent-286, score-0.145]

91 2 Model identifiability Let us conclude the analysis by investigating conditions under which the limit of the Hessian 2 Ln (λ) is non-singular. [sent-288, score-0.133]

92 Note that 2 Ln (λ) is negative definite for every n, so also the limit is negative definite, but not necessarily strictly negative definite. [sent-289, score-0.102]

93 Then the following conditions are necessary for the limit of 2 Ln (λ) to be non-singular: (i) For each feature function f (x, i, j), there exists a set A ⊂ X with P(Xt ∈ A) > 0 such that, for every x ∈ A, we can find i, j, k, l ∈ Y with f (x, i, j) = f (x, k, l). [sent-292, score-0.177]

94 Clearly, features violating condition (i) can be assigned arbitrary model weights without any effect on the conditional distributions. [sent-296, score-0.161]

95 6 Conclusions We have established ergodicity and various mixing properties of Conditional Markov Chains with unbounded feature functions. [sent-301, score-0.388]

96 In particular, the proof of Theorem 2 has shown that the sequence of observations needs to satisfy conditions under which Hoeffding-type concentration inequalities can be obtained. [sent-303, score-0.255]

97 The proof of Theorem 3 has reveiled an interesting interplay between mixing rates, feature functions, and the tail behaviour of the distribution of observations. [sent-304, score-0.229]

98 By applying the mixing properties to the empirical likelihood functions we have obtained necessary conditions for the Maximum Likelihood estimates to be strongly consistent. [sent-305, score-0.312]

99 (2006) An introduction to conditional random fields for relational learning. [sent-320, score-0.15]

100 (2000) Concentration of measure inequalities for Markov chains and Φ-mixing processes. [sent-374, score-0.192]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('yt', 0.764), ('xt', 0.333), ('ln', 0.198), ('chains', 0.147), ('lim', 0.145), ('gn', 0.135), ('conditional', 0.121), ('ergodicity', 0.118), ('cov', 0.109), ('mixing', 0.1), ('ergodic', 0.087), ('px', 0.085), ('cmcs', 0.084), ('conditions', 0.075), ('unbounded', 0.069), ('markov', 0.065), ('theorem', 0.061), ('limit', 0.058), ('sequences', 0.054), ('zt', 0.052), ('likelihood', 0.052), ('rij', 0.051), ('aligned', 0.049), ('asymptotic', 0.048), ('hidden', 0.048), ('pt', 0.047), ('inequalities', 0.045), ('strictly', 0.044), ('feature', 0.044), ('establishes', 0.043), ('concentration', 0.041), ('hessian', 0.04), ('nite', 0.039), ('suppose', 0.038), ('limn', 0.038), ('sequence', 0.038), ('behaviour', 0.037), ('sinn', 0.037), ('ki', 0.036), ('observations', 0.036), ('transition', 0.035), ('establish', 0.035), ('measurable', 0.034), ('properties', 0.032), ('surely', 0.032), ('xs', 0.031), ('identi', 0.031), ('functions', 0.03), ('consistency', 0.029), ('central', 0.029), ('relational', 0.029), ('tail', 0.028), ('states', 0.026), ('martingale', 0.025), ('established', 0.025), ('hold', 0.024), ('concave', 0.024), ('xn', 0.024), ('estimates', 0.023), ('uniqueness', 0.023), ('expectations', 0.022), ('theorems', 0.022), ('write', 0.022), ('satisfying', 0.022), ('mccallum', 0.022), ('proposition', 0.022), ('notation', 0.022), ('establishing', 0.022), ('kth', 0.022), ('potential', 0.022), ('completes', 0.021), ('ability', 0.021), ('condition', 0.021), ('hence', 0.021), ('pereira', 0.021), ('fundamental', 0.021), ('ii', 0.021), ('satis', 0.02), ('fields', 0.02), ('key', 0.02), ('says', 0.02), ('weak', 0.02), ('proof', 0.02), ('furthermore', 0.019), ('going', 0.019), ('observable', 0.019), ('features', 0.019), ('corollary', 0.019), ('rn', 0.019), ('mathsinn', 0.018), ('mulhuddart', 0.018), ('neville', 0.018), ('pjk', 0.018), ('revised', 0.018), ('pjl', 0.018), ('basel', 0.018), ('bei', 0.018), ('pik', 0.018), ('pil', 0.018), ('sinai', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

Author: Mathieu Sinn, Bei Chen

Abstract: Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. Large-sample properties of Conditional Markov Chains have been first studied in [1]. The paper extends this work in two directions: first, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identifiability and the uniqueness of maximum likelihood estimates are being given. 1

2 0.63836223 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

Author: Claudio Gentile, Francesco Orabona

Abstract: We present a novel multilabel/ranking algorithm working in partial information settings. The algorithm is based on 2nd-order descent methods, and relies on upper-confidence bounds to trade-off exploration and exploitation. We analyze this algorithm in a partial adversarial setting, where covariates can be adversarial, but multilabel probabilities are ruled by (generalized) linear models. We show O(T 1/2 log T ) regret bounds, which improve in several ways on the existing results. We test the effectiveness of our upper-confidence scheme by contrasting against full-information baselines on real-world multilabel datasets, often obtaining comparable performance. 1

3 0.42752248 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

Author: Pedro Ortega, Jordi Grau-moya, Tim Genewein, David Balduzzi, Daniel Braun

Abstract: We propose a novel Bayesian approach to solve stochastic optimization problems that involve finding extrema of noisy, nonlinear functions. Previous work has focused on representing possible functions explicitly, which leads to a two-step procedure of first, doing inference over the function space and second, finding the extrema of these functions. Here we skip the representation step and directly model the distribution over extrema. To this end, we devise a non-parametric conjugate prior based on a kernel regressor. The resulting posterior distribution directly captures the uncertainty over the maximum of the unknown function. Given t observations of the function, the posterior can be evaluated efficiently in time O(t2 ) up to a multiplicative constant. Finally, we show how to apply our model to optimize a noisy, non-convex, high-dimensional objective function.

4 0.38947511 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks

Author: Tivadar Papai, Henry Kautz, Daniel Stefankovic

Abstract: Markov logic is a widely used tool in statistical relational learning, which uses a weighted first-order logic knowledge base to specify a Markov random field (MRF) or a conditional random field (CRF). In many applications, a Markov logic network (MLN) is trained in one domain, but used in a different one. This paper focuses on dynamic Markov logic networks, where the size of the discretized time-domain typically varies between training and testing. It has been previously pointed out that the marginal probabilities of truth assignments to ground atoms can change if one extends or reduces the domains of predicates in an MLN. We show that in addition to this problem, the standard way of unrolling a Markov logic theory into a MRF may result in time-inhomogeneity of the underlying Markov chain. Furthermore, even if these representational problems are not significant for a given domain, we show that the more practical problem of generating samples in a sequential conditional random field for the next slice relying on the samples from the previous slice has high computational cost in the general case, due to the need to estimate a normalization factor for each sample. We propose a new discriminative model, slice normalized dynamic Markov logic networks (SN-DMLN), that suffers from none of these issues. It supports efficient online inference, and can directly model influences between variables within a time slice that do not have a causal direction, in contrast with fully directed models (e.g., DBNs). Experimental results show an improvement in accuracy over previous approaches to online inference in dynamic Markov logic networks. 1

5 0.31300741 324 nips-2012-Stochastic Gradient Descent with Only One Projection

Author: Mehrdad Mahdavi, Tianbao Yang, Rong Jin, Shenghuo Zhu, Jinfeng Yi

Abstract: Although many variants of stochastic gradient descent have been proposed for large-scale convex optimization, most of them require projecting the solution at each iteration to ensure that the obtained solution stays within the feasible domain. For complex domains (e.g., positive semidefinite cone), the projection step can be computationally expensive, making stochastic gradient descent unattractive for large-scale optimization problems. We address this limitation by developing novel stochastic optimization algorithms that do not need intermediate projections. Instead, only one projection at the last iteration is needed to obtain a feasible solution in the given domain. Our theoretical analysis shows that with a high probability, √ the proposed algorithms achieve an O(1/ T ) convergence rate for general convex optimization, and an O(ln T /T ) rate for strongly convex optimization under mild conditions about the domain and the objective function. 1

6 0.30606306 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

7 0.29656285 293 nips-2012-Relax and Randomize : From Value to Algorithms

8 0.26641354 292 nips-2012-Regularized Off-Policy TD-Learning

9 0.24753529 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

10 0.22061239 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

11 0.21379836 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

12 0.20431279 283 nips-2012-Putting Bayes to sleep

13 0.16157562 303 nips-2012-Searching for objects driven by context

14 0.13030192 72 nips-2012-Cocktail Party Processing via Structured Prediction

15 0.11693862 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

16 0.11299695 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

17 0.10984046 66 nips-2012-Causal discovery with scale-mixture model for spatiotemporal variance dependencies

18 0.10851955 195 nips-2012-Learning visual motion in recurrent neural networks

19 0.10678002 41 nips-2012-Ancestor Sampling for Particle Gibbs

20 0.10384415 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.245), (1, -0.033), (2, 0.301), (3, 0.581), (4, 0.17), (5, -0.269), (6, -0.015), (7, -0.093), (8, 0.016), (9, 0.033), (10, 0.053), (11, -0.058), (12, 0.067), (13, 0.058), (14, 0.062), (15, -0.038), (16, 0.118), (17, 0.011), (18, 0.109), (19, -0.011), (20, -0.016), (21, -0.017), (22, 0.058), (23, 0.037), (24, 0.054), (25, 0.028), (26, 0.061), (27, -0.038), (28, -0.044), (29, 0.05), (30, 0.078), (31, 0.109), (32, -0.045), (33, -0.011), (34, -0.006), (35, -0.012), (36, -0.012), (37, 0.017), (38, -0.044), (39, -0.017), (40, 0.032), (41, 0.038), (42, -0.002), (43, -0.004), (44, 0.008), (45, -0.011), (46, 0.032), (47, 0.002), (48, 0.045), (49, 0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98377758 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

Author: Mathieu Sinn, Bei Chen

Abstract: Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. Large-sample properties of Conditional Markov Chains have been first studied in [1]. The paper extends this work in two directions: first, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identifiability and the uniqueness of maximum likelihood estimates are being given. 1

2 0.92764872 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

Author: Claudio Gentile, Francesco Orabona

Abstract: We present a novel multilabel/ranking algorithm working in partial information settings. The algorithm is based on 2nd-order descent methods, and relies on upper-confidence bounds to trade-off exploration and exploitation. We analyze this algorithm in a partial adversarial setting, where covariates can be adversarial, but multilabel probabilities are ruled by (generalized) linear models. We show O(T 1/2 log T ) regret bounds, which improve in several ways on the existing results. We test the effectiveness of our upper-confidence scheme by contrasting against full-information baselines on real-world multilabel datasets, often obtaining comparable performance. 1

3 0.81723475 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

Author: Liva Ralaivola

Abstract: This paper provides the first —to the best of our knowledge— analysis of online learning algorithms for multiclass problems when the confusion matrix is taken as a performance measure. The work builds upon recent and elegant results on noncommutative concentration inequalities, i.e. concentration inequalities that apply to matrices, and, more precisely, to matrix martingales. We do establish generalization bounds for online learning algorithms and show how the theoretical study motivates the proposition of a new confusion-friendly learning procedure. This learning algorithm, called COPA (for COnfusion Passive-Aggressive) is a passive-aggressive learning algorithm; it is shown that the update equations for COPA can be computed analytically and, henceforth, there is no need to recourse to any optimization package to implement it. 1

4 0.79483521 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks

Author: Tivadar Papai, Henry Kautz, Daniel Stefankovic

Abstract: Markov logic is a widely used tool in statistical relational learning, which uses a weighted first-order logic knowledge base to specify a Markov random field (MRF) or a conditional random field (CRF). In many applications, a Markov logic network (MLN) is trained in one domain, but used in a different one. This paper focuses on dynamic Markov logic networks, where the size of the discretized time-domain typically varies between training and testing. It has been previously pointed out that the marginal probabilities of truth assignments to ground atoms can change if one extends or reduces the domains of predicates in an MLN. We show that in addition to this problem, the standard way of unrolling a Markov logic theory into a MRF may result in time-inhomogeneity of the underlying Markov chain. Furthermore, even if these representational problems are not significant for a given domain, we show that the more practical problem of generating samples in a sequential conditional random field for the next slice relying on the samples from the previous slice has high computational cost in the general case, due to the need to estimate a normalization factor for each sample. We propose a new discriminative model, slice normalized dynamic Markov logic networks (SN-DMLN), that suffers from none of these issues. It supports efficient online inference, and can directly model influences between variables within a time slice that do not have a causal direction, in contrast with fully directed models (e.g., DBNs). Experimental results show an improvement in accuracy over previous approaches to online inference in dynamic Markov logic networks. 1

5 0.7493183 293 nips-2012-Relax and Randomize : From Value to Algorithms

Author: Sasha Rakhlin, Ohad Shamir, Karthik Sridharan

Abstract: We show a principled way of deriving online learning algorithms from a minimax analysis. Various upper bounds on the minimax value, previously thought to be non-constructive, are shown to yield algorithms. This allows us to seamlessly recover known methods and to derive new ones, also capturing such “unorthodox” methods as Follow the Perturbed Leader and the R2 forecaster. Understanding the inherent complexity of the learning problem thus leads to the development of algorithms. To illustrate our approach, we present several new algorithms, including a family of randomized methods that use the idea of a “random playout”. New versions of the Follow-the-Perturbed-Leader algorithms are presented, as well as methods based on the Littlestone’s dimension, efficient methods for matrix completion with trace norm, and algorithms for the problems of transductive learning and prediction with static experts. 1

6 0.72967809 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

7 0.69290894 283 nips-2012-Putting Bayes to sleep

8 0.66782457 292 nips-2012-Regularized Off-Policy TD-Learning

9 0.64196032 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

10 0.61640841 66 nips-2012-Causal discovery with scale-mixture model for spatiotemporal variance dependencies

11 0.53450972 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

12 0.5166834 72 nips-2012-Cocktail Party Processing via Structured Prediction

13 0.50634718 11 nips-2012-A Marginalized Particle Gaussian Process Regression

14 0.49169305 324 nips-2012-Stochastic Gradient Descent with Only One Projection

15 0.49154601 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

16 0.45098391 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

17 0.4023737 303 nips-2012-Searching for objects driven by context

18 0.37957013 23 nips-2012-A lattice filter model of the visual pathway

19 0.37850052 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

20 0.37720835 222 nips-2012-Multi-Task Averaging


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.025), (10, 0.094), (21, 0.044), (36, 0.025), (38, 0.115), (39, 0.03), (42, 0.024), (54, 0.028), (55, 0.014), (74, 0.037), (76, 0.138), (80, 0.258), (92, 0.068)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94302982 100 nips-2012-Discriminative Learning of Sum-Product Networks

Author: Robert Gens, Pedro Domingos

Abstract: Sum-product networks are a new deep architecture that can perform fast, exact inference on high-treewidth models. Only generative methods for training SPNs have been proposed to date. In this paper, we present the first discriminative training algorithms for SPNs, combining the high accuracy of the former with the representational power and tractability of the latter. We show that the class of tractable discriminative SPNs is broader than the class of tractable generative ones, and propose an efficient backpropagation-style algorithm for computing the gradient of the conditional log likelihood. Standard gradient descent suffers from the diffusion problem, but networks with many layers can be learned reliably using “hard” gradient descent, where marginal inference is replaced by MPE inference (i.e., inferring the most probable state of the non-evidence variables). The resulting updates have a simple and intuitive form. We test discriminative SPNs on standard image classification tasks. We obtain the best results to date on the CIFAR-10 dataset, using fewer features than prior methods with an SPN architecture that learns local image structure discriminatively. We also report the highest published test accuracy on STL-10 even though we only use the labeled portion of the dataset. 1

2 0.93430823 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks

Author: Tivadar Papai, Henry Kautz, Daniel Stefankovic

Abstract: Markov logic is a widely used tool in statistical relational learning, which uses a weighted first-order logic knowledge base to specify a Markov random field (MRF) or a conditional random field (CRF). In many applications, a Markov logic network (MLN) is trained in one domain, but used in a different one. This paper focuses on dynamic Markov logic networks, where the size of the discretized time-domain typically varies between training and testing. It has been previously pointed out that the marginal probabilities of truth assignments to ground atoms can change if one extends or reduces the domains of predicates in an MLN. We show that in addition to this problem, the standard way of unrolling a Markov logic theory into a MRF may result in time-inhomogeneity of the underlying Markov chain. Furthermore, even if these representational problems are not significant for a given domain, we show that the more practical problem of generating samples in a sequential conditional random field for the next slice relying on the samples from the previous slice has high computational cost in the general case, due to the need to estimate a normalization factor for each sample. We propose a new discriminative model, slice normalized dynamic Markov logic networks (SN-DMLN), that suffers from none of these issues. It supports efficient online inference, and can directly model influences between variables within a time slice that do not have a causal direction, in contrast with fully directed models (e.g., DBNs). Experimental results show an improvement in accuracy over previous approaches to online inference in dynamic Markov logic networks. 1

3 0.9337799 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

Author: James Scott, Jonathan W. Pillow

Abstract: Characterizing the information carried by neural populations in the brain requires accurate statistical models of neural spike responses. The negative-binomial distribution provides a convenient model for over-dispersed spike counts, that is, responses with greater-than-Poisson variability. Here we describe a powerful data-augmentation framework for fully Bayesian inference in neural models with negative-binomial spiking. Our approach relies on a recently described latentvariable representation of the negative-binomial distribution, which equates it to a Polya-gamma mixture of normals. This framework provides a tractable, conditionally Gaussian representation of the posterior that can be used to design efficient EM and Gibbs sampling based algorithms for inference in regression and dynamic factor models. We apply the model to neural data from primate retina and show that it substantially outperforms Poisson regression on held-out data, and reveals latent structure underlying spike count correlations in simultaneously recorded spike trains. 1

same-paper 4 0.93138152 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

Author: Mathieu Sinn, Bei Chen

Abstract: Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. Large-sample properties of Conditional Markov Chains have been first studied in [1]. The paper extends this work in two directions: first, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identifiability and the uniqueness of maximum likelihood estimates are being given. 1

5 0.92927182 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

Author: Ashish Kapoor, Raajay Viswanathan, Prateek Jain

Abstract: In this paper, we present a Bayesian framework for multilabel classiďŹ cation using compressed sensing. The key idea in compressed sensing for multilabel classiďŹ cation is to ďŹ rst project the label vector to a lower dimensional space using a random transformation and then learn regression functions over these projections. Our approach considers both of these components in a single probabilistic model, thereby jointly optimizing over compression as well as learning tasks. We then derive an efďŹ cient variational inference scheme that provides joint posterior distribution over all the unobserved labels. The two key beneďŹ ts of the model are that a) it can naturally handle datasets that have missing labels and b) it can also measure uncertainty in prediction. The uncertainty estimate provided by the model allows for active learning paradigms where an oracle provides information about labels that promise to be maximally informative for the prediction task. Our experiments show signiďŹ cant boost over prior methods in terms of prediction performance over benchmark datasets, both in the fully labeled and the missing labels case. Finally, we also highlight various useful active learning scenarios that are enabled by the probabilistic model. 1

6 0.92572874 204 nips-2012-MAP Inference in Chains using Column Generation

7 0.92475468 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

8 0.91173273 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

9 0.90199023 11 nips-2012-A Marginalized Particle Gaussian Process Regression

10 0.89463508 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

11 0.89445543 251 nips-2012-On Lifting the Gibbs Sampling Algorithm

12 0.88630396 200 nips-2012-Local Supervised Learning through Space Partitioning

13 0.88326156 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

14 0.88153237 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

15 0.88122648 197 nips-2012-Learning with Recursive Perceptual Representations

16 0.87216669 293 nips-2012-Relax and Randomize : From Value to Algorithms

17 0.87049258 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

18 0.86963087 279 nips-2012-Projection Retrieval for Classification

19 0.86765355 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

20 0.86390448 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines