nips nips2008 nips2008-40 knowledge-graph by maker-knowledge-mining

40 nips-2008-Bounds on marginal probability distributions

Source: pdf

Author: Joris M. Mooij, Hilbert J. Kappen

Abstract: We propose a novel bound on single-variable marginal probability distributions in factor graphs with discrete variables. The bound is obtained by propagating local bounds (convex sets of probability distributions) over a subtree of the factor graph, rooted in the variable of interest. By construction, the method not only bounds the exact marginal probability distribution of a variable, but also its approximate Belief Propagation marginal (“belief”). Thus, apart from providing a practical means to calculate bounds on marginals, our contribution also lies in providing a better understanding of the error made by Belief Propagation. We show that our bound outperforms the state-of-the-art on some inference problems arising in medical diagnosis. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 nl Abstract We propose a novel bound on single-variable marginal probability distributions in factor graphs with discrete variables. [sent-7, score-0.506]

2 The bound is obtained by propagating local bounds (convex sets of probability distributions) over a subtree of the factor graph, rooted in the variable of interest. [sent-8, score-1.122]

3 By construction, the method not only bounds the exact marginal probability distribution of a variable, but also its approximate Belief Propagation marginal (“belief”). [sent-9, score-0.457]

4 Thus, apart from providing a practical means to calculate bounds on marginals, our contribution also lies in providing a better understanding of the error made by Belief Propagation. [sent-10, score-0.282]

5 We show that our bound outperforms the state-of-the-art on some inference problems arising in medical diagnosis. [sent-11, score-0.199]

6 Thus it is desirable to calculate, in addition to the approximate results, tight bounds on the approximation error. [sent-18, score-0.23]

7 There exist various methods to bound the BP error [3, 4, 5, 6], which can be used, in conjunction with the results of BP, to calculate bounds on the exact marginals. [sent-19, score-0.51]

8 , [7, 8], can be combined with lower bounds on the partition sum, such as the well-known mean ﬁeld bound or higher-order lower bounds [9], to obtain bounds on marginals. [sent-22, score-0.857]

9 Finally, a method called Bound Propagation [10] directly calculates bounds on marginals. [sent-23, score-0.23]

10 However, most of these bounds (with the exception of [3, 10]) have only been formulated for the special case of pairwise interactions, which limits their applicability, excluding for example the interesting class of Bayesian networks. [sent-24, score-0.23]

11 In this contribution we describe a novel bound on exact single-variable marginals in factor graphs which is not limited to pairwise interactions. [sent-25, score-0.597]

12 This has led to bounds which are at the same time bounds for the exact single-variable marginals as well as for the BP beliefs. [sent-27, score-0.634]

13 A particularly nice feature of our bounds is that their computational cost is relatively low, provided that the number of possible values of each variable in the factor graph is small. [sent-28, score-0.581]

14 On the other hand, the computational complexity is exponential in the number of possible values of the variables, which limits application to factor graphs in which each variable has a low number of possible values. [sent-29, score-0.315]

15 On these factor graphs however, our bound can signiﬁcantly outperform existing methods, either in terms of accuracy or in terms of computation time (or both). [sent-30, score-0.423]

16 The basic idea underlying our method is that we recursively propagate bounds over a particular subtree of the factor graph. [sent-32, score-0.864]

17 The propagation rules are similar to those of Belief Propagation; however, instead of propagating messages, we propagate convex sets of messages. [sent-33, score-0.382]

18 This can be done in such a way that the ﬁnal “beliefs” at the root node of the subtree are convex sets which contain the exact marginal of the root node (and, by construction, also its BP belief). [sent-34, score-1.024]

19 (1) x∈XV I∈F For each factor index I ∈ F, there is an associated subset NI ⊆ V of variable indices and the factor ψI is a nonnegative function ψI : XNI → [0, ∞). [sent-58, score-0.445]

20 In this article, we focus on the task of obtaining rigorous bounds on single-variable marginals P(xi ) = xV\{i} P(x). [sent-63, score-0.343]

21 We can represent the structure of the probability distribution (1) using a factor graph (V, F, E). [sent-65, score-0.292]

22 This is a bipartite graph, consisting of variable nodes i ∈ V, factor nodes I ∈ F, and edges e ∈ E, with an edge {i, I} between i ∈ V and I ∈ F if and only if the factor ψI depends on xi (i. [sent-66, score-0.732]

23 We will represent factor nodes visually as rectangles and variable nodes as circles. [sent-69, score-0.378]

24 Figure 1(a) shows a simple example of a factor graph. [sent-70, score-0.193]

25 The set of neighbors of a factor node I is precisely NI ; similarly, we denote the set of neighbors of a variable node i by Ni := {I ∈ F : i ∈ NI }. [sent-71, score-0.432]

26 We will assume throughout this article that the factor graph corresponding to (1) is connected. [sent-72, score-0.334]

27 We denote the set of extreme points of a convex set X ⊆ Rd by ext (X). [sent-74, score-0.302]

28 For a subset Y ⊆ Rd , the convex hull of Y is deﬁned as the smallest convex set X ⊆ Rd with Y ⊆ X; we denote the convex hull of Y as conv (Y ). [sent-75, score-0.661]

29 Another important operation is the partial summation: given A ⊆ B ⊆ V and Ψ ∈ MB , deﬁne xA Ψ to be the measure in MB\A obtained by summing Ψ over all xa ∈ XA , i. [sent-84, score-0.266]

30 Note that a simplex is convex; the simplex PA has precisely #(XA ) extreme points, each of which corresponds to putting all probability mass on one of the possible values of xA . [sent-105, score-0.197]

31 Then we deﬁne the box with lower bound a and upper bound b by B (a, b) := {x ∈ Rd : aα ≤ xα ≤ bα for all α = 1, . [sent-114, score-0.455]

32 Note that a box is convex; indeed, its extreme points are the “corners” of which there are 2d . [sent-118, score-0.202]

33 The smallest bounding box of X is deﬁned as B (X) := B (a, b) where the lower bound a is given by the pointwise inﬁmum of X and the upper bound b is given by the pointwise supremum of X, that is aα := inf{xα : x ∈ X} and bα := sup{xα : x ∈ X} for all α = 1, . [sent-121, score-0.631]

34 Therefore, if X is convex, the smallest bounding box for X depends only on the extreme points ext (X), i. [sent-126, score-0.508]

35 , B (X) = B (ext (X)); this bounding box can be easily calculated if the number of extreme points is not too large. [sent-128, score-0.376]

36 2 The basic tools To calculate marginals of subsets of variables in some factor graph, several operations performed on measures are relevant: normalization, taking products of measures, and summing over subsets of variables. [sent-130, score-0.567]

37 This will turn out to be useful later on, because our bounds make use of convex sets of measures that are propagated over the factor graph. [sent-132, score-0.608]

38 The next lemma concerns the interplay between convexity and taking products; it says that if we take the product of convex sets of measures on different spaces, the resulting set is contained in the convex hull of the product of the extreme points of the convex sets. [sent-139, score-0.753]

39 The third lemma says that the product of several boxes on the same subset A of variables can be easily calculated: the product of the boxes is again a box, with as lower (upper) bound the product of the lower (upper) bounds of the boxes. [sent-149, score-0.882]

40 It basically says that one can bound the marginal of a variable by replacing a factor depending on some other variables by a product of single-variable Bi (a) (b) i (c) i (d) i (e) i i BJ→i J K J K J K J J j L j k j k j k BK→i K k Bj→J K Bk→K L δ j ? [sent-156, score-0.611]

41 ”; (d) Subtree of the factor graph; (e) Propagating convex sets of measures (boxes or simplices) on the subtree (d), leading to a bound Bi on the marginal probability of xi in G. [sent-159, score-1.064]

42 C C The positivity condition is a technical condition, which in our experience is fulﬁlled for many practically relevant factor graphs. [sent-168, score-0.193]

43 3 A simple example Before proceeding to our main result, we ﬁrst illustrate for a simple case how the basic lemma can be employed to obtain computationally tractable bounds on marginals. [sent-170, score-0.296]

44 We derive a bound for the marginal of the variable xi in the factor graph in Figure 1(a). [sent-171, score-0.679]

45 , adding a new variable xj that is constrained to take the same value as xj . [sent-174, score-0.381]

46 In terms of the factor graph, we add a variable node j and a factor node δ, connected to variable nodes j and j , with corresponding factor ψδ (xj , xj ) := δxj (xj ); see also Figure 1(b). [sent-175, score-1.101]

47 Clearly, the marginal of xi satisﬁes:     P(xi ) = N  ψJ ψK ψL  = N  xj xk ψJ ψK ψL δxj (xj ) xj xk xj where it should be noted that the ﬁrst occurrence of ψL is shorthand for ψL (xj , xk ), but the second occurrence is shorthand for ψL (xj , xk ). [sent-176, score-1.052]

48 Noting that ψδ ∈ M∗ } and applying the basic lemma {j,j with A = {i}, B = {k}, C = {j, j } and Ψ = ψJ ψK ψL yields:     ψJ ψK ψL M∗ }  ∈ BN  {j,j P(xi ) ∈ N  xj xj xk ψJ ψK ψL Q∗ }  . [sent-177, score-0.49]

49 {j,j xj Applying the distributive law, we obtain (see also Figure 1(c)):   ψJ M ∗   {j} P(xi ) ∈ BN  xj xk  ψL M∗ }  , {j ψK xk xj xj which we relax to     ψJ P{j}  BN  P(xi ) ∈ BN BN   ψK BN  xj xk  ψL P{j }  . [sent-178, score-1.151]

50 xj Now it may seem that this smallest bounding box would be difﬁcult to compute. [sent-179, score-0.458]

51 Since smallest bounding boxes only depend on extreme points, we conclude that       P(xi ) ∈ BN BN  ψJ ext P{j}  BN  xj ψK BN  xk ψL ext P{j }  . [sent-181, score-0.898]

52 xj which can be calculated efﬁciently if the number of possible values of each variable is small. [sent-182, score-0.277]

53 First, one chooses a particular subtree of the factor graph, rooted in the variable for which one wants to calculate a bound on its marginal. [sent-185, score-0.862]

54 Then, one propagates messages (which are either bounding boxes or simplices) over this subtree, from the leaf nodes towards the root node. [sent-186, score-0.516]

55 The resulting “belief” at the root node is a box that bounds the exact marginal of the root node. [sent-188, score-0.803]

56 The choice of the subtree is arbitrary; different choices lead to different bounds in general. [sent-189, score-0.621]

57 We call the bipartite graph (V, F, E) a subtree of (V, F, E) with root i if i ∈ V ⊆ V, F ⊆ F, E ⊆ E such that (V, F, E) is a tree with root i and for all {j, J} ∈ E, j ∈ V and J ∈ F (i. [sent-192, score-0.787]

58 An illustration of a possible subtree of the factor graph in Figure 1(a) is the one shown in Figure 1(d). [sent-196, score-0.683]

59 The bound that we will obtain using this subtree corresponds to the example described in the previous subsection. [sent-197, score-0.558]

60 In the following, we will use the topology of the original factor graph (V, F, E) whenever we refer to neighbors of variables or factors. [sent-198, score-0.327]

61 Each edge of the subtree will carry one message, oriented such that it “ﬂows” towards the root node. [sent-199, score-0.5]

62 In addition, we deﬁne messages entering the subtree for all “missing” edges in the subtree (see also Figure 1(e)). [sent-200, score-0.96]

63 Because of the bipartite character of the factor graph, we can distinguish two types of messages: messages BJ→j ⊆ Mj sent to a variable j ∈ V from a neighboring factor J ∈ Nj , and messages Bj→J ⊆ Mj sent to a factor J ∈ F from a neighboring variable j ∈ NJ . [sent-201, score-1.088]

64 The messages entering the subtree are all deﬁned to be simplices; more precisely, we deﬁne the incoming messages Bj→J = Pj for all J ∈ F , {j, J} ∈ E \ E BJ→j = Pj for all j ∈ V , {j, J} ∈ E \ E. [sent-202, score-0.68]

65 We propagate messages towards the root i of the tree using the following update rules (note the similarity with the BP update rules). [sent-203, score-0.3]

66 The message sent from a variable j ∈ V to its parent J = par(j) ∈ F is deﬁned as  BK→j if all incoming BK→j are boxes  Bj→J = K∈Nj \J  Pj if at least one of the BK→j is the simplex Pj , where the product of the boxes can be calculated using Lemma 3. [sent-204, score-0.62]

67 The ﬁnal “belief” Bi at the root node i is calculated by   BN BK→i if all incoming BK→i are boxes Bi = K∈Ni   Pi if at least one of the BK→i is the simplex Pi . [sent-207, score-0.47]

68 We can now formulate our main result, which gives a rigorous bound on the exact single-variable marginal of the root node: Theorem 6 Let (V, F, E) be a factor graph with corresponding probability distribution (1). [sent-208, score-0.712]

69 Let i ∈ V and (V, F, E) be a subtree of (V, F, E) with root i ∈ V . [sent-209, score-0.5]

70 Apply the “box propagation” algorithm described above to calculate the ﬁnal “belief” Bi on the root node i. [sent-210, score-0.251]

71 Proof sketch The ﬁrst step consists in extending the subtree such that each factor node has the right number of neighboring variables by cloning the missing variables. [sent-212, score-0.778]

72 The second step consists of applying the basic lemma where the set C consists of all the variable nodes of the subtree which have connecting edges in E \ E, together with all the cloned variable nodes. [sent-213, score-0.674]

73 Then we apply the distributive law, which can be done because the extended subtree has no cycles. [sent-214, score-0.431]

74 Finally, we relax the bound by adding additional normalizations and smallest bounding boxes at each factor node in the subtree. [sent-215, score-0.744]

75 It should now be clear that the “box propagation” algorithm described above precisely calculates the smallest bounding box at the root node i that corresponds to this procedure. [sent-216, score-0.496]

76 Another method to obtain bounds on exact marginals is “Bound Propagation” [10]. [sent-224, score-0.404]

77 For a variable i ∈ V, we deﬁne the sets ∆i := Ni (consisting of all variables that appear in some factor in which i participates) and ∂i := ∆i \ {i} (the Markov blanket of i). [sent-226, score-0.343]

78 On the other hand, the advantage of this approach is that a bound on P(xj ) for j ∈ ∂i is also a bound on P(x∂i ), which in turn gives rise to a bound on P(xi ). [sent-229, score-0.501]

79 In this way, bounds can propagate through the graphical model, eventually yielding a new (tighter) bound on P(x∂i ). [sent-230, score-0.447]

80 Although the iteration can result in rather tight bounds, the main disadvantage of Bound Propagation is its computational cost: it is exponential in the Markov blanket and often many iterations are needed for the bounds to become tight. [sent-231, score-0.286]

81 1 1 Gaps Figure 2: Comparisons of various methods on different factor graphs: P ROMEDAS (left), a large grid with strong interactions (middle) and a small grid with medium-strength interactions (right). [sent-251, score-0.429]

82 We have compared different methods for calculating bounds on single-variable marginals; for each method and each variable, we calculated the gap (tightness) of the bound, which we deﬁned as the ∞ distance between the upper and lower bound of the bounding box. [sent-254, score-0.571]

83 We have investigated three different types of factor graphs; the results are shown in Figure 2. [sent-255, score-0.193]

84 The factor graphs used for our experiments are provided as supplementary material to the electronic version of this article at books. [sent-256, score-0.298]

85 These factor graphs have binary variables and singleton, pairwise and triple interactions (containing zeros). [sent-262, score-0.349]

86 For each instance, we calculated bounds for each “ﬁnding” variable in that instance using our method (“B OX P ROP”) and the method in [10]. [sent-264, score-0.346]

87 Note that the tightness of both bounds varies widely depending on the instance and on the variable of interest. [sent-265, score-0.336]

88 Our bound was tighter than the bound from [10] for all but one out of 1270 variables. [sent-266, score-0.369]

89 The single-variable factors were chosen as exp(θi xi ) with θi ∼ N (0, 1), the pair factors were exp(θij xi xj ) with θij ∼ N (0, 1). [sent-270, score-0.447]

90 We truncated the subtree to 400 nodes and the SAW tree to 105 nodes. [sent-271, score-0.486]

91 Note that our method yields the tightest bound for almost all variables. [sent-272, score-0.223]

92 Note that many methods return completely uninformative bounds in this case. [sent-278, score-0.23]

93 4 Conclusion and discussion We have described a novel bound on exact single-variable marginals, which is at the same time a bound on the (approximate) Belief Propagation marginals. [sent-279, score-0.395]

94 Contrary to many other existing bounds, it is formulated for the general case of factor graphs with discrete variables and factors depending on an arbitrary number of variables. [sent-280, score-0.356]

95 The bound is calculated by propagating convex sets of measures over a subtree of the factor graph, with update equations resembling those of BP. [sent-281, score-1.042]

96 For variables with a limited number of possible values, the bounds can be computed efﬁciently. [sent-282, score-0.265]

97 We have compared our bounds with existing methods and conclude that our method belongs to the best methods, but that it is difﬁcult to say in general which method will yield the tightest bounds for a given variable in a speciﬁc factor graph. [sent-283, score-0.768]

98 Although our bounds are a step forward in quantifying the error of Belief Propagation, the actual error made by BP is often at least one order of magnitude lower than the tightness of these bounds. [sent-285, score-0.277]

99 Error bounds between marginal probabilities and beliefs of loopy belief propagation algorithm. [sent-314, score-0.672]

100 A new class of upper bounds on the log partition function. [sent-334, score-0.23]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('subtree', 0.391), ('xa', 0.266), ('bounds', 0.23), ('factor', 0.193), ('bn', 0.181), ('conv', 0.173), ('bound', 0.167), ('xj', 0.161), ('propagation', 0.159), ('bp', 0.135), ('ext', 0.13), ('xb', 0.121), ('box', 0.121), ('belief', 0.119), ('boxes', 0.118), ('bounding', 0.117), ('bk', 0.115), ('marginals', 0.113), ('bj', 0.112), ('root', 0.109), ('messages', 0.109), ('xk', 0.102), ('ni', 0.099), ('graph', 0.099), ('mb', 0.098), ('ma', 0.095), ('qa', 0.093), ('leisink', 0.093), ('convex', 0.091), ('node', 0.09), ('marginal', 0.083), ('propagating', 0.082), ('extreme', 0.081), ('bi', 0.08), ('hull', 0.078), ('xi', 0.078), ('gaps', 0.075), ('pj', 0.072), ('boxprop', 0.069), ('cloning', 0.069), ('romedas', 0.069), ('simplices', 0.069), ('xni', 0.069), ('par', 0.069), ('xv', 0.069), ('lemma', 0.066), ('factors', 0.065), ('graphs', 0.063), ('nodes', 0.063), ('sent', 0.063), ('bl', 0.063), ('exact', 0.061), ('mooij', 0.061), ('measures', 0.061), ('grid', 0.06), ('smallest', 0.059), ('variable', 0.059), ('simplex', 0.058), ('interactions', 0.058), ('calculated', 0.057), ('blanket', 0.056), ('subtrees', 0.056), ('tightest', 0.056), ('calculate', 0.052), ('propagate', 0.05), ('interplay', 0.049), ('nj', 0.049), ('bipartite', 0.047), ('tightness', 0.047), ('kappen', 0.046), ('libdai', 0.046), ('promedas', 0.046), ('wemmenhove', 0.046), ('wiegerinck', 0.046), ('xnj', 0.046), ('loopy', 0.045), ('rd', 0.042), ('article', 0.042), ('distributive', 0.04), ('parent', 0.04), ('pa', 0.04), ('subsets', 0.04), ('incoming', 0.038), ('product', 0.037), ('says', 0.037), ('dutch', 0.037), ('edges', 0.036), ('beliefs', 0.036), ('jaakkola', 0.035), ('variables', 0.035), ('release', 0.035), ('tighter', 0.035), ('convexity', 0.034), ('products', 0.033), ('entering', 0.033), ('propagated', 0.033), ('medical', 0.032), ('message', 0.032), ('tree', 0.032), ('mj', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 40 nips-2008-Bounds on marginal probability distributions

Author: Joris M. Mooij, Hilbert J. Kappen

2 0.16148145 50 nips-2008-Continuously-adaptive discretization for message-passing algorithms

Author: Michael Isard, John MacCormick, Kannan Achan

Abstract: Continuously-Adaptive Discretization for Message-Passing (CAD-MP) is a new message-passing algorithm for approximate inference. Most message-passing algorithms approximate continuous probability distributions using either: a family of continuous distributions such as the exponential family; a particle-set of discrete samples; or a ﬁxed, uniform discretization. In contrast, CAD-MP uses a discretization that is (i) non-uniform, and (ii) adaptive to the structure of the marginal distributions. Non-uniformity allows CAD-MP to localize interesting features (such as sharp peaks) in the marginal belief distributions with time complexity that scales logarithmically with precision, as opposed to uniform discretization which scales at best linearly. We give a principled method for altering the non-uniform discretization according to information-based measures. CAD-MP is shown in experiments to estimate marginal beliefs much more precisely than competing approaches for the same computational expense. 1

3 0.15427199 53 nips-2008-Counting Solution Clusters in Graph Coloring Problems Using Belief Propagation

Author: Lukas Kroc, Ashish Sabharwal, Bart Selman

Abstract: We show that an important and computationally challenging solution space feature of the graph coloring problem (COL), namely the number of clusters of solutions, can be accurately estimated by a technique very similar to one for counting the number of solutions. This cluster counting approach can be naturally written in terms of a new factor graph derived from the factor graph representing the COL instance. Using a variant of the Belief Propagation inference framework, we can efﬁciently approximate cluster counts in random COL problems over a large range of graph densities. We illustrate the algorithm on instances with up to 100, 000 vertices. Moreover, we supply a methodology for computing the number of clusters exactly using advanced techniques from the knowledge compilation literature. This methodology scales up to several hundred variables. 1

4 0.15196386 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization

Author: Sham M. Kakade, Karthik Sridharan, Ambuj Tewari

Abstract: This work characterizes the generalization ability of algorithms whose predictions are linear in the input vector. To this end, we provide sharp bounds for Rademacher and Gaussian complexities of (constrained) linear classes, which directly lead to a number of generalization bounds. This derivation provides simpliﬁed proofs of a number of corollaries including: risk bounds for linear prediction (including settings where the weight vectors are constrained by either L2 or L1 constraints), margin bounds (including both L2 and L1 margins, along with more general notions based on relative entropy), a proof of the PAC-Bayes theorem, and upper bounds on L2 covering numbers (with Lp norm constraints and relative entropy constraints). In addition to providing a uniﬁed analysis, the results herein provide some of the sharpest risk and margin bounds. Interestingly, our results show that the uniform convergence rates of empirical risk minimization algorithms tightly match the regret bounds of online learning algorithms for linear prediction, up to a constant factor of 2. 1

5 0.14594682 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations

Author: David Sontag, Amir Globerson, Tommi S. Jaakkola

Abstract: We propose a new class of consistency constraints for Linear Programming (LP) relaxations for ﬁnding the most probable (MAP) conﬁguration in graphical models. Usual cluster-based LP relaxations enforce joint consistency on the beliefs of a cluster of variables, with computational cost increasing exponentially with the size of the clusters. By partitioning the state space of a cluster and enforcing consistency only across partitions, we obtain a class of constraints which, although less tight, are computationally feasible for large clusters. We show how to solve the cluster selection and partitioning problem monotonically in the dual LP, using the current beliefs to guide these choices. We obtain a dual message passing algorithm and apply it to protein design problems where the variables have large state spaces and the usual cluster-based relaxations are very costly. The resulting method solves many of these problems exactly, and signiﬁcantly faster than a method that does not use partitioning. 1

6 0.12502657 89 nips-2008-Gates

7 0.12337431 46 nips-2008-Characterizing response behavior in multisensory perception with conflicting cues

8 0.11470558 223 nips-2008-Structure Learning in Human Sequential Decision-Making

9 0.10469931 129 nips-2008-MAS: a multiplicative approximation scheme for probabilistic inference

10 0.10199085 2 nips-2008-A Convex Upper Bound on the Log-Partition Function for Binary Distributions

11 0.099823251 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel

12 0.095124744 235 nips-2008-The Infinite Hierarchical Factor Regression Model

13 0.091426611 239 nips-2008-Tighter Bounds for Structured Estimation

14 0.089704931 37 nips-2008-Biasing Approximate Dynamic Programming with a Lower Discount Factor

15 0.081341736 104 nips-2008-Improved Moves for Truncated Convex Models

16 0.078559905 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

17 0.07813476 51 nips-2008-Convergence and Rate of Convergence of a Manifold-Based Dimension Reduction Algorithm

18 0.077861384 72 nips-2008-Empirical performance maximization for linear rank statistics

19 0.074518345 69 nips-2008-Efficient Exact Inference in Planar Ising Models

20 0.073121108 189 nips-2008-Rademacher Complexity Bounds for Non-I.I.D. Processes

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.229), (1, 0.026), (2, -0.108), (3, 0.065), (4, 0.004), (5, -0.17), (6, 0.074), (7, -0.016), (8, 0.045), (9, -0.143), (10, -0.067), (11, 0.133), (12, 0.134), (13, -0.031), (14, -0.11), (15, 0.091), (16, 0.129), (17, -0.097), (18, 0.027), (19, -0.019), (20, 0.08), (21, -0.107), (22, -0.161), (23, -0.127), (24, 0.048), (25, -0.056), (26, 0.073), (27, -0.036), (28, -0.085), (29, -0.076), (30, -0.164), (31, 0.06), (32, 0.086), (33, 0.108), (34, -0.14), (35, -0.007), (36, 0.046), (37, 0.054), (38, -0.041), (39, -0.094), (40, -0.041), (41, -0.011), (42, -0.081), (43, -0.015), (44, 0.065), (45, -0.109), (46, 0.051), (47, -0.046), (48, 0.111), (49, 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96993232 40 nips-2008-Bounds on marginal probability distributions

Author: Joris M. Mooij, Hilbert J. Kappen

2 0.8122763 89 nips-2008-Gates

Author: Tom Minka, John Winn

Abstract: Gates are a new notation for representing mixture models and context-sensitive independence in factor graphs. Factor graphs provide a natural representation for message-passing algorithms, such as expectation propagation. However, message passing in mixture models is not well captured by factor graphs unless the entire mixture is represented by one factor, because the message equations have a containment structure. Gates capture this containment structure graphically, allowing both the independences and the message-passing equations for a model to be readily visualized. Different variational approximations for mixture models can be understood as different ways of drawing the gates in a model. We present general equations for expectation propagation and variational message passing in the presence of gates. 1

3 0.75166726 50 nips-2008-Continuously-adaptive discretization for message-passing algorithms

Author: Michael Isard, John MacCormick, Kannan Achan

4 0.63352352 53 nips-2008-Counting Solution Clusters in Graph Coloring Problems Using Belief Propagation

Author: Lukas Kroc, Ashish Sabharwal, Bart Selman

5 0.60205263 129 nips-2008-MAS: a multiplicative approximation scheme for probabilistic inference

Author: Ydo Wexler, Christopher Meek

Abstract: We propose a multiplicative approximation scheme (MAS) for inference problems in graphical models, which can be applied to various inference algorithms. The method uses -decompositions which decompose functions used throughout the inference procedure into functions over smaller sets of variables with a known error . MAS translates these local approximations into bounds on the accuracy of the results. We show how to optimize -decompositions and provide a fast closed-form solution for an L2 approximation. Applying MAS to the Variable Elimination inference algorithm, we introduce an algorithm we call DynaDecomp which is extremely fast in practice and provides guaranteed error bounds on the result. The superior accuracy and efﬁciency of DynaDecomp is demonstrated. 1

6 0.58557832 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations

7 0.56065029 2 nips-2008-A Convex Upper Bound on the Log-Partition Function for Binary Distributions

8 0.51854312 37 nips-2008-Biasing Approximate Dynamic Programming with a Lower Discount Factor

9 0.48208311 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization

10 0.46813902 46 nips-2008-Characterizing response behavior in multisensory perception with conflicting cues

11 0.44772717 104 nips-2008-Improved Moves for Truncated Convex Models

12 0.44132298 115 nips-2008-Learning Bounded Treewidth Bayesian Networks

13 0.43740997 69 nips-2008-Efficient Exact Inference in Planar Ising Models

14 0.38673022 250 nips-2008-Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning

15 0.38339147 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel

16 0.38126934 235 nips-2008-The Infinite Hierarchical Factor Regression Model

17 0.37904552 51 nips-2008-Convergence and Rate of Convergence of a Manifold-Based Dimension Reduction Algorithm

18 0.3677426 82 nips-2008-Fast Computation of Posterior Mode in Multi-Level Hierarchical Models

19 0.36492124 239 nips-2008-Tighter Bounds for Structured Estimation

20 0.3491967 78 nips-2008-Exact Convex Confidence-Weighted Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.05), (7, 0.069), (12, 0.064), (21, 0.013), (28, 0.188), (57, 0.063), (59, 0.02), (63, 0.034), (69, 0.239), (71, 0.042), (77, 0.064), (78, 0.025), (83, 0.046)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94465005 7 nips-2008-A computational model of hippocampal function in trace conditioning

Author: Elliot A. Ludvig, Richard S. Sutton, Eric Verbeek, E. J. Kehoe

Abstract: We introduce a new reinforcement-learning model for the role of the hippocampus in classical conditioning, focusing on the differences between trace and delay conditioning. In the model, all stimuli are represented both as unindividuated wholes and as a series of temporal elements with varying delays. These two stimulus representations interact, producing different patterns of learning in trace and delay conditioning. The model proposes that hippocampal lesions eliminate long-latency temporal elements, but preserve short-latency temporal elements. For trace conditioning, with no contiguity between cue and reward, these long-latency temporal elements are necessary for learning adaptively timed responses. For delay conditioning, the continued presence of the cue supports conditioned responding, and the short-latency elements suppress responding early in the cue. In accord with the empirical data, simulated hippocampal damage impairs trace conditioning, but not delay conditioning, at medium-length intervals. With longer intervals, learning is impaired in both procedures, and, with shorter intervals, in neither. In addition, the model makes novel predictions about the response topography with extended cues or post-training lesions. These results demonstrate how temporal contiguity, as in delay conditioning, changes the timing problem faced by animals, rendering it both easier and less susceptible to disruption by hippocampal lesions. The hippocampus is an important structure in many types of learning and memory, with prominent involvement in spatial navigation, episodic and working memories, stimulus conﬁguration, and contextual conditioning. One empirical phenomenon that has eluded many theories of the hippocampus is the dependence of aversive trace conditioning on an intact hippocampus (but see Rodriguez & Levy, 2001; Schmajuk & DiCarlo, 1992; Yamazaki & Tanaka, 2005). For example, trace eyeblink conditioning disappears following hippocampal lesions (Solomon et al., 1986; Moyer, Jr. et al., 1990), induces hippocampal neurogenesis (Gould et al., 1999), and produces unique activity patterns in hippocampal neurons (McEchron & Disterhoft, 1997). In this paper, we present a new abstract computational model of hippocampal function during trace conditioning. We build on a recent extension of the temporal-difference (TD) model of conditioning (Ludvig, Sutton & Kehoe, 2008; Sutton & Barto, 1990) to demonstrate how the details of stimulus representation can qualitatively alter learning during trace and delay conditioning. By gently tweaking this stimulus representation and reducing long-latency temporal elements, trace conditioning is severely impaired, whereas delay conditioning is hardly affected. In the model, the hippocampus is responsible for maintaining these long-latency elements, thus explaining the selective importance of this brain structure in trace conditioning. The difference between trace and delay conditioning is one of the most basic operational distinctions in classical conditioning (e.g., Pavlov, 1927). Figure 1 is a schematic of the two training procedures. In trace conditioning, a conditioned stimulus (CS) is followed some time later by a reward or uncon1 Trace Delay Stimulus Reward Figure 1: Event timelines in trace and delay conditioning. Time ﬂows from left-to-right in the diagram. A vertical bar represents a punctate (short) event, and the extended box is a continuously available stimulus. In delay conditioning, the stimulus and reward overlap, whereas, in trace conditioning, there is a stimulus-free gap between the two punctate events. ditioned stimulus (US); the two stimuli are separated by a stimulus-free gap. In contrast, in delay conditioning, the CS remains on until presentation of the US. Trace conditioning is learned more slowly than delay conditioning, with poorer performance often observed even at asymptote. In both eyeblink conditioning (Moyer, Jr. et al., 1990; Solomon et al., 1986; Tseng et al., 2004) and fear conditioning (e.g., McEchron et al., 1998), hippocampal damage severely impairs the acquisition of conditioned responding during trace conditioning, but not delay conditioning. These selective hippocampal deﬁcits with trace conditioning are modulated by the inter-stimulus interval (ISI) between CS onset and US onset. With very short ISIs (∼300 ms in eyeblink conditioning in rabbits), there is little deﬁcit in the acquisition of responding during trace conditioning (Moyer, Jr. et al., 1990). Furthermore, with very long ISIs (>1000 ms), delay conditioning is also impaired by hippocampal lesions (Beylin et al., 2001). These interactions between ISI and the hippocampaldependency of conditioning are the primary data that motivate the new model. 1 TD Model of Conditioning Our full model of conditioning consists of three separate modules: the stimulus representation, learning algorithm, and response rule. The explanation of hippocampal function relies mostly on the details of the stimulus representation. To illustrate the implications of these representational issues, we have chosen the temporal-difference (TD) learning algorithm from reinforcement learning (Sutton & Barto, 1990, 1998) that has become the sine qua non for modeling reward learning in dopamine neurons (e.g., Ludvig et al., 2008; Schultz, Dayan, & Montague, 1997), and a simple, leaky-integrator response rule described below. We use these for simplicity and consistency with prior work; other learning algorithms and response rules might also yield similar conclusions. 1.1 Stimulus Representation In the model, stimuli are not coherent wholes, but are represented as a series of elements or internal microstimuli. There are two types of elements in the stimulus representation: the ﬁrst is the presence microstimulus, which is exactly equivalent to the external stimulus (Sutton & Barto, 1990). This microstimulus is available whenever the corresponding stimulus is on (see Fig. 3). The second type of elements are the temporal microstimuli or spectral traces, which are a series of successively later and gradually broadening elements (see Grossberg & Schmajuk, 1989; Machado, 1997; Ludvig et al., 2008). Below, we show how the interaction between these two types of representational elements produces different styles of learning in delay and trace conditioning, resulting in differential sensitivity of these procedures to hippocampal manipulation. The temporal microstimuli are created in the model through coarse coding of a decaying memory trace triggered by stimulus onset. Figure 2 illustrates how this memory trace (left panel) is encoded by a series of basis functions evenly spaced across the height of the trace (middle panel). Each basis function effectively acts as a receptive ﬁeld for trace height: As the memory trace fades, different basis functions become more or less active, each with a particular temporal proﬁle (right panel). These activity proﬁles for the temporal microstimuli are then used to generate predictions of the US. For the basis functions, we chose simple Gaussians: 1 (y − µ)2 f (y, µ, σ) = √ exp(− ). 2σ 2 2π 2 (1) 0.4 Microstimulus Level Trace Height 1 0.75 + 0.5 0.25 0 0 20 40 60 Time Step 0.3 0.2 0.1 0 Temporal Basis Functions 0 20 40 60 Time Step Figure 2: Creating Microstimuli. The memory traces for a stimulus (left) are coarsely coded by a series of temporal basis functions (middle). The resultant time courses (right) of the temporal microstimuli are used to predict future occurrence of the US. A single basis function (middle) and approximately corresponding microstimulus (right) have been darkened. The inset in the right panel shows the levels of several microstimuli at the time indicated by the dashed line. Given these basis functions, the microstimulus levels xt (i) at time t are determined by the corresponding memory trace height: xt (i) = f (yt , i/m, σ)yt , (2) where f is the basis function deﬁned above and m is the number of temporal microstimuli per stimulus. The trace level yt was set to 1 at stimulus onset and decreased exponentially, controlled by a single decay parameter, which was allowed to vary to simulate the effects of hippocampal lesions. Every stimulus, including the US, was represented by a single memory trace and resultant microstimuli. 1.2 Hippocampal Damage We propose that hippocampal damage results in the selective loss of the long-latency temporal elements of the stimulus representation. This idea is implemented in the model through a decrease in the memory decay constant from .985 to .97, approximately doubling the decay rate of the memory trace that determines the microstimuli. In effect, we assume that hippocampal damage results in a memory trace that decays more quickly, or, equivalently, is more susceptible to interference. Figure 3 shows the effects of this parameter manipulation on the time course of the elements in the stimulus representation. The presence microstimulus is not affected by this manipulation, but the temporal microstimuli are compressed for both the CS and the US. Each microstimulus has a briefer time course, and, as a group, they cover a shorter time span. Other means for eliminating or reducing the long-latency temporal microstimuli are certainly possible and would likely be compatible with our theory. For example, if one assumes that the stimulus representation contains multiple memory traces with different time constants, each with a separate set of microstimuli, then eliminating the slower memory traces would also remove the long-latency elements, and many of the results below hold (simulations not shown). The key point is that hippocampal damage reduces the number and magnitude of long-latency microstimuli. 1.3 Learning and Responding The model approaches conditioning as a reinforcement-learning prediction problem, wherein the agent tries to predict the upcoming rewards or USs. The model learns through linear TD(λ) (Ludvig et al., 2008; Schultz et al., 1997; Sutton, 1988; Sutton & Barto, 1990, 1998). At each time step, the US prediction (Vt ) is determined by: n T Vt (x) = wt x 0 = wt (i)x(i) i=1 3 , 0 (3) Microstimulus Level Normal Hippocampal 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 500 1000 0 500 1000 Time (ms) Figure 3: Hippocampal effects on the stimulus representation. The left panel presents the stimulus representation in delay conditioning with the normal parameter settings, and the right panel presents the altered stimulus representation following simulated hippocampal damage. In the hippocampal representation, the temporal microstimuli for both CS (red, solid lines) and US (green, dashed lines) are all briefer and shallower. The presence microstimuli (blue square wave and black spike) are not affected by the hippocampal manipulation. where x is a vector of the activation levels x(i) for the various microstimuli, wt is a corresponding vector of adjustable weights wt (i) at time step t, and n is the total number of all microstimuli. The US prediction is constrained to be non-negative, with negative values rectiﬁed to 0. As is standard in TD models, this US prediction is compared to the reward received and the previous US prediction to generate a TD error (δt ): δt = rt + γVt (xt ) − Vt (xt−1 ), (4) where γ is a discount factor that determines the temporal horizon of the US prediction. This TD error is then used to update the weight vector based on the following update rule: wt+1 = wt + αδt et , (5) where α is a step-size parameter and et is a vector of eligibility trace levels (see Sutton & Barto, 1998), which together help determine the speed of learning. Each microstimulus has its own corresponding eligibility trace which continuously decays, but accumulates whenever that microstimulus is present: et+1 = γλet + xt , (6) where γ is the discount factor as above and λ is a decay parameter that determines the plasticity window. These US predictions are translated into responses through a simple, thresholded leakyintegrator response rule: at+1 = υat + Vt+1 (xt ) θ , (7) where υ is a decay constant, and θ is a threshold on the value function V . Our model is deﬁned by Equations 1-7 and 7 additional parameters, which were ﬁxed at the following values for the simulations below: λ = .95, α = .005, γ = .97, n = 50, σ = .08, υ = .93, θ = .25. In the simulated experiments, one time step was interpreted as 10 ms. 4 CR Magnitude ISI250 5 4 3 3 Delay!Normal 2 Delay!HPC Trace!Normal 1 Trace!HPC 0 250 500 50 3 2 1 50 ISI1000 5 4 4 0 ISI500 5 2 1 250 500 0 50 250 500 Trials Figure 4: Learning in the model for trace and delay conditioning with and without hippocampal (HPC) damage. The three panels present training with different interstimulus intervals (ISI). 2 Results We simulated 12 total conditions with the model: trace and delay conditioning, both with and without hippocampal damage, for short (250 ms), medium (500 ms), and long (1000 ms) ISIs. Each simulated experiment was run for 500 trials, with every 5th trial an unreinforced probe trial, during which no US was presented. For delay conditioning, the CS lasted the same duration as the ISI and terminated with US presentation. For trace conditioning, the CS was present for 5 time steps (50 ms). The US always lasted for a single time step, and an inter-trial interval of 5000 ms separated all trials (onset to onset). Conditioned responding (CR magnitude) was measured as the maximum height of the response curve on a given trial. 0.8 CR Magnitude US Prediction Figure 4 summarizes our results. The ﬁgure depicts how the CR magnitude changed across the 500 trials of acquisition training. In general, trace conditioning produced lower levels of responding than delay conditioning, but this effect was most pronounced with the longest ISI. The effects of simulated hippocampal damage varied with the ISI. With the shortest ISI (250 ms; left panel), there was little effect on responding in either trace or delay conditioning. There was a small deﬁcit early in training with trace conditioning, but this difference disappeared quickly with further training. With the longest ISI (1000 ms; right panel), there was a profound effect on responding in both trace and delay conditioning, with trace conditioning completely eliminated. The intermediate ISI (500 ms; middle panel) produced the most complex and interesting results. With this interval, there was only a minor deﬁcit in delay conditioning, but a substantial drop in trace conditioning, especially early in training. This pattern of results roughly matches the empirical data, capturing the selective deﬁcit in trace conditioning caused by hippocampal lesions (Solomon et al., 1986) as well as the modulation of this deﬁcit by ISI (Beylin et al., 2001; Moyer, Jr. et al., 1990). Delay Trace 0.6 0.4 0.2 0 0 250 500 750 Time (ms) 5 4 3 2 1 0 0 250 500 750 Time (ms) Figure 5: Time course of US prediction and CR magnitude for both trace (red, dashed line) and delay conditioning (blue, solid line) with a 500-ms ISI. 5 These differences in sensitivity to simulated hippocampal damage arose despite similar model performance during normal trace and delay conditioning. Figure 5 shows the time course of the US prediction (left panel) and CR magnitude (right panel) after trace and delay conditioning on a probe trial with a 500-ms ISI. In both instances, the US prediction grew throughout the trial as the usual time of the US became imminent. Note the sharp drop off in US prediction for delay conditioning exactly as the CS terminates. This change reﬂects the disappearance of the presence microstimulus, which supports much of the responding in delay conditioning (see Fig. 6). In both procedures, even after the usual time of the US (and CS termination in the case of delay conditioning), there was still some residual US prediction. These US predictions were caused by the long-latency microstimuli, which did not disappear exactly at CS offset, and were ordinarily (on non-probe trials) countered by negative weights on the US microstimuli. The CR magnitude tracked the US prediction curve quite closely, peaking around the time the US would have occurred for both trace and delay conditioning. There was little difference in either curve between trace and delay conditioning, yet altering the stimulus representation (see Fig. 3) had a more pronounced effect on trace conditioning. An examination of the weight distribution for trace and delay conditioning explains why hippocampal damage had a more pronounced effect on trace than delay conditioning. Figure 6 depicts some representative microstimuli (left column) as well as their corresponding weights (right columns) following trace or delay conditioning with or without simulated hippocampal damage. For clarity in the ﬁgure, we have grouped the weights into four categories: positive (+), large positive (+++), negative (-), and large negative (--). The left column also depicts how the model poses the computational problem faced by an animal during conditioning; the goal is to sum together weighted versions of the available microstimuli to produce the ideal US prediction curve in the bottom row. In normal delay conditioning, the model placed a high positive weight on the presence microstimulus, but balanced that with large negative weights on the early CS microstimuli, producing a prediction topography that roughly matched the ideal prediction (see Fig. 5, left panel). In normal trace conditioning, the model only placed a small positive weight on the presence microstimulus, but supplemented that with large positive weights on both the early and late CS microstimuli, also producing a prediction topography that roughly matched the ideal prediction. Weights Normal HPC Lesion Delay CS Presence Stimulus CS Early Microstimuli CS Late Microstimuli US Early Microstimuli Trace Delay Trace +++ + +++ + -- + -- + + +++ N/A N/A - -- - - Ideal Summed Prediction Figure 6: Schematic of the weights (right columns) on various microstimuli following trace and delay conditioning. The left column illustrates four representative microstimuli: the presence microstimulus, an early CS microstimulus, a late CS microstimulus, and a US microstimulus. The ideal prediction is the expectation of the sum of future discounted rewards. 6 Following hippocampal lesions, the late CS microstimuli were no longer available (N/A), and the system could only use the other microstimuli to generate the best possible prediction proﬁle. In delay conditioning, the loss of these long-latency microstimuli had a small effect, notable only with the longest ISI (1000 ms) with these parameter settings. With trace conditioning, the loss of the long-latency microstimuli was catastrophic, as these microstimuli were usually the major basis for the prediction of the upcoming US. As a result, trace conditioning became much more difﬁcult (or impossible in the case of the 1000-ms ISI), even though delay conditioning was less affected. The most notable (and deﬁning) difference between trace and delay conditioning is that the CS and US overlap in delay conditioning, but not trace conditioning. In our model, this overlap is necessary, but not sufﬁcient, for the the unique interaction between the presence microstimulus and temporal microstimuli in delay conditioning. For example, if the CS were extended to stay on beyond the time of US occurrence, this contiguity would be maintained, but negative weights on the early CS microstimuli would not sufﬁce to suppress responding throughout this extended CS. In this case, the best solution to predicting the US for the model might be to put high weights on the long-latency temporal microstimuli (as in trace conditioning; see Fig 6), which would not persist as long as the now extended presence microstimulus. Indeed, with a CS that was three times as long as the ISI, we found that the US prediction, CR magnitude, and underlying weights were completely indistinguishable from trace conditioning (simulations not shown). Thus, the model predicts that this extended delay conditioning should be equally sensitive to hippocampal damage as trace conditioning for the same ISIs. This empirical prediction is a fundamental test of the representational assumptions underlying the model. The particular mechanism that we chose for simulating the loss of the long-latency microstimuli (increasing the decay rate of the memory trace) also leads to a testable model prediction. If one were to pre-train an animal with trace conditioning and then perform hippocampal lesions, there should be some loss of responding, but, more importantly, those CRs that do occur should appear earlier in the interval because the temporal microstimuli now follow a shorter time course (see Fig. 3). There is some evidence for additional short-latency CRs during trace conditioning in lesioned animals (e.g., Port et al., 1986; Solomon et al., 1986), but, to our knowledge, this precise model prediction has not been rigorously evaluated. 3 Discussion and Conclusion We evaluated a novel computational model for the role of the hippocampus in trace conditioning, based on a reinforcement-learning framework. We extended the microstimulus TD model presented by Ludvig et al. (2008) by suggesting a role for the hippocampus in maintaining long-latency elements of the temporal stimulus representation. The current model also introduced an additional element to the stimulus representation (the presence microstimulus) and a simple response rule for translating prediction into actions; we showed how these subtle innovations yield interesting interactions when comparing trace and delay conditioning. In addition, we adduced a pair of testable model predictions about the effects of extended stimuli and post-training lesions. There are several existing theories for the role of the hippocampus in trace conditioning, including the modulation of timing (Solomon et al., 1986), establishment of contiguity (e.g., Wallenstein et al., 1998), and overcoming of task difﬁculty (Beylin et al., 2001). Our new model provides a computational mechanism that links these three proposed explanations. In our model, for similar ISIs, delay conditioning requires learning to suppress responding early in the CS, whereas trace conditioning requires learning to create responding later in the trial, near the time of the US (see Fig. 6). As a result, for the same ISI, delay conditioning requires changing weights associated with earlier microstimuli than trace conditioning, though in opposite directions. These early microstimuli reach higher activation levels (see Fig. 2), producing higher eligibility traces, and are therefore learned about more quickly. This differential speed of learning for short-latency temporal microstimuli corresponds with much behavioural data that shorter ISIs tend to improve both the speed and asymptote of learning in eyeblink conditioning (e.g., Schneiderman & Gormerzano, 1964). Thus, the contiguity between the CS and US in delay conditioning alters the timing problem that the animal faces, effectively making the time interval to be learned shorter, and rendering the task easier for most ISIs. In future work, it will be important to characterize the exact mathematical properties that constrain the temporal microstimuli. Our simple Gaussian basis function approach sufﬁces for the datasets 7 examined here (cf. Ludvig et al., 2008), but other related mathematical functions are certainly possible. For example, replacing the temporal microstimuli in our model with the spectral traces of Grossberg & Schmajuk (1989) produces results that are similar to ours, but using sequences of Gamma-shaped functions tends to fail, with longer intervals learned too slowly relative to shorter intervals. One important characteristic of the microstimulus series seems to be that the heights of individual elements should not decay too quickly. Another key challenge for future modeling is reconciling this abstract account of hippocampal function in trace conditioning with approaches that consider greater physiological detail (e.g., Rodriguez & Levy, 2001; Yamazaki & Tanaka, 2005). The current model also contributes to our understanding of the TD models of dopamine (e.g., Schultz et al., 1997) and classical conditioning (Sutton & Barto, 1990). These models have often given short shrift to issues of stimulus representation, focusing more closely on the properties of the learning algorithm (but see Ludvig et al., 2008). Here, we reveal how the interaction of various stimulus representations in conjunction with the TD learning rule produces a viable model of some of the differences between trace and delay conditioning. References Beylin, A. V., Gandhi, C. C, Wood, G. E., Talk, A. C., Matzel, L. D., & Shors, T. J. (2001). The role of the hippocampus in trace conditioning: Temporal discontinuity or task difﬁculty? Neurobiology of Learning & Memory, 76, 447-61. Gould, E., Beylin, A., Tanapat, P., Reeves, A., & Shors, T. J. (1999). Learning enhances adult neurogenesis in the hippocampal formation. Nature Neuroscience, 2, 260-5. Grossberg, S., & Schmajuk, N. A. (1989). Neural dynamics of adaptive timing and temporal discrimination during associative learning. Neural Networks, 2, 79-102. Ludvig, E. A., Sutton, R. S., & Kehoe, E. J. (2008). Stimulus representation and the timing of reward-prediction errors in models of the dopamine system. Neural Computation, 20, 3034-54. Machado, A. (1997). Learning the temporal dynamics of behavior. Psychological Review, 104, 241-265. McEchron, M. D., Bouwmeester, H., Tseng, W., Weiss, C., & Disterhoft, J. F. (1998). Hippocampectomy disrupts auditory trace fear conditioning and contextual fear conditioning in the rat. Hippocampus, 8, 63846. McEchron, M. D., Disterhoft, J. F. (1997). Sequence of single neuron changes in CA1 hippocampus of rabbits during acquisition of trace eyeblink conditioned responses. Journal of Neurophysiology, 78, 1030-44. Moyer, J. R., Jr., Deyo, R. A., & Disterhoft, J. F. (1990). Hippocampectomy disrupts trace eye-blink conditioning in rabbits. Behavioral Neuroscience, 104, 243-52. Pavlov, I. P. (1927). Conditioned Reﬂexes. London: Oxford University Press. Port, R. L., Romano, A. G., Steinmetz, J. E., Mikhail, A. A., & Patterson, M. M. (1986). Retention and acquisition of classical trace conditioned responses by rabbits with hippocampal lesions. Behavioral Neuroscience, 100, 745-752. Rodriguez, P., & Levy, W. B. (2001). A model of hippocampal activity in trace conditioning: Where’s the trace? Behavioral Neuroscience, 115, 1224-1238. Schmajuk, N. A., & DiCarlo, J. J. (1992). Stimulus conﬁguration, classical conditioning, and hippocampal function. Psychological Review, 99, 268-305. Schneiderman, N., & Gormezano, I. (1964). Conditioning of the nictitating membrane of the rabbit as a function of CS-US interval. Journal of Comparative and Physiological Psychology, 57, 188-195. Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593-9. Solomon, P. R., Vander Schaaf, E. R., Thompson, R. F., & Weisz, D. J. (1986). Hippocampus and trace conditioning of the rabbit’s classically conditioned nictitating membrane response. Behavioral Neuroscience, 100, 729-744. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9-44. Sutton, R. S., & Barto, A. G. (1990). Time-derivative models of Pavlovian reinforcement. In M. Gabriel & J. Moore (Eds.), Learning and Computational Neuroscience: Foundations of Adaptive Networks (pp. 497-537). Cambridge, MA: MIT Press. Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press. Tseng, W., Guan, R., Disterhoft, J. F., & Weiss, C. (2004). Trace eyeblink conditioning is hippocampally dependent in mice. Hippocampus, 14, 58-65. Wallenstein, G., Eichenbaum, H., & Hasselmo, M. (1998). The hippocampus as an associator of discontiguous events. Trends in Neuroscience, 21, 317-323. Yamazaki, T., & Tanaka, S. (2005). A neural network model for trace conditioning. International Journal of Neural Systems, 15, 23-30. 8

same-paper 2 0.80582041 40 nips-2008-Bounds on marginal probability distributions

Author: Joris M. Mooij, Hilbert J. Kappen

3 0.79700714 54 nips-2008-Covariance Estimation for High Dimensional Data Vectors Using the Sparse Matrix Transform

Author: Guangzhi Cao, Charles Bouman

Abstract: Covariance estimation for high dimensional vectors is a classically difﬁcult problem in statistical analysis and machine learning. In this paper, we propose a maximum likelihood (ML) approach to covariance estimation, which employs a novel sparsity constraint. More speciﬁcally, the covariance is constrained to have an eigen decomposition which can be represented as a sparse matrix transform (SMT). The SMT is formed by a product of pairwise coordinate rotations known as Givens rotations. Using this framework, the covariance can be efﬁciently estimated using greedy minimization of the log likelihood function, and the number of Givens rotations can be efﬁciently computed using a cross-validation procedure. The resulting estimator is positive deﬁnite and well-conditioned even when the sample size is limited. Experiments on standard hyperspectral data sets show that the SMT covariance estimate is consistently more accurate than both traditional shrinkage estimates and recently proposed graphical lasso estimates for a variety of different classes and sample sizes. 1

4 0.6879862 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization

Author: Liu Yang, Rong Jin, Rahul Sukthankar

Abstract: The cluster assumption is exploited by most semi-supervised learning (SSL) methods. However, if the unlabeled data is merely weakly related to the target classes, it becomes questionable whether driving the decision boundary to the low density regions of the unlabeled data will help the classiﬁcation. In such case, the cluster assumption may not be valid; and consequently how to leverage this type of unlabeled data to enhance the classiﬁcation accuracy becomes a challenge. We introduce “Semi-supervised Learning with Weakly-Related Unlabeled Data” (SSLW), an inductive method that builds upon the maximum-margin approach, towards a better usage of weakly-related unlabeled information. Although the SSLW could improve a wide range of classiﬁcation tasks, in this paper, we focus on text categorization with a small training pool. The key assumption behind this work is that, even with different topics, the word usage patterns across different corpora tends to be consistent. To this end, SSLW estimates the optimal wordcorrelation matrix that is consistent with both the co-occurrence information derived from the weakly-related unlabeled documents and the labeled documents. For empirical evaluation, we present a direct comparison with a number of stateof-the-art methods for inductive semi-supervised learning and text categorization. We show that SSLW results in a signiﬁcant improvement in categorization accuracy, equipped with a small training set and an unlabeled resource that is weakly related to the test domain.

5 0.68733919 4 nips-2008-A Scalable Hierarchical Distributed Language Model

Author: Andriy Mnih, Geoffrey E. Hinton

Abstract: Neural probabilistic language models (NPLMs) have been shown to be competitive with and occasionally superior to the widely-used n-gram language models. The main drawback of NPLMs is their extremely long training and testing times. Morin and Bengio have proposed a hierarchical language model built around a binary tree of words, which was two orders of magnitude faster than the nonhierarchical model it was based on. However, it performed considerably worse than its non-hierarchical counterpart in spite of using a word tree created using expert knowledge. We introduce a fast hierarchical language model along with a simple feature-based algorithm for automatic construction of word trees from the data. We then show that the resulting models can outperform non-hierarchical neural models as well as the best n-gram models. 1

6 0.68692166 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

7 0.68522775 118 nips-2008-Learning Transformational Invariants from Natural Movies

8 0.68495935 231 nips-2008-Temporal Dynamics of Cognitive Control

9 0.68438935 50 nips-2008-Continuously-adaptive discretization for message-passing algorithms

10 0.68436331 195 nips-2008-Regularized Policy Iteration

11 0.68435806 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations

12 0.6836071 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

13 0.68034226 113 nips-2008-Kernelized Sorting

14 0.67925262 16 nips-2008-Adaptive Template Matching with Shift-Invariant Semi-NMF

15 0.67911887 138 nips-2008-Modeling human function learning with Gaussian processes

16 0.67897671 179 nips-2008-Phase transitions for high-dimensional joint support recovery

17 0.67873269 37 nips-2008-Biasing Approximate Dynamic Programming with a Lower Discount Factor

18 0.67855781 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions

19 0.67849714 196 nips-2008-Relative Margin Machines

20 0.67829418 216 nips-2008-Sparse probabilistic projections