nips nips2007 nips2007-48 knowledge-graph by maker-knowledge-mining

48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration

Source: pdf

Author: M.a. S. Elmohamed, Dexter Kozen, Daniel R. Sheldon

Abstract: We investigate a family of inference problems on Markov models, where many sample paths are drawn from a Markov chain and partial information is revealed to an observer who attempts to reconstruct the sample paths. We present algorithms and hardness results for several variants of this problem which arise by revealing different information to the observer and imposing different requirements for the reconstruction of sample paths. Our algorithms are analogous to the classical Viterbi algorithm for Hidden Markov Models, which ﬁnds the single most probable sample path given a sequence of observations. Our work is motivated by an important application in ecology: inferring bird migration paths from a large database of observations. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We investigate a family of inference problems on Markov models, where many sample paths are drawn from a Markov chain and partial information is revealed to an observer who attempts to reconstruct the sample paths. [sent-7, score-0.326]

2 Our algorithms are analogous to the classical Viterbi algorithm for Hidden Markov Models, which ﬁnds the single most probable sample path given a sequence of observations. [sent-9, score-0.332]

3 Our work is motivated by an important application in ecology: inferring bird migration paths from a large database of observations. [sent-10, score-0.624]

4 Each state generates an output symbol from alphabet Σ, and these output symbols constitute the data or observations. [sent-12, score-0.411]

5 A classical problem, solved by the Viterbi algorithm, is to ﬁnd the most probable sample path given certain observations for a given Markov model. [sent-13, score-0.442]

6 We call this the single path problem; it is well suited to labeling or tagging a single sequence of data. [sent-14, score-0.29]

7 We introduce two generalizations of the single path problem for performing collective inference on Markov models, motivated by an effort to model bird migration patterns using a large database of static observations. [sent-16, score-0.684]

8 1 Observations report location, date, species and number of birds observed. [sent-18, score-0.323]

9 The eBird data set is very rich; the human eye can easily discern migration patterns from animations showing the observations as they unfold over time on a map of North America. [sent-19, score-0.347]

10 Conclusions about migration patterns are made by the human observer. [sent-21, score-0.237]

11 Our goal is to build a mathematical framework to infer dynamic migration models from the static eBird data. [sent-22, score-0.263]

12 Quantitative migration models are of great scientiﬁc and practical import: for example, this problem arose out of an interdisciplinary project at Cornell University to model the possible spread of avian inﬂuenza in North America through wild bird migration. [sent-23, score-0.405]

13 The migratory behavior for a species of birds can be modeled using a single generative process that independently governs how individual birds ﬂy between locations, giving rise to the following 1 2 http://ebird. [sent-24, score-0.524]

14 net/visualization 1 inference problem: a hidden experiment simultaneously draws many independent sample paths from a Markov chain, and the observations reveal aggregate information about the collection of sample paths at each time step, from which the observer attempts to reconstruct the paths. [sent-27, score-0.763]

15 In the multiple path problem, we assume that exactly M independent sample paths are drawn from the Markov model, and the observations reveal the number of paths that output symbol α at time t, for each α and t. [sent-30, score-1.13]

16 The observer seeks the most likely collection of paths given the observations. [sent-31, score-0.363]

17 The fractional path problem is a further generalization in which paths are divisible entities. [sent-32, score-0.723]

18 The observations reveal the fraction of paths that output symbol α at time t, and the observer’s job is to ﬁnd the most likely (in a sense to be deﬁned later) weighted collection of paths given the observations. [sent-33, score-0.849]

19 Conceptually, the fractional path problem can be derived from the multiple path problem by letting M go to inﬁnity; or it has a probabilistic interpretation in terms of distributions over paths. [sent-34, score-0.796]

20 After discussing some preliminaries in section 2, sections 3 and 4 present algorithms for the multiple and fractional path problems, respectively, using network ﬂow techniques on the trellis graph of the Markov model. [sent-35, score-0.645]

21 The multiple path problem in its most general form is NP-hard, but can be solved as an integer program. [sent-36, score-0.376]

22 The special case when output symbols uniquely identify their associated states can be solved efﬁciently as a ﬂow problem; although the single path problem is trivial in this case, the multiple and fractional path problems remain interesting. [sent-37, score-0.987]

23 The fractional path problem can be solved by linear programming. [sent-38, score-0.457]

24 We also introduce a practical extension to the fractional path problem, including slack variables allowing the solution to deviate slightly from potentially noisy observations. [sent-39, score-0.6]

25 In section 5, we demonstrate our techniques with visualizations for the migration of Archilochus colubris, the Ruby-throated Hummingbird, devoting some attention to a challenging problem we have neglected so far: estimating species distributions from eBird observations. [sent-40, score-0.391]

26 Each state generates a unique output symbol from alphabet Σ, given by the mapping σ : V → Σ. [sent-49, score-0.238]

27 Let Y = V T be the set of all possible sample paths of length T . [sent-54, score-0.256]

28 We represent a path y ∈ Y as a row vector y = (y1 , . [sent-55, score-0.29]

29 , yT ), and a collection of M paths as the M × T matrix Y = (yit ), with each row yi· representing an independent sample path. [sent-58, score-0.293]

30 Then, for example, we write Prπ [Yt = u] to be the probability under π that the tth state is u, and Eπ [f (Y )] to be the expected value of f (Y ) for any function f of a random path Y drawn from π. [sent-64, score-0.325]

31 States u and v output the symbol 0, and state w outputs the symbol 1. [sent-69, score-0.286]

32 (a) The bold path is feasible for the speciﬁed observations, with probability p(s, u)p(u, u)p(u, w). [sent-70, score-0.335]

33 The bold path has cost c(s, u) + c(u, u) + c(u, w). [sent-72, score-0.331]

34 , αT , and we seek the most probable path y given these observations. [sent-76, score-0.332]

35 We call path y feasible if σ(yt ) = αt for all t; then we wish to maximize λ(y) over feasible y. [sent-77, score-0.38]

36 Here, the states are replicated for each time step, and edges connect a state at time t to its possible successors at time t + 1, labeled with the transition probability. [sent-79, score-0.23]

37 A feasible path must pass through partition Vαt at step t, so we can prune all edges incident on other partitions, leaving only feasible paths. [sent-80, score-0.419]

38 By deﬁning the cost of an edge as c(u, v) = − log p(u, v), and letting the path cost c(y) be the sum of its edge costs, straightforward algebra shows that arg maxy λ(y) = arg miny c(y), i. [sent-81, score-0.42]

39 , the path of maximum probability becomes the path of minimum cost under this transformation. [sent-83, score-0.621]

40 Thus the Viterbi algorithm ﬁnds the shortest feasible path in the trellis using edge lengths c(u, v). [sent-84, score-0.494]

41 3 Multiple Path Problem In the multiple path problem, M sample paths are drawn from the model and the observations reveal the number of paths Nt (α) that output α at time t, for all α and t; or, equivalently, the multiset At of output symbols at time t. [sent-85, score-1.235]

42 1 (1) i=1 t=1 (2) Reduction to the Single Path Problem A naive approach to the multiple path problem reduces it to the single path problem by creating a new Markov model on state set V M where state v1 , . [sent-93, score-0.678]

43 i=1 A state from the product space V M corresponds to an entire column of the matrix Y, and by changing the order of multiplication in (1), we see that the probability of a path in the new model is equal to the probability of the entire collection of paths in the old model. [sent-103, score-0.618]

44 Thus we can write the optimization problem in (2) as the following ﬂow integer program, with additional constraints that the ﬂow paths generate the correct observations. [sent-113, score-0.333]

45 The decision variable xt indicates the ﬂow traveling from u to v at time t; or, the number of sample paths that transition uv from u to v at time t. [sent-114, score-0.791]

46 (IP) u xt+1 vw for all v, t, (3) for all α, t, (4) w xt = Nt (α) uv u∈Vα ,v∈V xt ∈ N for all u, v, t. [sent-117, score-0.626]

47 uv The ﬂow conservation constraints (3) are standard: the ﬂow into v at time t is equal to the ﬂow leaving v at time t + 1. [sent-118, score-0.42]

48 These also imply that exactly M units of ﬂow pass through each level of the trellis, by summing over all α, xt = uv xt = uv u,v α u∈Vα ,v∈V Nt (α) = M. [sent-120, score-0.949]

49 3 An Efﬁcient Special Case In the special case when σ is one-to-one, the output symbols uniquely identify their generating states, so we may assume that Σ = V , and the output symbol is always the name of the current state. [sent-129, score-0.329]

50 To see how the problem IP simpliﬁes, we now have Vu = {u} for all u, so each partition consists of a single state, and the observations completely specify the ﬂow through each node in the trellis: xt = Nt (u) uv for all u, t. [sent-130, score-0.57]

51 (4 ) v Substituting the new observation constraints (4 ) for time t+1 into the RHS of the ﬂow conservation constraints (3) for time t yield the following replacements: xt = Nt+1 (v) for all v, t. [sent-131, score-0.329]

52 uv (3 ) u This gives an equivalent set of constraints, each of which refers only to variables xt for a single uv t. [sent-132, score-0.736]

53 State u at time t has a supply of Nt (u) units of ﬂow coming from the previous step, and we must route Nt+1 (v) units of ﬂow to state v at time t + 1, so we place a demand of Nt+1 (v) at the corresponding node. [sent-138, score-0.289]

54 4 Nt (·) Demand v1 u,v xt = Nt+1 (v) uv ∀v, (3 ) v1 c(v1 , v1 ) 3 s. [sent-145, score-0.445]

55 (IPt ) Nt+1 (·) Supply c(u, v)xt uv min 4 v2 v2 2 u xt uv ∀u, = Nt (u) (4 ) 0 v3 v3 0 v xt ∈ N uv (a) ∀u, v. [sent-147, score-1.181]

56 4 Fractional Path Problem In the fractional path problem, a path is a divisible entity. [sent-150, score-0.757]

57 The observations specify qt (α), the fraction of paths that output α at time t, and the observer chooses π(y) fractional units of each path y, totaling one unit, such that qt (α) units output α at time t. [sent-151, score-1.737]

58 Put another way, π is a distribution over paths such that Prπ [Yt ∈ Vα ] = qt (α), y∈Y λ(y) i. [sent-153, score-0.537]

59 , qt speciﬁes the marginal distribution over symbols at time t. [sent-155, score-0.42]

60 By taking the logarithm, an equivalent objective is to maximize Eπ [log λ(Y )], so we seek the distribution π that maximizes the expected log-probability of a path Y drawn from π. [sent-156, score-0.29]

61 Conceptually, the fractional path problem arises by letting M → ∞ in the multiple path problem and normalizing to let qt (α) = Nt (α)/M specify the fraction of paths that output α at time t. [sent-157, score-1.463]

62 Operationally, the fractional path problem is modeled by the LP relaxation of IP, which routes one splittable unit of ﬂow through the trellis. [sent-158, score-0.531]

63 (RELAX) u xt+1 vw for all v, t, w xt = qt (α) uv for all α, t, (5) u∈Vα v∈V xt ≥ 0 uv for all u, v, t. [sent-161, score-1.198]

64 Given any distribution π, let xt = Prπ [Yt = u, Yt+1 = v]; then x is a ﬂow because the probability a path enters v at uv time t is equal to the probability it leaves v at time t + 1. [sent-163, score-0.789]

65 Conversely, given a unit ﬂow x, any path decomposition assigning ﬂow π(y) to each y ∈ Y is a probability distribution because the total ﬂow is one. [sent-164, score-0.336]

66 Furthermore, under this correspondence, x satisﬁes the marginal constraints (5) if and only if π has the correct marginals: xt = uv u∈Vα v∈V Pr [Yt = u] = Pr [Yt ∈ Vα ] . [sent-166, score-0.491]

67 Pr [Yt = u, Yt+1 = v] = u∈Vα v∈V u∈Vα Finally, we can rewrite the objective function in terms of paths: c(u, v)xt = uv u,v,t π(y)c(y) = Eπ [c(Y )] = Eπ [− log λ(Y )] . [sent-167, score-0.291]

68 y∈Y By switching signs and changing from minimization to maximization, we see that RELAX solves the fractional path problem. [sent-168, score-0.43]

69 1 Incorporating Slack In our application, the marginal distributions qt (·) are themselves estimates, and it is useful to allow the LP to deviate slightly from these marginals to ﬁnd a better overall solution. [sent-172, score-0.346]

70 To accomplish this, t we add slack variables δu into the marginal constraints (5), and charge for the slack in the objective function. [sent-173, score-0.359]

71 The new marginal constraints are t xt = qt (α) + δα uv for all α, t, (5 ) u∈Vα v∈V t t and we add the term α,t γα |δα | into the objective function to charge for the slack, using a standard t LP trick [8] to model the absolute value term. [sent-174, score-0.813]

72 The slack costs γα can be tailored to individual input values; for example, one may want to charge more to deviate from a conﬁdent estimate. [sent-175, score-0.259]

73 We also add the necessary constraints to ensure that the new t marginals qt (α) = qt (α) + δα form a valid probability distribution for all t. [sent-177, score-0.639]

74 5 Demonstration In this section, we demonstrate our techniques by using the fractional path problem to create visualizations showing likely migration routes of Archilochus colubris, the Ruby-throated Hummingbird, a common bird whose range is relatively well covered by eBird observations. [sent-178, score-0.885]

75 We must specify the Markov model governing transitions between locations (grid cells) in successive weeks; also, we require estimates qt (·) for the weekly distributions of hummingbirds across locations. [sent-180, score-0.43]

76 In the appendix, we outline one approach based on harmonic energy minimization [12], but we may use any technique t that produces weekly distributions qt (u) and slack costs γu . [sent-182, score-0.58]

77 Finally, although our ﬁnal observations qt (·) are distributions over states (locations) and not output symbols — i. [sent-184, score-0.576]

78 On the eBird website, birdwatchers submit checklists of birds they observe, indicating a count for each species, along with the location, date, time and additional information. [sent-190, score-0.265]

79 2 Migration Inference t Given weekly distributions qt (u) and slack costs γu (see the appendix), it remains to specify the Markov model. [sent-199, score-0.586]

80 To reduce problem size, we omitted variables xt from uv the LP when d(u, v) > 1350 km, effectively setting p(u, v) = 0. [sent-202, score-0.445]

81 We also found it useful to impose t upper bounds δu ≤ qt (u) on the slack variables so no single value could increase by more than a factor of two. [sent-203, score-0.417]

82 Figure 3 displays the migration paths our model inferred for the four weeks starting on the dates indicated. [sent-205, score-0.537]

83 The top row shows the distribution and paths inferred by the model; grid cells colored 3 Users may enter historical observations. [sent-206, score-0.347]

84 in lighter shades have more birds (higher values for qt (u)). [sent-209, score-0.482]

85 Arrows indicate ﬂight paths (xt ) uv between the week shown and the following week, with line width proportional to ﬂow xt . [sent-210, score-0.806]

86 In uv the bottom row, the raw data is given for comparison. [sent-211, score-0.291]

87 The inferred distributions and paths are consistent with both seasonal ranges and written accounts of migration routes. [sent-214, score-0.493]

88 For example, in the summary paragraph on migration from the Archilochus colubris species account in Birds of North America [13], Robinson et al. [sent-215, score-0.432]

89 A Estimating Weekly Distributions from eBird Our goal is to estimate qt (u), the fraction of birds in grid cell u during week t. [sent-328, score-0.674]

90 Given enough observations, we can estimate qt (u) using the average number of birds counted per checklist, a quantity we call the rate rt (u). [sent-329, score-0.514]

91 However, even for a bird with good eBird coverage, there are cells with few or no observations during some weeks. [sent-330, score-0.259]

92 This technique uses a graph-based similarity structure, in our case the 3-dimensional lattice built on points ut , where ut represents cell u during week t. [sent-332, score-0.484]

93 Point ut is connected to its four grid neighbors in time slice t by edges of unit weight, excluding edges between cells separated by water (speciﬁcally, when the line connecting the centers is more than half water). [sent-334, score-0.438]

94 The value of f at interior point ut is determined by the expected value of the following random experiment: perform a random walk starting from ut , following outgoing edges with probability proportional to their weight. [sent-338, score-0.456]

95 We derive a measure of conﬁdence in the value f (ut ) from the same experiment: let h(ut ) be the expected number of steps for the random walk from ut to hit the boundary (the hitting time of the boundary set [14]). [sent-341, score-0.428]

96 When h(ut ) is small, ut is close to the boundary and we are more conﬁdent in f (ut ). [sent-342, score-0.253]

97 As point ut gains more observations, its behavior approaches that of a hard boundary: with probability approaching one, the walk from ut will reach an observation in the ﬁrst step, so f (ut ) will approach rt (u), the average of the observations. [sent-344, score-0.407]

98 Since f (ut ) approximates the rate rt (u), we multiply by the land mass of cell u to get an estimate qt (u) for the (relative) number of birds ˆ in cell u at time t. [sent-347, score-0.655]

99 Finally, we normalize q for each time slice t, taking qt (u) = qt (u)/ u qt (u). [sent-348, score-0.897]

100 ˆ ˆ ˆ t For slack costs, we set γu = γ0 /h(ut ) to be inversely proportional to boundary hitting time, with γ0 ≈ 261 chosen in conjunction with the transition costs in section 5. [sent-349, score-0.331]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('uv', 0.291), ('path', 0.29), ('qt', 0.281), ('ow', 0.26), ('paths', 0.256), ('ebird', 0.256), ('migration', 0.237), ('birds', 0.201), ('ut', 0.169), ('nt', 0.162), ('trellis', 0.159), ('xt', 0.154), ('fractional', 0.14), ('slack', 0.136), ('bird', 0.131), ('species', 0.122), ('symbols', 0.112), ('week', 0.105), ('symbol', 0.095), ('ipt', 0.091), ('north', 0.087), ('boundary', 0.084), ('yt', 0.084), ('observations', 0.083), ('weekly', 0.079), ('archilochus', 0.073), ('colubris', 0.073), ('hummingbird', 0.073), ('observer', 0.07), ('markov', 0.065), ('viterbi', 0.063), ('cornell', 0.061), ('output', 0.061), ('lp', 0.06), ('units', 0.059), ('america', 0.056), ('ip', 0.056), ('multisets', 0.055), ('ornithology', 0.055), ('routes', 0.055), ('costs', 0.048), ('letting', 0.048), ('alphabet', 0.047), ('unit', 0.046), ('grid', 0.046), ('constraints', 0.046), ('cells', 0.045), ('feasible', 0.045), ('supply', 0.044), ('weeks', 0.044), ('specify', 0.042), ('probable', 0.042), ('interior', 0.042), ('cost', 0.041), ('cell', 0.041), ('charge', 0.041), ('pr', 0.04), ('edges', 0.039), ('states', 0.039), ('demand', 0.038), ('walk', 0.037), ('collection', 0.037), ('caruana', 0.037), ('checklists', 0.037), ('divisible', 0.037), ('fink', 0.037), ('hochachka', 0.037), ('saleh', 0.037), ('wild', 0.037), ('harmonic', 0.036), ('transition', 0.036), ('state', 0.035), ('hmms', 0.035), ('reveal', 0.034), ('deviate', 0.034), ('citizen', 0.032), ('land', 0.032), ('robinson', 0.032), ('visualizations', 0.032), ('rt', 0.032), ('marginals', 0.031), ('integer', 0.031), ('conservation', 0.029), ('subproblem', 0.029), ('phillips', 0.029), ('subproblems', 0.029), ('transportation', 0.029), ('multiple', 0.028), ('locations', 0.028), ('graph', 0.028), ('probabilities', 0.028), ('time', 0.027), ('slice', 0.027), ('hitting', 0.027), ('vm', 0.027), ('vw', 0.027), ('solved', 0.027), ('relax', 0.026), ('lab', 0.026), ('static', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration

Author: M.a. S. Elmohamed, Dexter Kozen, Daniel R. Sheldon

2 0.15798862 213 nips-2007-Variational Inference for Diffusion Processes

Author: Cédric Archambeau, Manfred Opper, Yuan Shen, Dan Cornford, John S. Shawe-taylor

Abstract: Diffusion processes are a family of continuous-time continuous-state stochastic processes that are in general only partially observed. The joint estimation of the forcing parameters and the system noise (volatility) in these dynamical systems is a crucial, but non-trivial task, especially when the system is nonlinear and multimodal. We propose a variational treatment of diffusion processes, which allows us to compute type II maximum likelihood estimates of the parameters by simple gradient techniques and which is computationally less demanding than most MCMC approaches. We also show how a cheap estimate of the posterior over the parameters can be constructed based on the variational free energy. 1

3 0.13352956 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

Author: Ambuj Tewari, Peter L. Bartlett

Abstract: We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average reward in an irreducible but otherwise unknown Markov decision process (MDP). OLP uses its experience so far to estimate the MDP. It chooses actions by optimistically maximizing estimated future rewards over a set of next-state transition probabilities that are close to the estimates, a computation that corresponds to solving linear programs. We show that the total expected reward obtained by OLP up to time T is within C(P ) log T of the reward obtained by the optimal policy, where C(P ) is an explicit, MDP-dependent constant. OLP is closely related to an algorithm proposed by Burnetas and Katehakis with four key differences: OLP is simpler, it does not require knowledge of the supports of transition probabilities, the proof of the regret bound is simpler, but our regret bound is a constant factor larger than the regret of their algorithm. OLP is also similar in ﬂavor to an algorithm recently proposed by Auer and Ortner. But OLP is simpler and its regret bound has a better dependence on the size of the MDP. 1

4 0.11527874 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding

Author: Omer Bobrowski, Ron Meir, Shy Shoham, Yonina Eldar

Abstract: It is becoming increasingly evident that organisms acting in uncertain dynamical environments often employ exact or approximate Bayesian statistical calculations in order to continuously estimate the environmental state, integrate information from multiple sensory modalities, form predictions and choose actions. What is less clear is how these putative computations are implemented by cortical neural networks. An additional level of complexity is introduced because these networks observe the world through spike trains received from primary sensory afferents, rather than directly. A recent line of research has described mechanisms by which such computations can be implemented using a network of neurons whose activity directly represents a probability distribution across the possible “world states”. Much of this work, however, uses various approximations, which severely restrict the domain of applicability of these implementations. Here we make use of rigorous mathematical results from the theory of continuous time point process ﬁltering, and show how optimal real-time state estimation and prediction may be implemented in a general setting using linear neural networks. We demonstrate the applicability of the approach with several examples, and relate the required network properties to the statistical nature of the environment, thereby quantifying the compatibility of a given network with its environment. 1

5 0.1138767 119 nips-2007-Learning with Tree-Averaged Densities and Distributions

Author: Sergey Kirshner

Abstract: We utilize the ensemble of trees framework, a tractable mixture over superexponential number of tree-structured distributions [1], to develop a new model for multivariate density estimation. The model is based on a construction of treestructured copulas – multivariate distributions with uniform on [0, 1] marginals. By averaging over all possible tree structures, the new model can approximate distributions with complex variable dependencies. We propose an EM algorithm to estimate the parameters for these tree-averaged models for both the real-valued and the categorical case. Based on the tree-averaged framework, we propose a new model for joint precipitation amounts data on networks of rain stations. 1

6 0.11323982 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching

7 0.10438471 214 nips-2007-Variational inference for Markov jump processes

8 0.098225713 197 nips-2007-The Infinite Markov Model

9 0.093391351 146 nips-2007-On higher-order perceptron algorithms

10 0.091222294 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

11 0.089061983 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

12 0.083421536 21 nips-2007-Adaptive Online Gradient Descent

13 0.076709047 199 nips-2007-The Price of Bandit Information for Online Optimization

14 0.076359682 13 nips-2007-A Unified Near-Optimal Estimator For Dimension Reduction in $l \alpha$ ($0<\alpha\leq 2$) Using Stable Random Projections

15 0.067526981 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

16 0.056365397 187 nips-2007-Structured Learning with Approximate Inference

17 0.05593431 20 nips-2007-Adaptive Embedded Subgraph Algorithms using Walk-Sum Analysis

18 0.055375334 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs

19 0.05448978 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm

20 0.052552231 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.19), (1, -0.082), (2, 0.04), (3, -0.004), (4, 0.015), (5, -0.118), (6, -0.009), (7, 0.03), (8, -0.042), (9, -0.187), (10, 0.082), (11, 0.143), (12, -0.071), (13, 0.071), (14, -0.023), (15, 0.025), (16, -0.12), (17, -0.026), (18, 0.041), (19, -0.032), (20, -0.03), (21, 0.068), (22, -0.067), (23, -0.034), (24, 0.101), (25, -0.07), (26, 0.009), (27, 0.031), (28, 0.032), (29, -0.031), (30, -0.073), (31, -0.003), (32, 0.057), (33, 0.123), (34, 0.076), (35, -0.066), (36, 0.025), (37, -0.044), (38, -0.039), (39, 0.044), (40, -0.107), (41, -0.044), (42, -0.109), (43, 0.062), (44, 0.07), (45, -0.09), (46, -0.247), (47, -0.1), (48, -0.092), (49, 0.095)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95513797 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration

Author: M.a. S. Elmohamed, Dexter Kozen, Daniel R. Sheldon

2 0.55494106 213 nips-2007-Variational Inference for Diffusion Processes

Author: Cédric Archambeau, Manfred Opper, Yuan Shen, Dan Cornford, John S. Shawe-taylor

3 0.53666627 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding

Author: Omer Bobrowski, Ron Meir, Shy Shoham, Yonina Eldar

4 0.50790948 119 nips-2007-Learning with Tree-Averaged Densities and Distributions

Author: Sergey Kirshner

5 0.49085677 214 nips-2007-Variational inference for Markov jump processes

Author: Manfred Opper, Guido Sanguinetti

Abstract: Markov jump processes play an important role in a large number of application domains. However, realistic systems are analytically intractable and they have traditionally been analysed using simulation based techniques, which do not provide a framework for statistical inference. We propose a mean ﬁeld approximation to perform posterior inference and parameter estimation. The approximation allows a practical solution to the inference problem, while still retaining a good degree of accuracy. We illustrate our approach on two biologically motivated systems.

6 0.44688228 146 nips-2007-On higher-order perceptron algorithms

7 0.43061554 176 nips-2007-Sequential Hypothesis Testing under Stochastic Deadlines

8 0.41797891 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching

9 0.36962393 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

10 0.36270899 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

11 0.35343388 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

12 0.35067037 199 nips-2007-The Price of Bandit Information for Online Optimization

13 0.32549867 59 nips-2007-Continuous Time Particle Filtering for fMRI

14 0.32320896 21 nips-2007-Adaptive Online Gradient Descent

15 0.3149035 20 nips-2007-Adaptive Embedded Subgraph Algorithms using Walk-Sum Analysis

16 0.3129757 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes

17 0.30703941 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process

18 0.30643028 150 nips-2007-Optimal models of sound localization by barn owls

19 0.30208436 197 nips-2007-The Infinite Markov Model

20 0.30101448 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.057), (9, 0.01), (13, 0.045), (16, 0.013), (18, 0.02), (21, 0.04), (31, 0.035), (34, 0.035), (35, 0.025), (45, 0.328), (47, 0.074), (83, 0.118), (85, 0.047), (87, 0.026), (90, 0.049)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.82956874 1 nips-2007-A Bayesian Framework for Cross-Situational Word-Learning

Author: Noah Goodman, Joshua B. Tenenbaum, Michael J. Black

Abstract: For infants, early word learning is a chicken-and-egg problem. One way to learn a word is to observe that it co-occurs with a particular referent across different situations. Another way is to use the social context of an utterance to infer the intended referent of a word. Here we present a Bayesian model of cross-situational word learning, and an extension of this model that also learns which social cues are relevant to determining reference. We test our model on a small corpus of mother-infant interaction and ﬁnd it performs better than competing models. Finally, we show that our model accounts for experimental phenomena including mutual exclusivity, fast-mapping, and generalization from social cues. To understand the difﬁculty of an infant word-learner, imagine walking down the street with a friend who suddenly says “dax blicket philbin na ﬁvy!” while at the same time wagging her elbow. If you knew any of these words you might infer from the syntax of her sentence that blicket is a novel noun, and hence the name of a novel object. At the same time, if you knew that this friend indicated her attention by wagging her elbow at objects, you might infer that she intends to refer to an object in a nearby show window. On the other hand if you already knew that “blicket” meant the object in the window, you might be able to infer these elements of syntax and social cues. Thus, the problem of early word-learning is a classic chicken-and-egg puzzle: in order to learn word meanings, learners must use their knowledge of the rest of language (including rules of syntax, parts of speech, and other word meanings) as well as their knowledge of social situations. But in order to learn about the facts of their language they must ﬁrst learn some words, and in order to determine which cues matter for establishing reference (for instance, pointing and looking at an object but normally not waggling your elbow) they must ﬁrst have a way to know the intended referent in some situations. For theories of language acquisition, there are two common ways out of this dilemma. The ﬁrst involves positing a wide range of innate structures which determine the syntax and categories of a language and which social cues are informative. (Though even when all of these elements are innately determined using them to learn a language from evidence may not be trivial [1].) The other alternative involves bootstrapping: learning some words, then using those words to learn how to learn more. This paper gives a proposal for the second alternative. We ﬁrst present a Bayesian model of how learners could use a statistical strategy—cross-situational word-learning—to learn how words map to objects, independent of syntactic and social cues. We then extend this model to a true bootstrapping situation: using social cues to learn words while using words to learn social cues. Finally, we examine several important phenomena in word learning: mutual exclusivity (the tendency to assign novel words to novel referents), fast-mapping (the ability to assign a novel word in a linguistic context to a novel referent after only a single use), and social generalization (the ability to use social context to learn the referent of a novel word). Without adding additional specialized machinery, we show how these can be explained within our model as the result of domain-general probabilistic inference mechanisms operating over the linguistic domain. 1 Os r, b Is Ws Figure 1: Graphical model describing the generation of words (Ws ) from an intention (Is ) and lexicon ( ), and intention from the objects present in a situation (Os ). The plate indicates multiple copies of the model for different situation/utterance pairs (s). Dotted portions indicate additions to include the generation of social cues Ss from intentions. Ss ∀s 1 The Model Behind each linguistic utterance is a meaning that the speaker intends to communicate. Our model operates by attempting to infer this intended meaning (which we call the intent) on the basis of the utterance itself and observations of the physical and social context. For the purpose of modeling early word learning—which consists primarily of learning words for simple object categories—in our model, we assume that intents are simply groups of objects. To state the model formally, we assume the non-linguistic situation consists of a set Os of objects and that utterances are unordered sets of words Ws 1 . The lexicon is a (many-to-many) map from words to objects, which captures the meaning of those words. (Syntax enters our model only obliquely by different treatment of words depending on whether they are in the lexicon or not—that is, whether they are common nouns or other types of words.) In this setting the speaker’s intention will be captured by a set of objects in the situation to which she intends to refer: Is ⊆ Os . This setup is indicated in the graphical model of Fig. 1. Different situation-utterance pairs Ws , Os are independent given the lexicon , giving: P (Ws |Is , ) · P (Is |Os ). P (W| , O) = s (1) Is We further simplify by assuming that P (Is |Os ) ∝ 1 (which could be reﬁned by adding a more detailed model of the communicative intentions a person is likely to form in different situations). We will assume that words in the utterance are generated independently given the intention and the lexicon and that the length of the utterance is observed. Each word is then generated from the intention set and lexicon by ﬁrst choosing whether the word is a referential word or a non-referential word (from a binomial distribution of weight γ), then, for referential words, choosing which object in the intent it refers to (uniformly). This process gives: P (Ws |Is , ) = (1 − γ)PNR (w| ) + γ w∈Ws x∈Is 1 PR (w|x, ) . |Is | The probability of word w referring to object x is PR (w|x, ) ∝ δx∈ w occurring as a non-referring word is PNR (w| ) ∝ 1 if (w) = ∅, κ otherwise. (w) , (2) and the probability of word (3) (this probability is a distribution over all words in the vocabulary, not just those in lexicon ). The constant κ is a penalty for using a word in the lexicon as a non-referring word—this penalty indirectly enforces a light-weight difference between two different groups of words (parts-of-speech): words that refer and words that do not refer. Because the generative structure of this model exposes the role of speaker’s intentions, it is straightforward to add non-linguistic social cues. We assume that social cues such as pointing are generated 1 Note that, since we ignore word order, the distribution of words in a sentence should be exchangeable given the lexicon and situation. This implies, by de Finetti’s theorem, that they are independent conditioned on a latent state—we assume that the latent state giving rise to words is the intention of the speaker. 2 from the speaker’s intent independently of the linguistic aspects (as shown in the dotted arrows of Fig. 1). With the addition of social cues Ss , Eq. 1 becomes: P (Ws |Is , ) · P (Ss |Is ) · P (Is |Os ). P (W| , O) = s (4) Is We assume that the social cues are a set Si (x) of independent binary (cue present or not) feature values for each object x ∈ Os , which are generated through a noisy-or process: P (Si (x)=1|Is , ri , bi ) = 1 − (1 − bi )(1 − ri )δx∈Is . (5) Here ri is the relevance of cue i, while bi is its base rate. For the model without social cues the posterior probability of a lexicon given a set of situated utterances is: P ( |W, O) ∝ P (W| , O)P ( ). (6) And for the model with social cues the joint posterior over lexicon and cue parameters is: P ( , r, b|W, O) ∝ P (W| , r, b, O)P ( )P (r, b). (7) We take the prior probability of a lexicon to be exponential in its size: P ( ) ∝ e−α| | , and the prior probability of social cue parameters to be uniform. Given the model above and the corpus described below, we found the best lexicon (or lexicon and cue parameters) according to Eq. 6 and 7 by MAP inference using stochastic search2 . 2 Previous work While cross-situational word-learning has been widely discussed in the empirical literature, e.g., [2], there have been relatively few attempts to model this process computationally. Siskind [3] created an ambitious model which used deductive rules to make hypotheses about propositional word meanings their use across situations. This model achieved surprising success in learning word meanings in artiﬁcial corpora, but was extremely complex and relied on the availability of fully coded representations of the meaning of each sentence, making it difﬁcult to extend to empirical corpus data. More recently, Yu and Ballard [4] have used a machine translation model (similar to IBM Translation Model I) to learn word-object association probabilities. In their study, they used a pre-existing corpus of mother-infant interactions and coded the objects present during each utterance (an example from this corpus—illustrated with our own coding scheme—is shown in Fig. 2). They applied their translation model to estimate the probability of an object given a word, creating a table of associations between words and objects. Using this table, they extracted a lexicon (a group of word-object mappings) which was relatively accurate in its guesses about the names of objects that were being talked about. They further extended their model to incorporate prosodic emphasis on words (a useful cue which we will not discuss here) and joint attention on objects. Joint attention was coded by hand, isolating a subset of objects which were attended to by both mother and infant. Their results reﬂected a sizable increase in recall with the use of social cues. 3 Materials and Assessment Methods To test the performance of our model on natural data, we used the Rollins section of the CHILDES corpus[5]. For comparison with the model by Yu and Ballard [4], we chose the ﬁles me03 and di06, each of which consisted of approximately ten minutes of interaction between a mother and a preverbal infant playing with objects found in a box of toys. Because we were not able to obtain the exact corpus Yu and Ballard used, we recoded the objects in the videos and added a coding of social cues co-occurring with each utterance. We annotated each utterance with the set of objects visible to the infant and with a social coding scheme (for an illustrated example, see Figure 2). Our social code included seven features: infants eyes, infants hands, infants mouth, infant touching, mothers hands, mothers eyes, mother touching. For each utterance, this coding created an object by social feature matrix. 2 In order to speed convergence we used a simulated tempering scheme with three temperature chains and a range of data-driven proposals. 3 Figure 2: A still frame from our corpus showing the coding of objects and social cues. We coded all mid-sized objects visible to the infant as well as social information including what both mother and infant were touching and looking at. We evaluated all models based on their coverage of a gold-standard lexicon, computing precision (how many of the word-object mappings in a lexicon were correct relative to the gold-standard), recall (how many of the total correct mappings were found), and their geometric mean, F-score. However, the gold-standard lexicon for word-learning is not obvious. For instance, should it include the mapping between the plural “pigs” or the sound “oink” and the object PIG? Should a goldstandard lexicon include word-object pairings that are correct but were not present in the learning situation? In the results we report, we included those pairings which would be useful for a child to learn (e.g., “oink” → PIG) but not including those pairings which were not observed to co-occur in the corpus (however, modifying these decisions did not affect the qualitative pattern of results). 4 Results For the purpose of comparison, we give scores for several other models on the same corpus. We implemented a range of simple associative models based on co-occurrence frequency, conditional probability (both word given object and object given word), and point-wise mutual information. In each of these models, we computed the relevant statistic across the entire corpus and then created a lexicon by including all word-object pairings for which the association statistic met a threshold value. We additionally implemented a translation model (based on Yu and Ballard [4]). Because Yu and Ballard did not include details on how they evaluated their model, we scored it in the same way as the other associative models, by creating an association matrix based on the scores P (O|W ) (as given in equation (3) in their paper) and then creating a lexicon based on a threshold value. In order to simulate this type of threshold value for our model, we searched for the MAP lexicon over a range of parameters α in our prior (the larger the prior value, the less probable a larger lexicon, thus this manipulation served to create more or less selective lexicons) . Base model. In Figure 3, we plot the precision and the recall for lexicons across a range of prior parameter values for our model and the full range of threshold values for the translation model and two of the simple association models (since results for the conditional probability models were very similar but slightly inferior to the performance of mutual information, we did not include them). For our model, we averaged performance at each threshold value across three runs of 5000 search iterations each. Our model performed better than any of the other models on a number of dimensions (best lexicon shown in Table 1), both achieving the highest F-score and showing a better tradeoff between precision and recall at sub-optimal threshold values. The translation model also performed well, increasing precision as the threshold of association was raised. Surprisingly, standard cooccurrence statistics proved to be relatively ineffective at extracting high-scoring lexicons: at any given threshold value, these models included a very large number of incorrect pairs. Table 1: The best lexicon found by the Bayesian model (α=11, γ=0.2, κ=0.01). baby → book hand → hand bigbird → bird hat → hat on → ring bird → rattle meow → kitty ring → ring 4 birdie → duck moocow → cow sheep → sheep book → book oink → pig 1 Co!occurrence frequency Mutual information Translation model Bayesian model 0.9 0.8 0.7 recall 0.6 0.5 0.4 0.3 F=0.54 F=0.44 F=0.21 F=0.12 0.2 0.1 0 0 0.2 0.4 0.6 precision 0.8 1 Figure 3: Comparison of models on corpus data: we plot model precision vs. recall across a range of threshold values for each model (see text). Unlike standard ROC curves for classiﬁcation tasks, the precision and recall of a lexicon depends on the entire lexicon, and irregularities in the curves reﬂect the small size of the lexicons). One additional virtue of our model over other associative models is its ability to determine which objects the speaker intended to refer to. In Table 2, we give some examples of situations in which the model correctly inferred the objects that the speaker was talking about. Social model. While the addition of social cues did not increase corpus performance above that found in the base model, the lexicons which were found by the social model did have several properties that were not present in the base model. First, the model effectively and quickly converged on the social cues that we found subjectively important in viewing the corpus videos. The two cues which were consistently found relevant across the model were (1) the target of the infant’s gaze and (2) the caregiver’s hand. These data are especially interesting in light of the speculation that infants initially believe their own point of gaze is a good cue to reference, and must learn over the second year that the true cue is the caregiver’s point of gaze, not their own [6]. Second, while the social model did not outperform the base model on the full corpus (where many words were paired with their referents several times), on a smaller corpus (taking every other utterance), the social cue model did slightly outperform a model without social cues (max F-score=0.43 vs. 0.37). Third, the addition of social cues allowed the model to infer the intent of a speaker even in the absence of a word being used. In the right-hand column of Table 2, we give an example of a situation in which the caregiver simply says ”see that?” but from the direction of the infant’s eyes and the location of her hand, the model correctly infers that she is talking about the COW, not either of the other possible referents. This kind of inference might lead the way in allowing infants to learn words like pronouns, which serve pick out an unambiguous focus of attention (one that is so obvious based on social and contextual cues that it does not need to be named). Finally, in the next section we show that the addition of social cues to the model allows correct performance in experimental tests of social generalization which only children older than 18 months can pass, suggesting perhaps that the social model is closer to the strategy used by more mature word learners. Table 2: Intentions inferred by the Bayesian model after having learned a lexicon from the corpus. (IE=Infant’s eyes, CH=Caregiver’s hands). Words Objects Social Cues Inferred intention “look at the moocow” COW GIRL BEAR “see the bear by the rattle?” BEAR RATTLE COW COW BEAR RATTLE 5 “see that?” BEAR RATTLE COW IE & CH→COW COW situation: !7.3, corpus: !631.1, total: !638.4

same-paper 2 0.74186611 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration

Author: M.a. S. Elmohamed, Dexter Kozen, Daniel R. Sheldon

3 0.61184818 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation

Author: David Newman, Padhraic Smyth, Max Welling, Arthur U. Asuncion

Abstract: We investigate the problem of learning a widely-used latent-variable model – the Latent Dirichlet Allocation (LDA) or “topic” model – using distributed computation, where each of processors only sees of the total data set. We propose two distributed inference schemes that are motivated from different perspectives. The ﬁrst scheme uses local Gibbs sampling on each processor with periodic updates—it is simple to implement and can be viewed as an approximation to a single processor implementation of Gibbs sampling. The second scheme relies on a hierarchical Bayesian extension of the standard LDA model to directly account for the fact that data are distributed across processors—it has a theoretical guarantee of convergence but is more complex to implement than the approximate method. Using ﬁve real-world text corpora we show that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning. Our extensive experimental results include large-scale distributed computation on 1000 virtual processors; and speedup experiments of learning topics in a 100-million word corpus using 16 processors. ¢ ¤ ¦¥£ ¢ ¢

4 0.47317904 63 nips-2007-Convex Relaxations of Latent Variable Training

Author: Yuhong Guo, Dale Schuurmans

Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semideﬁnite relaxation that yields global training by eliminating local minima. 1

5 0.46244782 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

Author: Matthias Bethge, Philipp Berens

Abstract: Maximum entropy analysis of binary variables provides an elegant way for studying the role of pairwise correlations in neural populations. Unfortunately, these approaches suffer from their poor scalability to high dimensions. In sensory coding, however, high-dimensional data is ubiquitous. Here, we introduce a new approach using a near-maximum entropy model, that makes this type of analysis feasible for very high-dimensional data—the model parameters can be derived in closed form and sampling is easy. Therefore, our NearMaxEnt approach can serve as a tool for testing predictions from a pairwise maximum entropy model not only for low-dimensional marginals, but also for high dimensional measurements of more than thousand units. We demonstrate its usefulness by studying natural images with dichotomized pixel intensities. Our results indicate that the statistics of such higher-dimensional measurements exhibit additional structure that are not predicted by pairwise correlations, despite the fact that pairwise correlations explain the lower-dimensional marginal statistics surprisingly well up to the limit of dimensionality where estimation of the full joint distribution is feasible. 1

6 0.4606545 115 nips-2007-Learning the 2-D Topology of Images

7 0.45909473 86 nips-2007-Exponential Family Predictive Representations of State

8 0.45837691 141 nips-2007-New Outer Bounds on the Marginal Polytope

9 0.45793712 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

10 0.45788693 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

11 0.45714733 187 nips-2007-Structured Learning with Approximate Inference

12 0.45616499 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

13 0.45542589 105 nips-2007-Infinite State Bayes-Nets for Structured Domains

14 0.45537904 180 nips-2007-Sparse Feature Learning for Deep Belief Networks

15 0.4553501 84 nips-2007-Expectation Maximization and Posterior Constraints

16 0.45514393 112 nips-2007-Learning Monotonic Transformations for Classification

17 0.45499614 134 nips-2007-Multi-Task Learning via Conic Programming

18 0.45370895 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging

19 0.45357716 49 nips-2007-Colored Maximum Variance Unfolding

20 0.45269698 128 nips-2007-Message Passing for Max-weight Independent Set