nips nips2004 nips2004-41 knowledge-graph by maker-knowledge-mining

41 nips-2004-Comparing Beliefs, Surveys, and Random Walks

Source: pdf

Author: Erik Aurell, Uri Gordon, Scott Kirkpatrick

Abstract: Survey propagation is a powerful technique from statistical physics that has been applied to solve the 3-SAT problem both in principle and in practice. We give, using only probability arguments, a common derivation of survey propagation, belief propagation and several interesting hybrid methods. We then present numerical experiments which use WSAT (a widely used random-walk based SAT solver) to quantify the complexity of the 3-SAT formulae as a function of their parameters, both as randomly generated and after simpli£cation, guided by survey propagation. Some properties of WSAT which have not previously been reported make it an ideal tool for this purpose – its mean cost is proportional to the number of variables in the formula (at a £xed ratio of clauses to variables) in the easy-SAT regime and slightly beyond, and its behavior in the hardSAT regime appears to re¤ect the underlying structure of the solution space that has been predicted by replica symmetry-breaking arguments. An analysis of the tradeoffs between the various methods of search for satisfying assignments shows WSAT to be far more powerful than has been appreciated, and suggests some interesting new directions for practical algorithm development. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 il Abstract Survey propagation is a powerful technique from statistical physics that has been applied to solve the 3-SAT problem both in principle and in practice. [sent-8, score-0.071]

2 We give, using only probability arguments, a common derivation of survey propagation, belief propagation and several interesting hybrid methods. [sent-9, score-0.141]

3 We then present numerical experiments which use WSAT (a widely used random-walk based SAT solver) to quantify the complexity of the 3-SAT formulae as a function of their parameters, both as randomly generated and after simpli£cation, guided by survey propagation. [sent-10, score-0.178]

4 An analysis of the tradeoffs between the various methods of search for satisfying assignments shows WSAT to be far more powerful than has been appreciated, and suggests some interesting new directions for practical algorithm development. [sent-12, score-0.092]

5 It consists of a ensemble of randomly generated logical expressions, each depending on N Boolean variables xi , and constructed by taking the AND of M clauses. [sent-14, score-0.138]

6 Each clause a consists of the OR of 3 “literals” yi,a . [sent-15, score-0.269]

7 yi,a is taken to be either xi or ¬xi at random with equal probability, and the three values of the index i in each clause are distinct. [sent-16, score-0.368]

8 Conversely, the neighborhood of a variable xi is Vi , the set of all clauses in which xi or ¬xi appear. [sent-17, score-0.402]

9 For each such random formula, one asks whether there is some set of xi values for which the formula evaluates to be TRUE. [sent-18, score-0.138]

10 At small α, solutions are easily found, while for suf£ciently large α there are almost certainly no satisfying con£gurations of the xi , and compact proofs of this fact can be constructed. [sent-20, score-0.173]

11 Between these limits lies a complex, spin-glass-like phase transition, at which the cost of analyzing the problem with either exact or heuristic methods explodes. [sent-21, score-0.124]

12 An iterative ”belief propagation” [6] (BP) algorithm for K-SAT can be derived to evaluate the probability, or ”belief,” that a variable will take the value TRUE in variable con£gurations that satisfy the formula considered. [sent-24, score-0.215]

13 The probabilistic interpretation is the (l) following: suppose we have ib→i for all clauses b connected to variable i. [sent-27, score-0.25]

14 Each of these (l) clauses can either be satis£ed by another variable (with probability i b→i ), or not be satis£ed (l) by another variable (with probability 1 − ib→i ), and also be satis£ed by variable i itself. [sent-28, score-0.426]

15 If we set variable xi to 0, then some clauses are satis£ed by x i , and some have to be satis£ed (l) by other variables. [sent-29, score-0.326]

16 Similarly, (l) if xi is set to 1 then all these clauses b are satis£ed with probability b=a,yi,b =¬xi ib→i . [sent-31, score-0.238]

17 Variable xi can be 0 or 1 in a solution if the clauses in which xi appears are either satis£ed directly by xi itself, or by other variables. [sent-33, score-0.41]

18 Hence Prob(xi ) = A0 i A0 i + A1 i and Prob(¬xi ) = A0 i A1 i + A1 i (4) A BP-based decimation scheme results from £xing the variables with largest probability to be either true or false. [sent-34, score-0.32]

19 We then recalculate the beliefs for the reduced formula, and repeat. [sent-35, score-0.028]

20 To arrive at SP we introduce a modi£ed system of beliefs: every variable falls into one of three classes: TRUE in all solutions (1); FALSE in all solutions(0); and TRUE in some and FALSE in other solutions (f ree). [sent-36, score-0.192]

21 The message from a clause to a variable (an in¤uence) is then the same as in BP above. [sent-37, score-0.393]

22 Note that there are here three transports for each directed link i → a, from a variable to a clause, in the graph. [sent-39, score-0.088]

23 As in BP, these numbers will be functions of the in¤uences from clauses to variables in the preceeding update step. [sent-40, score-0.203]

24 That gives (compare (3)): (l) ti→a =        (l−1) ia→i A1 i (l−1) 1 ia→i Ai +A0 −A1 A0 i i i if yi,a = ¬xi (6) (l−1) ia→i A0 i (l−1) ia→i A0 +A1 −A1 A0 i i i i if yi,a = xi The update equations for ti→a are the same in SP as in BP, ´. [sent-45, score-0.093]

25 Similarly to (4), decimation now removes the most £xed variable, i. [sent-48, score-0.279]

26 Given the complexity of the i i i i i i original derivation of SP [1, 2], it is remarkable that the SP scheme can be interpreted as a type of belief propagation in another belief system. [sent-51, score-0.145]

27 And even more remarkable that the £nal iteration formulae differ so little. [sent-52, score-0.097]

28 A modi£cation of SP which we will consider in the following is to interpolate between BP Fraction of sites remaining after decimation 1. [sent-53, score-0.279]

29 4 Figure 1: Dependence of decimation depth on the interpolation parameter ρ. [sent-69, score-0.279]

30 (ρ = 0) and SP (ρ = 1) 1 by considering equations  (l−1) 1 A  (l−1) ia→i 0 i 1 0  i  a→i A1 +Ai −ρAi Ai i (l) ti→a  (l−1)  ia→i A0 i  (l−1) ia→i A0 +A1 −ρA1 A0 i i i i if yi,a = ¬xi (7) if yi,a = xi We do not have an interpretation of the intermediate cases of ρ as belief systems. [sent-70, score-0.124]

31 2 The Phase Diagram of 3-SAT Early work on developing 3-SAT heuristics discovered that as α is increased, the problem changes from being easy to solve to extremely hard, then again relatively easy when the formulae are almost certainly UNSAT. [sent-71, score-0.08]

32 It was natural to expect that a sharp phase boundary between SAT and UNSAT phases in the limit of large N accompanies this “easy-hard-easy” observed transition, and the £nite-size scaling results of [7] con£rmed this. [sent-72, score-0.023]

33 Monasson and Zecchina [8] soon showed, using the replica method from statistical mechanics, that the phase transition to be expected had unusual characteristics, including “frozen variables” and a highly nonuniform distribution of solutions, making search dif£cult. [sent-75, score-0.12]

34 Recent technical advances have made it possible to use simpler cavity mean £eld methods to pinpoint the SAT/UNSAT boundary at α = 4. [sent-76, score-0.046]

35 These calculations also predicted a speci£c solution structure (termed 1-RSB for “one step replica symmetry-breaking”) [1, 2] in which the satis£able con£gurations occur in large clusters, maximally separated from each other. [sent-79, score-0.049]

36 Two types of frozen variables are predicted, one set which take the same value in all clusters and a second set whose value is £xed within a particular cluster. [sent-80, score-0.134]

37 The remaining variables are “paramagnetic” and can take either value in some of the states of a given cluster. [sent-81, score-0.041]

38 “Survey-induced decimation” consists of using SP to determine the variable most likely to be frozen, then setting that variable to the indicated frozen value, simplifying the formula as a result, updating the SP calculation, and repeating the process. [sent-93, score-0.308]

39 9 we expect SP to discover that all spins are free to take on more than one value in some ground state, so no spins will be decimated. [sent-95, score-0.216]

40 9, SP ideally should identify frozen spins until all that remain are paramagnetic. [sent-97, score-0.201]

41 The depth of decimation, or fraction of spins reminaing when SP sees only paramagnetic spins, is thus an important characteristic. [sent-98, score-0.175]

42 1 the fraction of spins remaining after survey-induced decimation for values of α from 3. [sent-100, score-0.387]

43 2, on the descending part of the curves, SP reaches a paramagnetic state and halts. [sent-105, score-0.067]

44 On the right, or ascending portion of the curves, SP stops by simply failing to converge. [sent-106, score-0.022]

45 Fig 1 also shows how different the behavior of BP and the hybrids between BP and SP are in their decimation behavior. [sent-107, score-0.279]

46 BP and underrelaxed SP do not reach a paramagnetic state, but continue until the formula breaks apart into clauses that have no variables shared between them. [sent-111, score-0.355]

47 The underrelaxed SP behaves like BP, but can be used well into the RSB region. [sent-115, score-0.046]

48 On the rising parts of all four curves in Fig 1, the scheme halted as the surveys ceased to converge. [sent-116, score-0.022]

49 1 may give reasonable recommendations for simpli£cation even on formulae which are likely to be UNSAT. [sent-118, score-0.08]

50 3 Some Background on WSAT Next we consider WSAT, the random walk-based search routine used to £nish the job of exhibiting a satisfying con£guration after SP (or some other decimation advisor) has simpli£ed the formula. [sent-119, score-0.372]

51 The surprising power exhibited by SP has to some extent obscured the fact that WSAT is itself a very powerful tool for solving constraint satisfaction problems, and has been widely used for this. [sent-120, score-0.022]

52 Its running time, expressed in the number of walk steps required for a successful search is also useful as an informal de£nition of the complexity of a logical formula. [sent-121, score-0.125]

53 Its history goes back to Papadimitriou’s [10] observation that a subtly biased random walk would with high probability discover satisfying solutions in the simpler 2-SAT problem after, at worst, O(N 2 ) steps. [sent-122, score-0.161]

54 have argued analytically and shown experimentally that Rwalksat £nds satisfying con£gurations of the variables after a number of steps that is proportional to N for values of α up to roughly 2. [sent-126, score-0.086]

55 after which this cost increases exponentially with N . [sent-128, score-0.101]

56 They also choose an unsatis£ed clause at random, but then reverse one of the “best” variables, selected at random, where “best” is de£ned as causing the fewest satis£ed clauses to become unsatis£ed. [sent-130, score-0.478]

57 For robustness, they mix this greedy move with random moves as used in RWalkSAT, recommending an equal mixture of the two types of moves. [sent-131, score-0.065]

58 If any variable in the selected unsatis£ed clause can be reversed without causing any other clauses to become unsatis£ed, this “free” move is immediately accepted and no further exploration is required. [sent-137, score-0.557]

59 2a, we show the median number of random walk steps per variable taken by the standard version of WSAT to solve 3-SAT formulas at values of α ranging from 0. [sent-142, score-0.324]

60 3 and for formulae of sizes ranging from N = 1000 to N = 20000. [sent-144, score-0.08]

61 The cost of WSAT remains linear in N well above α = 3. [sent-145, score-0.101]

62 WSAT cost distributions were collected on at least 1000 cases at each point. [sent-147, score-0.101]

63 Since the distributions are asymmetric, with strong tails extending to higher cost, it is not obvious that WSAT cost is, in the statistical mechanics language, self-averaging, or concentrated about a well-de£ned mean value which dominates the distribution as N → ∞. [sent-148, score-0.147]

64 To test this, we calculated higher moments of the WSAT cost distribution and found that they scale with simple powers of N. [sent-149, score-0.101]

65 2b, we show that the variance of the WSAT cost per variable, scaled up by N, is a wellde£ned function of α up to almost 4. [sent-151, score-0.13]

66 A detailed analysis of the cost distributions which we observed will be published elsewhere but we conclude that the median cost of solving 3-SAT using the WSAT random walk search, as well as the mean cost if that is well-de£ned, remains linear in N up to α = 4. [sent-155, score-0.418]

67 In the 1-RSB regime, the WSAT cost per variable distributions shift to higher values as N increases, and an exponential increase in cost with N is likely. [sent-157, score-0.319]

68 15 really the endpoint for WSAT’s linearity, or will the search cost per variable converge at still larger values of N which we could not study? [sent-159, score-0.243]

69 We de£ne a rough estimate of Nonset (α) by study of the cumulative distributions of WSAT cost as the value of N for a given α above which the distributions cross at a £xed percentile. [sent-160, score-0.118]

70 Onset for linear WSAT cost per variable 5 10 N=1000 N=2000 N=5000 N=10000 N=20000 100000 N onset Median WalkSat Cost 10000 1000 100 0. [sent-165, score-0.249]

71 2 Figure 3: Size N at which WSAT cost is linear in N as function of 4. [sent-172, score-0.101]

72 4 Practical Aspects of SP + WSAT The power of SP comes from its use to guide decimation by identifying spins which can be frozen while minimally reducing the number of solutions that can be constructed. [sent-175, score-0.549]

73 To assess the complexity of the reduced formulae that decimation guided in this way produces we compare, in Fig. [sent-176, score-0.396]

74 4, the median number of WSAT steps required to £nd a satisfying con£guration of the variables before and after decimation. [sent-177, score-0.137]

75 To a rough approximation, we can say that SP caps the cost of £nding a solution to what it would be at the entry to the critical regime. [sent-178, score-0.135]

76 There are two factors, the reduction in the number of variables that have to be searched, and the reduction of the distance the random walk must traverse when it is restricted to a single cluster of solutions. [sent-179, score-0.105]

77 2c the solid lines show the WSAT costs divided by N, the original number of variables in each formula. [sent-181, score-0.041]

78 If we instead divide the WSAT cost after decimation by the number of variables remaining, the complexity measure that we obtain is only a factor of two larger, as shown by the dotted lines. [sent-182, score-0.438]

79 The relative cost of running WSAT without bene£t of decimation is 3-4 decades larger. [sent-183, score-0.401]

80 We measured the actual compute time consumed in survey propagation and in WSAT. [sent-184, score-0.11]

81 3 survey propagation code, and the copy of WSAT (H. [sent-186, score-0.11]

82 All programs were run on a Pentium IV Xeon 3GHz dual processor server with 4GB of memory, and only one processor busy. [sent-188, score-0.04]

83 We compare timings from runs on the same 100 formulas with N = 10000 and α = 4. [sent-189, score-0.109]

84 2 (the formulas are simply extended slightly for the second case). [sent-191, score-0.092]

85 In the £rst case, the 100 formulas were solved using WSAT alone in 921 seconds. [sent-192, score-0.092]

86 Using SP to guide decimation one variable at a time, with the survey updates performed locally around each modi£ed variable, the same 100 formulas required 6218 seconds to solve, of which only 31 sec was spent in WSAT. [sent-193, score-0.621]

87 Running WSAT on 100 formulas with N = 10000 required 27771 seconds on the same servers, and would have taken even longer if about half of the runs had not been stopped by a cutoff without producing a satisfying con£guration. [sent-196, score-0.181]

88 In contrast, the same 100 formulas were solved by SP followed with WSAT in 10,420 sec, of which only 300 seconds were spent in WSAT. [sent-197, score-0.148]

89 The cost of SP does not scale linearly with N , but appears to scale as N 2 in this regime. [sent-198, score-0.121]

90 We solved 100 formulas with N = 20, 000 using SP followed by WSAT in 39643 seconds, of which 608 sec was spent in WSAT. [sent-199, score-0.149]

91 The cost of running SP to decimate roughly half the spins has quadrupled, while the cost of the £nal WSAT runs remained proportional to N . [sent-200, score-0.348]

92 Decimation must stop short of the paramagnetic state at the highest values of α, to avoid having SP fail to converge. [sent-201, score-0.067]

93 In those cases we found that WSAT could sometimes £nd satisfying con£gurations if started slightly before this point. [sent-202, score-0.045]

94 We also explored partial decimation as a means of reducing the cost of WSAT just below the 1-RSB regime, but found that decimation of small fractions of the variables caused the WSAT running times to be highly unpredictable, in many cases increasing strongly. [sent-203, score-0.721]

95 As a result, partial decimation does not seem to be a useful approach. [sent-204, score-0.279]

96 Further directing its random choices to incorporate the insights gained from BP and SP might make it an effective algorithm even closer to the SAT/UNSAT transition. [sent-207, score-0.023]

97 Acknowledgments We have enjoyed discussions of this work with members of the replica and cavity theory community, especially Riccardo Zecchina, Alfredo Braunstein, Marc Mezard, Remi Monasson and Andrea Montanari. [sent-208, score-0.095]

98 (2002) The random K-satis£ability problem: from an analytic solue tion to an ef£cient algorithm. [sent-227, score-0.023]

99 (2003), On the probabilistic approach to the random satis£ability problem, Proc. [sent-237, score-0.023]

100 (2003) Instability of one-step replica-symmetricbroken phase in satis£ability problems. [sent-258, score-0.023]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wsat', 0.65), ('sp', 0.34), ('decimation', 0.279), ('clause', 0.269), ('clauses', 0.162), ('zecchina', 0.139), ('ia', 0.134), ('bp', 0.116), ('ti', 0.109), ('ree', 0.108), ('spins', 0.108), ('ib', 0.103), ('cost', 0.101), ('frozen', 0.093), ('formulas', 0.092), ('variable', 0.088), ('unsatis', 0.081), ('formulae', 0.08), ('xi', 0.076), ('satis', 0.071), ('paramagnetic', 0.067), ('kautz', 0.062), ('monasson', 0.062), ('selman', 0.062), ('survey', 0.061), ('solutions', 0.052), ('median', 0.051), ('replica', 0.049), ('propagation', 0.049), ('arxiv', 0.046), ('braunstein', 0.046), ('cavity', 0.046), ('parisi', 0.046), ('rwalksat', 0.046), ('transport', 0.046), ('underrelaxed', 0.046), ('mechanics', 0.046), ('satisfying', 0.045), ('walk', 0.041), ('variables', 0.041), ('vi', 0.04), ('formula', 0.039), ('message', 0.036), ('gurations', 0.035), ('sat', 0.034), ('regime', 0.033), ('ability', 0.032), ('barthel', 0.031), ('kirkpatrick', 0.031), ('mezard', 0.031), ('nonset', 0.031), ('overrelaxed', 0.031), ('swedish', 0.031), ('zard', 0.031), ('belief', 0.031), ('onset', 0.031), ('reverse', 0.029), ('spent', 0.029), ('per', 0.029), ('sec', 0.028), ('beliefs', 0.028), ('sweden', 0.027), ('seconds', 0.027), ('search', 0.025), ('false', 0.025), ('uences', 0.025), ('papadimitriou', 0.025), ('ai', 0.024), ('transition', 0.023), ('phase', 0.023), ('random', 0.023), ('fig', 0.023), ('powerful', 0.022), ('moves', 0.022), ('con', 0.022), ('stops', 0.022), ('surveys', 0.022), ('ed', 0.022), ('running', 0.021), ('trick', 0.021), ('logical', 0.021), ('prob', 0.021), ('move', 0.02), ('appears', 0.02), ('guided', 0.02), ('processor', 0.02), ('jerusalem', 0.019), ('tj', 0.018), ('sent', 0.018), ('causing', 0.018), ('tk', 0.018), ('boolean', 0.018), ('critical', 0.017), ('runs', 0.017), ('complexity', 0.017), ('equations', 0.017), ('begins', 0.017), ('rough', 0.017), ('guide', 0.017), ('remarkable', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 41 nips-2004-Comparing Beliefs, Surveys, and Random Walks

Author: Erik Aurell, Uri Gordon, Scott Kirkpatrick

2 0.097074911 203 nips-2004-Validity Estimates for Loopy Belief Propagation on Binary Real-world Networks

Author: Joris M. Mooij, Hilbert J. Kappen

Abstract: We introduce a computationally efﬁcient method to estimate the validity of the BP method as a function of graph topology, the connectivity strength, frustration and network size. We present numerical results that demonstrate the correctness of our estimates for the uniform random model and for a real-world network (“C. Elegans”). Although the method is restricted to pair-wise interactions, no local evidence (zero “biases”) and binary variables, we believe that its predictions correctly capture the limitations of BP for inference and MAP estimation on arbitrary graphical models. Using this approach, we ﬁnd that BP always performs better than MF. Especially for large networks with broad degree distributions (such as scale-free networks) BP turns out to signiﬁcantly outperform MF. 1

3 0.085162535 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning

Author: Liam Paninski

Abstract: Log-concavity is an important property in the context of optimization, Laplace approximation, and sampling; Bayesian methods based on Gaussian process priors have become quite popular recently for classiﬁcation, regression, density estimation, and point process intensity estimation. Here we prove that the predictive densities corresponding to each of these applications are log-concave, given any observed data. We also prove that the likelihood is log-concave in the hyperparameters controlling the mean function of the Gaussian prior in the density and point process intensity estimation cases, and the mean, covariance, and observation noise parameters in the classiﬁcation and regression cases; this result leads to a useful parameterization of these hyperparameters, indicating a suitably large class of priors for which the corresponding maximum a posteriori problem is log-concave.

4 0.083542898 116 nips-2004-Message Errors in Belief Propagation

Author: Alexander T. Ihler, John W. Fisher, Alan S. Willsky

Abstract: Belief propagation (BP) is an increasingly popular method of performing approximate inference on arbitrary graphical models. At times, even further approximations are required, whether from quantization or other simpliﬁed message representations or from stochastic approximation methods. Introducing such errors into the BP message computations has the potential to adversely affect the solution obtained. We analyze this effect with respect to a particular measure of message error, and show bounds on the accumulation of errors in the system. This leads both to convergence conditions and error bounds in traditional and approximate BP message passing. 1

5 0.071406111 63 nips-2004-Expectation Consistent Free Energies for Approximate Inference

Author: Manfred Opper, Ole Winther

Abstract: We propose a novel a framework for deriving approximations for intractable probabilistic models. This framework is based on a free energy (negative log marginal likelihood) and can be seen as a generalization of adaptive TAP [1, 2, 3] and expectation propagation (EP) [4, 5]. The free energy is constructed from two approximating distributions which encode different aspects of the intractable model such a single node constraints and couplings and are by construction consistent on a chosen set of moments. We test the framework on a difﬁcult benchmark problem with binary variables on fully connected graphs and 2D grid graphs. We ﬁnd good performance using sets of moments which either specify factorized nodes or a spanning tree on the nodes (structured approximation). Surprisingly, the Bethe approximation gives very inferior results even on grids. 1

6 0.068716101 49 nips-2004-Density Level Detection is Classification

7 0.058579125 113 nips-2004-Maximum-Margin Matrix Factorization

8 0.057097886 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval

9 0.056920696 52 nips-2004-Discrete profile alignment via constrained information bottleneck

10 0.053005204 64 nips-2004-Experts in a Markov Decision Process

11 0.052490368 178 nips-2004-Support Vector Classification with Input Data Uncertainty

12 0.048521351 55 nips-2004-Distributed Occlusion Reasoning for Tracking with Nonparametric Belief Propagation

13 0.042404182 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

14 0.040802747 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

15 0.040752303 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

16 0.040425636 17 nips-2004-Adaptive Manifold Learning

17 0.03464181 83 nips-2004-Incremental Learning for Visual Tracking

18 0.032469388 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices

19 0.031405166 126 nips-2004-Nearly Tight Bounds for the Continuum-Armed Bandit Problem

20 0.031259257 44 nips-2004-Conditional Random Fields for Object Recognition

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.113), (1, 0.006), (2, 0.023), (3, 0.014), (4, -0.001), (5, 0.026), (6, -0.044), (7, 0.054), (8, -0.064), (9, -0.138), (10, 0.089), (11, -0.063), (12, 0.103), (13, 0.06), (14, 0.062), (15, 0.004), (16, 0.012), (17, 0.092), (18, -0.022), (19, 0.089), (20, -0.05), (21, 0.036), (22, 0.013), (23, 0.0), (24, 0.07), (25, -0.031), (26, -0.021), (27, 0.035), (28, 0.016), (29, -0.086), (30, -0.013), (31, -0.043), (32, -0.005), (33, -0.003), (34, 0.166), (35, -0.004), (36, 0.131), (37, 0.028), (38, -0.048), (39, 0.065), (40, -0.032), (41, -0.044), (42, 0.021), (43, 0.011), (44, 0.095), (45, -0.146), (46, -0.038), (47, -0.084), (48, -0.067), (49, -0.006)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91391498 41 nips-2004-Comparing Beliefs, Surveys, and Random Walks

Author: Erik Aurell, Uri Gordon, Scott Kirkpatrick

2 0.63437033 63 nips-2004-Expectation Consistent Free Energies for Approximate Inference

Author: Manfred Opper, Ole Winther

3 0.60561347 203 nips-2004-Validity Estimates for Loopy Belief Propagation on Binary Real-world Networks

Author: Joris M. Mooij, Hilbert J. Kappen

4 0.49325386 116 nips-2004-Message Errors in Belief Propagation

Author: Alexander T. Ihler, John W. Fisher, Alan S. Willsky

5 0.42649499 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval

Author: Max Welling, Michal Rosen-zvi, Geoffrey E. Hinton

Abstract: Directed graphical models with one layer of observed random variables and one or more layers of hidden random variables have been the dominant modelling paradigm in many research ﬁelds. Although this approach has met with considerable success, the causal semantics of these models can make it difﬁcult to infer the posterior distribution over the hidden variables. In this paper we propose an alternative two-layer model based on exponential family distributions and the semantics of undirected models. Inference in these “exponential family harmoniums” is fast while learning is performed by minimizing contrastive divergence. A member of this family is then studied as an alternative probabilistic model for latent semantic indexing. In experiments it is shown that they perform well on document retrieval tasks and provide an elegant solution to searching with keywords.

6 0.36929044 17 nips-2004-Adaptive Manifold Learning

7 0.36809072 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning

8 0.36506122 44 nips-2004-Conditional Random Fields for Object Recognition

9 0.33568382 122 nips-2004-Modelling Uncertainty in the Game of Go

10 0.33161804 6 nips-2004-A Hidden Markov Model for de Novo Peptide Sequencing

11 0.32508832 55 nips-2004-Distributed Occlusion Reasoning for Tracking with Nonparametric Belief Propagation

12 0.31412503 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

13 0.30586645 29 nips-2004-Beat Tracking the Graphical Model Way

14 0.28808463 52 nips-2004-Discrete profile alignment via constrained information bottleneck

15 0.27851862 178 nips-2004-Support Vector Classification with Input Data Uncertainty

16 0.25747573 113 nips-2004-Maximum-Margin Matrix Factorization

17 0.25411215 49 nips-2004-Density Level Detection is Classification

18 0.25297311 124 nips-2004-Multiple Alignment of Continuous Time Series

19 0.24087553 74 nips-2004-Harmonising Chorales by Probabilistic Inference

20 0.23857237 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.061), (15, 0.132), (26, 0.056), (31, 0.029), (32, 0.012), (33, 0.14), (35, 0.018), (39, 0.014), (50, 0.037), (82, 0.013), (88, 0.011), (93, 0.348)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74399316 41 nips-2004-Comparing Beliefs, Surveys, and Random Walks

Author: Erik Aurell, Uri Gordon, Scott Kirkpatrick

2 0.70067728 91 nips-2004-Joint Tracking of Pose, Expression, and Texture using Conditionally Gaussian Filters

Author: Tim K. Marks, J. C. Roddey, Javier R. Movellan, John R. Hershey

Abstract: We present a generative model and stochastic ﬁltering algorithm for simultaneous tracking of 3D position and orientation, non-rigid motion, object texture, and background texture using a single camera. We show that the solution to this problem is formally equivalent to stochastic ﬁltering of conditionally Gaussian processes, a problem for which well known approaches exist [3, 8]. We propose an approach based on Monte Carlo sampling of the nonlinear component of the process (object motion) and exact ﬁltering of the object and background textures given the sampled motion. The smoothness of image sequences in time and space is exploited by using Laplace’s method to generate proposal distributions for importance sampling [7]. The resulting inference algorithm encompasses both optic ﬂow and template-based tracking as special cases, and elucidates the conditions under which these methods are optimal. We demonstrate an application of the system to 3D non-rigid face tracking. 1 Background Recent algorithms track morphable objects by solving optic ﬂow equations, subject to the constraint that the tracked points belong to an object whose non-rigid deformations are linear combinations of a set of basic shapes [10, 2, 11]. These algorithms require precise initialization of the object pose and tend to drift out of alignment on long video sequences. We present G-ﬂow, a generative model and stochastic ﬁltering formulation of tracking that address the problems of initialization and error recovery in a principled manner. We deﬁne a non-rigid object by the 3D locations of n vertices. The object is a linear combination of k ﬁxed morph bases, with coefﬁcients c = [c1 , c2 , · · · , ck ]T . The ﬁxed 3 × k matrix hi contains the position of the ith vertex in all k morph bases. The transformation from object-centered to image coordinates consists of a rotation, weak perspective projection, and translation. Thus xi , the 2D location of the ith vertex on the image plane, is xi = grhi c + l, (1) where r is the 3 × 3 rotation matrix, l is the 2 × 1 translation vector, and g = 1 0 0 is the 010 projection matrix. The object pose, ut , comprises both the rigid motion parameters and the morph parameters at time t: ut = {r(t), l(t), c(t)}. (2) 1.1 Optic ﬂow Let yt represent the current image, and let xi (ut ) index the image pixel that is rendered by the ith object vertex when the object assumes pose ut . Suppose that we know ut−1 , the pose at time t − 1, and we want to ﬁnd ut , the pose at time t. This problem can be solved by minimizing the following form with respect to ut : ut = argmin ˆ ut 1 2 n 2 [yt (xi (ut )) − yt−1 (xi (ut−1 ))] . (3) i=1 In the special case in which the xi (ut ) are neighboring points that move with the same 2D displacement, this reduces to the standard Lucas-Kanade optic ﬂow algorithm [9, 1]. Recent work [10, 2, 11] has shown that in the general case, this optimization problem can be solved efﬁciently using the Gauss-Newton method. We will take advantage of this fact to develop an efﬁcient stochastic inference algorithm within the framework of G-ﬂow. Notational conventions Unless otherwise stated, capital letters are used for random variables, small letters for speciﬁc values taken by random variables, and Greek letters for ﬁxed model parameters. Subscripted colons indicate sequences: e.g., X1:t = X1 · · · Xt . The term In stands for the n × n identity matrix, E for expected value, V ar for the covariance matrix, and V ar−1 for the inverse of the covariance matrix (precision matrix). 2 The Generative Model for G-Flow Figure 1: Left: a(Ut ) determines which texel (color at a vertex of the object model or a pixel of the background model) is responsible for rendering each image pixel. Right: G-ﬂow video generation model: At time t, the object’s 3D pose, Ut , is used to project the object texture, Vt , into 2D. This projection is combined with the background texture, Bt , to generate the observed image, Yt . We model the image sequence Y as a stochastic process generated by three hidden causes, U , V , and B, as shown in the graphical model (Figure 1, right). The m × 1 random vector Yt represents the m-pixel image at time t. The n × 1 random vector Vt and the m × 1 random vector Bt represent the n-texel object texture and the m-texel background texture, respectively. As illustrated in Figure 1, left, the object pose, Ut , determines onto which image pixels the object and background texels project at time t. This is formulated using the projection function a(Ut ). For a given pose, ut , the projection a(ut ) is a block matrix, def a(ut ) = av (ut ) ab (ut ) . Here av (ut ), the object projection function, is an m × n matrix of 0s and 1s that tells onto which image pixel each object vertex projects; e.g., a 1 at row j, column i it means that the ith object point projects onto image pixel j. Matrix ab plays the same role for background pixels. Assuming the foreground mapping is one-toone, we let ab = Im −av (ut )av (ut )T , expressing the simple occlusion constraint that every image pixel is rendered by object or background, but not both. In the G-ﬂow generative model: Vt Yt = a(Ut ) + Wt Wt ∼ N (0, σw Im ), σw > 0 Bt (4) Ut ∼ p(ut | ut−1 ) v v Vt = Vt−1 + Zt−1 Zt−1 ∼ N (0, Ψv ), Ψv is diagonal b b Bt = Bt−1 + Zt−1 Zt−1 ∼ N (0, Ψb ), Ψb is diagonal where p(ut | ut−1 ) is the pose transition distribution, and Z v , Z b , W are independent of each other, of the initial conditions, and over time. The form of the pose distribution is left unspeciﬁed since the algorithm proposed here does not require the pose distribution or the pose dynamics to be Gaussian. For the initial conditions, we require that the variance of V1 and the variance of B1 are both diagonal. Non-rigid 3D tracking is a difﬁcult nonlinear ﬁltering problem because changing the pose has a nonlinear effect on the image pixels. Fortunately, the problem has a rich structure that we can exploit: under the G-ﬂow model, video generation is a conditionally Gaussian process [3, 6, 4, 5]. If the speciﬁc values taken by the pose sequence, u1:t , were known, then the texture processes, V and B, and the image process, Y , would be jointly Gaussian. This suggests the following scheme: we could use particle ﬁltering to obtain a distribution of pose experts (each expert corresponds to a highly probable sample of pose, u1:t ). For each expert we could then use Kalman ﬁltering equations to infer the posterior distribution of texture given the observed images. This method is known in the statistics community as a Monte Carlo ﬁltering solution for conditionally Gaussian processes [3, 4], and in the machine learning community as Rao-Blackwellized particle ﬁltering [6, 5]. We found that in addition to Rao-Blackwellization, it was also critical to use Laplace’s method to generate the proposal distributions for importance sampling [7]. In the context of G-ﬂow, we accomplished this by performing an optic ﬂow-like optimization, using an efﬁcient algorithm similar to those in [10, 2]. 3 Inference Our goal is to ﬁnd an expression for the ﬁltering distribution, p(ut , vt , bt | y1:t ). Using the law of total probability, we have the following equation for the ﬁltering distribution: p(ut , vt , bt | y1:t ) = p(ut , vt , bt | u1:t−1 , y1:t ) p(u1:t−1 | y1:t ) du1:t−1 Opinion of expert (5) Credibility of expert We can think of the integral in (5) as a sum over a distribution of experts, where each expert corresponds to a single pose history, u1:t−1 . Based on its hypothesis about pose history, each expert has an opinion about the current pose of the object, Ut , and the texture maps of the object and background, Vt and Bt . Each expert also has a credibility, a scalar that measures how well the expert’s opinion matches the observed image yt . Thus, (5) can be interpreted as follows: The ﬁltering distribution at time t is obtained by integrating over the entire ensemble of experts the opinion of each expert weighted by that expert’s credibility. The opinion distribution of expert u1:t−1 can be factorized into the expert’s opinion about the pose Ut times the conditional distribution of texture Vt , Bt given pose: p(ut , vt , bt | u1:t−1 , y1:t ) = p(ut | u1:t−1 , y1:t ) p(vt , bt | u1:t , y1:t ) (6) Opinion of expert Pose Opinion Texture Opinion given pose The rest of this section explains how we evaluate each term in (5) and (6). We cover the distribution of texture given pose in 3.1, pose opinion in 3.2, and credibility in 3.3. 3.1 Texture opinion given pose The distribution of Vt and Bt given the pose history u1:t is Gaussian with mean and covariance that can be obtained using the Kalman ﬁlter estimation equations: −1 V ar−1 (Vt , Bt | u1:t , y1:t ) = V ar−1 (Vt , Bt | u1:t−1 , y1:t−1 ) + a(ut )T σw a(ut ) E(Vt , Bt | u1:t , y1:t ) = V ar(Vt , Bt | u1:t , y1:t ) −1 × V ar−1 (Vt , Bt | u1:t−1 , y1:t−1 )E(Vt , Bt | u1:t−1 , y1:t−1 ) + a(ut )T σw yt (7) (8) This requires p(Vt , Bt |u1:t−1 , y1:t−1 ), which we get from the Kalman prediction equations: E(Vt , Bt | u1:t−1 , y1:t−1 ) = E(Vt−1 , Bt−1 | u1:t−1 , y1:t−1 ) V ar(Vt , Bt | u1:t−1 , y1:t−1 ) = V ar(Vt−1 , Bt−1 | u1:t−1 , y1:t−1 ) + (9) Ψv 0 0 Ψb (10) In (9), the expected value E(Vt , Bt | u1:t−1 , y1:t−1 ) consists of texture maps (templates) for the object and background. In (10), V ar(Vt , Bt | u1:t−1 , y1:t−1 ) represents the degree of uncertainty about each texel in these texture maps. Since this is a diagonal matrix, we can refer to the mean and variance of each texel individually. For the ith texel in the object texture map, we use the following notation: µv (i) t v σt (i) def = ith element of E(Vt | u1:t−1 , y1:t−1 ) def = (i, i)th element of V ar(Vt | u1:t−1 , y1:t−1 ) b Similarly, deﬁne µb (j) and σt (j) as the mean and variance of the jth texel in the backt ground texture map. (This notation leaves the dependency on u1:t−1 and y1:t−1 implicit.) 3.2 Pose opinion Based on its current texture template (derived from the history of poses and images up to time t−1) and the new image yt , each expert u1:t−1 has a pose opinion, p(ut |u1:t−1 , y1:t ), a probability distribution representing that expert’s beliefs about the pose at time t. Since the effect of ut on the likelihood function is nonlinear, we will not attempt to ﬁnd an analytical solution for the pose opinion distribution. However, due to the spatio-temporal smoothness of video signals, it is possible to estimate the peak and variance of an expert’s pose opinion. 3.2.1 Estimating the peak of an expert’s pose opinion We want to estimate ut (u1:t−1 ), the value of ut that maximizes the pose opinion. Since ˆ p(ut | u1:t−1 , y1:t ) = p(y1:t−1 | u1:t−1 ) p(ut | ut−1 ) p(yt | u1:t , y1:t−1 ), p(y1:t | u1:t−1 ) (11) def ut (u1:t−1 ) = argmax p(ut | u1:t−1 , y1:t ) = argmax p(ut | ut−1 ) p(yt | u1:t , y1:t−1 ). ˆ ut ut (12) We now need an expression for the ﬁnal term in (12), the predictive distribution p(yt | u1:t , y1:t−1 ). By integrating out the hidden texture variables from p(yt , vt , bt | u1:t , y1:t−1 ), and using the conditional independence relationships deﬁned by the graphical model (Figure 1, right), we can derive: 1 m log p(yt | u1:t , y1:t−1 ) = − log 2π − log |V ar(Yt | u1:t , y1:t−1 )| 2 2 n v 2 1 (yt (xi (ut )) − µt (i)) 1 (yt (j) − µb (j))2 t − − , (13) v (i) + σ b 2 i=1 σt 2 σt (j) + σw w j∈X (ut ) where xi (ut ) is the image pixel rendered by the ith object vertex when the object assumes pose ut , and X (ut ) is the set of all image pixels rendered by the object under pose ut . Combining (12) and (13), we can derive ut (u1:t−1 ) = argmin − log p(ut | ut−1 ) ˆ (14) ut + 1 2 n i=1 [yt (xi (ut )) − µv (i)]2 [yt (xi (ut )) − µb (xi (ut ))]2 t t b − − log[σt (xi (ut )) + σw ] v b σt (i) + σw σt (xi (ut )) + σw Foreground term Background terms Note the similarity between (14) and constrained optic ﬂow (3). For example, focus on the foreground term in (14) and ignore the weights in the denominator. The previous image yt−1 from (3) has been replaced by µv (·), the estimated object texture based on the images t and poses up to time t − 1. As in optic ﬂow, we can ﬁnd the pose estimate ut (u1:t−1 ) ˆ efﬁciently using the Gauss-Newton method. 3.2.2 Estimating the distribution of an expert’s pose opinion We estimate the distribution of an expert’s pose opinion using a combination of Laplace’s method and importance sampling. Suppose at time t − 1 we are given a sample of experts (d) (d) indexed by d, each endowed with a pose sequence u1:t−1 , a weight wt−1 , and the means and variances of Gaussian distributions for object and background texture. For each expert (d) (d) u1:t−1 , we use (14) to compute ut , the peak of the pose distribution at time t according ˆ (d) to that expert. Deﬁne σt as the inverse Hessian matrix of (14) at this peak, the Laplace ˆ estimate of the covariance matrix of the expert’s opinion. We then generate a set of s (d,e) (d) independent samples {ut : e = 1, · · · , s} from a Gaussian distribution with mean ut ˆ (d) (d) (d) and variance proportional to σt , g(·|ˆt , αˆt ), where the parameter α > 0 determines ˆ u σ the sharpness of the sampling distribution. (Note that letting α → 0 would be equivalent to (d,e) (d) simply setting the new pose equal to the peak of the pose opinion, ut = ut .) To ﬁnd ˆ the parameters of this Gaussian proposal distribution, we use the Gauss-Newton method, ignoring the second of the two background terms in (14). (This term is not ignored in the importance sampling step.) To reﬁne our estimate of the pose opinion we use importance sampling. We assign each sample from the proposal distribution an importance weight wt (d, e) that is proportional to the ratio between the posterior distribution and the proposal distribution: s (d) p(ut | u1:t−1 , y1:t ) = ˆ (d,e) δ(ut − ut ) wt (d, e) s f =1 wt (d, f ) (15) e=1 (d,e) (d) (d) (d,e) p(ut | ut−1 )p(yt | u1:t−1 , ut , y1:t−1 ) wt (d, e) = (16) (d,e) (d) (d) g(ut | ut , αˆt ) ˆ σ (d,e) (d) The numerator of (16) is proportional to p(ut |u1:t−1 , y1:t ) by (12), and the denominator of (16) is the sampling distribution. 3.3 Estimating an expert’s credibility (d) The credibility of the dth expert, p(u1:t−1 | y1:t ), is proportional to the product of a prior term and a likelihood term: (d) (d) p(u1:t−1 | y1:t−1 )p(yt | u1:t−1 , y1:t−1 ) (d) p(u1:t−1 | y1:t ) = . (17) p(yt | y1:t−1 ) Regarding the likelihood, p(yt |u1:t−1 , y1:t−1 ) = p(yt , ut |u1:t−1 , y1:t−1 )dut = p(yt |u1:t , y1:t−1 )p(ut |ut−1 )dut (18) (d,e) We already generated a set of samples {ut : e = 1, · · · , s} that estimate the pose opin(d) ion of the dth expert, p(ut | u1:t−1 , y1:t ). We can now use these samples to estimate the likelihood for the dth expert: (d) p(yt | u1:t−1 , y1:t−1 ) = (d) (d) p(yt | u1:t−1 , ut , y1:t−1 )p(ut | ut−1 )dut (19) (d) (d) (d) (d) = p(yt | u1:t−1 , ut , y1:t−1 )g(ut | ut , αˆt ) ˆ σ 3.4 p(ut | ut−1 ) s e=1 dut ≈ wt (d, e) s Updating the ﬁltering distribution g(ut | (d) (d) ut , αˆt ) ˆ σ Once we have calculated the opinion and credibility of each expert u1:t−1 , we evaluate the integral in (5) as a weighted sum over experts. The credibilities of all of the experts are normalized to sum to 1. New experts u1:t (children) are created from the old experts u1:t−1 (parents) by appending a pose ut to the parent’s history of poses u1:t−1 . Every expert in the new generation is created as follows: One parent is chosen to sire the child. The probability of being chosen is proportional to the parent’s credibility. The child’s value of ut is chosen at random from its parent’s pose opinion (the weighted samples described in Section 3.2.2). 4 Relation to Optic Flow and Template Matching In basic template-matching, the same time-invariant texture map is used to track every frame in the video sequence. Optic ﬂow can be thought of as template-matching with a template that is completely reset at each frame for use in the subsequent frame. In most cases, optimal inference under G-ﬂow involves a combination of optic ﬂow-based and template-based tracking, in which the texture template gradually evolves as new images are presented. Pure optic ﬂow and template-matching emerge as special cases. Optic Flow as a Special Case Suppose that the pose transition probability p(ut | ut−1 ) is uninformative, that the background is uninformative, that every texel in the initial object texture map has equal variance, V ar(V1 ) = κIn , and that the texture transition uncertainty is very high, Ψv → diag(∞). Using (7), (8), and (10), it follows that: µv (i) = [av (ut−1 )]T yt−1 = yt−1 (xi (ut−1 )) , t (20) i.e., the object texture map at time t is determined by the pixels from image yt−1 that according to pose ut−1 were rendered by the object. As a result, (14) reduces to: ut (u1:t−1 ) = argmin ˆ ut 1 2 n yt (xi (ut )) − yt−1 (xi (ut−1 )) 2 (21) i=1 which is identical to (3). Thus constrained optic ﬂow [10, 2, 11] is simply a special case of optimal inference under G-ﬂow, with a single expert and with sampling parameter α → 0. The key assumption that Ψv → diag(∞) means that the object’s texture is very different in adjacent frames. However, optic ﬂow is typically applied in situations in which the object’s texture in adjacent frames is similar. The optimal solution in such situations calls not for optic ﬂow, but for a texture map that integrates information across multiple frames. Template Matching as a Special Case Suppose the initial texture map is known precisely, V ar(V1 ) = 0, and the texture transition uncertainty is very low, Ψv → 0. By (7), (8), and (10), it follows that µv (i) = µv (i) = µv (i), i.e., the texture map does not change t t−1 1 over time, but remains ﬁxed at its initial value (it is a texture template). Then (14) becomes: n yt (xi (ut )) − µv (i) 1 ut (u1:t−1 ) = argmin ˆ ut 2 (22) i=1 where µv (i) is the ith texel of the ﬁxed texture template. This is the error function mini1 mized by standard template-matching algorithms. The key assumption that Ψv → 0 means the object’s texture is constant from each frame to the next, which is rarely true in real data. G-ﬂow provides a principled way to relax this unrealistic assumption of template methods. General Case In general, if the background is uninformative, then minimizing (14) results in a weighted combination of optic ﬂow and template matching, with the weight of each approach depending on the current level of certainty about the object template. In addition, when there is useful information in the background, G-ﬂow infers a model of the background which is used to improve tracking. Figure 2: G-ﬂow tracking an outdoor video. Results are shown for frames 1, 81, and 620. 5 Simulations We collected a video (30 frames/sec) of a subject in an outdoor setting who made a variety of facial expressions while moving her head. A later motion-capture session was used to create a 3D morphable model of her face, consisting of a set of 5 morph bases (k = 5). Twenty experts were initialized randomly near the correct pose on frame 1 of the video and propagated using G-ﬂow inference (assuming an uninformative background). See http://mplab.ucsd.edu for video. Figure 2 shows the distribution of experts for three frames. In each frame, every expert has a hypothesis about the pose (translation, rotation, scale, and morph coefﬁcients). The 38 points in the model are projected into the image according to each expert’s pose, yielding 760 red dots in each frame. In each frame, the mean of the experts gives a single hypothesis about the 3D non-rigid deformation of the face (lower right) as well as the rigid pose of the face (rotated 3D axes, lower left). Notice G-ﬂow’s ability to recover from error: bad initial hypotheses are weeded out, leaving only good hypotheses. To compare G-ﬂow’s performance versus deterministic constrained optic ﬂow algorithms such as [10, 2, 11] , we used both G-ﬂow and the method from [2] to track the same video sequence. We ran each tracker several times, introducing small errors in the starting pose. Figure 3: Average error over time for G-ﬂow (green) and for deterministic optic ﬂow [2] (blue). Results were averaged over 16 runs (deterministic algorithm) or 4 runs (G-ﬂow) and smoothed. As ground truth, the 2D locations of 6 points were hand-labeled in every 20th frame. The error at every 20th frame was calculated as the distance from these labeled locations to the inferred (tracked) locations, averaged across several runs. Figure 3 compares this tracking error as a function of time for the deterministic constrained optic ﬂow algorithm and for a 20-expert version of the G-ﬂow tracking algorithm. Notice that the deterministic system has a tendency to drift (increase in error) over time, whereas G-ﬂow can recover from drift. Acknowledgments Tim K. Marks was supported by NSF grant IIS-0223052 and NSF grant DGE-0333451 to GWC. John Hershey was supported by the UCDIMI grant D00-10084. J. Cooper Roddey was supported by the Swartz Foundation. Javier R. Movellan was supported by NSF grants IIS-0086107, IIS-0220141, and IIS-0223052, and by the UCDIMI grant D00-10084. References [1] Simon Baker and Iain Matthews. Lucas-kanade 20 years on: A unifying framework. International Journal of Computer Vision, 56(3):221–255, 2002. [2] M. Brand. Flexible ﬂow for 3D nonrigid tracking and shape recovery. In CVPR, volume 1, pages 315–322, 2001. [3] H. Chen, P. Kumar, and J. van Schuppen. On Kalman ﬁltering for conditionally gaussian systems with random matrices. Syst. Contr. Lett., 13:397–404, 1989. [4] R. Chen and J. Liu. Mixture Kalman ﬁlters. J. R. Statist. Soc. B, 62:493–508, 2000. [5] A. Doucet and C. Andrieu. Particle ﬁltering for partially observed gaussian state space models. J. R. Statist. Soc. B, 64:827–838, 2002. [6] A. Doucet, N. de Freitas, K. Murphy, and S. Russell. Rao-blackwellised particle ﬁltering for dynamic bayesian networks. In 16th Conference on Uncertainty in AI, pages 176–183, 2000. [7] A. Doucet, S. J. Godsill, and C. Andrieu. On sequential monte carlo sampling methods for bayesian ﬁltering. Statistics and Computing, 10:197–208, 2000. [8] Zoubin Ghahramani and Geoffrey E. Hinton. Variational learning for switching state-space models. Neural Computation, 12(4):831–864, 2000. [9] B. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Proceedings of the International Joint Conference on Artiﬁcial Intelligence, 1981. [10] L. Torresani, D. Yang, G. Alexander, and C. Bregler. Tracking and modeling non-rigid objects with rank constraints. In CVPR, pages 493–500, 2001. [11] Lorenzo Torresani, Aaron Hertzmann, and Christoph Bregler. Learning non-rigid 3d shape from 2d motion. In Advances in Neural Information Processing Systems 16. MIT Press, 2004.

3 0.66652215 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth

Abstract: We address the problem of learning a symmetric positive deﬁnite matrix. The central issue is to design parameter updates that preserve positive deﬁniteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: On-line learning with a simple square loss and ﬁnding a symmetric positive deﬁnite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive deﬁnite matrix of trace one instead of a probability vector (which in this context is a diagonal positive deﬁnite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive deﬁniteness. Most importantly, we show how the analysis of each algorithm generalizes to the non-diagonal case. We apply both new algorithms, called the Matrix Exponentiated Gradient (MEG) update and DeﬁniteBoost, to learn a kernel matrix from distance measurements.

4 0.66453886 166 nips-2004-Semi-supervised Learning via Gaussian Processes

Author: Neil D. Lawrence, Michael I. Jordan

Abstract: We present a probabilistic approach to learning a Gaussian Process classiﬁer in the presence of unlabeled data. Our approach involves a “null category noise model” (NCNM) inspired by ordered categorical noise models. The noise model reﬂects an assumption that the data density is lower between the class-conditional densities. We illustrate our approach on a toy problem and present comparative results for the semi-supervised classiﬁcation of handwritten digits. 1

5 0.51878244 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer

Abstract: Prediction sufﬁx trees (PST) provide a popular and effective tool for tasks such as compression, classiﬁcation, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efﬁcient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the ﬁrst provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any ﬁxed PST determined in hindsight. 1

6 0.51718867 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods

7 0.5153895 178 nips-2004-Support Vector Classification with Input Data Uncertainty

8 0.5137611 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

9 0.51252264 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

10 0.51250947 69 nips-2004-Fast Rates to Bayes for Kernel Machines

11 0.51216674 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

12 0.51211095 131 nips-2004-Non-Local Manifold Tangent Learning

13 0.51164001 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications

14 0.5112707 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering

15 0.51047772 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

16 0.51044071 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

17 0.51031381 68 nips-2004-Face Detection --- Efficient and Rank Deficient

18 0.51010424 168 nips-2004-Semigroup Kernels on Finite Sets

19 0.5095768 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

20 0.5089106 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform