nips nips2004 nips2004-110 knowledge-graph by maker-knowledge-mining

110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection


Source: pdf

Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth

Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: On-line learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the analysis of each algorithm generalizes to the non-diagonal case. We apply both new algorithms, called the Matrix Exponentiated Gradient (MEG) update and DefiniteBoost, to learn a kernel matrix from distance measurements.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We address the problem of learning a symmetric positive definite matrix. [sent-10, score-0.255]

2 The central issue is to design parameter updates that preserve positive definiteness. [sent-11, score-0.167]

3 Our updates are motivated with the von Neumann divergence. [sent-12, score-0.171]

4 Rather than treating the most general case, we focus on two key applications that exemplify our methods: On-line learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. [sent-13, score-0.648]

5 The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). [sent-14, score-0.86]

6 The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. [sent-15, score-0.237]

7 We apply both new algorithms, called the Matrix Exponentiated Gradient (MEG) update and DefiniteBoost, to learn a kernel matrix from distance measurements. [sent-17, score-0.352]

8 More specifically, when learning a similarity or a distance function among objects, the parameters are defined as a symmetric positive definite matrix that serves as a kernel (e. [sent-20, score-0.501]

9 Learning is typically formulated as a parameter updating procedure to optimize a loss function. [sent-23, score-0.148]

10 The gradient descent update [6] is one of the most commonly used algorithms, but it is not appropriate when the parameters form a positive definite matrix, because the updated parameter is not necessarily positive definite. [sent-24, score-0.421]

11 In this paper, we introduce the Matrix Exponentiated Gradient update which works as follows: First, the matrix logarithm of the current parameter matrix is computed. [sent-28, score-0.333]

12 Finally, the parameter matrix is updated to the exponential of the modified log-matrix. [sent-30, score-0.183]

13 Our update preserves symmetry and positive definiteness because the matrix exponential maps any symmetric matrix to a positive definite matrix. [sent-31, score-0.668]

14 A learning problem is essentially defined by a loss function, and a divergence that measures the discrepancy between parameters. [sent-33, score-0.266]

15 More precisely, the updates are motivated by minimizing the sum of the loss function and the Bregman divergence, where the loss function is multiplied by a positive learning rate. [sent-34, score-0.404]

16 For example, the gradient descent is derived from the squared Euclidean distance, and the exponentiated gradient from the Kullback-Leibler divergence. [sent-36, score-0.334]

17 We use the von Neumann divergence (also called quantum relative entropy) for measuring the discrepancy between two positive definite matrices [8]. [sent-37, score-0.522]

18 We derive a new Matrix Exponentiated Gradient update from this divergence (which is a Bregman divergence for positive definite matrices). [sent-38, score-0.42]

19 Finally we prove relative loss bounds using the von Neumann divergence as a measure of progress. [sent-39, score-0.407]

20 Also the following related key problem has received a lot of attention recently [14, 11, 13]: Find a symmetric positive definite matrix that satisfies a number of symmetric linear inequality constraints. [sent-40, score-0.57]

21 When choosing F(W) = tr(W log W − W), then f (W) = log W and the corresponding Bregman divergence becomes the von Neumann divergence [8]: ∆F (W, W) = tr(W log W − W log W − W + W). [sent-45, score-0.567]

22 In this case, the positive symmetric definite matrices are related to density matrices commonly used in Statistical Physics and the divergence simplifies to ∆F (W, W) = tr(W log W − W log W). [sent-47, score-0.66]

23 If W = i λi v i v i is our notation for the eigenvalue decomposition, then we can rewrite the normalized divergence as ˜ ˜ λi ln λi + ∆F (W, W) = i ˜ λi ln λj (˜ i v j )2 . [sent-48, score-0.14]

24 v i,j So this divergence quantifies the difference in the eigenvalues as well as the eigenvectors. [sent-49, score-0.16]

25 3 On-line Learning In this section, we present a natural extension of the Exponentiated Gradient (EG) update [6] to an update for symmetric positive definite matrices. [sent-50, score-0.467]

26 At the t-th trial, the algorithm receives a symmetric instance matrix Xt ∈ Rd×d . [sent-51, score-0.274]

27 It then produces a prediction yt = tr(Wt Xt ) based on the algorithm’s current symmetric positive ˆ definite parameter matrix Wt . [sent-52, score-0.581]

28 Finally it incurs for instance1 a quadratic loss (ˆt − yt )2 , y 1 For the sake of simplicity, we use the simple quadratic loss: Lt (W) = (tr(Xt W) − yt )2 . [sent-53, score-0.515]

29 For the general update, the gradient Lt (Wt ) is exponentiated in the update (4) and this gradient must be symmetric. [sent-54, score-0.416]

30 In the update we aim to solve the following problem: Wt+1 = argminW ∆F (W, Wt ) + η(tr(WXt ) − yt )2 , (2) where the convex function F defines the Bregman divergence. [sent-57, score-0.338]

31 Setting the derivative with respect to W to zero, we have f (Wt+1 ) − f (Wt ) + η [(tr(Wt+1 Xt ) − yt )2 ] = 0. [sent-58, score-0.198]

32 Then, we have the following update: Wt+1 = f −1 (f (Wt ) − 2η(ˆt − yt )Xt ). [sent-61, score-0.198]

33 y In our case, F(W) = tr(W log W − W) and thus f (W) = log W and f −1 (W) = exp W. [sent-62, score-0.204]

34 We also augment (2) with the constraint tr(W) = 1, leading to the following Matrix Exponential Gradient (MEG) Update: 1 Wt+1 = y (4) exp(log Wt − 2η(ˆt − yt )Xt ), Zt where the normalization factor Zt is tr[exp(log Wt − 2η(ˆt − yt )Xt )]. [sent-63, score-0.428]

35 Note that in the y above update, the exponent log Wt − 2η(ˆt − yt )Xt is an arbitrary symmetric matrix and y the matrix exponential converts this matrix back into a symmetric positive definite matrix. [sent-64, score-1.04]

36 A numerically stable version of the MEG update is given in Section 3. [sent-65, score-0.152]

37 1 Relative Loss Bounds We now begin with the definitions needed for the relative loss bounds. [sent-68, score-0.162]

38 , (XT , yT ) denote a sequence of examples, where the instance matrices Xt ∈ Rd×d are symmetric and the labels yt ∈ R. [sent-72, score-0.455]

39 For any symmetric positive semi-definite maT trix U with tr(U) = 1, define its total loss as LU (S) = t=1 (tr(UXt ) − yt )2 . [sent-73, score-0.572]

40 The total T loss of the on-line algorithm is LMEG (S) = t=1 (tr(Wt Xt ) − yt )2 . [sent-74, score-0.317]

41 We prove a bound on the relative loss LMEG (S) − LU (S) that holds for any U. [sent-75, score-0.194]

42 The proof generalizes a similar bound for the Exponentiated Gradient update (Lemmas 5. [sent-76, score-0.17]

43 The relative loss bound is derived in two steps: Lemma 3. [sent-79, score-0.194]

44 1 bounds the relative loss for an individual trial and Lemma 3. [sent-80, score-0.205]

45 Let Xt be any symmetric matrix whose smallest and largest eigenvalues satisfy λmax − λmin ≤ r. [sent-84, score-0.317]

46 Assume Wt+1 is produced from Wt by the MEG update and let U be any symmetric positive semi-definite matrix. [sent-85, score-0.361]

47 , tr[exp(A + B)] ≥ tr[exp(A) exp(B)] for symmetric matrices A and B. [sent-88, score-0.257]

48 We also needed to prove the following generalization of Jensen’s inequality to matrices: exp(ρ1 A + ρ2 (I − A)) ≤ exp(ρ1 )A + exp(ρ2 )(I − A) for finite ρ1 , ρ2 ∈ R and any symmetric matrix A with 0 < A ≤ I. [sent-89, score-0.315]

49 2 Let W1 and U be arbitrary symmetric positive definite initial and comparison matrices, respectively. [sent-92, score-0.255]

50 Assuming LU (S) ≤ max and ∆(U, W1 ) ≤ dmax , the bound (6) is tightest when c = √ 2 r 2dmax / max . [sent-97, score-0.162]

51 2 Numerically stable MEG update The MEG update is numerically unstable when the eigenvalues of Wt are around zero. [sent-100, score-0.301]

52 However we can “unwrap” Wt+1 as follows: t Wt+1 = 1 exp(ct I + log W1 − 2η (ˆs − ys )Xs ), y ˜t Z s=1 (7) ˜ where the constant Zt normalizes the trace of Wt+1 to one. [sent-101, score-0.144]

53 Note that the update is independent of the choice of ct ∈ R. [sent-103, score-0.204]

54 We incrementally maintain an eigenvalue decomposition of the matrix in the exponent (O(n3 ) per iteration): t T Vt Λt Vt = ct I + log W1 − 2η s=1 (ˆs − ys )Xs ), y where the constant ct is chosen so that the maximum eigenvalue of the above is zero. [sent-104, score-0.459]

55 , n, (8) where the symmetric positive definite matrix W1 of trace one is the initial parameter matrix, and C1 , . [sent-109, score-0.433]

56 Prior knowledge about W is encoded in the constraints, and the matrix closest to W1 is chosen among the matrices satisfying all constraints. [sent-113, score-0.181]

57 Tsuda and Noble [13] employed this approach for learning a kernel matrix among graph nodes, and this method can be potentially applied to learn a kernel matrix in other settings (e. [sent-114, score-0.35]

58 The problem (8) is a projection of W1 to the intersection of convex regions defined by the constraints. [sent-117, score-0.16]

59 It is well known that the Bregman projection into the intersection of convex regions can be solved by sequential projections to each region [1]. [sent-118, score-0.217]

60 We generalize the latter algorithm and its analysis to symmetric positive definite matrices and call the new algorithm DefiniteBoost. [sent-121, score-0.337]

61 2 Note that if η is large then the on-line update (2) becomes a Bregman projection subject to a single equality constraint tr(WXt ) = yt . [sent-123, score-0.426]

62 Approximate Projection Exact Projection Before presenting the algorithm, let us derive the dual problem of (8) by means of Lagrange multipliers γ,    n γ ∗ = argminγ log tr exp(log W1 − j=1 γj Cj ) , γj ≥ 0. [sent-129, score-0.16]

63 Note that we can use the same numerically stable update as in the previous section. [sent-141, score-0.152]

64 However, one can use the following approximate solution: αt = λmax t 1 log − λmin t 1 + rt /λmax t 1 + rt /λmin t , (13) when the eigenvalues of Ct lie in the interval [λmin , λmax ] and rt = tr(Wt Ct ). [sent-144, score-0.549]

65 Since the t t most unsatisfied constraint is chosen, rt ≥ 0 and thus αt ≥ 0. [sent-145, score-0.172]

66 Although the projection is done only approximately,3 the convergence of the dual objective (9) can be shown using the following upper bound. [sent-146, score-0.264]

67 3 The approximate Bregman projection (with αt as in (13) can also be motivated as an online algorithm based on an entropic loss and learning rate one (following Section 3 and [4]). [sent-147, score-0.261]

68 1 The dual objective (9) is bounded as    n tr exp log W1 − where ρ(rt ) = 1− j=1 γj Cj  ≤ λmax t λmax −λmin t t rt T ρ(rt ) rt 1 − min λt λmax t (14) t=1 −λmin t λmax −λmin t t . [sent-149, score-0.84]

69 The dual objective is monotonically decreasing, because ρ(rt ) ≤ 1. [sent-150, score-0.132]

70 Also, since rt corresponds to the maximum value among all constraint violations {rj }n , we have ρ(rt ) = 1 j=1 only if rt = 0. [sent-151, score-0.312]

71 Thus the dual objective continues to decrease until all constraints are satisfied. [sent-152, score-0.132]

72 For the j-th hypothesis hj (x), let max / min us define Cj = diag(y1 hj (x1 ), . [sent-159, score-0.196]

73 Setting W1 = I/d, the dual objective (14) is rewritten as   d n 1 exp −yi γj hj (xi ) , d i=1 j=1 which is equivalent to the exponential loss function used in AdaBoost. [sent-164, score-0.446]

74 Since Cj and W1 are diagonal, the matrix Wt stays diagonal after the update. [sent-165, score-0.133]

75 If wti = [Wt ]ii , the updating formula (12) becomes the AdaBoost update: wt+1,i = wti exp(−αt yi ht (xi ))/Zt (αt ). [sent-166, score-0.18]

76 The 1+r approximate solution of αt (13) is described as αt = 1 log 1−rt , where rt is the weighted 2 t d i=1 training error of the t-th hypothesis, i. [sent-167, score-0.226]

77 5 Experiments on Learning Kernels In this section, our technique is applied to learning a kernel matrix from a set of distance measurements. [sent-170, score-0.246]

78 When K is a d × d kernel matrix among d objects, then the Kij characterizes the similarity between objects i and j. [sent-172, score-0.203]

79 In the feature space, Kij corresponds to the inner product between object i and j, and thus the Euclidean distance can be computed from the entries of the kernel matrix [10]. [sent-173, score-0.246]

80 In some cases, the kernel matrix is not given explicitly, but only a set of distance measurements is available. [sent-174, score-0.246]

81 Our task is to obtain a positive definite kernel matrix which fits well to the given distance data. [sent-181, score-0.326]

82 On-line kernel learning In the first experiment, we consider the on-line learning scenario in which only one distance example is shown to the learner at each time step. [sent-182, score-0.147]

83 The distance example at time t is described as {at , bt , yt }, which indicates that the squared Euclidean distance between objects at and bt is yt . [sent-183, score-0.696]

84 Let us define a time-developing sequence of kernel matrices as {Wt }T , and the corresponding points in the feature space as {xti }d (i. [sent-184, score-0.158]

85 Then, the total loss incurred by this sequence is T t=1 xtat − xtbt 2 − yt 2 T = t=1 (tr(Wt Xt ) − yt )2 , 1. [sent-187, score-0.515]

86 where Xt is a symmetric matrix whose (at , at ) and (bt , bt ) elements are 0. [sent-215, score-0.339]

87 We consider a controlled experiment in which the distance examples are created from a known target kernel matrix. [sent-218, score-0.187]

88 We used a 52 × 52 kernel matrix among gyrB proteins of bacteria (d = 52). [sent-219, score-0.257]

89 When the comparison matrix U is set to the target matrix, LU (S) = 0 and max = 0, because all the distance examples are derived from the target matrix. [sent-223, score-0.299]

90 Therefore we choose learning rate η = 2, which minimizes the relative loss bound of Lemma 3. [sent-224, score-0.194]

91 The total loss of the kernel matrix sequence obtained by the matrix exponential update is shown in Figure 2 (left). [sent-226, score-0.528]

92 In the plot, we have also shown the relative loss bound. [sent-227, score-0.162]

93 To evaluate the learned kernel matrix, the prediction accuracy of bacteria species by the nearest neighbor classifier is calculated (Figure 2, right), where the 52 proteins are randomly divided into 50% training and 50% testing data. [sent-229, score-0.245]

94 Kernel learning by Bregman projection Next, let us consider a batch learning scenario where we have a set of qualitative distance evaluations (i. [sent-233, score-0.186]

95 As in the previous section, distance examples are generated from the target kernel matrix between gyrB proteins. [sent-245, score-0.286]

96 Figure 3 (left) shows the convergence of the dual objective function as proven in Theorem 4. [sent-248, score-0.174]

97 As opposed to the previous experiment, the error rate is higher than that of the target kernel matrix, because substantial amount of information is lost by the conversion to inequality constraints. [sent-252, score-0.157]

98 6 Conclusion We motivated and analyzed a new update for symmetric positive matrices using the von Neumann divergence. [sent-264, score-0.556]

99 We showed that the standard bounds for on-line learning and Boosting generalize to the case when the parameters are a symmetric positive definite matrix (of trace one) instead of a probability vector. [sent-265, score-0.447]

100 The em algorithm for kernel matrix completion with auxiliary data. [sent-362, score-0.175]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wt', 0.501), ('tr', 0.389), ('bregman', 0.313), ('yt', 0.198), ('symmetric', 0.175), ('exponentiated', 0.158), ('rt', 0.14), ('meg', 0.128), ('xt', 0.128), ('loss', 0.119), ('divergence', 0.117), ('lmeg', 0.107), ('niteboost', 0.107), ('update', 0.106), ('matrix', 0.099), ('dual', 0.098), ('ct', 0.098), ('cj', 0.092), ('projection', 0.09), ('neumann', 0.09), ('adaboost', 0.085), ('von', 0.085), ('quantum', 0.085), ('matrices', 0.082), ('nite', 0.08), ('exp', 0.08), ('positive', 0.08), ('lu', 0.078), ('gradient', 0.076), ('kernel', 0.076), ('distance', 0.071), ('bt', 0.065), ('argminw', 0.064), ('uxt', 0.064), ('wti', 0.064), ('log', 0.062), ('updates', 0.058), ('projections', 0.057), ('unsatis', 0.056), ('bacteria', 0.056), ('zt', 0.055), ('hj', 0.054), ('tsuda', 0.054), ('boosting', 0.054), ('kivinen', 0.051), ('trace', 0.05), ('max', 0.049), ('divergences', 0.047), ('vt', 0.047), ('numerically', 0.046), ('eigenvalues', 0.043), ('bounds', 0.043), ('relative', 0.043), ('gyrb', 0.043), ('tightness', 0.043), ('wcj', 0.043), ('convergence', 0.042), ('inequality', 0.041), ('target', 0.04), ('min', 0.039), ('noble', 0.037), ('wxt', 0.037), ('intersection', 0.036), ('lemma', 0.036), ('diagonal', 0.034), ('convex', 0.034), ('manfred', 0.034), ('objective', 0.034), ('inequalities', 0.033), ('neighbor', 0.033), ('bound', 0.032), ('generalizes', 0.032), ('constraint', 0.032), ('dmax', 0.032), ('ys', 0.032), ('rewritten', 0.032), ('nearest', 0.03), ('discrepancy', 0.03), ('exponential', 0.029), ('iterations', 0.029), ('parameter', 0.029), ('motivated', 0.028), ('ht', 0.028), ('objects', 0.028), ('kij', 0.027), ('xs', 0.027), ('classification', 0.027), ('proteins', 0.026), ('argmin', 0.026), ('updated', 0.026), ('evaluations', 0.025), ('species', 0.024), ('lagrange', 0.024), ('yi', 0.024), ('descent', 0.024), ('approximate', 0.024), ('eg', 0.024), ('exponent', 0.024), ('lt', 0.024), ('eigenvalue', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth

Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: On-line learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the analysis of each algorithm generalizes to the non-diagonal case. We apply both new algorithms, called the Matrix Exponentiated Gradient (MEG) update and DefiniteBoost, to learn a kernel matrix from distance measurements.

2 0.26750091 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

Author: Peter L. Bartlett, Michael Collins, Ben Taskar, David A. McAllester

Abstract: We consider the problem of structured classification, where the task is to predict a label y from an input x, and y has meaningful internal structure. Our framework includes supervised training of Markov random fields and weighted context-free grammars as special cases. We describe an algorithm that solves the large-margin optimization problem defined in [12], using an exponential-family (Gibbs distribution) representation of structured objects. The algorithm is efficient—even in cases where the number of labels y is exponential in size—provided that certain expectations under Gibbs distributions can be calculated efficiently. The method for structured labels relies on a more general result, specifically the application of exponentiated gradient updates [7, 8] to quadratic programs. 1

3 0.26448882 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer

Abstract: Prediction suffix trees (PST) provide a popular and effective tool for tasks such as compression, classification, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efficient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the first provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any fixed PST determined in hindsight. 1

4 0.21638048 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

Author: Nicolò Cesa-bianchi, Claudio Gentile, Luca Zaniboni

Abstract: We provide a worst-case analysis of selective sampling algorithms for learning linear threshold functions. The algorithms considered in this paper are Perceptron-like algorithms, i.e., algorithms which can be efficiently run in any reproducing kernel Hilbert space. Our algorithms exploit a simple margin-based randomized rule to decide whether to query the current label. We obtain selective sampling algorithms achieving on average the same bounds as those proven for their deterministic counterparts, but using much fewer labels. We complement our theoretical findings with an empirical comparison on two text categorization tasks. The outcome of these experiments is largely predicted by our theoretical results: Our selective sampling algorithms tend to perform as good as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels. 1

5 0.18846902 175 nips-2004-Stable adaptive control with online learning

Author: H. J. Kim, Andrew Y. Ng

Abstract: Learning algorithms have enjoyed numerous successes in robotic control tasks. In problems with time-varying dynamics, online learning methods have also proved to be a powerful tool for automatically tracking and/or adapting to the changing circumstances. However, for safety-critical applications such as airplane flight, the adoption of these algorithms has been significantly hampered by their lack of safety, such as “stability,” guarantees. Rather than trying to show difficult, a priori, stability guarantees for specific learning methods, in this paper we propose a method for “monitoring” the controllers suggested by the learning algorithm online, and rejecting controllers leading to instability. We prove that even if an arbitrary online learning method is used with our algorithm to control a linear dynamical system, the resulting system is stable. 1

6 0.13832247 113 nips-2004-Maximum-Margin Matrix Factorization

7 0.13509969 30 nips-2004-Binet-Cauchy Kernels

8 0.12632416 178 nips-2004-Support Vector Classification with Input Data Uncertainty

9 0.11980105 136 nips-2004-On Semi-Supervised Classification

10 0.11880573 48 nips-2004-Convergence and No-Regret in Multiagent Learning

11 0.11506338 183 nips-2004-Temporal-Difference Networks

12 0.10853978 2 nips-2004-A Direct Formulation for Sparse PCA Using Semidefinite Programming

13 0.10809655 148 nips-2004-Probabilistic Computation in Spiking Populations

14 0.09961427 185 nips-2004-The Convergence of Contrastive Divergences

15 0.097762868 119 nips-2004-Mistake Bounds for Maximum Entropy Discrimination

16 0.089943074 168 nips-2004-Semigroup Kernels on Finite Sets

17 0.089462198 91 nips-2004-Joint Tracking of Pose, Expression, and Texture using Conditionally Gaussian Filters

18 0.080455497 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

19 0.080299415 64 nips-2004-Experts in a Markov Decision Process

20 0.080194779 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.259), (1, 0.032), (2, 0.234), (3, 0.16), (4, 0.166), (5, -0.299), (6, -0.063), (7, -0.036), (8, 0.081), (9, 0.06), (10, -0.046), (11, -0.033), (12, -0.145), (13, 0.033), (14, 0.097), (15, 0.075), (16, -0.006), (17, 0.099), (18, -0.04), (19, 0.003), (20, 0.042), (21, 0.096), (22, -0.027), (23, 0.068), (24, -0.102), (25, 0.009), (26, 0.075), (27, -0.15), (28, -0.089), (29, -0.031), (30, -0.036), (31, -0.067), (32, 0.02), (33, 0.083), (34, 0.037), (35, -0.096), (36, -0.035), (37, -0.013), (38, 0.097), (39, 0.05), (40, 0.035), (41, 0.047), (42, -0.067), (43, -0.026), (44, 0.088), (45, -0.042), (46, 0.104), (47, -0.048), (48, 0.031), (49, 0.007)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96437341 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth

Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: On-line learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the analysis of each algorithm generalizes to the non-diagonal case. We apply both new algorithms, called the Matrix Exponentiated Gradient (MEG) update and DefiniteBoost, to learn a kernel matrix from distance measurements.

2 0.74026382 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer

Abstract: Prediction suffix trees (PST) provide a popular and effective tool for tasks such as compression, classification, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efficient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the first provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any fixed PST determined in hindsight. 1

3 0.64378411 175 nips-2004-Stable adaptive control with online learning

Author: H. J. Kim, Andrew Y. Ng

Abstract: Learning algorithms have enjoyed numerous successes in robotic control tasks. In problems with time-varying dynamics, online learning methods have also proved to be a powerful tool for automatically tracking and/or adapting to the changing circumstances. However, for safety-critical applications such as airplane flight, the adoption of these algorithms has been significantly hampered by their lack of safety, such as “stability,” guarantees. Rather than trying to show difficult, a priori, stability guarantees for specific learning methods, in this paper we propose a method for “monitoring” the controllers suggested by the learning algorithm online, and rejecting controllers leading to instability. We prove that even if an arbitrary online learning method is used with our algorithm to control a linear dynamical system, the resulting system is stable. 1

4 0.63339043 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

Author: Nicolò Cesa-bianchi, Claudio Gentile, Luca Zaniboni

Abstract: We provide a worst-case analysis of selective sampling algorithms for learning linear threshold functions. The algorithms considered in this paper are Perceptron-like algorithms, i.e., algorithms which can be efficiently run in any reproducing kernel Hilbert space. Our algorithms exploit a simple margin-based randomized rule to decide whether to query the current label. We obtain selective sampling algorithms achieving on average the same bounds as those proven for their deterministic counterparts, but using much fewer labels. We complement our theoretical findings with an empirical comparison on two text categorization tasks. The outcome of these experiments is largely predicted by our theoretical results: Our selective sampling algorithms tend to perform as good as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels. 1

5 0.59127963 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

Author: Peter L. Bartlett, Michael Collins, Ben Taskar, David A. McAllester

Abstract: We consider the problem of structured classification, where the task is to predict a label y from an input x, and y has meaningful internal structure. Our framework includes supervised training of Markov random fields and weighted context-free grammars as special cases. We describe an algorithm that solves the large-margin optimization problem defined in [12], using an exponential-family (Gibbs distribution) representation of structured objects. The algorithm is efficient—even in cases where the number of labels y is exponential in size—provided that certain expectations under Gibbs distributions can be calculated efficiently. The method for structured labels relies on a more general result, specifically the application of exponentiated gradient updates [7, 8] to quadratic programs. 1

6 0.48821983 30 nips-2004-Binet-Cauchy Kernels

7 0.45796227 2 nips-2004-A Direct Formulation for Sparse PCA Using Semidefinite Programming

8 0.44586951 113 nips-2004-Maximum-Margin Matrix Factorization

9 0.41106513 48 nips-2004-Convergence and No-Regret in Multiagent Learning

10 0.38604295 183 nips-2004-Temporal-Difference Networks

11 0.38437802 185 nips-2004-The Convergence of Contrastive Divergences

12 0.38088715 136 nips-2004-On Semi-Supervised Classification

13 0.3579289 94 nips-2004-Kernels for Multi--task Learning

14 0.35641995 196 nips-2004-Triangle Fixing Algorithms for the Metric Nearness Problem

15 0.35337171 190 nips-2004-The Rescorla-Wagner Algorithm and Maximum Likelihood Estimation of Causal Parameters

16 0.3420437 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices

17 0.33521441 119 nips-2004-Mistake Bounds for Maximum Entropy Discrimination

18 0.32929435 178 nips-2004-Support Vector Classification with Input Data Uncertainty

19 0.32780951 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

20 0.3205584 19 nips-2004-An Application of Boosting to Graph Classification


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.098), (15, 0.164), (26, 0.075), (31, 0.031), (32, 0.022), (33, 0.185), (35, 0.022), (39, 0.041), (50, 0.04), (93, 0.231)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90931624 41 nips-2004-Comparing Beliefs, Surveys, and Random Walks

Author: Erik Aurell, Uri Gordon, Scott Kirkpatrick

Abstract: Survey propagation is a powerful technique from statistical physics that has been applied to solve the 3-SAT problem both in principle and in practice. We give, using only probability arguments, a common derivation of survey propagation, belief propagation and several interesting hybrid methods. We then present numerical experiments which use WSAT (a widely used random-walk based SAT solver) to quantify the complexity of the 3-SAT formulae as a function of their parameters, both as randomly generated and after simpli£cation, guided by survey propagation. Some properties of WSAT which have not previously been reported make it an ideal tool for this purpose – its mean cost is proportional to the number of variables in the formula (at a £xed ratio of clauses to variables) in the easy-SAT regime and slightly beyond, and its behavior in the hardSAT regime appears to re¤ect the underlying structure of the solution space that has been predicted by replica symmetry-breaking arguments. An analysis of the tradeoffs between the various methods of search for satisfying assignments shows WSAT to be far more powerful than has been appreciated, and suggests some interesting new directions for practical algorithm development. 1

2 0.88462198 91 nips-2004-Joint Tracking of Pose, Expression, and Texture using Conditionally Gaussian Filters

Author: Tim K. Marks, J. C. Roddey, Javier R. Movellan, John R. Hershey

Abstract: We present a generative model and stochastic filtering algorithm for simultaneous tracking of 3D position and orientation, non-rigid motion, object texture, and background texture using a single camera. We show that the solution to this problem is formally equivalent to stochastic filtering of conditionally Gaussian processes, a problem for which well known approaches exist [3, 8]. We propose an approach based on Monte Carlo sampling of the nonlinear component of the process (object motion) and exact filtering of the object and background textures given the sampled motion. The smoothness of image sequences in time and space is exploited by using Laplace’s method to generate proposal distributions for importance sampling [7]. The resulting inference algorithm encompasses both optic flow and template-based tracking as special cases, and elucidates the conditions under which these methods are optimal. We demonstrate an application of the system to 3D non-rigid face tracking. 1 Background Recent algorithms track morphable objects by solving optic flow equations, subject to the constraint that the tracked points belong to an object whose non-rigid deformations are linear combinations of a set of basic shapes [10, 2, 11]. These algorithms require precise initialization of the object pose and tend to drift out of alignment on long video sequences. We present G-flow, a generative model and stochastic filtering formulation of tracking that address the problems of initialization and error recovery in a principled manner. We define a non-rigid object by the 3D locations of n vertices. The object is a linear combination of k fixed morph bases, with coefficients c = [c1 , c2 , · · · , ck ]T . The fixed 3 × k matrix hi contains the position of the ith vertex in all k morph bases. The transformation from object-centered to image coordinates consists of a rotation, weak perspective projection, and translation. Thus xi , the 2D location of the ith vertex on the image plane, is xi = grhi c + l, (1) where r is the 3 × 3 rotation matrix, l is the 2 × 1 translation vector, and g = 1 0 0 is the 010 projection matrix. The object pose, ut , comprises both the rigid motion parameters and the morph parameters at time t: ut = {r(t), l(t), c(t)}. (2) 1.1 Optic flow Let yt represent the current image, and let xi (ut ) index the image pixel that is rendered by the ith object vertex when the object assumes pose ut . Suppose that we know ut−1 , the pose at time t − 1, and we want to find ut , the pose at time t. This problem can be solved by minimizing the following form with respect to ut : ut = argmin ˆ ut 1 2 n 2 [yt (xi (ut )) − yt−1 (xi (ut−1 ))] . (3) i=1 In the special case in which the xi (ut ) are neighboring points that move with the same 2D displacement, this reduces to the standard Lucas-Kanade optic flow algorithm [9, 1]. Recent work [10, 2, 11] has shown that in the general case, this optimization problem can be solved efficiently using the Gauss-Newton method. We will take advantage of this fact to develop an efficient stochastic inference algorithm within the framework of G-flow. Notational conventions Unless otherwise stated, capital letters are used for random variables, small letters for specific values taken by random variables, and Greek letters for fixed model parameters. Subscripted colons indicate sequences: e.g., X1:t = X1 · · · Xt . The term In stands for the n × n identity matrix, E for expected value, V ar for the covariance matrix, and V ar−1 for the inverse of the covariance matrix (precision matrix). 2 The Generative Model for G-Flow Figure 1: Left: a(Ut ) determines which texel (color at a vertex of the object model or a pixel of the background model) is responsible for rendering each image pixel. Right: G-flow video generation model: At time t, the object’s 3D pose, Ut , is used to project the object texture, Vt , into 2D. This projection is combined with the background texture, Bt , to generate the observed image, Yt . We model the image sequence Y as a stochastic process generated by three hidden causes, U , V , and B, as shown in the graphical model (Figure 1, right). The m × 1 random vector Yt represents the m-pixel image at time t. The n × 1 random vector Vt and the m × 1 random vector Bt represent the n-texel object texture and the m-texel background texture, respectively. As illustrated in Figure 1, left, the object pose, Ut , determines onto which image pixels the object and background texels project at time t. This is formulated using the projection function a(Ut ). For a given pose, ut , the projection a(ut ) is a block matrix, def a(ut ) = av (ut ) ab (ut ) . Here av (ut ), the object projection function, is an m × n matrix of 0s and 1s that tells onto which image pixel each object vertex projects; e.g., a 1 at row j, column i it means that the ith object point projects onto image pixel j. Matrix ab plays the same role for background pixels. Assuming the foreground mapping is one-toone, we let ab = Im −av (ut )av (ut )T , expressing the simple occlusion constraint that every image pixel is rendered by object or background, but not both. In the G-flow generative model: Vt Yt = a(Ut ) + Wt Wt ∼ N (0, σw Im ), σw > 0 Bt (4) Ut ∼ p(ut | ut−1 ) v v Vt = Vt−1 + Zt−1 Zt−1 ∼ N (0, Ψv ), Ψv is diagonal b b Bt = Bt−1 + Zt−1 Zt−1 ∼ N (0, Ψb ), Ψb is diagonal where p(ut | ut−1 ) is the pose transition distribution, and Z v , Z b , W are independent of each other, of the initial conditions, and over time. The form of the pose distribution is left unspecified since the algorithm proposed here does not require the pose distribution or the pose dynamics to be Gaussian. For the initial conditions, we require that the variance of V1 and the variance of B1 are both diagonal. Non-rigid 3D tracking is a difficult nonlinear filtering problem because changing the pose has a nonlinear effect on the image pixels. Fortunately, the problem has a rich structure that we can exploit: under the G-flow model, video generation is a conditionally Gaussian process [3, 6, 4, 5]. If the specific values taken by the pose sequence, u1:t , were known, then the texture processes, V and B, and the image process, Y , would be jointly Gaussian. This suggests the following scheme: we could use particle filtering to obtain a distribution of pose experts (each expert corresponds to a highly probable sample of pose, u1:t ). For each expert we could then use Kalman filtering equations to infer the posterior distribution of texture given the observed images. This method is known in the statistics community as a Monte Carlo filtering solution for conditionally Gaussian processes [3, 4], and in the machine learning community as Rao-Blackwellized particle filtering [6, 5]. We found that in addition to Rao-Blackwellization, it was also critical to use Laplace’s method to generate the proposal distributions for importance sampling [7]. In the context of G-flow, we accomplished this by performing an optic flow-like optimization, using an efficient algorithm similar to those in [10, 2]. 3 Inference Our goal is to find an expression for the filtering distribution, p(ut , vt , bt | y1:t ). Using the law of total probability, we have the following equation for the filtering distribution: p(ut , vt , bt | y1:t ) = p(ut , vt , bt | u1:t−1 , y1:t ) p(u1:t−1 | y1:t ) du1:t−1 Opinion of expert (5) Credibility of expert We can think of the integral in (5) as a sum over a distribution of experts, where each expert corresponds to a single pose history, u1:t−1 . Based on its hypothesis about pose history, each expert has an opinion about the current pose of the object, Ut , and the texture maps of the object and background, Vt and Bt . Each expert also has a credibility, a scalar that measures how well the expert’s opinion matches the observed image yt . Thus, (5) can be interpreted as follows: The filtering distribution at time t is obtained by integrating over the entire ensemble of experts the opinion of each expert weighted by that expert’s credibility. The opinion distribution of expert u1:t−1 can be factorized into the expert’s opinion about the pose Ut times the conditional distribution of texture Vt , Bt given pose: p(ut , vt , bt | u1:t−1 , y1:t ) = p(ut | u1:t−1 , y1:t ) p(vt , bt | u1:t , y1:t ) (6) Opinion of expert Pose Opinion Texture Opinion given pose The rest of this section explains how we evaluate each term in (5) and (6). We cover the distribution of texture given pose in 3.1, pose opinion in 3.2, and credibility in 3.3. 3.1 Texture opinion given pose The distribution of Vt and Bt given the pose history u1:t is Gaussian with mean and covariance that can be obtained using the Kalman filter estimation equations: −1 V ar−1 (Vt , Bt | u1:t , y1:t ) = V ar−1 (Vt , Bt | u1:t−1 , y1:t−1 ) + a(ut )T σw a(ut ) E(Vt , Bt | u1:t , y1:t ) = V ar(Vt , Bt | u1:t , y1:t ) −1 × V ar−1 (Vt , Bt | u1:t−1 , y1:t−1 )E(Vt , Bt | u1:t−1 , y1:t−1 ) + a(ut )T σw yt (7) (8) This requires p(Vt , Bt |u1:t−1 , y1:t−1 ), which we get from the Kalman prediction equations: E(Vt , Bt | u1:t−1 , y1:t−1 ) = E(Vt−1 , Bt−1 | u1:t−1 , y1:t−1 ) V ar(Vt , Bt | u1:t−1 , y1:t−1 ) = V ar(Vt−1 , Bt−1 | u1:t−1 , y1:t−1 ) + (9) Ψv 0 0 Ψb (10) In (9), the expected value E(Vt , Bt | u1:t−1 , y1:t−1 ) consists of texture maps (templates) for the object and background. In (10), V ar(Vt , Bt | u1:t−1 , y1:t−1 ) represents the degree of uncertainty about each texel in these texture maps. Since this is a diagonal matrix, we can refer to the mean and variance of each texel individually. For the ith texel in the object texture map, we use the following notation: µv (i) t v σt (i) def = ith element of E(Vt | u1:t−1 , y1:t−1 ) def = (i, i)th element of V ar(Vt | u1:t−1 , y1:t−1 ) b Similarly, define µb (j) and σt (j) as the mean and variance of the jth texel in the backt ground texture map. (This notation leaves the dependency on u1:t−1 and y1:t−1 implicit.) 3.2 Pose opinion Based on its current texture template (derived from the history of poses and images up to time t−1) and the new image yt , each expert u1:t−1 has a pose opinion, p(ut |u1:t−1 , y1:t ), a probability distribution representing that expert’s beliefs about the pose at time t. Since the effect of ut on the likelihood function is nonlinear, we will not attempt to find an analytical solution for the pose opinion distribution. However, due to the spatio-temporal smoothness of video signals, it is possible to estimate the peak and variance of an expert’s pose opinion. 3.2.1 Estimating the peak of an expert’s pose opinion We want to estimate ut (u1:t−1 ), the value of ut that maximizes the pose opinion. Since ˆ p(ut | u1:t−1 , y1:t ) = p(y1:t−1 | u1:t−1 ) p(ut | ut−1 ) p(yt | u1:t , y1:t−1 ), p(y1:t | u1:t−1 ) (11) def ut (u1:t−1 ) = argmax p(ut | u1:t−1 , y1:t ) = argmax p(ut | ut−1 ) p(yt | u1:t , y1:t−1 ). ˆ ut ut (12) We now need an expression for the final term in (12), the predictive distribution p(yt | u1:t , y1:t−1 ). By integrating out the hidden texture variables from p(yt , vt , bt | u1:t , y1:t−1 ), and using the conditional independence relationships defined by the graphical model (Figure 1, right), we can derive: 1 m log p(yt | u1:t , y1:t−1 ) = − log 2π − log |V ar(Yt | u1:t , y1:t−1 )| 2 2 n v 2 1 (yt (xi (ut )) − µt (i)) 1 (yt (j) − µb (j))2 t − − , (13) v (i) + σ b 2 i=1 σt 2 σt (j) + σw w j∈X (ut ) where xi (ut ) is the image pixel rendered by the ith object vertex when the object assumes pose ut , and X (ut ) is the set of all image pixels rendered by the object under pose ut . Combining (12) and (13), we can derive ut (u1:t−1 ) = argmin − log p(ut | ut−1 ) ˆ (14) ut + 1 2 n i=1 [yt (xi (ut )) − µv (i)]2 [yt (xi (ut )) − µb (xi (ut ))]2 t t b − − log[σt (xi (ut )) + σw ] v b σt (i) + σw σt (xi (ut )) + σw Foreground term Background terms Note the similarity between (14) and constrained optic flow (3). For example, focus on the foreground term in (14) and ignore the weights in the denominator. The previous image yt−1 from (3) has been replaced by µv (·), the estimated object texture based on the images t and poses up to time t − 1. As in optic flow, we can find the pose estimate ut (u1:t−1 ) ˆ efficiently using the Gauss-Newton method. 3.2.2 Estimating the distribution of an expert’s pose opinion We estimate the distribution of an expert’s pose opinion using a combination of Laplace’s method and importance sampling. Suppose at time t − 1 we are given a sample of experts (d) (d) indexed by d, each endowed with a pose sequence u1:t−1 , a weight wt−1 , and the means and variances of Gaussian distributions for object and background texture. For each expert (d) (d) u1:t−1 , we use (14) to compute ut , the peak of the pose distribution at time t according ˆ (d) to that expert. Define σt as the inverse Hessian matrix of (14) at this peak, the Laplace ˆ estimate of the covariance matrix of the expert’s opinion. We then generate a set of s (d,e) (d) independent samples {ut : e = 1, · · · , s} from a Gaussian distribution with mean ut ˆ (d) (d) (d) and variance proportional to σt , g(·|ˆt , αˆt ), where the parameter α > 0 determines ˆ u σ the sharpness of the sampling distribution. (Note that letting α → 0 would be equivalent to (d,e) (d) simply setting the new pose equal to the peak of the pose opinion, ut = ut .) To find ˆ the parameters of this Gaussian proposal distribution, we use the Gauss-Newton method, ignoring the second of the two background terms in (14). (This term is not ignored in the importance sampling step.) To refine our estimate of the pose opinion we use importance sampling. We assign each sample from the proposal distribution an importance weight wt (d, e) that is proportional to the ratio between the posterior distribution and the proposal distribution: s (d) p(ut | u1:t−1 , y1:t ) = ˆ (d,e) δ(ut − ut ) wt (d, e) s f =1 wt (d, f ) (15) e=1 (d,e) (d) (d) (d,e) p(ut | ut−1 )p(yt | u1:t−1 , ut , y1:t−1 ) wt (d, e) = (16) (d,e) (d) (d) g(ut | ut , αˆt ) ˆ σ (d,e) (d) The numerator of (16) is proportional to p(ut |u1:t−1 , y1:t ) by (12), and the denominator of (16) is the sampling distribution. 3.3 Estimating an expert’s credibility (d) The credibility of the dth expert, p(u1:t−1 | y1:t ), is proportional to the product of a prior term and a likelihood term: (d) (d) p(u1:t−1 | y1:t−1 )p(yt | u1:t−1 , y1:t−1 ) (d) p(u1:t−1 | y1:t ) = . (17) p(yt | y1:t−1 ) Regarding the likelihood, p(yt |u1:t−1 , y1:t−1 ) = p(yt , ut |u1:t−1 , y1:t−1 )dut = p(yt |u1:t , y1:t−1 )p(ut |ut−1 )dut (18) (d,e) We already generated a set of samples {ut : e = 1, · · · , s} that estimate the pose opin(d) ion of the dth expert, p(ut | u1:t−1 , y1:t ). We can now use these samples to estimate the likelihood for the dth expert: (d) p(yt | u1:t−1 , y1:t−1 ) = (d) (d) p(yt | u1:t−1 , ut , y1:t−1 )p(ut | ut−1 )dut (19) (d) (d) (d) (d) = p(yt | u1:t−1 , ut , y1:t−1 )g(ut | ut , αˆt ) ˆ σ 3.4 p(ut | ut−1 ) s e=1 dut ≈ wt (d, e) s Updating the filtering distribution g(ut | (d) (d) ut , αˆt ) ˆ σ Once we have calculated the opinion and credibility of each expert u1:t−1 , we evaluate the integral in (5) as a weighted sum over experts. The credibilities of all of the experts are normalized to sum to 1. New experts u1:t (children) are created from the old experts u1:t−1 (parents) by appending a pose ut to the parent’s history of poses u1:t−1 . Every expert in the new generation is created as follows: One parent is chosen to sire the child. The probability of being chosen is proportional to the parent’s credibility. The child’s value of ut is chosen at random from its parent’s pose opinion (the weighted samples described in Section 3.2.2). 4 Relation to Optic Flow and Template Matching In basic template-matching, the same time-invariant texture map is used to track every frame in the video sequence. Optic flow can be thought of as template-matching with a template that is completely reset at each frame for use in the subsequent frame. In most cases, optimal inference under G-flow involves a combination of optic flow-based and template-based tracking, in which the texture template gradually evolves as new images are presented. Pure optic flow and template-matching emerge as special cases. Optic Flow as a Special Case Suppose that the pose transition probability p(ut | ut−1 ) is uninformative, that the background is uninformative, that every texel in the initial object texture map has equal variance, V ar(V1 ) = κIn , and that the texture transition uncertainty is very high, Ψv → diag(∞). Using (7), (8), and (10), it follows that: µv (i) = [av (ut−1 )]T yt−1 = yt−1 (xi (ut−1 )) , t (20) i.e., the object texture map at time t is determined by the pixels from image yt−1 that according to pose ut−1 were rendered by the object. As a result, (14) reduces to: ut (u1:t−1 ) = argmin ˆ ut 1 2 n yt (xi (ut )) − yt−1 (xi (ut−1 )) 2 (21) i=1 which is identical to (3). Thus constrained optic flow [10, 2, 11] is simply a special case of optimal inference under G-flow, with a single expert and with sampling parameter α → 0. The key assumption that Ψv → diag(∞) means that the object’s texture is very different in adjacent frames. However, optic flow is typically applied in situations in which the object’s texture in adjacent frames is similar. The optimal solution in such situations calls not for optic flow, but for a texture map that integrates information across multiple frames. Template Matching as a Special Case Suppose the initial texture map is known precisely, V ar(V1 ) = 0, and the texture transition uncertainty is very low, Ψv → 0. By (7), (8), and (10), it follows that µv (i) = µv (i) = µv (i), i.e., the texture map does not change t t−1 1 over time, but remains fixed at its initial value (it is a texture template). Then (14) becomes: n yt (xi (ut )) − µv (i) 1 ut (u1:t−1 ) = argmin ˆ ut 2 (22) i=1 where µv (i) is the ith texel of the fixed texture template. This is the error function mini1 mized by standard template-matching algorithms. The key assumption that Ψv → 0 means the object’s texture is constant from each frame to the next, which is rarely true in real data. G-flow provides a principled way to relax this unrealistic assumption of template methods. General Case In general, if the background is uninformative, then minimizing (14) results in a weighted combination of optic flow and template matching, with the weight of each approach depending on the current level of certainty about the object template. In addition, when there is useful information in the background, G-flow infers a model of the background which is used to improve tracking. Figure 2: G-flow tracking an outdoor video. Results are shown for frames 1, 81, and 620. 5 Simulations We collected a video (30 frames/sec) of a subject in an outdoor setting who made a variety of facial expressions while moving her head. A later motion-capture session was used to create a 3D morphable model of her face, consisting of a set of 5 morph bases (k = 5). Twenty experts were initialized randomly near the correct pose on frame 1 of the video and propagated using G-flow inference (assuming an uninformative background). See http://mplab.ucsd.edu for video. Figure 2 shows the distribution of experts for three frames. In each frame, every expert has a hypothesis about the pose (translation, rotation, scale, and morph coefficients). The 38 points in the model are projected into the image according to each expert’s pose, yielding 760 red dots in each frame. In each frame, the mean of the experts gives a single hypothesis about the 3D non-rigid deformation of the face (lower right) as well as the rigid pose of the face (rotated 3D axes, lower left). Notice G-flow’s ability to recover from error: bad initial hypotheses are weeded out, leaving only good hypotheses. To compare G-flow’s performance versus deterministic constrained optic flow algorithms such as [10, 2, 11] , we used both G-flow and the method from [2] to track the same video sequence. We ran each tracker several times, introducing small errors in the starting pose. Figure 3: Average error over time for G-flow (green) and for deterministic optic flow [2] (blue). Results were averaged over 16 runs (deterministic algorithm) or 4 runs (G-flow) and smoothed. As ground truth, the 2D locations of 6 points were hand-labeled in every 20th frame. The error at every 20th frame was calculated as the distance from these labeled locations to the inferred (tracked) locations, averaged across several runs. Figure 3 compares this tracking error as a function of time for the deterministic constrained optic flow algorithm and for a 20-expert version of the G-flow tracking algorithm. Notice that the deterministic system has a tendency to drift (increase in error) over time, whereas G-flow can recover from drift. Acknowledgments Tim K. Marks was supported by NSF grant IIS-0223052 and NSF grant DGE-0333451 to GWC. John Hershey was supported by the UCDIMI grant D00-10084. J. Cooper Roddey was supported by the Swartz Foundation. Javier R. Movellan was supported by NSF grants IIS-0086107, IIS-0220141, and IIS-0223052, and by the UCDIMI grant D00-10084. References [1] Simon Baker and Iain Matthews. Lucas-kanade 20 years on: A unifying framework. International Journal of Computer Vision, 56(3):221–255, 2002. [2] M. Brand. Flexible flow for 3D nonrigid tracking and shape recovery. In CVPR, volume 1, pages 315–322, 2001. [3] H. Chen, P. Kumar, and J. van Schuppen. On Kalman filtering for conditionally gaussian systems with random matrices. Syst. Contr. Lett., 13:397–404, 1989. [4] R. Chen and J. Liu. Mixture Kalman filters. J. R. Statist. Soc. B, 62:493–508, 2000. [5] A. Doucet and C. Andrieu. Particle filtering for partially observed gaussian state space models. J. R. Statist. Soc. B, 64:827–838, 2002. [6] A. Doucet, N. de Freitas, K. Murphy, and S. Russell. Rao-blackwellised particle filtering for dynamic bayesian networks. In 16th Conference on Uncertainty in AI, pages 176–183, 2000. [7] A. Doucet, S. J. Godsill, and C. Andrieu. On sequential monte carlo sampling methods for bayesian filtering. Statistics and Computing, 10:197–208, 2000. [8] Zoubin Ghahramani and Geoffrey E. Hinton. Variational learning for switching state-space models. Neural Computation, 12(4):831–864, 2000. [9] B. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Proceedings of the International Joint Conference on Artificial Intelligence, 1981. [10] L. Torresani, D. Yang, G. Alexander, and C. Bregler. Tracking and modeling non-rigid objects with rank constraints. In CVPR, pages 493–500, 2001. [11] Lorenzo Torresani, Aaron Hertzmann, and Christoph Bregler. Learning non-rigid 3d shape from 2d motion. In Advances in Neural Information Processing Systems 16. MIT Press, 2004.

same-paper 3 0.8692503 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth

Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: On-line learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the analysis of each algorithm generalizes to the non-diagonal case. We apply both new algorithms, called the Matrix Exponentiated Gradient (MEG) update and DefiniteBoost, to learn a kernel matrix from distance measurements.

4 0.85383415 166 nips-2004-Semi-supervised Learning via Gaussian Processes

Author: Neil D. Lawrence, Michael I. Jordan

Abstract: We present a probabilistic approach to learning a Gaussian Process classifier in the presence of unlabeled data. Our approach involves a “null category noise model” (NCNM) inspired by ordered categorical noise models. The noise model reflects an assumption that the data density is lower between the class-conditional densities. We illustrate our approach on a toy problem and present comparative results for the semi-supervised classification of handwritten digits. 1

5 0.7644251 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer

Abstract: Prediction suffix trees (PST) provide a popular and effective tool for tasks such as compression, classification, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efficient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the first provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any fixed PST determined in hindsight. 1

6 0.76140875 178 nips-2004-Support Vector Classification with Input Data Uncertainty

7 0.76064068 131 nips-2004-Non-Local Manifold Tangent Learning

8 0.75854772 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

9 0.75620413 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

10 0.75551963 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

11 0.75417781 69 nips-2004-Fast Rates to Bayes for Kernel Machines

12 0.75400501 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

13 0.75388736 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

14 0.75344515 68 nips-2004-Face Detection --- Efficient and Rank Deficient

15 0.75214165 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications

16 0.75178576 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform

17 0.75076014 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods

18 0.75059283 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

19 0.75048953 70 nips-2004-Following Curved Regularized Optimization Solution Paths

20 0.74881524 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach