nips nips2004 nips2004-93 knowledge-graph by maker-knowledge-mining

93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition

Source: pdf

Author: Laurent Zwald, Gilles Blanchard, Pascal Massart, Régis Vert

Abstract: This paper investigates the effect of Kernel Principal Component Analysis (KPCA) within the classiﬁcation framework, essentially the regularization properties of this dimensionality reduction method. KPCA has been previously used as a pre-processing step before applying an SVM but we point out that this method is somewhat redundant from a regularization point of view and we propose a new algorithm called Kernel Projection Machine to avoid this redundancy, based on an analogy with the statistical framework of regression for a Gaussian white noise model. Preliminary experimental results show that this algorithm reaches the same performances as an SVM. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 fr Pascal Massart D´ partement de Math´ matiques, e e Universit´ Paris-Sud, e Bat. [sent-7, score-0.05]

2 fr Laurent Zwald D´ partement de Math´ matiques, e e Universit´ Paris-Sud, e Bat. [sent-11, score-0.05]

3 fr Abstract This paper investigates the effect of Kernel Principal Component Analysis (KPCA) within the classiﬁcation framework, essentially the regularization properties of this dimensionality reduction method. [sent-15, score-0.287]

4 Preliminary experimental results show that this algorithm reaches the same performances as an SVM. [sent-17, score-0.085]

5 n be n given realizations of a random variable (X, Y ) living in X × {−1; 1}. [sent-21, score-0.077]

6 The xi ’s are often referred to as inputs (or patterns), and the yi ’s as labels. [sent-23, score-0.085]

7 Pattern recognition is concerned with ﬁnding a classiﬁer, i. [sent-24, score-0.029]

8 It is often the case with real world data that the dimension of the patterns is very large, and some of the components carry more noise than information. [sent-27, score-0.079]

9 In such cases, reducing the dimension of the data before running a classiﬁcation algorithm on it sounds reasonable. [sent-28, score-0.079]

10 One of the most famous methods for this kind of pre-processing is PCA, and its kernelized ¨ version (KPCA), introduced in the pioneering work of Scholkopf, Smola and M¨ ller [8]. [sent-29, score-0.058]

11 Some experiments have already been carried out to investigate the use of KPCA for classiﬁcation purposes, and numerical results are reported in [8]. [sent-32, score-0.052]

12 The authors considered the USPS handwritten digit database and reported the test error rates achieved by the linear SVM trained on the data pre-processed with KPCA: the conclusion was that the larger the number of principal components, the better the performance. [sent-33, score-0.1]

13 In other words, the KPCA step was useless or even counterproductive. [sent-34, score-0.04]

14 This conclusion might be explained by a redundancy arising in their experiments: there is actually a double regularization, the ﬁrst corresponding to the dimensionality reduction achieved by KPCA, and the other to the regularization achieved by the SVM. [sent-35, score-0.393]

15 With that in mind it does not seem so surprising that KPCA does not help in that case: whatever the dimensionality reduction, the SVM anyway achieves a (possibly strong) regularization. [sent-36, score-0.112]

16 The aforementioned experiments suggest that KPCA should be used together with a classiﬁcation algorithm that is not regularized (e. [sent-38, score-0.049]

17 a simple empirical risk minimizer): in that case, it should be expected that the KPCA is by itself sufﬁcient to achieve regularization, the choice of the dimension being guided by adequate model selection. [sent-40, score-0.286]

18 In this paper, we propose a new algorithm, called the Kernel Projection Machine (KPM), that implements this idea: an optimal dimension is sought so as to minimize the test error of the resulting classiﬁer. [sent-41, score-0.108]

19 A nice property is that the training labels are used to select the optimal dimension – optimal means that the resulting D-dimensional representation of the data contains the right amount of information needed to classify the inputs. [sent-42, score-0.112]

20 To sum up, the KPM can be seen as a dimensionality-reduction-based classiﬁcation method that takes into account the labels for the dimensionality reduction step. [sent-43, score-0.177]

21 This paper is organized as follows: Section 2 gives some statistical background on regularized method vs. [sent-44, score-0.049]

22 Section 3 explicitly gives the details of the algorithm; experiments and results, which should be considered preliminary, are reported in Section 4. [sent-47, score-0.052]

23 1 Motivations for the Kernel Projection Machine The Gaussian Intuition: a Statistician’s Perspective Regularization methods have been used for quite a long time in non parametric statistics since the pioneering works of Grace Wahba in the eighties (see [10] for a review). [sent-49, score-0.099]

24 1 So let us assume that one observes a noisy signal dY (x) = s(x)dx + √n dw(x) , Y (0) = 0 on [0,1] where dw(x) denotes standard white noise. [sent-51, score-0.04]

25 To the reader not familiar with this model, it should be considered as nothing more but an idealization of the well-known ﬁxed design regression problem Yi = s(i/n) + εi for i = 1, . [sent-52, score-0.028]

26 , n, where εi ∼ N (0, 1), where the goal is to recover the regression function s. [sent-55, score-0.028]

27 (The white noise model is actually simpler to study from a mathematical point of view). [sent-56, score-0.081]

28 Given a Mercer kernel k on [0, 1]×[0, 1], the regularization least square procedure proposes to minimize γn (f ) + ζn f (1) Hk where (ζn ) is a conveniently chosen sequence and Hk denotes the RKHS induced by k. [sent-58, score-0.329]

29 This procedure can indeed be viewed as a model selection procedure since minimizing γn (f ) + ζn f Hk amounts to minimizing γn (f ) + ζn R2 inf f ≤R over R > 0. [sent-59, score-0.257]

30 In other words, regularization aims at selecting the “best” RKHS ball {f, f ≤ R} to represent our data. [sent-60, score-0.222]

31 At this stage, it is interesting to realize that the balls in the RKHS space can be viewed as ellipsoids in the original Hilbert space L2 ([0, 1]). [sent-61, score-0.16]

32   λj j=1 j=1 Now, due to the approximation properties of the ﬁnite dimensional spaces {φ j , j ≤ D}, D ∈ N∗ with respect to the ellipsoids, one can think of penalized ﬁnite dimensional projection as an alternative method to regularization. [sent-64, score-0.187]

33 More precisely, if sD denotes the projection D estimator on φj , j ≤ D , i. [sent-65, score-0.189]

34 sD = j=1 φj dY φj and one considers the penalized selection criterion D = argmin[γn (sD ) + D 2D n ] then, it is proved in [1] that the selected estimator sD obeys to the following oracle inequality b E[ s − sD b 2 ] ≤ C inf E s − sD 2 D≥1 where C is some absolute constant. [sent-67, score-0.275]

35 As a consequence, the estimator sD is simultaneously minimax over the collection of all b √ ellipsoids E(c), which in particular includes the collection {E( λR), R > 0}. [sent-69, score-0.34]

36 To conclude and summarize, from a statistical performance point of view, what we can expect from a regularized estimator s (i. [sent-70, score-0.136]

37 a minimizer of (1)) is that a convenient device√ ζn ensures that s is simultaneously minimax over the collection of ellipsoids of {E( λR), R > 0}, (at least as far as asymptotic rates of convergence are concerned ). [sent-72, score-0.283]

38 The alternative estimator sD actually achieves this goal and even better since it is also b √ adaptive over the collection of all ellipsoids and not only the family {E( λR), R > 0}. [sent-73, score-0.302]

39 First of all, it has been noted by several authors ([6],[9]) that the SVM can be seen as a regularized estimation method, where the regularizer is the squared norm of the function in H k . [sent-76, score-0.049]

40 Precisely, the SVM algorithm solves the following unconstrained optimization problem: min f ∈Hb k 1 n n i=1 (1 − yi f (xi ))+ + λ f 2 Hk , (2) b where Hk = {f (x) + b, f ∈ Hk , b ∈ R}. [sent-77, score-0.075]

41 The above regularization can be viewed as a model selection process over RKHS balls, similarly to the previous section. [sent-78, score-0.193]

42 Now, the line of ideas developed there suggests that it might actually be a better idea to consider a sequence of ﬁnite-dimensional estimators. [sent-79, score-0.041]

43 Additionally, it has been shown in [4] that the regularization term of the SVM is actually too strong. [sent-80, score-0.181]

44 Consider a Mercer kernel k deﬁned on X × X and Let Tk denote the operator associated with kernel k in the following way Tk : f (. [sent-82, score-0.329]

45 denote the eigenvectors of Tk , ordered by decreasing associated eigenvalues (λi )1≥1 . [sent-87, score-0.143]

46 , φD } 1, b (where 1 denotes the constant function equal to 1) corresponds to a subspace of H k associ1 ∞ b ated with kernel k, and Hk = D=1 FD . [sent-91, score-0.141]

47 Instead of selecting the “best” ball in the RKHS, as the SVM does, we consider the analogue of the projection estimator sD : n ˆ fD = arg min f ∈FD i=1 (1 − yi f (xi ))+ (3) that is, more explicitly, D ˆ fD (. [sent-92, score-0.38]

48 Unfortunately, since the underlying probability P is unknown, neither are the eigenfunctions φ1 , . [sent-95, score-0.088]

49 , and it is therefore not possible to implement this procedure directly. [sent-98, score-0.048]

50 We thus resort to considering empirical quantities as will be explained in more detail in section 3. [sent-99, score-0.043]

51 Essentially, the unknown vectorial space spanned by the ﬁrst eigenfunctions of T k is replaced by the space spanned by the ﬁrst eigenvectors of the normalized kernel Gram matrix 1 n (k(xi , xj ))1≤i,j≤n . [sent-100, score-0.332]

52 We next precise this relation and give an interpretation of the resulting algorithm in terms of dimensionality reduction. [sent-102, score-0.083]

53 They are often used as a pre-processing on the data in order to reduce the dimensionality or to perform de-noising. [sent-106, score-0.083]

54 Hence, roughly speaking, in the KPM, the SVM penalization is replaced by dimensionality reduction. [sent-108, score-0.147]

55 Choosing D amounts to selecting the optimal D-dimensional representation of our data for the classiﬁcation task, in other words to extracting the information that is needed for this task by model selection taking into account the relevance of the directions for the classiﬁcation task. [sent-109, score-0.135]

56 To conclude, the KPM is a method of dimensionality reduction that takes into account the labels of the training data to choose the “best” dimension. [sent-110, score-0.177]

57 3 The Kernel Projection Machine Algorithm In this section, the empirical (and computable) version of the KPM algorithm is derived from the previous theoretical arguments. [sent-111, score-0.043]

58 In practice the true eigenfunctions of the kernel operator are not computable. [sent-112, score-0.276]

59 , xn are needed for minimizing the empirical risk over FD , the eigenvectors of the kernel matrix K = (k(xi , xj ))1≤i,j≤n will be enough for our purpose. [sent-119, score-0.41]

60 Indeed, it is well known in numerical analysis (see [2]) that the eigenvectors of the kernel matrix approximate the eigenfunctions of the kernel operator. [sent-120, score-0.435]

61 , VD denote the D ﬁrst eigenvectors of K with associated eigenvalues λ1 ≥ λ2 ≥ . [sent-125, score-0.114]

62 At this stage, we can do an expansion of the solution in terms of the kernel similarly to the SVM algorithm, in the following way: n ˆ fD (. [sent-138, score-0.141]

63 32 0 2 4 6 8 10 12 14 16 18 20 Figure 1: Left: KPM risk (solid) and empirical risk (dashed) versus dimension D. [sent-156, score-0.374]

64 + βD VD = Kα∗ D (9) ∗ βj which has a straightforward solution: α∗ = j=1 b Vj (provided the D ﬁrst eigenvalues λj are all strictly positive). [sent-165, score-0.049]

65 , xn ∈ X and a positive kernel k deﬁned on X × X , compute the kernel matrix K and its eigenvectors V1 , . [sent-170, score-0.382]

66 , Vn together with its eigenvalues in decreasing order λ1 ≥ λ2 ≥ . [sent-173, score-0.078]

67 for each dimension D such that λD > 0 solve the linear optimization problem n (β ∗ , b∗ ) = arg min β,b,ξ  under constraints ∀i = 1 . [sent-178, score-0.142]

68 n, ξi ≥ 0 , yi  D Next, compute α∗ = j=1 ∗ βj λj ˆ Vj and fD (. [sent-181, score-0.046]

69 The last step is a model selection problem: choose a dimension D for which ˆˆ performs well. [sent-185, score-0.132]

70 We do not address directly this point here; one can think of fD applying cross-validation, or to penalize the empirical loss by a penalty function depending on the dimension. [sent-186, score-0.043]

71 Since the algorithm involves the eigendecomposition of the kernel matrix, only small datasets have been considered for the moment. [sent-188, score-0.187]

72 In order to assess the performance of the KPM, we carried out experiments on benchmark datasets available on Gunnar R¨ tsch’s web site [3]. [sent-189, score-0.127]

73 To get a valid comparison with the SVM, on each classiﬁcation task, we used Table 1: Test errors of the KPM on several benchmark datasets, compared with SVM, using G. [sent-192, score-0.054]

74 26 Table 2: Test errors of the KPM on several benchmark datasets, compared with SVM, using standard 5-fold cross-validation on each realization. [sent-231, score-0.054]

75 18 the same kernel parameters as those used for SVM, so as to work with exactly the same geometry. [sent-256, score-0.141]

76 There is a subtle, but important point arising here. [sent-257, score-0.031]

77 R¨ tsch, the regularization parameter C was ﬁrst determined by cross-validation on the ﬁrst a 5 realizations of each dataset; then the median of these values was taken as a ﬁxed value for the other realizations. [sent-259, score-0.249]

78 This was done apparently for saving computation time, but this might lead to an over-optimistic estimation of the performances since in some sense some extraneous information is then available to the algorithm and the variation due to the choice of C is reduced to almost zero. [sent-260, score-0.143]

79 We ﬁrst tried to mimic this methodology by applying it, in our case, to the choice of D itself (the median of 5 D values obtained by cross-validation on the ﬁrst realizations was then used on the other realizations). [sent-261, score-0.139]

80 One might then argue that this way we are selecting a parameter by this method instead of a meta-parameter for the SVM, so that the comparison is unfair. [sent-262, score-0.052]

81 To avoid this kind of debate and obtain fair results, we decided to re-run the SVM tests by selecting systematically the regularization parameter by a 5-fold cross-validation on each training set, and for our method, apply the same procedure to select D. [sent-264, score-0.277]

82 Note that there is still extraneous information in the choice of the kernel parameters, but at least it is the same for both algorithms. [sent-265, score-0.199]

83 Results relative to the ﬁrst methodology are reported in table 1, and those relative the second one are reported in table 2. [sent-266, score-0.134]

84 The globally worst performances exhibited in the second table show that the ﬁrst procedure may indeed be too optimistic. [sent-267, score-0.133]

85 It is to be mentionned that the parameter C of the SVM was systematically sought on a grid of only 100 values, ranging from 0 to three times the optimal value given in [3]. [sent-268, score-0.066]

86 Hence those experimental results are to be considered as preliminary, and in no way they should be used to establish a signiﬁcant difference between the performances of the KPM and the SVM. [sent-269, score-0.085]

87 Interestingly, the graphic on the left in Figure 4 shows that our procedure is very different from the one of [8]: when D is very large, our risk increases (leading to the existence of a minimum) while the risk of [8] always decreases with D. [sent-270, score-0.3]

88 5 Conclusion and discussion To summarize, one can see the KPM as an alternative to the regularization of the SVM: regularization using the RKHS norm can be replaced by ﬁnite dimensional projection. [sent-271, score-0.309]

89 Moreover, this algorithm performs KPCA towards classiﬁcation and thus offers a criterion to decide what is the right order of expansion for the KPCA. [sent-272, score-0.03]

90 Dimensionality reduction can thus be used for classiﬁcation but it is important to keep in mind that it behaves like a regularizer. [sent-273, score-0.093]

91 Hence, it is clearly useless to plug it in a classiﬁcation algorithm that is already regularized: the effect of the dimensionality reduction may be canceled as noted by [8]. [sent-274, score-0.187]

92 Our experiments explicitly show the regularizing effect of KPCA: no other smoothness control has been added in our algorithm and still, it gives performances comparable to the one of SVM provided the dimension D is picked correctly. [sent-275, score-0.164]

93 We only considered here selection of D by cross-validation; other methods such as penalization will be studied in future works. [sent-276, score-0.117]

94 Thus KPM can be see as a de-noising method who takes into account the labels. [sent-278, score-0.03]

95 This version of the KPM only considers one kernel and thus one vectorial space by dimension. [sent-279, score-0.179]

96 A more advanced version of this algorithm is to consider several kernels and thus choose among a bigger family of spaces. [sent-280, score-0.027]

97 This family then contains more than one space by dimension and will allow to directly compare the performance of different kernels on a given task, thus improving efﬁciency for the dimensionality reduction while taking into account the labels. [sent-281, score-0.283]

98 Nonlinear component analysis as a o u kernel eigenvalue problem. [sent-338, score-0.141]

99 On a kernel-based method for pattern recognition, o regression, approximation and operator inversion. [sent-344, score-0.047]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('kpm', 0.519), ('kpca', 0.343), ('fd', 0.301), ('sd', 0.238), ('svm', 0.182), ('hk', 0.146), ('kernel', 0.141), ('regularization', 0.14), ('risk', 0.126), ('ellipsoids', 0.114), ('inf', 0.108), ('projection', 0.102), ('rkhs', 0.092), ('eigenfunctions', 0.088), ('estimator', 0.087), ('orsay', 0.086), ('performances', 0.085), ('dimensionality', 0.083), ('dimension', 0.079), ('realizations', 0.077), ('minimax', 0.073), ('classi', 0.067), ('tk', 0.066), ('france', 0.066), ('eigenvectors', 0.065), ('reduction', 0.064), ('penalization', 0.064), ('diabetis', 0.058), ('extraneous', 0.058), ('massart', 0.058), ('matiques', 0.058), ('pioneering', 0.058), ('vd', 0.058), ('vj', 0.055), ('dy', 0.055), ('benchmark', 0.054), ('selection', 0.053), ('vi', 0.053), ('selecting', 0.052), ('reported', 0.052), ('partement', 0.05), ('banana', 0.05), ('solar', 0.05), ('math', 0.05), ('regularized', 0.049), ('eigenvalues', 0.049), ('procedure', 0.048), ('principal', 0.048), ('tsch', 0.048), ('operator', 0.047), ('smola', 0.047), ('datasets', 0.046), ('balls', 0.046), ('german', 0.046), ('flare', 0.046), ('yi', 0.046), ('cation', 0.045), ('universit', 0.045), ('empirical', 0.043), ('breast', 0.043), ('actually', 0.041), ('non', 0.041), ('white', 0.04), ('useless', 0.04), ('blanchard', 0.04), ('dx', 0.04), ('xi', 0.039), ('adequate', 0.038), ('vectorial', 0.038), ('dw', 0.038), ('systematically', 0.037), ('argmin', 0.035), ('xn', 0.035), ('arg', 0.034), ('heart', 0.034), ('redundancy', 0.034), ('minimizer', 0.034), ('preliminary', 0.033), ('collection', 0.033), ('mercer', 0.033), ('nice', 0.033), ('cancer', 0.032), ('median', 0.032), ('arising', 0.031), ('methodology', 0.03), ('ball', 0.03), ('offers', 0.03), ('account', 0.03), ('concerned', 0.029), ('mind', 0.029), ('sought', 0.029), ('min', 0.029), ('precisely', 0.029), ('decreasing', 0.029), ('dimensional', 0.029), ('regression', 0.028), ('penalized', 0.027), ('web', 0.027), ('family', 0.027), ('pascal', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition

Author: Laurent Zwald, Gilles Blanchard, Pascal Massart, Régis Vert

2 0.13924693 168 nips-2004-Semigroup Kernels on Finite Sets

Author: Marco Cuturi, Jean-philippe Vert

Abstract: Complex objects can often be conveniently represented by ﬁnite sets of simpler components, such as images by sets of patches or texts by bags of words. We study the class of positive deﬁnite (p.d.) kernels for two such objects that can be expressed as a function of the merger of their respective sets of components. We prove a general integral representation of such kernels and present two particular examples. One of them leads to a kernel for sets of points living in a space endowed itself with a positive deﬁnite kernel. We provide experimental results on a benchmark experiment of handwritten digits image classiﬁcation which illustrate the validity of the approach. 1

3 0.13422951 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

Author: Saharon Rosset, Robert Tibshirani, Ji Zhu, Trevor J. Hastie

Abstract: In this paper we argue that the choice of the SVM cost parameter can be critical. We then derive an algorithm that can ﬁt the entire path of SVM solutions for every value of the cost parameter, with essentially the same computational cost as ﬁtting one SVM model. 1

4 0.12635385 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension

Author: Elizaveta Levina, Peter J. Bickel

Abstract: We propose a new method for estimating intrinsic dimension of a dataset derived by applying the principle of maximum likelihood to the distances between close neighbors. We derive the estimator by a Poisson process approximation, assess its bias and variance theoretically and by simulations, and apply it to a number of simulated and real datasets. We also show it has the best overall performance compared with two other intrinsic dimension estimators. 1

5 0.12302044 34 nips-2004-Breaking SVM Complexity with Cross-Training

Author: Léon Bottou, Jason Weston, Gökhan H. Bakir

Abstract: We propose to selectively remove examples from the training set using probabilistic estimates related to editing algorithms (Devijver and Kittler, 1982). This heuristic procedure aims at creating a separable distribution of training examples with minimal impact on the position of the decision boundary. It breaks the linear dependency between the number of SVs and the number of training examples, and sharply reduces the complexity of SVMs during both the training and prediction stages. 1

6 0.11529162 69 nips-2004-Fast Rates to Bayes for Kernel Machines

7 0.11305958 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems

8 0.11106063 42 nips-2004-Computing regularization paths for learning multiple kernels

9 0.096849382 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

10 0.095986769 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

11 0.095795929 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

12 0.090374351 70 nips-2004-Following Curved Regularized Optimization Solution Paths

13 0.088588551 178 nips-2004-Support Vector Classification with Input Data Uncertainty

14 0.087563217 49 nips-2004-Density Level Detection is Classification

15 0.086712487 59 nips-2004-Efficient Kernel Discriminant Analysis via QR Decomposition

16 0.083140753 92 nips-2004-Kernel Methods for Implicit Surface Modeling

17 0.083036207 163 nips-2004-Semi-parametric Exponential Family PCA

18 0.081233777 188 nips-2004-The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

19 0.080007017 144 nips-2004-Parallel Support Vector Machines: The Cascade SVM

20 0.077958241 54 nips-2004-Distributed Information Regularization on Graphs

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.239), (1, 0.115), (2, -0.073), (3, 0.169), (4, -0.081), (5, 0.001), (6, -0.008), (7, -0.146), (8, 0.062), (9, -0.023), (10, 0.019), (11, -0.014), (12, -0.023), (13, -0.042), (14, -0.094), (15, -0.099), (16, -0.005), (17, 0.004), (18, 0.026), (19, 0.062), (20, -0.094), (21, -0.086), (22, 0.06), (23, -0.001), (24, -0.032), (25, 0.07), (26, 0.048), (27, -0.022), (28, -0.01), (29, 0.008), (30, -0.07), (31, -0.003), (32, 0.003), (33, 0.05), (34, -0.112), (35, -0.064), (36, 0.012), (37, 0.063), (38, -0.02), (39, 0.014), (40, -0.067), (41, -0.117), (42, 0.039), (43, -0.09), (44, 0.08), (45, 0.009), (46, -0.037), (47, 0.089), (48, -0.007), (49, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94048369 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition

Author: Laurent Zwald, Gilles Blanchard, Pascal Massart, Régis Vert

2 0.65025485 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems

Author: Lorenzo Rosasco, Andrea Caponnetto, Ernesto D. Vito, Francesca Odone, Umberto D. Giovannini

Abstract: Many works have shown that strong connections relate learning from examples to regularization techniques for ill-posed inverse problems. Nevertheless by now there was no formal evidence neither that learning from examples could be seen as an inverse problem nor that theoretical results in learning theory could be independently derived using tools from regularization theory. In this paper we provide a positive answer to both questions. Indeed, considering the square loss, we translate the learning problem in the language of regularization theory and show that consistency results and optimal regularization parameter choice can be derived by the discretization of the corresponding inverse problem. 1

3 0.63563216 94 nips-2004-Kernels for Multi--task Learning

Author: Charles A. Micchelli, Massimiliano Pontil

Abstract: This paper provides a foundation for multi–task learning using reproducing kernel Hilbert spaces of vector–valued functions. In this setting, the kernel is a matrix–valued function. Some explicit examples will be described which go beyond our earlier results in [7]. In particular, we characterize classes of matrix– valued kernels which are linear and are of the dot product or the translation invariant type. We discuss how these kernels can be used to model relations between the tasks and present linear multi–task learning algorithms. Finally, we present a novel proof of the representer theorem for a minimizer of a regularization functional which is based on the notion of minimal norm interpolation. 1

4 0.61855978 49 nips-2004-Density Level Detection is Classification

Author: Ingo Steinwart, Don Hush, Clint Scovel

Abstract: We show that anomaly detection can be interpreted as a binary classiﬁcation problem. Using this interpretation we propose a support vector machine (SVM) for anomaly detection. We then present some theoretical results which include consistency and learning rates. Finally, we experimentally compare our SVM with the standard one-class SVM. 1

5 0.6168617 168 nips-2004-Semigroup Kernels on Finite Sets

Author: Marco Cuturi, Jean-philippe Vert

6 0.61387801 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform

7 0.60214692 69 nips-2004-Fast Rates to Bayes for Kernel Machines

8 0.59561908 34 nips-2004-Breaking SVM Complexity with Cross-Training

9 0.59097993 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

10 0.57011706 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension

11 0.54933167 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

12 0.54052579 42 nips-2004-Computing regularization paths for learning multiple kernels

13 0.51263052 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations

14 0.48582333 59 nips-2004-Efficient Kernel Discriminant Analysis via QR Decomposition

15 0.48105994 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

16 0.48006576 144 nips-2004-Parallel Support Vector Machines: The Cascade SVM

17 0.47676569 188 nips-2004-The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

18 0.47071549 137 nips-2004-On the Adaptive Properties of Decision Trees

19 0.46704355 92 nips-2004-Kernel Methods for Implicit Surface Modeling

20 0.45407468 8 nips-2004-A Machine Learning Approach to Conjoint Analysis

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.15), (15, 0.123), (26, 0.066), (31, 0.061), (33, 0.157), (35, 0.022), (39, 0.024), (50, 0.032), (52, 0.011), (71, 0.252), (76, 0.021), (94, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90072548 83 nips-2004-Incremental Learning for Visual Tracking

Author: Jongwoo Lim, David A. Ross, Ruei-sung Lin, Ming-Hsuan Yang

Abstract: Most existing tracking algorithms construct a representation of a target object prior to the tracking task starts, and utilize invariant features to handle appearance variation of the target caused by lighting, pose, and view angle change. In this paper, we present an efﬁcient and effective online algorithm that incrementally learns and adapts a low dimensional eigenspace representation to reﬂect appearance changes of the target, thereby facilitating the tracking task. Furthermore, our incremental method correctly updates the sample mean and the eigenbasis, whereas existing incremental subspace update methods ignore the fact the sample mean varies over time. The tracking problem is formulated as a state inference problem within a Markov Chain Monte Carlo framework and a particle ﬁlter is incorporated for propagating sample distributions over time. Numerous experiments demonstrate the effectiveness of the proposed tracking algorithm in indoor and outdoor environments where the target objects undergo large pose and lighting changes. 1

2 0.88541329 24 nips-2004-Approximately Efficient Online Mechanism Design

Author: David C. Parkes, Dimah Yanovsky, Satinder P. Singh

Abstract: Online mechanism design (OMD) addresses the problem of sequential decision making in a stochastic environment with multiple self-interested agents. The goal in OMD is to make value-maximizing decisions despite this self-interest. In previous work we presented a Markov decision process (MDP)-based approach to OMD in large-scale problem domains. In practice the underlying MDP needed to solve OMD is too large and hence the mechanism must consider approximations. This raises the possibility that agents may be able to exploit the approximation for selﬁsh gain. We adopt sparse-sampling-based MDP algorithms to implement efﬁcient policies, and retain truth-revelation as an approximate BayesianNash equilibrium. Our approach is empirically illustrated in the context of the dynamic allocation of WiFi connectivity to users in a coffeehouse. 1

same-paper 3 0.83202159 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition

Author: Laurent Zwald, Gilles Blanchard, Pascal Massart, Régis Vert

4 0.77918303 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications

Author: Ruei-sung Lin, David A. Ross, Jongwoo Lim, Ming-Hsuan Yang

Abstract: This paper presents an adaptive discriminative generative model that generalizes the conventional Fisher Linear Discriminant algorithm and renders a proper probabilistic interpretation. Within the context of object tracking, we aim to ﬁnd a discriminative generative model that best separates the target from the background. We present a computationally efﬁcient algorithm to constantly update this discriminative model as time progresses. While most tracking algorithms operate on the premise that the object appearance or ambient lighting condition does not signiﬁcantly change as time progresses, our method adapts a discriminative generative model to reﬂect appearance variation of the target and background, thereby facilitating the tracking task in ever-changing environments. Numerous experiments show that our method is able to learn a discriminative generative model for tracking target objects undergoing large pose and lighting changes.

5 0.73381156 91 nips-2004-Joint Tracking of Pose, Expression, and Texture using Conditionally Gaussian Filters

Author: Tim K. Marks, J. C. Roddey, Javier R. Movellan, John R. Hershey

Abstract: We present a generative model and stochastic ﬁltering algorithm for simultaneous tracking of 3D position and orientation, non-rigid motion, object texture, and background texture using a single camera. We show that the solution to this problem is formally equivalent to stochastic ﬁltering of conditionally Gaussian processes, a problem for which well known approaches exist [3, 8]. We propose an approach based on Monte Carlo sampling of the nonlinear component of the process (object motion) and exact ﬁltering of the object and background textures given the sampled motion. The smoothness of image sequences in time and space is exploited by using Laplace’s method to generate proposal distributions for importance sampling [7]. The resulting inference algorithm encompasses both optic ﬂow and template-based tracking as special cases, and elucidates the conditions under which these methods are optimal. We demonstrate an application of the system to 3D non-rigid face tracking. 1 Background Recent algorithms track morphable objects by solving optic ﬂow equations, subject to the constraint that the tracked points belong to an object whose non-rigid deformations are linear combinations of a set of basic shapes [10, 2, 11]. These algorithms require precise initialization of the object pose and tend to drift out of alignment on long video sequences. We present G-ﬂow, a generative model and stochastic ﬁltering formulation of tracking that address the problems of initialization and error recovery in a principled manner. We deﬁne a non-rigid object by the 3D locations of n vertices. The object is a linear combination of k ﬁxed morph bases, with coefﬁcients c = [c1 , c2 , · · · , ck ]T . The ﬁxed 3 × k matrix hi contains the position of the ith vertex in all k morph bases. The transformation from object-centered to image coordinates consists of a rotation, weak perspective projection, and translation. Thus xi , the 2D location of the ith vertex on the image plane, is xi = grhi c + l, (1) where r is the 3 × 3 rotation matrix, l is the 2 × 1 translation vector, and g = 1 0 0 is the 010 projection matrix. The object pose, ut , comprises both the rigid motion parameters and the morph parameters at time t: ut = {r(t), l(t), c(t)}. (2) 1.1 Optic ﬂow Let yt represent the current image, and let xi (ut ) index the image pixel that is rendered by the ith object vertex when the object assumes pose ut . Suppose that we know ut−1 , the pose at time t − 1, and we want to ﬁnd ut , the pose at time t. This problem can be solved by minimizing the following form with respect to ut : ut = argmin ˆ ut 1 2 n 2 [yt (xi (ut )) − yt−1 (xi (ut−1 ))] . (3) i=1 In the special case in which the xi (ut ) are neighboring points that move with the same 2D displacement, this reduces to the standard Lucas-Kanade optic ﬂow algorithm [9, 1]. Recent work [10, 2, 11] has shown that in the general case, this optimization problem can be solved efﬁciently using the Gauss-Newton method. We will take advantage of this fact to develop an efﬁcient stochastic inference algorithm within the framework of G-ﬂow. Notational conventions Unless otherwise stated, capital letters are used for random variables, small letters for speciﬁc values taken by random variables, and Greek letters for ﬁxed model parameters. Subscripted colons indicate sequences: e.g., X1:t = X1 · · · Xt . The term In stands for the n × n identity matrix, E for expected value, V ar for the covariance matrix, and V ar−1 for the inverse of the covariance matrix (precision matrix). 2 The Generative Model for G-Flow Figure 1: Left: a(Ut ) determines which texel (color at a vertex of the object model or a pixel of the background model) is responsible for rendering each image pixel. Right: G-ﬂow video generation model: At time t, the object’s 3D pose, Ut , is used to project the object texture, Vt , into 2D. This projection is combined with the background texture, Bt , to generate the observed image, Yt . We model the image sequence Y as a stochastic process generated by three hidden causes, U , V , and B, as shown in the graphical model (Figure 1, right). The m × 1 random vector Yt represents the m-pixel image at time t. The n × 1 random vector Vt and the m × 1 random vector Bt represent the n-texel object texture and the m-texel background texture, respectively. As illustrated in Figure 1, left, the object pose, Ut , determines onto which image pixels the object and background texels project at time t. This is formulated using the projection function a(Ut ). For a given pose, ut , the projection a(ut ) is a block matrix, def a(ut ) = av (ut ) ab (ut ) . Here av (ut ), the object projection function, is an m × n matrix of 0s and 1s that tells onto which image pixel each object vertex projects; e.g., a 1 at row j, column i it means that the ith object point projects onto image pixel j. Matrix ab plays the same role for background pixels. Assuming the foreground mapping is one-toone, we let ab = Im −av (ut )av (ut )T , expressing the simple occlusion constraint that every image pixel is rendered by object or background, but not both. In the G-ﬂow generative model: Vt Yt = a(Ut ) + Wt Wt ∼ N (0, σw Im ), σw > 0 Bt (4) Ut ∼ p(ut | ut−1 ) v v Vt = Vt−1 + Zt−1 Zt−1 ∼ N (0, Ψv ), Ψv is diagonal b b Bt = Bt−1 + Zt−1 Zt−1 ∼ N (0, Ψb ), Ψb is diagonal where p(ut | ut−1 ) is the pose transition distribution, and Z v , Z b , W are independent of each other, of the initial conditions, and over time. The form of the pose distribution is left unspeciﬁed since the algorithm proposed here does not require the pose distribution or the pose dynamics to be Gaussian. For the initial conditions, we require that the variance of V1 and the variance of B1 are both diagonal. Non-rigid 3D tracking is a difﬁcult nonlinear ﬁltering problem because changing the pose has a nonlinear effect on the image pixels. Fortunately, the problem has a rich structure that we can exploit: under the G-ﬂow model, video generation is a conditionally Gaussian process [3, 6, 4, 5]. If the speciﬁc values taken by the pose sequence, u1:t , were known, then the texture processes, V and B, and the image process, Y , would be jointly Gaussian. This suggests the following scheme: we could use particle ﬁltering to obtain a distribution of pose experts (each expert corresponds to a highly probable sample of pose, u1:t ). For each expert we could then use Kalman ﬁltering equations to infer the posterior distribution of texture given the observed images. This method is known in the statistics community as a Monte Carlo ﬁltering solution for conditionally Gaussian processes [3, 4], and in the machine learning community as Rao-Blackwellized particle ﬁltering [6, 5]. We found that in addition to Rao-Blackwellization, it was also critical to use Laplace’s method to generate the proposal distributions for importance sampling [7]. In the context of G-ﬂow, we accomplished this by performing an optic ﬂow-like optimization, using an efﬁcient algorithm similar to those in [10, 2]. 3 Inference Our goal is to ﬁnd an expression for the ﬁltering distribution, p(ut , vt , bt | y1:t ). Using the law of total probability, we have the following equation for the ﬁltering distribution: p(ut , vt , bt | y1:t ) = p(ut , vt , bt | u1:t−1 , y1:t ) p(u1:t−1 | y1:t ) du1:t−1 Opinion of expert (5) Credibility of expert We can think of the integral in (5) as a sum over a distribution of experts, where each expert corresponds to a single pose history, u1:t−1 . Based on its hypothesis about pose history, each expert has an opinion about the current pose of the object, Ut , and the texture maps of the object and background, Vt and Bt . Each expert also has a credibility, a scalar that measures how well the expert’s opinion matches the observed image yt . Thus, (5) can be interpreted as follows: The ﬁltering distribution at time t is obtained by integrating over the entire ensemble of experts the opinion of each expert weighted by that expert’s credibility. The opinion distribution of expert u1:t−1 can be factorized into the expert’s opinion about the pose Ut times the conditional distribution of texture Vt , Bt given pose: p(ut , vt , bt | u1:t−1 , y1:t ) = p(ut | u1:t−1 , y1:t ) p(vt , bt | u1:t , y1:t ) (6) Opinion of expert Pose Opinion Texture Opinion given pose The rest of this section explains how we evaluate each term in (5) and (6). We cover the distribution of texture given pose in 3.1, pose opinion in 3.2, and credibility in 3.3. 3.1 Texture opinion given pose The distribution of Vt and Bt given the pose history u1:t is Gaussian with mean and covariance that can be obtained using the Kalman ﬁlter estimation equations: −1 V ar−1 (Vt , Bt | u1:t , y1:t ) = V ar−1 (Vt , Bt | u1:t−1 , y1:t−1 ) + a(ut )T σw a(ut ) E(Vt , Bt | u1:t , y1:t ) = V ar(Vt , Bt | u1:t , y1:t ) −1 × V ar−1 (Vt , Bt | u1:t−1 , y1:t−1 )E(Vt , Bt | u1:t−1 , y1:t−1 ) + a(ut )T σw yt (7) (8) This requires p(Vt , Bt |u1:t−1 , y1:t−1 ), which we get from the Kalman prediction equations: E(Vt , Bt | u1:t−1 , y1:t−1 ) = E(Vt−1 , Bt−1 | u1:t−1 , y1:t−1 ) V ar(Vt , Bt | u1:t−1 , y1:t−1 ) = V ar(Vt−1 , Bt−1 | u1:t−1 , y1:t−1 ) + (9) Ψv 0 0 Ψb (10) In (9), the expected value E(Vt , Bt | u1:t−1 , y1:t−1 ) consists of texture maps (templates) for the object and background. In (10), V ar(Vt , Bt | u1:t−1 , y1:t−1 ) represents the degree of uncertainty about each texel in these texture maps. Since this is a diagonal matrix, we can refer to the mean and variance of each texel individually. For the ith texel in the object texture map, we use the following notation: µv (i) t v σt (i) def = ith element of E(Vt | u1:t−1 , y1:t−1 ) def = (i, i)th element of V ar(Vt | u1:t−1 , y1:t−1 ) b Similarly, deﬁne µb (j) and σt (j) as the mean and variance of the jth texel in the backt ground texture map. (This notation leaves the dependency on u1:t−1 and y1:t−1 implicit.) 3.2 Pose opinion Based on its current texture template (derived from the history of poses and images up to time t−1) and the new image yt , each expert u1:t−1 has a pose opinion, p(ut |u1:t−1 , y1:t ), a probability distribution representing that expert’s beliefs about the pose at time t. Since the effect of ut on the likelihood function is nonlinear, we will not attempt to ﬁnd an analytical solution for the pose opinion distribution. However, due to the spatio-temporal smoothness of video signals, it is possible to estimate the peak and variance of an expert’s pose opinion. 3.2.1 Estimating the peak of an expert’s pose opinion We want to estimate ut (u1:t−1 ), the value of ut that maximizes the pose opinion. Since ˆ p(ut | u1:t−1 , y1:t ) = p(y1:t−1 | u1:t−1 ) p(ut | ut−1 ) p(yt | u1:t , y1:t−1 ), p(y1:t | u1:t−1 ) (11) def ut (u1:t−1 ) = argmax p(ut | u1:t−1 , y1:t ) = argmax p(ut | ut−1 ) p(yt | u1:t , y1:t−1 ). ˆ ut ut (12) We now need an expression for the ﬁnal term in (12), the predictive distribution p(yt | u1:t , y1:t−1 ). By integrating out the hidden texture variables from p(yt , vt , bt | u1:t , y1:t−1 ), and using the conditional independence relationships deﬁned by the graphical model (Figure 1, right), we can derive: 1 m log p(yt | u1:t , y1:t−1 ) = − log 2π − log |V ar(Yt | u1:t , y1:t−1 )| 2 2 n v 2 1 (yt (xi (ut )) − µt (i)) 1 (yt (j) − µb (j))2 t − − , (13) v (i) + σ b 2 i=1 σt 2 σt (j) + σw w j∈X (ut ) where xi (ut ) is the image pixel rendered by the ith object vertex when the object assumes pose ut , and X (ut ) is the set of all image pixels rendered by the object under pose ut . Combining (12) and (13), we can derive ut (u1:t−1 ) = argmin − log p(ut | ut−1 ) ˆ (14) ut + 1 2 n i=1 [yt (xi (ut )) − µv (i)]2 [yt (xi (ut )) − µb (xi (ut ))]2 t t b − − log[σt (xi (ut )) + σw ] v b σt (i) + σw σt (xi (ut )) + σw Foreground term Background terms Note the similarity between (14) and constrained optic ﬂow (3). For example, focus on the foreground term in (14) and ignore the weights in the denominator. The previous image yt−1 from (3) has been replaced by µv (·), the estimated object texture based on the images t and poses up to time t − 1. As in optic ﬂow, we can ﬁnd the pose estimate ut (u1:t−1 ) ˆ efﬁciently using the Gauss-Newton method. 3.2.2 Estimating the distribution of an expert’s pose opinion We estimate the distribution of an expert’s pose opinion using a combination of Laplace’s method and importance sampling. Suppose at time t − 1 we are given a sample of experts (d) (d) indexed by d, each endowed with a pose sequence u1:t−1 , a weight wt−1 , and the means and variances of Gaussian distributions for object and background texture. For each expert (d) (d) u1:t−1 , we use (14) to compute ut , the peak of the pose distribution at time t according ˆ (d) to that expert. Deﬁne σt as the inverse Hessian matrix of (14) at this peak, the Laplace ˆ estimate of the covariance matrix of the expert’s opinion. We then generate a set of s (d,e) (d) independent samples {ut : e = 1, · · · , s} from a Gaussian distribution with mean ut ˆ (d) (d) (d) and variance proportional to σt , g(·|ˆt , αˆt ), where the parameter α > 0 determines ˆ u σ the sharpness of the sampling distribution. (Note that letting α → 0 would be equivalent to (d,e) (d) simply setting the new pose equal to the peak of the pose opinion, ut = ut .) To ﬁnd ˆ the parameters of this Gaussian proposal distribution, we use the Gauss-Newton method, ignoring the second of the two background terms in (14). (This term is not ignored in the importance sampling step.) To reﬁne our estimate of the pose opinion we use importance sampling. We assign each sample from the proposal distribution an importance weight wt (d, e) that is proportional to the ratio between the posterior distribution and the proposal distribution: s (d) p(ut | u1:t−1 , y1:t ) = ˆ (d,e) δ(ut − ut ) wt (d, e) s f =1 wt (d, f ) (15) e=1 (d,e) (d) (d) (d,e) p(ut | ut−1 )p(yt | u1:t−1 , ut , y1:t−1 ) wt (d, e) = (16) (d,e) (d) (d) g(ut | ut , αˆt ) ˆ σ (d,e) (d) The numerator of (16) is proportional to p(ut |u1:t−1 , y1:t ) by (12), and the denominator of (16) is the sampling distribution. 3.3 Estimating an expert’s credibility (d) The credibility of the dth expert, p(u1:t−1 | y1:t ), is proportional to the product of a prior term and a likelihood term: (d) (d) p(u1:t−1 | y1:t−1 )p(yt | u1:t−1 , y1:t−1 ) (d) p(u1:t−1 | y1:t ) = . (17) p(yt | y1:t−1 ) Regarding the likelihood, p(yt |u1:t−1 , y1:t−1 ) = p(yt , ut |u1:t−1 , y1:t−1 )dut = p(yt |u1:t , y1:t−1 )p(ut |ut−1 )dut (18) (d,e) We already generated a set of samples {ut : e = 1, · · · , s} that estimate the pose opin(d) ion of the dth expert, p(ut | u1:t−1 , y1:t ). We can now use these samples to estimate the likelihood for the dth expert: (d) p(yt | u1:t−1 , y1:t−1 ) = (d) (d) p(yt | u1:t−1 , ut , y1:t−1 )p(ut | ut−1 )dut (19) (d) (d) (d) (d) = p(yt | u1:t−1 , ut , y1:t−1 )g(ut | ut , αˆt ) ˆ σ 3.4 p(ut | ut−1 ) s e=1 dut ≈ wt (d, e) s Updating the ﬁltering distribution g(ut | (d) (d) ut , αˆt ) ˆ σ Once we have calculated the opinion and credibility of each expert u1:t−1 , we evaluate the integral in (5) as a weighted sum over experts. The credibilities of all of the experts are normalized to sum to 1. New experts u1:t (children) are created from the old experts u1:t−1 (parents) by appending a pose ut to the parent’s history of poses u1:t−1 . Every expert in the new generation is created as follows: One parent is chosen to sire the child. The probability of being chosen is proportional to the parent’s credibility. The child’s value of ut is chosen at random from its parent’s pose opinion (the weighted samples described in Section 3.2.2). 4 Relation to Optic Flow and Template Matching In basic template-matching, the same time-invariant texture map is used to track every frame in the video sequence. Optic ﬂow can be thought of as template-matching with a template that is completely reset at each frame for use in the subsequent frame. In most cases, optimal inference under G-ﬂow involves a combination of optic ﬂow-based and template-based tracking, in which the texture template gradually evolves as new images are presented. Pure optic ﬂow and template-matching emerge as special cases. Optic Flow as a Special Case Suppose that the pose transition probability p(ut | ut−1 ) is uninformative, that the background is uninformative, that every texel in the initial object texture map has equal variance, V ar(V1 ) = κIn , and that the texture transition uncertainty is very high, Ψv → diag(∞). Using (7), (8), and (10), it follows that: µv (i) = [av (ut−1 )]T yt−1 = yt−1 (xi (ut−1 )) , t (20) i.e., the object texture map at time t is determined by the pixels from image yt−1 that according to pose ut−1 were rendered by the object. As a result, (14) reduces to: ut (u1:t−1 ) = argmin ˆ ut 1 2 n yt (xi (ut )) − yt−1 (xi (ut−1 )) 2 (21) i=1 which is identical to (3). Thus constrained optic ﬂow [10, 2, 11] is simply a special case of optimal inference under G-ﬂow, with a single expert and with sampling parameter α → 0. The key assumption that Ψv → diag(∞) means that the object’s texture is very different in adjacent frames. However, optic ﬂow is typically applied in situations in which the object’s texture in adjacent frames is similar. The optimal solution in such situations calls not for optic ﬂow, but for a texture map that integrates information across multiple frames. Template Matching as a Special Case Suppose the initial texture map is known precisely, V ar(V1 ) = 0, and the texture transition uncertainty is very low, Ψv → 0. By (7), (8), and (10), it follows that µv (i) = µv (i) = µv (i), i.e., the texture map does not change t t−1 1 over time, but remains ﬁxed at its initial value (it is a texture template). Then (14) becomes: n yt (xi (ut )) − µv (i) 1 ut (u1:t−1 ) = argmin ˆ ut 2 (22) i=1 where µv (i) is the ith texel of the ﬁxed texture template. This is the error function mini1 mized by standard template-matching algorithms. The key assumption that Ψv → 0 means the object’s texture is constant from each frame to the next, which is rarely true in real data. G-ﬂow provides a principled way to relax this unrealistic assumption of template methods. General Case In general, if the background is uninformative, then minimizing (14) results in a weighted combination of optic ﬂow and template matching, with the weight of each approach depending on the current level of certainty about the object template. In addition, when there is useful information in the background, G-ﬂow infers a model of the background which is used to improve tracking. Figure 2: G-ﬂow tracking an outdoor video. Results are shown for frames 1, 81, and 620. 5 Simulations We collected a video (30 frames/sec) of a subject in an outdoor setting who made a variety of facial expressions while moving her head. A later motion-capture session was used to create a 3D morphable model of her face, consisting of a set of 5 morph bases (k = 5). Twenty experts were initialized randomly near the correct pose on frame 1 of the video and propagated using G-ﬂow inference (assuming an uninformative background). See http://mplab.ucsd.edu for video. Figure 2 shows the distribution of experts for three frames. In each frame, every expert has a hypothesis about the pose (translation, rotation, scale, and morph coefﬁcients). The 38 points in the model are projected into the image according to each expert’s pose, yielding 760 red dots in each frame. In each frame, the mean of the experts gives a single hypothesis about the 3D non-rigid deformation of the face (lower right) as well as the rigid pose of the face (rotated 3D axes, lower left). Notice G-ﬂow’s ability to recover from error: bad initial hypotheses are weeded out, leaving only good hypotheses. To compare G-ﬂow’s performance versus deterministic constrained optic ﬂow algorithms such as [10, 2, 11] , we used both G-ﬂow and the method from [2] to track the same video sequence. We ran each tracker several times, introducing small errors in the starting pose. Figure 3: Average error over time for G-ﬂow (green) and for deterministic optic ﬂow [2] (blue). Results were averaged over 16 runs (deterministic algorithm) or 4 runs (G-ﬂow) and smoothed. As ground truth, the 2D locations of 6 points were hand-labeled in every 20th frame. The error at every 20th frame was calculated as the distance from these labeled locations to the inferred (tracked) locations, averaged across several runs. Figure 3 compares this tracking error as a function of time for the deterministic constrained optic ﬂow algorithm and for a 20-expert version of the G-ﬂow tracking algorithm. Notice that the deterministic system has a tendency to drift (increase in error) over time, whereas G-ﬂow can recover from drift. Acknowledgments Tim K. Marks was supported by NSF grant IIS-0223052 and NSF grant DGE-0333451 to GWC. John Hershey was supported by the UCDIMI grant D00-10084. J. Cooper Roddey was supported by the Swartz Foundation. Javier R. Movellan was supported by NSF grants IIS-0086107, IIS-0220141, and IIS-0223052, and by the UCDIMI grant D00-10084. References [1] Simon Baker and Iain Matthews. Lucas-kanade 20 years on: A unifying framework. International Journal of Computer Vision, 56(3):221–255, 2002. [2] M. Brand. Flexible ﬂow for 3D nonrigid tracking and shape recovery. In CVPR, volume 1, pages 315–322, 2001. [3] H. Chen, P. Kumar, and J. van Schuppen. On Kalman ﬁltering for conditionally gaussian systems with random matrices. Syst. Contr. Lett., 13:397–404, 1989. [4] R. Chen and J. Liu. Mixture Kalman ﬁlters. J. R. Statist. Soc. B, 62:493–508, 2000. [5] A. Doucet and C. Andrieu. Particle ﬁltering for partially observed gaussian state space models. J. R. Statist. Soc. B, 64:827–838, 2002. [6] A. Doucet, N. de Freitas, K. Murphy, and S. Russell. Rao-blackwellised particle ﬁltering for dynamic bayesian networks. In 16th Conference on Uncertainty in AI, pages 176–183, 2000. [7] A. Doucet, S. J. Godsill, and C. Andrieu. On sequential monte carlo sampling methods for bayesian ﬁltering. Statistics and Computing, 10:197–208, 2000. [8] Zoubin Ghahramani and Geoffrey E. Hinton. Variational learning for switching state-space models. Neural Computation, 12(4):831–864, 2000. [9] B. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Proceedings of the International Joint Conference on Artiﬁcial Intelligence, 1981. [10] L. Torresani, D. Yang, G. Alexander, and C. Bregler. Tracking and modeling non-rigid objects with rank constraints. In CVPR, pages 493–500, 2001. [11] Lorenzo Torresani, Aaron Hertzmann, and Christoph Bregler. Learning non-rigid 3d shape from 2d motion. In Advances in Neural Information Processing Systems 16. MIT Press, 2004.

6 0.70958573 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons

7 0.70619297 131 nips-2004-Non-Local Manifold Tangent Learning

8 0.70015168 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech

9 0.69902033 28 nips-2004-Bayesian inference in spiking neurons

10 0.69838136 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

11 0.6965794 73 nips-2004-Generative Affine Localisation and Tracking

12 0.69595748 102 nips-2004-Learning first-order Markov models for control

13 0.6958378 163 nips-2004-Semi-parametric Exponential Family PCA

14 0.69533211 70 nips-2004-Following Curved Regularized Optimization Solution Paths

15 0.69327855 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems

16 0.69284755 64 nips-2004-Experts in a Markov Decision Process

17 0.69217628 36 nips-2004-Class-size Independent Generalization Analsysis of Some Discriminative Multi-Category Classification

18 0.69148427 116 nips-2004-Message Errors in Belief Propagation

19 0.69013196 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms

20 0.68963665 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants