nips nips2004 nips2004-166 knowledge-graph by maker-knowledge-mining

166 nips-2004-Semi-supervised Learning via Gaussian Processes


Source: pdf

Author: Neil D. Lawrence, Michael I. Jordan

Abstract: We present a probabilistic approach to learning a Gaussian Process classifier in the presence of unlabeled data. Our approach involves a “null category noise model” (NCNM) inspired by ordered categorical noise models. The noise model reflects an assumption that the data density is lower between the class-conditional densities. We illustrate our approach on a toy problem and present comparative results for the semi-supervised classification of handwritten digits. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We present a probabilistic approach to learning a Gaussian Process classifier in the presence of unlabeled data. [sent-13, score-0.207]

2 Our approach involves a “null category noise model” (NCNM) inspired by ordered categorical noise models. [sent-14, score-0.542]

3 The noise model reflects an assumption that the data density is lower between the class-conditional densities. [sent-15, score-0.178]

4 We illustrate our approach on a toy problem and present comparative results for the semi-supervised classification of handwritten digits. [sent-16, score-0.096]

5 It is natural to consider whether such predictive performance can be improved via “semi-supervised learning,” in which a combination of labeled data and unlabeled data are available. [sent-25, score-0.329]

6 In the latter case, if we fail to make any assumptions about the underlying distribution of input data, the unlabeled data will not affect our predictions. [sent-27, score-0.188]

7 Thus, most attempts to make use of unlabeled data within a probabilistic framework focus on incorporating a model of p (x n ): for example, by treating it as a mixture, yn p (xn |yn ) p (yn ), and inferring p (yn |xn ) (e. [sent-28, score-0.674]

8 In particular, the cluster assumption asserts that the data density should be reduced in the vicinity of a decision boundary (e. [sent-35, score-0.176]

9 In the current paper we take up the challenge Figure 1: The ordered categorical noise model. [sent-39, score-0.245]

10 The plot shows p (yn |fn ) for different values of yn . [sent-40, score-0.434]

11 Our approach involves a notion of a “null category region,” a region which acts to exclude unlabeled data points. [sent-43, score-0.474]

12 Such a region is analogous to the traditional notion of a “margin” and indeed our approach is similar in spirit to the transductive SVM [10], which seeks to maximize the margin by allocating labels to the unlabeled data. [sent-44, score-0.405]

13 A major difference, however, is that our approach maintains and updates the process variance (not merely the process mean) and, as we will see, this variance turns out to interact in a significant way with the null category concept. [sent-45, score-0.85]

14 We introduce the basic probabilistic framework in Section 2 and discuss the effect of the null category in Section 3. [sent-47, score-0.517]

15 2 Probabilistic Model In addition to the input vector xn and the label yn , our model includes a latent process variable fn , such that the probability of class membership decomposes as p (yn |xn ) = p (yn |fn ) p (fn |xn ) dfn . [sent-50, score-1.224]

16 We first focus on the noise model, p (yn |fn ), deferring the discussion of an appropriate process model, p (fn |xn ), to later. [sent-51, score-0.171]

17 1 Ordered categorical models We introduce a novel noise model which we have termed a null category noise model, as it derives from the general class of ordered categorical models [1]. [sent-53, score-0.959]

18 In the specific context of binary classification, our focus in this paper, we consider an ordered categorical model containing three categories1 . [sent-54, score-0.198]

19  φ − fn + w for yn = −1  2 p (yn |fn ) = φ fn + w − φ fn − w for yn = 0 , 2 2  φ fn − w for yn = 1 2 x where φ (x) = −∞ N (z|0, 1) dz is the cumulative Gaussian distribution function and w is a parameter giving the width of category yn = 0 (see Figure 1). [sent-55, score-3.826]

20 We can also express this model in an equivalent and simpler form by replacing the 1 See also [9] who makes use of a similar noise model in a discussion of Bayesian interpretations of the SVM. [sent-56, score-0.143]

21 Figure 2: Graphical representation of the null category model. [sent-57, score-0.49]

22 The fully-shaded nodes are always observed, whereas the lightly-shaded node is observed when zn = 0. [sent-58, score-0.193]

23 To use this model in an unlabeled setting we introduce a further variable, z n , which is one if a data point is unlabeled and zero otherwise. [sent-60, score-0.401]

24 We first impose p (zn = 1|yn = 0) = 0; (1) in other words, a data point can not be from the category yn = 0 and be unlabeled. [sent-61, score-0.721]

25 We see from the graphical representation in Figure 2 that zn is d-separated from xn . [sent-63, score-0.385]

26 Thus when yn is observed, the posterior process is updated by using p (yn |fn ). [sent-64, score-0.652]

27 On the other hand, when the data point is unlabeled the posterior process must be updated by p (zn |fn ) which is easily computed as: p (zn = 1|fn ) = p (yn |fn ) p (zn = 1|yn ) . [sent-65, score-0.437]

28 yn The “effective likelihood function” for a single data point, L (fn ), therefore takes one of three forms:   H − fn + 1 2 L (fn ) = γ− H − fn + 1 + γ + H fn − 2  H fn − 1 2 1 2 for yn = −1, zn = 0 . [sent-66, score-2.956]

29 for zn = 1 for yn = 1 zn = 0 The constraint imposed by (1) implies that an unlabeled data point never comes from the class yn = 0. [sent-67, score-1.525]

30 Since yn = 0 lies between the labeled classes this is equivalent to a hard assumption that no data comes from the region around the decision boundary. [sent-68, score-0.746]

31 We can also soften this hard assumption if so desired by injection of noise into the process model. [sent-69, score-0.194]

32 If we also assume that our labeled data only comes from the classes yn = 1 and yn = −1 we will never obtain any evidence for data with yn = 0; for this reason we refer to this category as the null category and the overall model as a null category noise model (NCNM). [sent-70, score-2.85]

33 3 Process Model and Effect of the Null Category We work within the Gaussian process framework and assume p (fn |xn ) = N (fn |µ (xn ) , ς (xn )) , where the mean µ (xn ) and the variance ς (xn ) are functions of the input space. [sent-71, score-0.18]

34 Diagrams show the prior distribution over fn (long dashes) the effective likelihood function from the noise model when zn = 1 (short dashes) and a schematic of the resulting posterior over fn (solid line). [sent-73, score-1.343]

35 Left: The posterior is bimodal and has a larger variance than the prior. [sent-74, score-0.22]

36 Right: The posterior has one dominant mode and a lower variance than the prior. [sent-75, score-0.178]

37 In both cases the process is pushed away from the null category. [sent-76, score-0.441]

38 distribution over fn from incorporating a new data point. [sent-77, score-0.485]

39 First we note that if yn ∈ {−1, 1} the effect of the likelihood will be similar to that incurred in binary classification, in that the posterior will be a convolution of the step function and a Gaussian distribution. [sent-78, score-0.579]

40 This is comforting as when a data point is labeled the model will act in a similar manner to a standard binary classification model. [sent-79, score-0.172]

41 The effect will depend on the mean and variance of p (fn |xn ). [sent-81, score-0.081]

42 If this Gaussian has little mass in the null category region, the posterior will be similar to the prior. [sent-82, score-0.643]

43 However, if the Gaussian has significant mass in the null category region, the outcome may be loosely described in two ways: 1. [sent-83, score-0.546]

44 If p (fn |xn ) “spans the likelihood,” Figure 3 (Left), then the mass of the posterior can be apportioned to either side of the null category region, leading to a bimodal posterior. [sent-84, score-0.708]

45 The variance of the posterior will be greater than the variance of the prior, a consequence of the fact that the effective likelihood function is not log-concave (as can be easily verified). [sent-85, score-0.34]

46 If p (fn |xn ) is “rectified by the likelihood,” Figure 3 (Right), then the mass of the posterior will be pushed in to one side of the null category and the variance of the posterior will be smaller than the variance of the prior. [sent-87, score-1.002]

47 Note that for all situations when a portion of the mass of the prior distribution falls within the null category region it is pushed out to one side or both sides. [sent-88, score-0.731]

48 The intuition behind the two situations is that in case 1, it is not clear what label the data point has, however it is clear that it shouldn’t be where it currently is (in the null category). [sent-89, score-0.393]

49 In case 2 the data point is being assigned a label and the decision boundary is pushed to one side of the point so that it is classified according to the assigned label. [sent-91, score-0.33]

50 4 Posterior Inference and Prediction Broadly speaking the effects discussed above are independent of the process model: the effective likelihood will always force the latent function away from the null category. [sent-92, score-0.412]

51 To implement our model, however, we must choose a process model and an inference method. [sent-93, score-0.124]

52 The nature of the noise model means that it is unlikely that we will find a non-trivial process model for which inference (in terms of marginalizing fn ) will be tractable. [sent-94, score-0.675]

53 The idea in ADF is to approximate the (generally non-Gaussian) posterior with a Gaussian by matching the moments between the approximation and the true posterior. [sent-98, score-0.097]

54 ADF has also been extended to allow each approximation to be revisited and improved as the posterior distribution evolves [7]. [sent-99, score-0.097]

55 Recall from Section 3 that the noise model is not log-concave. [sent-100, score-0.097]

56 When the variance of the process increases the best Gaussian approximation to our noise model can have negative variance. [sent-101, score-0.277]

57 In our implementation we followed the simplest suggestion: we set a negative variance to zero. [sent-103, score-0.081]

58 One important advantage of the Gaussian process framework is that hyperparameters in the covariance function (i. [sent-104, score-0.123]

59 In practice, however, if the process variance is maximized in an unconstrained manner the effective width of the null category can be driven to zero, yielding a model that is equivalent to a standard binary classification noise model2 . [sent-107, score-0.865]

60 To prevent this from happening we regularize with an L1 penalty on the process variances (this is equivalent to placing an exponential prior on those parameters). [sent-108, score-0.142]

61 1 Prediction with the NCNM Once the parameters of the process model have been learned, we wish to make predictions about a new test-point x∗ via the marginal distribution p (y∗ |x∗ ). [sent-110, score-0.147]

62 For the NCNM an issue arises here: this distribution will have a non-zero probability of y∗ = 0, a label that does not exist in either our labeled or unlabeled data. [sent-111, score-0.284]

63 The new point also has z∗ = 1 so in reality the probability that a data point is from the positive class is given by p (y∗ |x∗ , z∗ ) ∝ p (z∗ |y∗ ) p (y∗ |x∗ ) . [sent-113, score-0.093]

64 An interesting consequence is that observing x∗ will have an effect on the process model. [sent-116, score-0.132]

65 This is contrary to the standard Gaussian process setup (see, e. [sent-117, score-0.099]

66 , [11]) in which the predictive distribution depends only on the labeled training data and the location of the test point x∗ . [sent-119, score-0.198]

67 In the NCNM the entire process model p (f∗ |x∗ ) should be updated after the observation of x∗ . [sent-120, score-0.146]

68 This is not a particular disadvantage of our approach; rather, it is an inevitable consequence of any method that allows unlabeled data to affect the location of the decision boundary—a consequence that our framework makes explicit. [sent-121, score-0.356]

69 5 Experiments Sparse representations of the data set are essential for speeding up the process of learning. [sent-123, score-0.13]

70 We made use of the informative vector machine3 (IVM) approach [6] to 2 Recall, as discussed in Section 1, that we fix the width of the null category to unity: changes in the scale of the process model are equivalent to changing this width. [sent-124, score-0.715]

71 There are 400 points, which are labeled with probability 0. [sent-127, score-0.085]

72 Left: Learning on the labeled data only with the IVM algorithm. [sent-132, score-0.116]

73 Right: Learning on the labeled and unlabeled data with the NCNM. [sent-134, score-0.273]

74 5 of the decision boundary (for the NCNM this is the edge of the null category). [sent-137, score-0.36]

75 In all our experiments we used a kernel of the form T knm = θ2 exp −θ1 (xn − xm ) (xn − xm ) + θ3 δnm , where δnm is the Kronecker delta function. [sent-140, score-0.087]

76 The IVM algorithm selects an active set, and the parameters of the kernel were learned by performing type-II maximum likelihood over the active set. [sent-141, score-0.23]

77 Since active set selection causes the marginalized likelihood to fluctuate it cannot be used to monitor convergence, we therefore simply iterated fifteen times between active set selection and kernel parameter optimisation. [sent-142, score-0.207]

78 The parameters of the noise model, {γ+ , γ− } can also be optimized, but note that if we constrain γ+ = γ− = γ then the likelihood is maximized by setting γ to the proportion of the training set that is unlabeled. [sent-143, score-0.17]

79 First a standard IVM classifier was trained on the labeled data only (Figure 4, Left). [sent-149, score-0.14]

80 We then used the null category approach to train a classifier that incorporates the unlabeled data. [sent-150, score-0.672]

81 As shown in Figure 4 (Right), the resulting decision boundary finds a region of low data density and more accurately reflects the underlying data distribution. [sent-151, score-0.245]

82 1 High-dimensional example To explore the capabilities of the model when the data set is of a much higher dimensionality we considered the USPS data set4 of handwritten digits. [sent-153, score-0.143]

83 To investigate performance across a range of different operating conditions, we varied the proportion of unlabeled data between 4 The data set contains 658 examples of 5s and 556 examples of 3s. [sent-155, score-0.241]

84 Mean and standard errors are shown for the IVM (solid line), the NCNM (dotted line), the SVM (dash-dot line) and the transductive SVM (dashed line). [sent-160, score-0.11]

85 We compared four classifiers: a standard IVM trained on the labeled data only, a support vector machine (SVM) trained on the labeled data only, the NCNM trained on the combined labeled-unlabeled data, and an implementation of the transductive SVM trained on the combined labeled-unlabeled data. [sent-164, score-0.438]

86 The SVM and transductive SVM used the SVMlight software [4]. [sent-165, score-0.11]

87 For the SVM, the kernel inverse width hyperparameter θ1 was set to the value learned by the IVM. [sent-166, score-0.113]

88 For the transductive SVM it was set to the higher of the two values learned by the IVM and the NCNM5 . [sent-167, score-0.133]

89 5 × 10−2 both the SVM and transductive SVM outperform the NCNM. [sent-173, score-0.11]

90 We would not want to read too much into the comparison between the transductive SVM and the NCNM since an exhaustive exploration of the regularisation parameter C was not undertaken. [sent-181, score-0.152]

91 Similar comments also apply to the regularisation of the process variances for the NCNM. [sent-182, score-0.163]

92 6 Discussion We have presented an approach to learning a classifier in the presence of unlabeled data which incorporates the natural assumption that the data density between classes should be low. [sent-190, score-0.339]

93 Our approach implements this qualitative assumption within a probabilistic framework without explicit, expensive and possibly counterproductive modeling of the class-conditional densities. [sent-191, score-0.097]

94 Our approach is similar in spirit to the transductive SVM, but with a major difference that in the SVM the process variance is discarded. [sent-192, score-0.315]

95 In the NCNM, the process variance is a key part of data point selection; in particular, Figure 3 illustrated how inclusion of some data points actually increases the posterior process variance. [sent-193, score-0.469]

96 Discarding process variance has advantages and disadvantages—an advantage is that it leads to an optimisation problem that is naturally sparse, while a disadvantage is that it prevents optimisation of kernel parameters via type-II maximum likelihood. [sent-194, score-0.331]

97 1 we discussed how test data points affect the location of our decision boundary. [sent-196, score-0.101]

98 An important desideratum would be that the location of the decision boundary should converge as the amount of test data goes to infinity. [sent-197, score-0.152]

99 Estimating a kernel Fisher discriminant in the o presence of label noise. [sent-225, score-0.106]

100 Fast sparse Gaussian process methods: The informative vector machine. [sent-233, score-0.154]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fn', 0.454), ('yn', 0.434), ('ncnm', 0.34), ('null', 0.265), ('category', 0.225), ('ivm', 0.194), ('zn', 0.193), ('xn', 0.17), ('unlabeled', 0.157), ('categorical', 0.127), ('transductive', 0.11), ('svm', 0.1), ('process', 0.099), ('posterior', 0.097), ('labeled', 0.085), ('variance', 0.081), ('ective', 0.081), ('pushed', 0.077), ('ect', 0.073), ('noise', 0.072), ('adf', 0.063), ('region', 0.061), ('active', 0.059), ('mass', 0.056), ('di', 0.055), ('gaussian', 0.053), ('boundary', 0.051), ('width', 0.049), ('dashes', 0.049), ('likelihood', 0.048), ('classi', 0.048), ('ordered', 0.046), ('decision', 0.044), ('neil', 0.042), ('regularisation', 0.042), ('bimodal', 0.042), ('svmlight', 0.042), ('label', 0.042), ('kernel', 0.041), ('toy', 0.04), ('optimisation', 0.039), ('erent', 0.038), ('lawrence', 0.038), ('roc', 0.035), ('consequence', 0.033), ('disadvantage', 0.032), ('point', 0.031), ('comparative', 0.031), ('capabilities', 0.031), ('informative', 0.031), ('data', 0.031), ('labels', 0.03), ('line', 0.028), ('maximized', 0.028), ('ma', 0.027), ('density', 0.027), ('never', 0.027), ('probabilistic', 0.027), ('location', 0.026), ('erence', 0.026), ('qualitative', 0.026), ('predictive', 0.025), ('model', 0.025), ('comes', 0.025), ('ects', 0.025), ('spirit', 0.025), ('nm', 0.025), ('incorporates', 0.025), ('handwritten', 0.025), ('trained', 0.024), ('hyperparameters', 0.024), ('situations', 0.024), ('sparse', 0.024), ('learned', 0.023), ('modelling', 0.023), ('xm', 0.023), ('side', 0.023), ('processes', 0.023), ('solid', 0.023), ('assumption', 0.023), ('predictions', 0.023), ('presence', 0.023), ('bayesian', 0.022), ('proportion', 0.022), ('margin', 0.022), ('updated', 0.022), ('graphical', 0.022), ('culties', 0.022), ('sons', 0.022), ('variances', 0.022), ('classes', 0.022), ('equivalent', 0.021), ('disadvantages', 0.021), ('counterproductive', 0.021), ('epsrc', 0.021), ('recreating', 0.021), ('tors', 0.021), ('aston', 0.021), ('disregard', 0.021), ('threes', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 166 nips-2004-Semi-supervised Learning via Gaussian Processes

Author: Neil D. Lawrence, Michael I. Jordan

Abstract: We present a probabilistic approach to learning a Gaussian Process classifier in the presence of unlabeled data. Our approach involves a “null category noise model” (NCNM) inspired by ordered categorical noise models. The noise model reflects an assumption that the data density is lower between the class-conditional densities. We illustrate our approach on a toy problem and present comparative results for the semi-supervised classification of handwritten digits. 1

2 0.14474435 164 nips-2004-Semi-supervised Learning by Entropy Minimization

Author: Yves Grandvalet, Yoshua Bengio

Abstract: We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. Our approach includes other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed solution benefits from unlabeled data. The method challenges mixture models when the data are sampled from the distribution class spanned by the generative model. The performances are definitely in favor of minimum entropy regularization when generative models are misspecified, and the weighting of unlabeled data provides robustness to the violation of the “cluster assumption”. Finally, we also illustrate that the method can also be far superior to manifold learning in high dimension spaces. 1

3 0.11507528 49 nips-2004-Density Level Detection is Classification

Author: Ingo Steinwart, Don Hush, Clint Scovel

Abstract: We show that anomaly detection can be interpreted as a binary classification problem. Using this interpretation we propose a support vector machine (SVM) for anomaly detection. We then present some theoretical results which include consistency and learning rates. Finally, we experimentally compare our SVM with the standard one-class SVM. 1

4 0.094446659 115 nips-2004-Maximum Margin Clustering

Author: Linli Xu, James Neufeld, Bryce Larson, Dale Schuurmans

Abstract: We propose a new method for clustering based on finding maximum margin hyperplanes through data. By reformulating the problem in terms of the implied equivalence relation matrix, we can pose the problem as a convex integer program. Although this still yields a difficult computational problem, the hard-clustering constraints can be relaxed to a soft-clustering formulation which can be feasibly solved with a semidefinite program. Since our clustering technique only depends on the data through the kernel matrix, we can easily achieve nonlinear clusterings in the same manner as spectral clustering. Experimental results show that our maximum margin clustering technique often obtains more accurate results than conventional clustering methods. The real benefit of our approach, however, is that it leads naturally to a semi-supervised training method for support vector machines. By maximizing the margin simultaneously on labeled and unlabeled training data, we achieve state of the art performance by using a single, integrated learning principle. 1

5 0.092007264 7 nips-2004-A Large Deviation Bound for the Area Under the ROC Curve

Author: Shivani Agarwal, Thore Graepel, Ralf Herbrich, Dan Roth

Abstract: The area under the ROC curve (AUC) has been advocated as an evaluation criterion for the bipartite ranking problem. We study large deviation properties of the AUC; in particular, we derive a distribution-free large deviation bound for the AUC which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on an independent test sequence. A comparison of our result with a corresponding large deviation result for the classification error rate suggests that the test sample size required to obtain an -accurate estimate of the expected accuracy of a ranking function with δ-confidence is larger than that required to obtain an -accurate estimate of the expected error rate of a classification function with the same confidence. A simple application of the union bound allows the large deviation bound to be extended to learned ranking functions chosen from finite function classes. 1

6 0.082262099 136 nips-2004-On Semi-Supervised Classification

7 0.082228906 54 nips-2004-Distributed Information Regularization on Graphs

8 0.081510417 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes

9 0.078860722 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

10 0.0786881 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

11 0.078614444 168 nips-2004-Semigroup Kernels on Finite Sets

12 0.074620135 145 nips-2004-Parametric Embedding for Class Visualization

13 0.074229531 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

14 0.074019246 34 nips-2004-Breaking SVM Complexity with Cross-Training

15 0.072711006 23 nips-2004-Analysis of a greedy active learning strategy

16 0.066996306 137 nips-2004-On the Adaptive Properties of Decision Trees

17 0.06537649 81 nips-2004-Implicit Wiener Series for Higher-Order Image Analysis

18 0.064674616 69 nips-2004-Fast Rates to Bayes for Kernel Machines

19 0.063762404 205 nips-2004-Who's In the Picture

20 0.06271638 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.197), (1, 0.069), (2, -0.055), (3, 0.052), (4, -0.027), (5, 0.062), (6, 0.035), (7, 0.039), (8, 0.085), (9, -0.044), (10, 0.009), (11, 0.124), (12, 0.07), (13, -0.07), (14, -0.047), (15, -0.048), (16, 0.008), (17, -0.061), (18, 0.088), (19, -0.072), (20, -0.117), (21, 0.098), (22, 0.094), (23, 0.079), (24, -0.08), (25, 0.006), (26, 0.037), (27, -0.034), (28, 0.024), (29, 0.013), (30, 0.151), (31, 0.045), (32, 0.073), (33, -0.05), (34, 0.06), (35, -0.004), (36, -0.028), (37, 0.069), (38, 0.048), (39, 0.099), (40, 0.045), (41, 0.099), (42, 0.022), (43, -0.027), (44, -0.027), (45, -0.082), (46, -0.041), (47, -0.116), (48, -0.006), (49, -0.175)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95003796 166 nips-2004-Semi-supervised Learning via Gaussian Processes

Author: Neil D. Lawrence, Michael I. Jordan

Abstract: We present a probabilistic approach to learning a Gaussian Process classifier in the presence of unlabeled data. Our approach involves a “null category noise model” (NCNM) inspired by ordered categorical noise models. The noise model reflects an assumption that the data density is lower between the class-conditional densities. We illustrate our approach on a toy problem and present comparative results for the semi-supervised classification of handwritten digits. 1

2 0.64897746 164 nips-2004-Semi-supervised Learning by Entropy Minimization

Author: Yves Grandvalet, Yoshua Bengio

Abstract: We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. Our approach includes other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed solution benefits from unlabeled data. The method challenges mixture models when the data are sampled from the distribution class spanned by the generative model. The performances are definitely in favor of minimum entropy regularization when generative models are misspecified, and the weighting of unlabeled data provides robustness to the violation of the “cluster assumption”. Finally, we also illustrate that the method can also be far superior to manifold learning in high dimension spaces. 1

3 0.57662278 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

Author: Omid Madani, David M. Pennock, Gary W. Flake

Abstract: In the context of binary classification, we define disagreement as a measure of how often two independently-trained models differ in their classification of unlabeled data. We explore the use of disagreement for error estimation and model selection. We call the procedure co-validation, since the two models effectively (in)validate one another by comparing results on unlabeled data, which we assume is relatively cheap and plentiful compared to labeled data. We show that per-instance disagreement is an unbiased estimate of the variance of error for that instance. We also show that disagreement provides a lower bound on the prediction (generalization) error, and a tight upper bound on the “variance of prediction error”, or the variance of the average error across instances, where variance is measured across training sets. We present experimental results on several data sets exploring co-validation for error estimation and model selection. The procedure is especially effective in active learning settings, where training sets are not drawn at random and cross validation overestimates error. 1

4 0.53600013 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection

Author: Dan Pelleg, Andrew W. Moore

Abstract: We introduce a novel active-learning scenario in which a user wants to work with a learning algorithm to identify useful anomalies. These are distinguished from the traditional statistical definition of anomalies as outliers or merely ill-modeled points. Our distinction is that the usefulness of anomalies is categorized subjectively by the user. We make two additional assumptions. First, there exist extremely few useful anomalies to be hunted down within a massive dataset. Second, both useful and useless anomalies may sometimes exist within tiny classes of similar anomalies. The challenge is thus to identify “rare category” records in an unlabeled noisy set with help (in the form of class labels) from a human expert who has a small budget of datapoints that they are prepared to categorize. We propose a technique to meet this challenge, which assumes a mixture model fit to the data, but otherwise makes no assumptions on the particular form of the mixture components. This property promises wide applicability in real-life scenarios and for various statistical models. We give an overview of several alternative methods, highlighting their strengths and weaknesses, and conclude with a detailed empirical analysis. We show that our method can quickly zoom in on an anomaly set containing a few tens of points in a dataset of hundreds of thousands. 1

5 0.51117337 49 nips-2004-Density Level Detection is Classification

Author: Ingo Steinwart, Don Hush, Clint Scovel

Abstract: We show that anomaly detection can be interpreted as a binary classification problem. Using this interpretation we propose a support vector machine (SVM) for anomaly detection. We then present some theoretical results which include consistency and learning rates. Finally, we experimentally compare our SVM with the standard one-class SVM. 1

6 0.49943864 54 nips-2004-Distributed Information Regularization on Graphs

7 0.49125993 136 nips-2004-On Semi-Supervised Classification

8 0.4707599 50 nips-2004-Dependent Gaussian Processes

9 0.46377778 81 nips-2004-Implicit Wiener Series for Higher-Order Image Analysis

10 0.44648647 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

11 0.4439553 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations

12 0.43278062 69 nips-2004-Fast Rates to Bayes for Kernel Machines

13 0.43000817 23 nips-2004-Analysis of a greedy active learning strategy

14 0.41741833 115 nips-2004-Maximum Margin Clustering

15 0.41461754 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression

16 0.41350013 145 nips-2004-Parametric Embedding for Class Visualization

17 0.39816484 111 nips-2004-Maximal Margin Labeling for Multi-Topic Text Categorization

18 0.38915703 127 nips-2004-Neighbourhood Components Analysis

19 0.37432051 172 nips-2004-Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units

20 0.35587782 106 nips-2004-Machine Learning Applied to Perception: Decision Images for Gender Classification


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.081), (15, 0.131), (26, 0.063), (31, 0.053), (33, 0.198), (35, 0.026), (39, 0.028), (50, 0.053), (93, 0.256)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.88484275 41 nips-2004-Comparing Beliefs, Surveys, and Random Walks

Author: Erik Aurell, Uri Gordon, Scott Kirkpatrick

Abstract: Survey propagation is a powerful technique from statistical physics that has been applied to solve the 3-SAT problem both in principle and in practice. We give, using only probability arguments, a common derivation of survey propagation, belief propagation and several interesting hybrid methods. We then present numerical experiments which use WSAT (a widely used random-walk based SAT solver) to quantify the complexity of the 3-SAT formulae as a function of their parameters, both as randomly generated and after simpli£cation, guided by survey propagation. Some properties of WSAT which have not previously been reported make it an ideal tool for this purpose – its mean cost is proportional to the number of variables in the formula (at a £xed ratio of clauses to variables) in the easy-SAT regime and slightly beyond, and its behavior in the hardSAT regime appears to re¤ect the underlying structure of the solution space that has been predicted by replica symmetry-breaking arguments. An analysis of the tradeoffs between the various methods of search for satisfying assignments shows WSAT to be far more powerful than has been appreciated, and suggests some interesting new directions for practical algorithm development. 1

2 0.84922045 91 nips-2004-Joint Tracking of Pose, Expression, and Texture using Conditionally Gaussian Filters

Author: Tim K. Marks, J. C. Roddey, Javier R. Movellan, John R. Hershey

Abstract: We present a generative model and stochastic filtering algorithm for simultaneous tracking of 3D position and orientation, non-rigid motion, object texture, and background texture using a single camera. We show that the solution to this problem is formally equivalent to stochastic filtering of conditionally Gaussian processes, a problem for which well known approaches exist [3, 8]. We propose an approach based on Monte Carlo sampling of the nonlinear component of the process (object motion) and exact filtering of the object and background textures given the sampled motion. The smoothness of image sequences in time and space is exploited by using Laplace’s method to generate proposal distributions for importance sampling [7]. The resulting inference algorithm encompasses both optic flow and template-based tracking as special cases, and elucidates the conditions under which these methods are optimal. We demonstrate an application of the system to 3D non-rigid face tracking. 1 Background Recent algorithms track morphable objects by solving optic flow equations, subject to the constraint that the tracked points belong to an object whose non-rigid deformations are linear combinations of a set of basic shapes [10, 2, 11]. These algorithms require precise initialization of the object pose and tend to drift out of alignment on long video sequences. We present G-flow, a generative model and stochastic filtering formulation of tracking that address the problems of initialization and error recovery in a principled manner. We define a non-rigid object by the 3D locations of n vertices. The object is a linear combination of k fixed morph bases, with coefficients c = [c1 , c2 , · · · , ck ]T . The fixed 3 × k matrix hi contains the position of the ith vertex in all k morph bases. The transformation from object-centered to image coordinates consists of a rotation, weak perspective projection, and translation. Thus xi , the 2D location of the ith vertex on the image plane, is xi = grhi c + l, (1) where r is the 3 × 3 rotation matrix, l is the 2 × 1 translation vector, and g = 1 0 0 is the 010 projection matrix. The object pose, ut , comprises both the rigid motion parameters and the morph parameters at time t: ut = {r(t), l(t), c(t)}. (2) 1.1 Optic flow Let yt represent the current image, and let xi (ut ) index the image pixel that is rendered by the ith object vertex when the object assumes pose ut . Suppose that we know ut−1 , the pose at time t − 1, and we want to find ut , the pose at time t. This problem can be solved by minimizing the following form with respect to ut : ut = argmin ˆ ut 1 2 n 2 [yt (xi (ut )) − yt−1 (xi (ut−1 ))] . (3) i=1 In the special case in which the xi (ut ) are neighboring points that move with the same 2D displacement, this reduces to the standard Lucas-Kanade optic flow algorithm [9, 1]. Recent work [10, 2, 11] has shown that in the general case, this optimization problem can be solved efficiently using the Gauss-Newton method. We will take advantage of this fact to develop an efficient stochastic inference algorithm within the framework of G-flow. Notational conventions Unless otherwise stated, capital letters are used for random variables, small letters for specific values taken by random variables, and Greek letters for fixed model parameters. Subscripted colons indicate sequences: e.g., X1:t = X1 · · · Xt . The term In stands for the n × n identity matrix, E for expected value, V ar for the covariance matrix, and V ar−1 for the inverse of the covariance matrix (precision matrix). 2 The Generative Model for G-Flow Figure 1: Left: a(Ut ) determines which texel (color at a vertex of the object model or a pixel of the background model) is responsible for rendering each image pixel. Right: G-flow video generation model: At time t, the object’s 3D pose, Ut , is used to project the object texture, Vt , into 2D. This projection is combined with the background texture, Bt , to generate the observed image, Yt . We model the image sequence Y as a stochastic process generated by three hidden causes, U , V , and B, as shown in the graphical model (Figure 1, right). The m × 1 random vector Yt represents the m-pixel image at time t. The n × 1 random vector Vt and the m × 1 random vector Bt represent the n-texel object texture and the m-texel background texture, respectively. As illustrated in Figure 1, left, the object pose, Ut , determines onto which image pixels the object and background texels project at time t. This is formulated using the projection function a(Ut ). For a given pose, ut , the projection a(ut ) is a block matrix, def a(ut ) = av (ut ) ab (ut ) . Here av (ut ), the object projection function, is an m × n matrix of 0s and 1s that tells onto which image pixel each object vertex projects; e.g., a 1 at row j, column i it means that the ith object point projects onto image pixel j. Matrix ab plays the same role for background pixels. Assuming the foreground mapping is one-toone, we let ab = Im −av (ut )av (ut )T , expressing the simple occlusion constraint that every image pixel is rendered by object or background, but not both. In the G-flow generative model: Vt Yt = a(Ut ) + Wt Wt ∼ N (0, σw Im ), σw > 0 Bt (4) Ut ∼ p(ut | ut−1 ) v v Vt = Vt−1 + Zt−1 Zt−1 ∼ N (0, Ψv ), Ψv is diagonal b b Bt = Bt−1 + Zt−1 Zt−1 ∼ N (0, Ψb ), Ψb is diagonal where p(ut | ut−1 ) is the pose transition distribution, and Z v , Z b , W are independent of each other, of the initial conditions, and over time. The form of the pose distribution is left unspecified since the algorithm proposed here does not require the pose distribution or the pose dynamics to be Gaussian. For the initial conditions, we require that the variance of V1 and the variance of B1 are both diagonal. Non-rigid 3D tracking is a difficult nonlinear filtering problem because changing the pose has a nonlinear effect on the image pixels. Fortunately, the problem has a rich structure that we can exploit: under the G-flow model, video generation is a conditionally Gaussian process [3, 6, 4, 5]. If the specific values taken by the pose sequence, u1:t , were known, then the texture processes, V and B, and the image process, Y , would be jointly Gaussian. This suggests the following scheme: we could use particle filtering to obtain a distribution of pose experts (each expert corresponds to a highly probable sample of pose, u1:t ). For each expert we could then use Kalman filtering equations to infer the posterior distribution of texture given the observed images. This method is known in the statistics community as a Monte Carlo filtering solution for conditionally Gaussian processes [3, 4], and in the machine learning community as Rao-Blackwellized particle filtering [6, 5]. We found that in addition to Rao-Blackwellization, it was also critical to use Laplace’s method to generate the proposal distributions for importance sampling [7]. In the context of G-flow, we accomplished this by performing an optic flow-like optimization, using an efficient algorithm similar to those in [10, 2]. 3 Inference Our goal is to find an expression for the filtering distribution, p(ut , vt , bt | y1:t ). Using the law of total probability, we have the following equation for the filtering distribution: p(ut , vt , bt | y1:t ) = p(ut , vt , bt | u1:t−1 , y1:t ) p(u1:t−1 | y1:t ) du1:t−1 Opinion of expert (5) Credibility of expert We can think of the integral in (5) as a sum over a distribution of experts, where each expert corresponds to a single pose history, u1:t−1 . Based on its hypothesis about pose history, each expert has an opinion about the current pose of the object, Ut , and the texture maps of the object and background, Vt and Bt . Each expert also has a credibility, a scalar that measures how well the expert’s opinion matches the observed image yt . Thus, (5) can be interpreted as follows: The filtering distribution at time t is obtained by integrating over the entire ensemble of experts the opinion of each expert weighted by that expert’s credibility. The opinion distribution of expert u1:t−1 can be factorized into the expert’s opinion about the pose Ut times the conditional distribution of texture Vt , Bt given pose: p(ut , vt , bt | u1:t−1 , y1:t ) = p(ut | u1:t−1 , y1:t ) p(vt , bt | u1:t , y1:t ) (6) Opinion of expert Pose Opinion Texture Opinion given pose The rest of this section explains how we evaluate each term in (5) and (6). We cover the distribution of texture given pose in 3.1, pose opinion in 3.2, and credibility in 3.3. 3.1 Texture opinion given pose The distribution of Vt and Bt given the pose history u1:t is Gaussian with mean and covariance that can be obtained using the Kalman filter estimation equations: −1 V ar−1 (Vt , Bt | u1:t , y1:t ) = V ar−1 (Vt , Bt | u1:t−1 , y1:t−1 ) + a(ut )T σw a(ut ) E(Vt , Bt | u1:t , y1:t ) = V ar(Vt , Bt | u1:t , y1:t ) −1 × V ar−1 (Vt , Bt | u1:t−1 , y1:t−1 )E(Vt , Bt | u1:t−1 , y1:t−1 ) + a(ut )T σw yt (7) (8) This requires p(Vt , Bt |u1:t−1 , y1:t−1 ), which we get from the Kalman prediction equations: E(Vt , Bt | u1:t−1 , y1:t−1 ) = E(Vt−1 , Bt−1 | u1:t−1 , y1:t−1 ) V ar(Vt , Bt | u1:t−1 , y1:t−1 ) = V ar(Vt−1 , Bt−1 | u1:t−1 , y1:t−1 ) + (9) Ψv 0 0 Ψb (10) In (9), the expected value E(Vt , Bt | u1:t−1 , y1:t−1 ) consists of texture maps (templates) for the object and background. In (10), V ar(Vt , Bt | u1:t−1 , y1:t−1 ) represents the degree of uncertainty about each texel in these texture maps. Since this is a diagonal matrix, we can refer to the mean and variance of each texel individually. For the ith texel in the object texture map, we use the following notation: µv (i) t v σt (i) def = ith element of E(Vt | u1:t−1 , y1:t−1 ) def = (i, i)th element of V ar(Vt | u1:t−1 , y1:t−1 ) b Similarly, define µb (j) and σt (j) as the mean and variance of the jth texel in the backt ground texture map. (This notation leaves the dependency on u1:t−1 and y1:t−1 implicit.) 3.2 Pose opinion Based on its current texture template (derived from the history of poses and images up to time t−1) and the new image yt , each expert u1:t−1 has a pose opinion, p(ut |u1:t−1 , y1:t ), a probability distribution representing that expert’s beliefs about the pose at time t. Since the effect of ut on the likelihood function is nonlinear, we will not attempt to find an analytical solution for the pose opinion distribution. However, due to the spatio-temporal smoothness of video signals, it is possible to estimate the peak and variance of an expert’s pose opinion. 3.2.1 Estimating the peak of an expert’s pose opinion We want to estimate ut (u1:t−1 ), the value of ut that maximizes the pose opinion. Since ˆ p(ut | u1:t−1 , y1:t ) = p(y1:t−1 | u1:t−1 ) p(ut | ut−1 ) p(yt | u1:t , y1:t−1 ), p(y1:t | u1:t−1 ) (11) def ut (u1:t−1 ) = argmax p(ut | u1:t−1 , y1:t ) = argmax p(ut | ut−1 ) p(yt | u1:t , y1:t−1 ). ˆ ut ut (12) We now need an expression for the final term in (12), the predictive distribution p(yt | u1:t , y1:t−1 ). By integrating out the hidden texture variables from p(yt , vt , bt | u1:t , y1:t−1 ), and using the conditional independence relationships defined by the graphical model (Figure 1, right), we can derive: 1 m log p(yt | u1:t , y1:t−1 ) = − log 2π − log |V ar(Yt | u1:t , y1:t−1 )| 2 2 n v 2 1 (yt (xi (ut )) − µt (i)) 1 (yt (j) − µb (j))2 t − − , (13) v (i) + σ b 2 i=1 σt 2 σt (j) + σw w j∈X (ut ) where xi (ut ) is the image pixel rendered by the ith object vertex when the object assumes pose ut , and X (ut ) is the set of all image pixels rendered by the object under pose ut . Combining (12) and (13), we can derive ut (u1:t−1 ) = argmin − log p(ut | ut−1 ) ˆ (14) ut + 1 2 n i=1 [yt (xi (ut )) − µv (i)]2 [yt (xi (ut )) − µb (xi (ut ))]2 t t b − − log[σt (xi (ut )) + σw ] v b σt (i) + σw σt (xi (ut )) + σw Foreground term Background terms Note the similarity between (14) and constrained optic flow (3). For example, focus on the foreground term in (14) and ignore the weights in the denominator. The previous image yt−1 from (3) has been replaced by µv (·), the estimated object texture based on the images t and poses up to time t − 1. As in optic flow, we can find the pose estimate ut (u1:t−1 ) ˆ efficiently using the Gauss-Newton method. 3.2.2 Estimating the distribution of an expert’s pose opinion We estimate the distribution of an expert’s pose opinion using a combination of Laplace’s method and importance sampling. Suppose at time t − 1 we are given a sample of experts (d) (d) indexed by d, each endowed with a pose sequence u1:t−1 , a weight wt−1 , and the means and variances of Gaussian distributions for object and background texture. For each expert (d) (d) u1:t−1 , we use (14) to compute ut , the peak of the pose distribution at time t according ˆ (d) to that expert. Define σt as the inverse Hessian matrix of (14) at this peak, the Laplace ˆ estimate of the covariance matrix of the expert’s opinion. We then generate a set of s (d,e) (d) independent samples {ut : e = 1, · · · , s} from a Gaussian distribution with mean ut ˆ (d) (d) (d) and variance proportional to σt , g(·|ˆt , αˆt ), where the parameter α > 0 determines ˆ u σ the sharpness of the sampling distribution. (Note that letting α → 0 would be equivalent to (d,e) (d) simply setting the new pose equal to the peak of the pose opinion, ut = ut .) To find ˆ the parameters of this Gaussian proposal distribution, we use the Gauss-Newton method, ignoring the second of the two background terms in (14). (This term is not ignored in the importance sampling step.) To refine our estimate of the pose opinion we use importance sampling. We assign each sample from the proposal distribution an importance weight wt (d, e) that is proportional to the ratio between the posterior distribution and the proposal distribution: s (d) p(ut | u1:t−1 , y1:t ) = ˆ (d,e) δ(ut − ut ) wt (d, e) s f =1 wt (d, f ) (15) e=1 (d,e) (d) (d) (d,e) p(ut | ut−1 )p(yt | u1:t−1 , ut , y1:t−1 ) wt (d, e) = (16) (d,e) (d) (d) g(ut | ut , αˆt ) ˆ σ (d,e) (d) The numerator of (16) is proportional to p(ut |u1:t−1 , y1:t ) by (12), and the denominator of (16) is the sampling distribution. 3.3 Estimating an expert’s credibility (d) The credibility of the dth expert, p(u1:t−1 | y1:t ), is proportional to the product of a prior term and a likelihood term: (d) (d) p(u1:t−1 | y1:t−1 )p(yt | u1:t−1 , y1:t−1 ) (d) p(u1:t−1 | y1:t ) = . (17) p(yt | y1:t−1 ) Regarding the likelihood, p(yt |u1:t−1 , y1:t−1 ) = p(yt , ut |u1:t−1 , y1:t−1 )dut = p(yt |u1:t , y1:t−1 )p(ut |ut−1 )dut (18) (d,e) We already generated a set of samples {ut : e = 1, · · · , s} that estimate the pose opin(d) ion of the dth expert, p(ut | u1:t−1 , y1:t ). We can now use these samples to estimate the likelihood for the dth expert: (d) p(yt | u1:t−1 , y1:t−1 ) = (d) (d) p(yt | u1:t−1 , ut , y1:t−1 )p(ut | ut−1 )dut (19) (d) (d) (d) (d) = p(yt | u1:t−1 , ut , y1:t−1 )g(ut | ut , αˆt ) ˆ σ 3.4 p(ut | ut−1 ) s e=1 dut ≈ wt (d, e) s Updating the filtering distribution g(ut | (d) (d) ut , αˆt ) ˆ σ Once we have calculated the opinion and credibility of each expert u1:t−1 , we evaluate the integral in (5) as a weighted sum over experts. The credibilities of all of the experts are normalized to sum to 1. New experts u1:t (children) are created from the old experts u1:t−1 (parents) by appending a pose ut to the parent’s history of poses u1:t−1 . Every expert in the new generation is created as follows: One parent is chosen to sire the child. The probability of being chosen is proportional to the parent’s credibility. The child’s value of ut is chosen at random from its parent’s pose opinion (the weighted samples described in Section 3.2.2). 4 Relation to Optic Flow and Template Matching In basic template-matching, the same time-invariant texture map is used to track every frame in the video sequence. Optic flow can be thought of as template-matching with a template that is completely reset at each frame for use in the subsequent frame. In most cases, optimal inference under G-flow involves a combination of optic flow-based and template-based tracking, in which the texture template gradually evolves as new images are presented. Pure optic flow and template-matching emerge as special cases. Optic Flow as a Special Case Suppose that the pose transition probability p(ut | ut−1 ) is uninformative, that the background is uninformative, that every texel in the initial object texture map has equal variance, V ar(V1 ) = κIn , and that the texture transition uncertainty is very high, Ψv → diag(∞). Using (7), (8), and (10), it follows that: µv (i) = [av (ut−1 )]T yt−1 = yt−1 (xi (ut−1 )) , t (20) i.e., the object texture map at time t is determined by the pixels from image yt−1 that according to pose ut−1 were rendered by the object. As a result, (14) reduces to: ut (u1:t−1 ) = argmin ˆ ut 1 2 n yt (xi (ut )) − yt−1 (xi (ut−1 )) 2 (21) i=1 which is identical to (3). Thus constrained optic flow [10, 2, 11] is simply a special case of optimal inference under G-flow, with a single expert and with sampling parameter α → 0. The key assumption that Ψv → diag(∞) means that the object’s texture is very different in adjacent frames. However, optic flow is typically applied in situations in which the object’s texture in adjacent frames is similar. The optimal solution in such situations calls not for optic flow, but for a texture map that integrates information across multiple frames. Template Matching as a Special Case Suppose the initial texture map is known precisely, V ar(V1 ) = 0, and the texture transition uncertainty is very low, Ψv → 0. By (7), (8), and (10), it follows that µv (i) = µv (i) = µv (i), i.e., the texture map does not change t t−1 1 over time, but remains fixed at its initial value (it is a texture template). Then (14) becomes: n yt (xi (ut )) − µv (i) 1 ut (u1:t−1 ) = argmin ˆ ut 2 (22) i=1 where µv (i) is the ith texel of the fixed texture template. This is the error function mini1 mized by standard template-matching algorithms. The key assumption that Ψv → 0 means the object’s texture is constant from each frame to the next, which is rarely true in real data. G-flow provides a principled way to relax this unrealistic assumption of template methods. General Case In general, if the background is uninformative, then minimizing (14) results in a weighted combination of optic flow and template matching, with the weight of each approach depending on the current level of certainty about the object template. In addition, when there is useful information in the background, G-flow infers a model of the background which is used to improve tracking. Figure 2: G-flow tracking an outdoor video. Results are shown for frames 1, 81, and 620. 5 Simulations We collected a video (30 frames/sec) of a subject in an outdoor setting who made a variety of facial expressions while moving her head. A later motion-capture session was used to create a 3D morphable model of her face, consisting of a set of 5 morph bases (k = 5). Twenty experts were initialized randomly near the correct pose on frame 1 of the video and propagated using G-flow inference (assuming an uninformative background). See http://mplab.ucsd.edu for video. Figure 2 shows the distribution of experts for three frames. In each frame, every expert has a hypothesis about the pose (translation, rotation, scale, and morph coefficients). The 38 points in the model are projected into the image according to each expert’s pose, yielding 760 red dots in each frame. In each frame, the mean of the experts gives a single hypothesis about the 3D non-rigid deformation of the face (lower right) as well as the rigid pose of the face (rotated 3D axes, lower left). Notice G-flow’s ability to recover from error: bad initial hypotheses are weeded out, leaving only good hypotheses. To compare G-flow’s performance versus deterministic constrained optic flow algorithms such as [10, 2, 11] , we used both G-flow and the method from [2] to track the same video sequence. We ran each tracker several times, introducing small errors in the starting pose. Figure 3: Average error over time for G-flow (green) and for deterministic optic flow [2] (blue). Results were averaged over 16 runs (deterministic algorithm) or 4 runs (G-flow) and smoothed. As ground truth, the 2D locations of 6 points were hand-labeled in every 20th frame. The error at every 20th frame was calculated as the distance from these labeled locations to the inferred (tracked) locations, averaged across several runs. Figure 3 compares this tracking error as a function of time for the deterministic constrained optic flow algorithm and for a 20-expert version of the G-flow tracking algorithm. Notice that the deterministic system has a tendency to drift (increase in error) over time, whereas G-flow can recover from drift. Acknowledgments Tim K. Marks was supported by NSF grant IIS-0223052 and NSF grant DGE-0333451 to GWC. John Hershey was supported by the UCDIMI grant D00-10084. J. Cooper Roddey was supported by the Swartz Foundation. Javier R. Movellan was supported by NSF grants IIS-0086107, IIS-0220141, and IIS-0223052, and by the UCDIMI grant D00-10084. References [1] Simon Baker and Iain Matthews. Lucas-kanade 20 years on: A unifying framework. International Journal of Computer Vision, 56(3):221–255, 2002. [2] M. Brand. Flexible flow for 3D nonrigid tracking and shape recovery. In CVPR, volume 1, pages 315–322, 2001. [3] H. Chen, P. Kumar, and J. van Schuppen. On Kalman filtering for conditionally gaussian systems with random matrices. Syst. Contr. Lett., 13:397–404, 1989. [4] R. Chen and J. Liu. Mixture Kalman filters. J. R. Statist. Soc. B, 62:493–508, 2000. [5] A. Doucet and C. Andrieu. Particle filtering for partially observed gaussian state space models. J. R. Statist. Soc. B, 64:827–838, 2002. [6] A. Doucet, N. de Freitas, K. Murphy, and S. Russell. Rao-blackwellised particle filtering for dynamic bayesian networks. In 16th Conference on Uncertainty in AI, pages 176–183, 2000. [7] A. Doucet, S. J. Godsill, and C. Andrieu. On sequential monte carlo sampling methods for bayesian filtering. Statistics and Computing, 10:197–208, 2000. [8] Zoubin Ghahramani and Geoffrey E. Hinton. Variational learning for switching state-space models. Neural Computation, 12(4):831–864, 2000. [9] B. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Proceedings of the International Joint Conference on Artificial Intelligence, 1981. [10] L. Torresani, D. Yang, G. Alexander, and C. Bregler. Tracking and modeling non-rigid objects with rank constraints. In CVPR, pages 493–500, 2001. [11] Lorenzo Torresani, Aaron Hertzmann, and Christoph Bregler. Learning non-rigid 3d shape from 2d motion. In Advances in Neural Information Processing Systems 16. MIT Press, 2004.

same-paper 3 0.83618802 166 nips-2004-Semi-supervised Learning via Gaussian Processes

Author: Neil D. Lawrence, Michael I. Jordan

Abstract: We present a probabilistic approach to learning a Gaussian Process classifier in the presence of unlabeled data. Our approach involves a “null category noise model” (NCNM) inspired by ordered categorical noise models. The noise model reflects an assumption that the data density is lower between the class-conditional densities. We illustrate our approach on a toy problem and present comparative results for the semi-supervised classification of handwritten digits. 1

4 0.82413667 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth

Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: On-line learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the analysis of each algorithm generalizes to the non-diagonal case. We apply both new algorithms, called the Matrix Exponentiated Gradient (MEG) update and DefiniteBoost, to learn a kernel matrix from distance measurements.

5 0.7192874 69 nips-2004-Fast Rates to Bayes for Kernel Machines

Author: Ingo Steinwart, Clint Scovel

Abstract: We establish learning rates to the Bayes risk for support vector machines (SVMs) with hinge loss. In particular, for SVMs with Gaussian RBF kernels we propose a geometric condition for distributions which can be used to determine approximation properties of these kernels. Finally, we compare our methods with a recent paper of G. Blanchard et al.. 1

6 0.71533597 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

7 0.712726 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

8 0.70870882 131 nips-2004-Non-Local Manifold Tangent Learning

9 0.70781261 77 nips-2004-Hierarchical Clustering of a Mixture Model

10 0.70625603 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve

11 0.70607817 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

12 0.7058208 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

13 0.70566374 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

14 0.70528775 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering

15 0.7049731 127 nips-2004-Neighbourhood Components Analysis

16 0.7047326 207 nips-2004-ℓ₀-norm Minimization for Basis Selection

17 0.7044301 103 nips-2004-Limits of Spectral Clustering

18 0.70410866 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

19 0.70399743 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach

20 0.70393103 161 nips-2004-Self-Tuning Spectral Clustering