nips nips2010 nips2010-235 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: M. P. Kumar, Benjamin Packer, Daphne Koller
Abstract: Latent variable models are a powerful tool for addressing several tasks in machine learning. However, the algorithms for learning the parameters of latent variable models are prone to getting stuck in a bad local optimum. To alleviate this problem, we build on the intuition that, rather than considering all samples simultaneously, the algorithm should be presented with the training data in a meaningful order that facilitates learning. The order of the samples is determined by how easy they are. The main challenge is that often we are not provided with a readily computable measure of the easiness of samples. We address this issue by proposing a novel, iterative self-paced learning algorithm where each iteration simultaneously selects easy samples and learns a new parameter vector. The number of samples selected is governed by a weight that is annealed until the entire training data has been considered. We empirically demonstrate that the self-paced learning algorithm outperforms the state of the art method for learning a latent structural SVM on four applications: object localization, noun phrase coreference, motif finding and handwritten digit recognition. 1
Reference: text
sentIndex sentText sentNum sentScore
1 However, the algorithms for learning the parameters of latent variable models are prone to getting stuck in a bad local optimum. [sent-5, score-0.436]
2 The order of the samples is determined by how easy they are. [sent-7, score-0.247]
3 We address this issue by proposing a novel, iterative self-paced learning algorithm where each iteration simultaneously selects easy samples and learns a new parameter vector. [sent-9, score-0.389]
4 We empirically demonstrate that the self-paced learning algorithm outperforms the state of the art method for learning a latent structural SVM on four applications: object localization, noun phrase coreference, motif finding and handwritten digit recognition. [sent-11, score-1.043]
5 In medical diagnosis, learning to diagnose a disease based on symptoms can be improved by treating unknown or unobserved diseases as latent variables (to deal with confounding factors). [sent-15, score-0.268]
6 Learning the parameters of a latent variable model often requires solving a non-convex optimization problem. [sent-16, score-0.229]
7 However, these approaches are prone to getting stuck in a bad local minimum with high training and generalization error. [sent-18, score-0.284]
8 [3] recently proposed an alternative method for training with non-convex objectives, called curriculum learning. [sent-23, score-0.246]
9 Curriculum learning suggests using the easy samples first and gradually introducing the learning algorithm to more complex ones. [sent-25, score-0.369]
10 The main challenge in using the curriculum learning strategy is that it requires the identification of easy and hard samples in a given training dataset. [sent-26, score-0.614]
11 We build on this intuition for learning latent variable models by 1 designing an iterative approach that simultaneously selects easy samples and updates the parameters at each iteration. [sent-30, score-0.577]
12 The number of samples selected at each iteration is determined by a weight that is gradually annealed such that later iterations introduce more samples. [sent-31, score-0.303]
13 The algorithm converges when all samples have been considered and the objective function cannot be improved further. [sent-32, score-0.245]
14 Note that, in self-paced learning, the characterization of what is “easy” applies not to individual samples, but to sets of samples; a set of samples is easy if it admits a good fit in the model space. [sent-33, score-0.247]
15 We empirically demonstrate that our self-paced learning approach outperforms the state of the art algorithm for learning a recently proposed latent variable model, called latent structural SVM, on four standard machine learning applications using publicly available datasets. [sent-34, score-0.679]
16 2 Related Work Self-paced learning is related to curriculum learning in that both regimes suggest processing the samples in a meaningful order. [sent-35, score-0.389]
17 However, in their work, they circumvented the challenge of obtaining such an ordering by using datasets where there is a clear distinction between easy and hard samples (for example, classifying equilateral triangles vs. [sent-38, score-0.335]
18 However, unlike our setting, in active learning the labels of all the samples are not known when the samples are chosen. [sent-47, score-0.331]
19 Our approach differs from co-training in that in our setting the latent variables are simply used to assist in predicting the target labels, which are always observed, whereas co-training deals with a semi-supervised setting in which some labels are missing. [sent-49, score-0.235]
20 3 Preliminaries We will denote the training data as D = {(xi , yi ), · · · , (xn , yn )}, where xi ∈ X are the observed variables (which we refer to as input) for the ith sample and yi ∈ Y are the unobserved variables (which we refer to as output), whose values are known during training. [sent-50, score-0.51]
21 In addition, latent variable models also contain latent, or hidden, variables that we denote by hi ∈ H. [sent-51, score-0.465]
22 For example, when learning a ‘car’ model using image-level labels, x represents an image, the binary output y indicates the presence or absence of a car in the image, and h represents the car’s bounding box (if present). [sent-52, score-0.274]
23 Given the training data, the parameters w of a latent variable model are learned by optimizing an objective function, for example by maximizing the likelihood of D or minimizing the risk over D. [sent-53, score-0.443]
24 (1) log Pr(xi , yi , hi ; w) − log Pr(xi , yi ; w) = max max An intuitive objective is to maximize likelihood: i i A common approach for this task is to use the EM method [8] or one of its many variants [12]. [sent-58, score-0.588]
25 Outlined in Algorithm 1, EM iterates between finding the expected value of the latent variables h and maximizing objective (1) subject to this expectation. [sent-59, score-0.347]
26 An efficient way to overcome this difficulty is to use the recently proposed latent structural support vector machine (hereby referred to as latent SSVM) formulation [9, 23] that minimizes a regularized upper bound on the risk. [sent-64, score-0.414]
27 1: t ← 0 2: repeat 3: Obtain the expectation of objective (1) under the distribution Pr(hi |xi , yi ; wt ). [sent-67, score-0.358]
28 Specifically, wt+1 = argmaxw i Pr(hi |xi , yi ; wt ) log Pr(xi , yi , hi ; w). [sent-69, score-0.579]
29 ξi , i=1 ˆ ˆ ˆ max w⊤ Φ(xi , yi , hi ) − Φ(xi , yi , hi ) ≥ ∆(yi , yi ) − ξi , hi ∈H ˆ ∀ˆ i ∈ Y, ∀hi ∈ H, i = 1, · · · , n. [sent-77, score-0.999]
30 y (2) ˆ For any given w, the value of ξi can be shown to be an upper bound on the risk ∆(yi , yi (w)) (where ˆ ˆ yi (w) is the predicted output given w). [sent-78, score-0.375]
31 The risk function can also depend on hi (w); that is, it can ˆ ˆ be of the form ∆(yi , yi (w), hi (w)). [sent-79, score-0.569]
32 The algorithm has two main steps: (i) imputing the hidden variables (step 3), which corresponds to approximating the concave function by a linear upper bound; and (ii) updating the value of the parameter using the values of the hidden variables. [sent-83, score-0.364]
33 Note that updating the parameters requires us to solve a convex SSVM learning problem (where the output yi is now concatenated with the hidden variable h∗ ) for which several i efficient algorithms exist in the literature [14, 20, 22]. [sent-84, score-0.37]
34 1: t ← 0 2: repeat ⊤ 3: Update h∗ = argmaxhi ∈H wt Φ(xi , yi , hi ). [sent-87, score-0.483]
35 i 4: Update wt+1 by fixing the hidden variables for output yi to h∗ and solving the corresponding i SSVM problem. [sent-88, score-0.343]
36 Specifically, 1 ˆ ˆ ˆ wt+1 = argminw 2 ||w||2 + C i max{0, ∆(yi , yi ) + w⊤ (Φ(xi , yi , hi ) − Φ(xi , yi , h∗ ))}. [sent-89, score-0.619]
37 4 Self-Paced Learning for Latent Variable Models Our self-paced learning strategy alleviates the main difficulty of curriculum learning, namely the lack of a readily computable measure of the easiness of a sample. [sent-92, score-0.352]
38 In the context of a latent variable model, for a given parameter w, this easiness can be defined in two ways: (i) a sample is easy if we are confident about the value of a hidden variable; or (ii) a sample is easy if it is easy to predict its true output. [sent-93, score-0.744]
39 They are different in that certainty does not imply correctness, and the hidden variables may not be directly relevant to what makes the output of a sample easy to predict. [sent-95, score-0.314]
40 We therefore focus on the second definition: easy samples are ones whose correct output can be predicted easily (its likelihood is high, or it lies far from the margin). [sent-96, score-0.29]
41 However, in order to operationalize selfpaced learning, we need a strategy for simultaneously selecting the easy samples and learning the parameter w at each iteration. [sent-98, score-0.44]
42 ) is the negative log-likelihood for EM or an upper bound on the risk for latent SSVM (or any other criteria for parameter learning). [sent-102, score-0.235]
43 We now modify the above optimization problem by introducing binary variables vi that indicate whether the ith sample is easy or not. [sent-103, score-0.252]
44 Formally, at each iteration we solve the following mixed-integer program: n (wt+1 , vt+1 ) = argmin vi f (xi , yi ; w) − r(w) + w∈Rd ,v∈{0,1}n i=1 1 K n vi . [sent-105, score-0.368]
45 (4) i=1 K is a weight that determines the number of samples to be considered: if K is large, the problem prefers to consider only “easy” samples with a small value of f (. [sent-106, score-0.266]
46 Importantly, however, the samples are tied together in the objective through the parameter w. [sent-108, score-0.245]
47 Therefore, no sample is considered independently easy; rather, a set of samples is easy if a w can be fit to it such that the corresponding values of f (. [sent-109, score-0.247]
48 We iteratively decrease the value of K in order to estimate the parameters of a latent variable model via self-paced learning. [sent-111, score-0.229]
49 We thus begin with only a few easy examples, gradually introducing more until the entire training dataset is used. [sent-113, score-0.226]
50 If f (xi , yi ; w) < 1/K then vi = 1 yields the optimal objective function value. [sent-116, score-0.347]
51 Similarly, if f (xi , yi ; w) > 1/K then the objective is optimal when vi = 0. [sent-117, score-0.347]
52 Given parameters w, we can obtain the optimum v as vi = δ(f (xi , yi ; w) < 1/K), where δ(. [sent-131, score-0.235]
53 As an illustrative example of self-paced learning, Algorithm 3 outlines the overall self-paced learning method for latent SSVM, which involves solving a modified version of problem (2). [sent-136, score-0.253]
54 The algorithm converges when it considers all samples but is unable to decrease the latent SSVM objective function value below the tolerance ǫ. [sent-138, score-0.507]
55 In our experiments, we obtained an estimate of w0 by initially setting vi = 1 for all samples and running the original CCCP algorithm for a fixed, small number of iterations T0 . [sent-142, score-0.262]
56 We show that our approach outperforms the state of the art CCCP algorithm on four standard machine learning 4 Algorithm 3 The self-paced learning algorithm for parameter estimation of latent SSVM. [sent-145, score-0.345]
57 2: repeat ⊤ 3: Update h∗ = argmaxhi ∈H wt Φ(xi , yi , hi ). [sent-148, score-0.483]
58 6: until vi = 1, ∀i and the objective function cannot be decreased below tolerance ǫ. [sent-151, score-0.277]
59 1 K n i=1 vi (a) (b) (c) Figure 1: Results for the noun phrase coreference experiment. [sent-152, score-0.44]
60 (a) The relative objective value computed as (objcccp −objspl )/objcccp , where objcccp and objspl are the objective values of CCCP and self-paced learning respectively. [sent-155, score-0.351]
61 In all our experiments, the initial weight K0 is set such that the first iteration selects more than half the samples (as there are typically more easy samples than difficult ones). [sent-171, score-0.46]
62 Given the occurrence of all the nouns in a document, the goal of noun phrase coreference is to provide a clustering of the nouns such that each cluster refers to a single object. [sent-176, score-0.478]
63 This task was formulated within the SSVM framework in [10] and extended to include latent variables in [23]. [sent-177, score-0.235]
64 A hidden variable h specifies a forest over the nouns such that each tree in the forest consists of all the nouns of one cluster. [sent-180, score-0.363]
65 Imputing the hidden variables involves finding the maximum spanning forest (which can be solved by Kruskal or Prims algorithm). [sent-181, score-0.229]
66 We use the publicly available MUC6 noun phrase coreference dataset, which consists of 60 documents. [sent-184, score-0.384]
67 1 compares the two methods in terms of the value of the objective function (which is the main focus of this work), the loss over the training data and the loss over the test data. [sent-189, score-0.288]
68 Here, imputing the hidden variables simply involves a search for the starting position of the motif. [sent-203, score-0.284]
69 The improvement in objective value also translates to an improvement in training and test errors. [sent-274, score-0.23]
70 We used five different folds for each protein, randomly initializing the motif positions for all training samples using four different seed values (fixed for both methods). [sent-276, score-0.547]
71 While improvements for most folds are small, for the fourth protein, CCCP gets stuck in a bad local minimum despite using multiple random initializations (this is indicated by the large mean and standard deviation values). [sent-282, score-0.367]
72 This behavior is to be expected: in many cases, the objective function landscape is such that CCCP avoids local optima; but in some cases, CCCP gets stuck in poor local optimum. [sent-283, score-0.255]
73 Indeed, over all the 100 runs (5 proteins, 5 folds and 4 seed values) CCCP got stuck in a bad local minimum 18 times (where a bad local minimum is one that gave 50% test error) compared to 1 run where self-paced learning got stuck. [sent-284, score-0.695]
74 2 shows the average Hamming distance between the motifs of the selected samples at each iteration of the self-paced learning algorithm. [sent-286, score-0.354]
75 Note that initially the algorithm selects samples whose motifs have a low Hamming distance (which intuitively correspond to the easy samples for this application). [sent-287, score-0.566]
76 Finally, it considers all samples and attempts to find the most discriminative motif across the entire dataset. [sent-289, score-0.3]
77 We specify the joint feature vector as Φ(x, y, h) = (0y(m+1) ; θh (x) 1; 0(9−y)(m+1) ), where θh (x) is the vector representation 6 Figure 2: Average Hamming distance between the motifs found in all selected samples at each iteration. [sent-299, score-0.28]
78 Our approach starts with easy samples (small Hamming distance) and gradually introduces more difficult samples (large Hamming distance) until it starts to consider all samples of the training set. [sent-300, score-0.625]
79 In other words, the joint feature vector is the rotated image of the digit which is padded in the front and back with the appropriate number of zeroes. [sent-311, score-0.24]
80 We use the standard MNIST dataset [15], which represents each handwritten digit as a vector of length 784 (that is, an image of size 28 × 28). [sent-315, score-0.273]
81 Modeling rotation as a hidden variable significantly improves classification performance, C allowing the images to be better aligned with each other. [sent-323, score-0.252]
82 Across all experiments for both learning methods, using hidden variables achieves better test error; the improvement over using no hidden variables is 12%, 8%, 11%, and 22%, respectively, for the four digit pairs. [sent-324, score-0.558]
83 The above figure compares the training and test errors and objective values between CCCP and selfpaced learning. [sent-326, score-0.263]
84 Though training and test errors do not necessarily correlate to objective values, the best test error across C values is better for self-paced learning for one of the digit pairs (1-7), and is the same for the others. [sent-328, score-0.349]
85 In practice, although it is easy to mine such images from free photo-sharing websites such as Flickr, it is burdensome to obtain ground truth annotations of the exact location of the object in each image. [sent-332, score-0.27]
86 For 7 Figure 4: The top row shows the imputed bounding boxes of an easy and a hard image using the CCCP algorithm over increasing iterations (left to right). [sent-335, score-0.451]
87 Note that for the hard (deer) image, the bounding box obtained at convergence does not localize the object accurately. [sent-336, score-0.262]
88 In contrast, the self-paced learning approach (bottom row) does not use the hard image during initial iterations (indicated by the red color of the bounding box). [sent-337, score-0.259]
89 In subsequent iterations, it is able to impute accurate bounding boxes for both the easy and hard image. [sent-338, score-0.281]
90 the above problem, imputing the hidden variables involves a simple search over possible locations ˆ in a given image. [sent-339, score-0.284]
91 We use images of 6 different mammals (approximately 45 images per mammal) that have been previously employed for object localization [13]. [sent-342, score-0.234]
92 The average training time over all folds were 362 seconds and 482 seconds for CCCP and self-paced learning respectively. [sent-349, score-0.265]
93 Table 2 shows the mean and standard deviation of three terms: the objective value, the training loss and the testing loss. [sent-350, score-0.228]
94 The better objective value resulted in a substantial improvement in the training (for 4 folds) and testing loss (an improvement of approximately 4% for achieved for 2 folds). [sent-352, score-0.29]
95 4 shows the imputed bounding boxes for two images during various iterations of the two algorithms. [sent-371, score-0.273]
96 The proposed self-paced learning algorithm does not use the hard image during the initial iterations (as indicated by the red bounding box). [sent-372, score-0.288]
97 Note that self-paced learning provides a more accurate bounding box for the hard image at convergence, thereby illustrating the importance of learning in a meaningful order. [sent-374, score-0.312]
98 6 Discussion We proposed the self-paced learning regime in the context of parameter estimation for latent variable models. [sent-376, score-0.262]
99 Our method works by iteratively solving a biconvex optimization problem that simultaneously selects easy samples and updates the parameters. [sent-377, score-0.457]
100 In the current work, we solve the biconvex optimization problem using an alternate convex search strategy, which only provides us with a local minimum solution. [sent-379, score-0.231]
wordName wordTfidf (topN-words)
[('cccp', 0.511), ('ssvm', 0.261), ('hi', 0.19), ('curriculum', 0.19), ('latent', 0.189), ('motif', 0.167), ('digit', 0.148), ('yi', 0.143), ('biconvex', 0.142), ('coreference', 0.142), ('samples', 0.133), ('mitre', 0.119), ('motifs', 0.115), ('easy', 0.114), ('hamming', 0.113), ('objective', 0.112), ('hidden', 0.111), ('folds', 0.11), ('phrase', 0.104), ('wt', 0.103), ('noun', 0.102), ('imputing', 0.096), ('selfpaced', 0.095), ('vi', 0.092), ('object', 0.078), ('bounding', 0.075), ('tolerance', 0.073), ('stuck', 0.073), ('imputed', 0.071), ('spl', 0.071), ('car', 0.066), ('bad', 0.066), ('nouns', 0.065), ('handwritten', 0.063), ('easiness', 0.062), ('image', 0.062), ('protein', 0.061), ('loss', 0.06), ('em', 0.058), ('art', 0.058), ('box', 0.057), ('dna', 0.056), ('localization', 0.056), ('gradually', 0.056), ('training', 0.056), ('minimum', 0.054), ('hard', 0.052), ('pr', 0.052), ('acs', 0.051), ('rotation', 0.051), ('images', 0.05), ('seed', 0.049), ('runs', 0.048), ('argmaxhi', 0.047), ('objcccp', 0.047), ('objspl', 0.047), ('uniprobe', 0.047), ('variables', 0.046), ('risk', 0.046), ('proteins', 0.045), ('xi', 0.044), ('output', 0.043), ('pairwise', 0.042), ('iteration', 0.041), ('forest', 0.041), ('variable', 0.04), ('boxes', 0.04), ('selects', 0.039), ('deformations', 0.038), ('sequences', 0.038), ('iterations', 0.037), ('structural', 0.036), ('packer', 0.036), ('mammal', 0.036), ('triangles', 0.036), ('annealed', 0.036), ('got', 0.036), ('publicly', 0.036), ('strategy', 0.036), ('local', 0.035), ('classi', 0.035), ('wy', 0.034), ('seconds', 0.033), ('learning', 0.033), ('distance', 0.032), ('bengio', 0.032), ('yn', 0.032), ('active', 0.032), ('four', 0.032), ('update', 0.031), ('cohn', 0.031), ('computable', 0.031), ('involves', 0.031), ('improvement', 0.031), ('finley', 0.03), ('rotated', 0.03), ('angle', 0.03), ('simultaneously', 0.029), ('indicated', 0.029), ('location', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 235 nips-2010-Self-Paced Learning for Latent Variable Models
Author: M. P. Kumar, Benjamin Packer, Daphne Koller
Abstract: Latent variable models are a powerful tool for addressing several tasks in machine learning. However, the algorithms for learning the parameters of latent variable models are prone to getting stuck in a bad local optimum. To alleviate this problem, we build on the intuition that, rather than considering all samples simultaneously, the algorithm should be presented with the training data in a meaningful order that facilitates learning. The order of the samples is determined by how easy they are. The main challenge is that often we are not provided with a readily computable measure of the easiness of samples. We address this issue by proposing a novel, iterative self-paced learning algorithm where each iteration simultaneously selects easy samples and learns a new parameter vector. The number of samples selected is governed by a weight that is annealed until the entire training data has been considered. We empirically demonstrate that the self-paced learning algorithm outperforms the state of the art method for learning a latent structural SVM on four applications: object localization, noun phrase coreference, motif finding and handwritten digit recognition. 1
2 0.18770923 240 nips-2010-Simultaneous Object Detection and Ranking with Weak Supervision
Author: Matthew Blaschko, Andrea Vedaldi, Andrew Zisserman
Abstract: A standard approach to learning object category detectors is to provide strong supervision in the form of a region of interest (ROI) specifying each instance of the object in the training images [17]. In this work are goal is to learn from heterogeneous labels, in which some images are only weakly supervised, specifying only the presence or absence of the object or a weak indication of object location, whilst others are fully annotated. To this end we develop a discriminative learning approach and make two contributions: (i) we propose a structured output formulation for weakly annotated images where full annotations are treated as latent variables; and (ii) we propose to optimize a ranking objective function, allowing our method to more effectively use negatively labeled images to improve detection average precision performance. The method is demonstrated on the benchmark INRIA pedestrian detection dataset of Dalal and Triggs [14] and the PASCAL VOC dataset [17], and it is shown that for a significant proportion of weakly supervised images the performance achieved is very similar to the fully supervised (state of the art) results. 1
3 0.17201683 151 nips-2010-Learning from Candidate Labeling Sets
Author: Jie Luo, Francesco Orabona
Abstract: In many real world applications we do not have access to fully-labeled training data, but only to a list of possible labels. This is the case, e.g., when learning visual classifiers from images downloaded from the web, using just their text captions or tags as learning oracles. In general, these problems can be very difficult. However most of the time there exist different implicit sources of information, coming from the relations between instances and labels, which are usually dismissed. In this paper, we propose a semi-supervised framework to model this kind of problems. Each training sample is a bag containing multi-instances, associated with a set of candidate labeling vectors. Each labeling vector encodes the possible labels for the instances in the bag, with only one being fully correct. The use of the labeling vectors provides a principled way not to exclude any information. We propose a large margin discriminative formulation, and an efficient algorithm to solve it. Experiments conducted on artificial datasets and a real-world images and captions dataset show that our approach achieves performance comparable to an SVM trained with the ground-truth labels, and outperforms other baselines.
4 0.12008523 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning
Author: David Sontag, Ofer Meshi, Amir Globerson, Tommi S. Jaakkola
Abstract: The problem of learning to predict structured labels is of key importance in many applications. However, for general graph structure both learning and inference are intractable. Here we show that it is possible to circumvent this difficulty when the distribution of training examples is rich enough, via a method similar in spirit to pseudo-likelihood. We show that our new method achieves consistency, and illustrate empirically that it indeed approaches the performance of exact methods when sufficiently large training sets are used. Many prediction problems in machine learning applications are structured prediction tasks. For example, in protein folding we are given a protein sequence and the goal is to predict the protein’s native structure [14]. In parsing for natural language processing, we are given a sentence and the goal is to predict the most likely parse tree [2]. In these and many other applications, we can formalize the structured prediction problem as taking an input x (e.g., primary sequence, sentence) and predicting ˆ y (e.g., structure, parse) according to y = arg maxy∈Y θ · φ(x, y ), where φ(x, y) is a function that ˆ maps any input and a candidate assignment to a feature vector, Y denotes the space of all possible assignments to the vector y, and θ is a weight vector to be learned. This paper addresses the problem of learning structured prediction models from data. In particular, given a set of labeled examples {(xm , y m )}M , our goal is to find a vector θ such that for each m=1 example m, y m = arg maxy∈Y θ · φ(xm , y), i.e. one which separates the training data. For many structured prediction models, maximization over Y is computationally intractable. This makes it difficult to apply previous algorithms for learning structured prediction models, such as structured perceptron [2], stochastic subgradient [10], and cutting-plane algorithms [5], which require making a prediction at every iteration (equivalent to repeatedly solving an integer linear program). Given training data, we can consider the space of parameters Θ that separate the data. This space can be defined by the intersection of a large number of linear inequalities. A recent approach to getting around the hardness of prediction is to use linear programming (LP) relaxations to approximate the maximization over Y [4, 6, 9]. However, separation with respect to a relaxation places stronger constraints on the parameters. The target solution, an integral vertex in the LP, must now distinguish itself also from possible fractional vertexes that arise due to the relaxation. The relaxations can therefore be understood as optimizing over an inner bound of Θ. This set may be empty even if the training data is separable with exact inference [6]. Another obstacle to using LP relaxations for learning is that solving the LPs can be very slow. In this paper we ask whether it is possible to learn while avoiding inference altogether. We propose a new learning algorithm, inspired by pseudo-likelihood [1], that optimizes over an outer bound of Θ. Learning involves optimizing over only a small number of constraints per data point, and thus can be performed quickly, even for complex structured prediction models. We show that, if the data available for learning is “nice”, this algorithm is consistent, i.e. it will find some θ ∈ Θ. This is an example of how having the right data can circumvent the hardness of learning for structured prediction. 1 We also investigate the limitations of the proposed method. We show that the problem of even deciding whether a given data set is separable is NP-hard, and thus learning in a strict sense is no easier than prediction. Thus, we should not expect for our algorithm, or any other polynomial time algorithm, to always succeed at learning from an arbitrary finite data set. To our knowledge, this is the first result characterizing the hardness of exact learning for structured prediction. Finally, we show empirically that our algorithm allows us to successfully learn the parameters for both multi-label prediction and protein side-chain placement. The performance of the algorithm is improved as more data becomes available, as our theoretical results anticipate. 1 Pseudo-Max method We consider the general structured prediction problem. The input space is denoted by X and the set of all possible assignments by Y. Each y ∈ Y corresponds to n variables y1 , . . . , yn , each with k possible states. The classifier uses a (given) function φ(x, y) : X , Y → Rd and (learned) weights θ ∈ Rd , and is defined as y(x; θ) = arg maxy∈Y f (ˆ ; x, θ) where f is the discriminant function y ˆ f (y; x, θ) = θ · φ(x, y). Our analysis will focus on functions φ whose scope is limited to small sets of the yi variables, but for now we keep the discussion general. Given a set of labeled examples {(xm , y m )}M , the goal of the typical learning problem is to find m=1 weights θ that correctly classify the training examples. Consider first the separable case. Define the set of separating weight vectors, Θ = θ | ∀m, y ∈ Y, f (y m ; xm , θ) ≥ f (y; xm , θ)+e(y, y m ) . e is a loss function (e.g., zero-one or Hamming) such that e(y m , y m ) = 0 and e(y, y m ) > 0 for y = y m , which serves to rule out the trivial solution θ = 0.1 The space Θ is defined by exponentially many constraints per example, one for each competing assignment. In this work we consider a much simpler set of constraints where, for each example, we only consider the competing assignments obtained by modifying a single label yi , while fixing the other labels to their value at y m . The pseudo-max set, which is an outer bound on Θ, is given by Here ym −i m Θps = θ | ∀m, i, yi , f (y m ; xm , θ) ≥ f (y m , yi ; xm , θ) + e(yi , yi ) . −i denotes the label y m (1) without the assignment to yi . When the data is not separable, Θ will be the empty set. Instead, we may choose to minimize the hinge loss, (θ) = m maxy f (y; xm , θ) − f (y m ; xm , θ) + e(y, y m ) , which can be shown to be an upper bound on the training error [13]. When the data is separable, minθ (θ) = 0. Note that regularization may be added to this objective. The corresponding pseudo-max objective replaces the maximization over all of y with maximization over a single variable yi while fixing the other labels to their value at y m :2,3 M ps (θ) n = m=1 i=1 m max f (y m , yi ; xm , θ) − f (y m ; xm , θ) + e(yi , yi ) . −i yi Analogous to before, we have minθ ps (θ) (2) = 0 if and only if θ ∈ Θps . The objective in Eq. 2 is similar in spirit to pseudo-likelihood objectives used for maximum likelihood estimation of parameters of Markov random fields (MRFs) [1]. The pseudo-likelihood estimate is provably consistent when the data generating distribution is a MRF of the same structure as used in the pseudo-likelihood objective. However, our setting is different since we only get to view the maximizing assignment of the MRF rather than samples from it. Thus, a particular x will always be paired with the same y rather than samples y drawn from the conditional distribution p(y|x; θ). The pseudo-max constraints in Eq. 1 are also related to cutting plane approaches to inference [4, 5]. In the latter, the learning problem is solved by repeatedly looking for assignments that violate the separability constraint (or its hinge version). Our constraints can be viewed as using a very small 1 An alternative formulation, which we use in the next section, is to break the symmetry by having part of the input not be multiplied by any weight. This will also rule out the trivial solution θ = 0. P 2 It is possible to use maxi instead of i , and some of our consistency results will still hold. 3 The pseudo-max approach is markedly different from a learning method which predicts each label yi independently, since the objective considers all i simultaneously (both at learning and test time). 2 x2 0.2 J ∗ + x1 = 0 y = (0, 1) y = (1, 1) g(J12) x2 = 0 x1 J ∗ + x1 + x2 = 0 y = (0, 0) c1=0 c1=1 c1= 1 0.15 0.1 J + x2 = 0 ∗ 0.05 y = (1, 0) x1 = 0 0 1 0.5 0 J 0.5 1 Figure 1: Illustrations for a model with two variables. Left: Partitioning of X induced by configurations y(x) for some J ∗ > 0. Blue lines carve out the exact regions. Red lines denote the pseudo-max constraints that hold with equality. Pseudo-max does not obtain the diagonal constraint coming from comparing configurations y = (1, 1) and (0, 0), since these differ by more than one coordinate. Right: One strictly-convex component of the ps (J ) function (see Eq. 9). The function is shown for different values of c1 , the mean of the x1 variable. subset of assignments for the set of candidate constraint violators. We also note that when exact maximization over the discriminant function f (y; x, θ) is hard, the standard cutting plane algorithm cannot be employed since it is infeasible to find a violated constraint. For the pseudo-max objective, finding a constraint violation is simple and linear in the number of variables.4 It is easy to see (as will be elaborated on next) that the pseudo-max method does not in general yield a consistent estimate of θ, even in the separable case. However, as we show, consistency can be shown to be achieved under particular assumptions on the data generating distribution p(x). 2 Consistency of the Pseudo-Max method In this section we show that if the feature generating distribution p(x) satisfies particular assumptions, then the pseudo-max approach yields a consistent estimate. In other words, if the training data is of the form {(xm , y(xm ; θ ∗ ))}M for some true parameter vector θ ∗ , then as M → ∞ the m=1 minimum of the pseudo-max objective will converge to θ ∗ (up to equivalence transformations). The section is organized as follows. First, we provide intuition for the consistency results by considering a model with only two variables. Then, in Sec. 2.1, we show that any parameter θ ∗ can be identified to within arbitrary accuracy by choosing a particular training set (i.e., choice of xm ). This in itself proves consistency, as long as there is a non-zero probability of sampling this set. In Sec. 2.2 we give a more direct proof of consistency by using strict convexity arguments. For ease of presentation, we shall work with a simplified instance of the structured learning setting. We focus on binary variables, yi ∈ {0, 1}, and consider discriminant functions corresponding to Ising models, a special case of pairwise MRFs (J denotes the vector of “interaction” parameters): f (y; x, J ) = ij∈E Jij yi yj + i yi xi (3) The singleton potential for variable yi is yi xi and is not dependent on the model parameters. We could have instead used Ji yi xi , which would be more standard. However, this would make the parameter vector J invariant to scaling, complicating the identifiability analysis. In the consistency analysis we will assume that the data is generated using a true parameter vector J ∗ . We will show that as the data size goes to infinity, minimization of ps (J ) yields J ∗ . We begin with an illustrative analysis of the pseudo-max constraints for a model with only two variables, i.e. f (y; x, J) = Jy1 y2 + y1 x1 + y2 x2 . The purpose of the analysis is to demonstrate general principles for when pseudo-max constraints may succeed or fail. Assume that training samples are generated via y(x) = argmaxy f (y; x, J ∗ ). We can partition the input space X into four regions, ˆ ˆ {x ∈ X : y(x) = y } for each of the four configurations y , shown in Fig. 1 (left). The blue lines outline the exact decision boundaries of f (y; x, J ∗ ), with the lines being given by the constraints 4 The methods differ substantially in the non-separable setting where we minimize ps (θ), using a slack variable for every node and example, rather than just one slack variable per example as in (θ). 3 in Θ that hold with equality. The red lines denote the pseudo-max constraints in Θps that hold with equality. For x such that y(x) = (1, 0) or (0, 1), the pseudo-max and exact constraints are identical. We can identify J ∗ by obtaining samples x = (x1 , x2 ) that explore both sides of one of the decision boundaries that depends on J ∗ . The pseudo-max constraints will fail to identify J ∗ if the samples do not sufficiently explore the transitions between y = (0, 1) and y = (1, 1) or between y = (1, 0) and y = (1, 1). This can happen, for example, when the input samples are dependent, giving only rise to the configurations y = (0, 0) and y = (1, 1). For points labeled (1, 1) around the decision line J ∗ + x1 + x2 = 0, pseudo-max can only tell that they respect J ∗ + x1 ≥ 0 and J ∗ + x2 ≥ 0 (dashed red lines), or x1 ≤ 0 and x2 ≤ 0 for points labeled (0, 0). Only constraints that depend on the parameter are effective for learning. For pseudo-max to be able to identify J ∗ , the input samples must be continuous, densely populating the two parameter dependent decision lines that pseudo-max can use. The two point sets in the figure illustrate good and bad input distributions for pseudo-max. The diagonal set would work well with the exact constraints but badly with pseudo-max, and the difference can be arbitrarily large. However, the input distribution on the right, populating the J ∗ + x2 = 0 decision line, would permit pseudo-max to identify J ∗ . 2.1 Identifiability of True Parameters In this section, we show that it is possible to approximately identify the true model parameters, up to model equivalence, using the pseudo-max constraints and a carefully chosen linear number of data points. Consider the learning problem for structured prediction defined on a fixed graph G = (V, E) where the parameters to be learned are pairwise potential functions θij (yi , yj ) for ij ∈ E and single node fields θi (yi ) for i ∈ V . We consider discriminant functions of the form f (y; x, θ) = ij∈E θij (yi , yj ) + i θi (yi ) + i xi (yi ), (4) where the input space X = R|V |k specifies the single node potentials. Without loss of generality, we remove the additional degrees of freedom in θ by restricting it to be in a canonical form: θ ∈ Θcan if for all edges θij (yi , yj ) = 0 whenever yi = 0 or yj = 0, and if for all nodes, θi (yi ) = 0 when yi = 0. As a result, assuming the training set comes from a model in this class, and the input fields xi (yi ) exercise the discriminant function appropriately, we can hope to identify θ ∗ ∈ Θcan . Indeed, we show that, for some data sets, the pseudo-max constraints are sufficient to identify θ ∗ . Let Θps ({y m , xm }) be the set of parameters that satisfy the pseudo-max classification constraints m Θps ({y m , xm }) = θ | ∀m, i, yi = yi , f (y m ; xm , θ) ≥ f (y m , yi ; xm , θ) . −i (5) m e(yi , yi ), For simplicity we omit the margin losses since the input fields xi (yi ) already suffice to rule out the trivial solution θ = 0. Proposition 2.1. For any θ ∗ ∈ Θcan , there is a set of 2|V |(k − 1) + 2|E|(k − 1)2 examples, {xm , y(xm ; θ ∗ )}, such that any pseudo-max consistent θ ∈ Θps ({y m , xm }) ∩ Θcan is arbitrarily close to θ ∗ . The proof is given in the supplementary material. To illustrate the key ideas, we consider the simpler binary discriminant function discussed in Eq. 3. Note that the binary model is already in the canonical form since Jij yi yj = 0 whenever yi = 0 or yj = 0. For any ij ∈ E, we show how to choose two input examples x1 and x2 such that any J consistent with the pseudo-max constraints for these ∗ ∗ two examples will have Jij ∈ [Jij − , Jij + ]. Repeating this for all of the edge parameters then gives the complete set of examples. The input examples we need for this will depend on J ∗ . For the first example, we set the input fields for all neighbors of i (except j) in such a way that ∗ we force the corresponding labels to be zero. More formally, we set x1 < −|N (k)| maxl |Jkl | for k 1 k ∈ N (i)\j, resulting in yk = 0, where y 1 = y(x1 ). In contrast, we set x1 to a large value, e.g. j ∗ 1 ∗ x1 > |N (j)| maxl |Jjl |, so that yj = 1. Finally, for node i, we set x1 = −Jij + so as to obtain a j i 1 slight preference for yi = 1. All other input fields can be set arbitrarily. As a result, the pseudo-max constraints pertaining to node i are f (y 1 ; x1 , J ) ≥ f (y 1 , yi ; x1 , J ) for yi = 0, 1. By taking into −i 1 account the label assignments for yi and its neighbors, and by removing terms that are the same on both sides of the equation, we get Jij + x1 + x1 ≥ Jij yi + yi x1 + x1 , which, for yi = 0, implies i j i j ∗ that Jij + x1 ≥ 0 or Jij − Jij + ≥ 0. The second example x2 differs only in terms of the input i ∗ 2 ∗ field for i. In particular, we set x2 = −Jij − so that yi = 0. This gives Jij ≤ Jij + , as desired. i 4 2.2 Consistency via Strict Convexity In this section we prove the consistency of the pseudo-max approach by showing that it corresponds to minimizing a strictly convex function. Our proof only requires that p(x) be non-zero for all x ∈ Rn (a simple example being a multi-variate Gaussian) and that J ∗ is finite. We use a discriminant function as in Eq. 3. Now, assume the input points xm are distributed according to p(x) and that y m are obtained via y m = arg maxy f (y; xm , J ∗ ). We can write the ps (J ) objective for finite data, and its limit when M → ∞, compactly as: 1 m m = max (yi − yi ) xm + Jki yk ps (J ) i M m i yi k∈N (i) p(x) max (yi − yi (x)) xi + → yi i Jki yk (x) dx (6) k∈N (i) ∗ where yi (x) is the label of i for input x when using parameters J . Starting from the above, consider the terms separately for each i. We partition the integral over x ∈ Rn into exclusive regions according to the predicted labels of the neighbors of i (given x). Define Sij = {x : yj (x) = 1 and yk (x) = 0 for k ∈ N (i)\j}. Eq. 6 can then be written as ps (J ) = gi ({Jik }k∈N (i) ) + ˆ i gik (Jik ) , (7) k∈N (i) where gik (Jik ) = x∈Sik p(x) maxyi [(yi −yi (x))(xi +Jik )]dx and gi ({Jik }k∈N (i) ) contains all of ˆ the remaining terms, i.e. where either zero or more than one neighbor is set to one. The function gi ˆ is convex in J since it is a sum of integrals over convex functions. We proceed to show that gik (Jik ) is strictly convex for all choices of i and k ∈ N (i). This will show that ps (J ) is strictly convex since it is a sum over functions strictly convex in each one of the variables in J . For all values xi ∈ (−∞, ∞) there is some x in Sij . This is because for any finite xi and finite J ∗ , the other xj ’s can be chosen so as to give the y configuration corresponding to Sij . Now, since p(x) has full support, we have P (Sij ) > 0 and p(x) > 0 for any x in Sij . As a result, this also holds for the marginal pi (xi |Sij ) over xi within Sij . After some algebra, we obtain: gij (Jij ) = P (Sij ) ∞ p(x)yi (x)(xi + Jij )dx pi (xi |Sij ) max [0, xi + Jij ] dxi − −∞ x∈Sij The integral over the yi (x)(xi + Jij ) expression just adds a linear term to gij (Jij ). The relevant remaining term is (for brevity we drop P (Sij ), a strictly positive constant, and the ij index): h(J) = ∞ pi (xi |Sij ) max [0, xi + J] dxi = −∞ ∞ ˆ pi (xi |Sij )h(xi , J)dxi (8) −∞ ˆ ˆ where we define h(xi , J) = max [0, xi + J]. Note that h(J) is convex since h(xi , J) is convex in J for all xi . We want to show that h(J) is strictly convex. Consider J < J and α ∈ (0, 1) and define ˆ ˆ the interval I = [−J, −αJ − (1 − α)J ]. For xi ∈ I it holds that: αh(xi , J) + (1 − α)h(xi , J ) > ˆ i , αJ + (1 − α)J ) (since the first term is strictly positive and the rest are zero). For all other x, h(x ˆ this inequality holds but is not necessarily strict (since h is always convex in J). We thus have after integrating over x that αh(J) + (1 − α)h(J ) > h(αJ + (1 − α)J ), implying h is strictly convex, as required. Note that we used the fact that p(x) has full support when integrating over I. The function ps (J ) is thus a sum of strictly convex functions in all its variables (namely g(Jik )) plus other convex functions of J , hence strictly convex. We can now proceed to show consistency. By strict convexity, the pseudo-max objective is minimized at a unique point J . Since we know that ps (J ∗ ) = 0 and zero is a lower bound on the value of ps (J ), it follows that J ∗ is the unique minimizer. Thus we have that as M → ∞, the minimizer of the pseudo-max objective is the true parameter vector, and thus we have consistency. As an example, consider the case of two variables y1 , y2 , with x1 and x2 distributed according to ∗ N (c1 , 1), N (0, 1) respectively. Furthermore assume J12 = 0. Then simple direct calculation yields: 2 2 2 c1 + J12 −c1 1 1 √ (9) e−x /2 dx − √ e−c1 /2 + √ e−(J12 +c1 ) /2 2π 2π 2π −J12 −c1 which is indeed a strictly convex function that is minimized at J = 0 (see Fig. 1 for an illustration). g(J12 ) = 5 3 Hardness of Structured Learning Most structured prediction learning algorithms use some form of inference as a subroutine. However, the corresponding prediction task is generally NP-hard. For example, maximizing the discriminant function defined in Eq. 3 is equivalent to solving Max-Cut, which is known to be NP-hard. This raises the question of whether it is possible to bypass prediction during learning. Although prediction may be intractable for arbitrary MRFs, what does this say about the difficulty of learning with a polynomial number of data points? In this section, we show that the problem of deciding whether there exists a parameter vector that separates the training data is NP-hard. Put in the context of the positive results in this paper, these hardness results show that, although in some cases the pseudo-max constraints yield a consistent estimate, we cannot hope for a certificate of optimality. Put differently, although the pseudo-max constraints in the separable case always give an outer bound on Θ (and may even be a single point), Θ could be the empty set – and we would never know the difference. Theorem 3.1. Given labeled examples {(xm , y m )}M for a fixed but arbitrary graph G, it is m=1 NP-hard to decide whether there exists parameters θ such that ∀m, y m = arg maxy f (y; xm , θ). Proof. Any parameters θ have an equivalent parameterization in canonical form (see section Sec. 2.1, also supplementary). Thus, the examples will be separable if and only if they are separable by some θ ∈ Θcan . We reduce from unweighted Max-Cut. The Max-Cut problem is to decide, given an undirected graph G, whether there exists a cut of at least K edges. Let G be the same graph as G, with k = 3 states per variable. We construct a small set of examples where a parameter vector will exist that separates the data if and only if there is no cut of K or more edges in G. Let θ be parameters in canonical form equivalent to θij (yi , yj ) = 1 if (yi , yj ) ∈ {(1, 2), (2, 1)}, 0 if yi = yj , and −n2 if (yi , yj ) ∈ {(1, 3), (2, 3), (3, 1), (3, 2)}. We first construct 4n + 8|E| examples, using the technique described in Sec. 2.1 (also supplementary material), which when restricted to the space Θcan , constrain the parameters to equal θ. We then use one more example (xm , y m ) where y m = 3 (every node is in state 3) and, for all i, xm (3) = K−1 and xm (1) = xm (2) = 0. The first i i i n two states encode the original Max-Cut instance, while the third state is used to construct a labeling y m that has value equal to K − 1, and is otherwise not used. Let K ∗ be the value of the maximum cut in G. If in any assignment to the last example there is a variable taking the state 3 and another variable taking the state 1 or 2, then the assignment’s value will be at most K ∗ − n2 , which is less than zero. By construction, the 3 assignment has value K − 1. Thus, the optimal assignment must either be 3 with value K − 1, or some combination of states 1 and 2, which has value at most K ∗ . If K ∗ > K − 1 then 3 is not optimal and the examples are not separable. If K ∗ ≤ K − 1, the examples are separable. This result illustrates the potential difficulty of learning in worst-case graphs. Nonetheless, many problems have a more restricted dependence on the input. For example, in computer vision, edge potentials may depend only on the difference in color between two adjacent pixels. Our results do not preclude positive results of learnability in such restricted settings. By establishing hardness of learning, we also close the open problem of relating hardness of inference and learning in structured prediction. If inference problems can be solved in polynomial time, then so can learning (using, e.g., structured perceptron). Thus, when learning is hard, inference must be hard as well. 4 Experiments To evaluate our learning algorithm, we test its performance on both synthetic and real-world datasets. We show that, as the number of training samples grows, the accuracy of the pseudo-max method improves and its speed-up gain over competing algorithms increases. Our learning algorithm corresponds to solving the following, where we add L2 regularization and use a scaled 0-1 loss, m m e(yi , yi ) = 1{yi = yi }/nm (nm is the number of labels in example m): min θ C m nm M nm m=1 i=1 m max f (y m , yi ; xm , θ) − f (y m ; xm , θ) + e(yi , yi ) + θ −i yi 2 . (10) We will compare the pseudo-max method with learning using structural SVMs, both with exact inference and LP relaxations [see, e.g., 4]. We use exact inference for prediction at test time. 6 (a) Synthetic (b) Reuters 0.4 exact LP−relaxation pseudo−max 0.15 Test error Test error 0.2 0.1 0.05 0 1 10 2 10 0.2 0.1 0 1 10 3 10 Train size exact LP−relaxation pseudo−max 0.3 2 10 3 10 4 10 Train size Figure 2: Test error as a function of train size for various algorithms. Subfigure (a) shows results for a synthetic setting, while (b) shows performance on the Reuters data. In the synthetic setting we use the discriminant function f (y; x, θ) = ij∈E θij (yi , yj ) + xi θi (yi ), which is similar to Eq. 4. We take a fully connected graph over n = 10 binary labels. i For a weight vector θ ∗ (sampled once, uniformly in the range [−1, 1], and used for all train/test sets) we generate train and test instances by sampling xm uniformly in the range [−5, 5] and then computing the optimal labels y m = arg maxy∈Y f (y; xm , θ ∗ ). We generate train sets of increasing size (M = {10, 50, 100, 500, 1000, 5000}), run the learning algorithms, and measure the test error for the learned weights (with 1000 test samples). For each train size we average the test error over 10 repeats of sampling and training. Fig. 2(a) shows a comparison of the test error for the three learning algorithms. For small numbers of training examples, the test error of pseudo-max is larger than that of the other algorithms. However, as the train size grows, the error converges to that of exact learning, as our consistency results predict. We also test the performance of our algorithm on a multi-label document classification task from the Reuters dataset [7]. The data consists of M = 23149 training samples, and we use a reduction of the dataset to the 5 most frequent labels. The 5 label variables form a fully connected pairwise graph structure (see [4] for a similar setting). We use random subsamples of increasing size from the train set to learn the parameters, and then measure the test error using 20000 additional samples. For each sample size and learning algorithm, we optimize the trade-off parameter C using 30% of the training data as a hold-out set. Fig. 2(b) shows that for the large data regime the performance of pseudo-max learning gets close to that of the other methods. However, unlike the synthetic setting there is still a small gap, even after seeing the entire train set. This could be because the full dataset is not yet large enough to be in the consistent regime (note that exact learning has not flattened either), or because the consistency conditions are not fully satisfied: the data might be non-separable or the support of the input distribution p(x) may be partial. We next apply our method to the problem of learning the energy function for protein side-chain placement, mirroring the learning setup of [14], where the authors train a conditional random field (CRF) using tree-reweighted belief propagation to maximize a lower bound on the likelihood.5 The prediction problem for side-chain placement corresponds to finding the most likely assignment in a pairwise MRF, and fits naturally into our learning framework. There are only 8 parameters to be learned, corresponding to a reweighting of known energy terms. The dataset consists of 275 proteins, where each MRF has several hundred variables (one per residue of the protein) and each variable has on average 20 states. For prediction we use CPLEX’s ILP solver. Fig. 3 shows a comparison of the pseudo-max method and a cutting-plane algorithm which uses an LP relaxation, solved with CPLEX, for finding violated constraints.6 We generate training sets of increasing size (M = {10, 50, 100, 274}), and measure the test error for the learned weights on the remaining examples.7 For M = 10, 50, 100 we average the test error over 3 random train/test splits, whereas for M = 274 we do 1-fold cross validation. We use C = 1 for both algorithms. 5 The authors’ data and results are available from: http://cyanover.fhcrc.org/recomb-2007/ We significantly optimized the cutting-plane algorithm, e.g. including a large number of initial cuttingplanes and restricting the weight vector to be positive (which we know to hold at optimality). 7 Specifically, for each protein we compute the fraction of correctly predicted χ1 and χ2 angles for all residues (except when trivial, e.g. just 1 state). Then, we compute the median of this value across all proteins. 6 7 Time to train (minutes) Test error (χ1 and χ2) 0.27 0.265 pseudo−max LP−relaxation Soft rep 0.26 0.255 0.25 0 50 100 150 200 Train size 250 250 200 pseudo−max LP−relaxation 150 100 50 0 0 50 100 150 200 Train size 250 Figure 3: Training time (for one train/test split) and test error as a function of train size for both the pseudomax method and a cutting-plane algorithm which uses a LP relaxation for inference, applied to the problem of learning the energy function for protein side-chain placement. The pseudo-max method obtains better accuracy than both the LP relaxation and HCRF (given roughly five times more data) for a fraction of the training time. The original weights (“Soft rep” [3]) used for this energy function have 26.7% error across all 275 proteins. The best previously reported parameters, learned in [14] using a Hidden CRF, obtain 25.6% error (their training set included 55 of these 275 proteins, so this is an optimistic estimate). To get a sense of the difficulty of this learning task, we also tried a random positive weight vector, uniformly sampled from the range [0, 1], obtaining an error of 34.9% (results would be much worse if we allowed the weights to be negative). Training using pseudo-max with 50 examples, we learn parameters in under a minute that give better accuracy than the HCRF. The speed-up of training with pseudo-max (using CPLEX’s QP solver) versus cutting-plane is striking. For example, for M = 10, pseudo-max takes only 3 seconds, a 1000-fold speedup. Unfortunately the cutting-plane algorithm took a prohibitive amount of time to be able to run on the larger training sets. Since the data used in learning for protein side-chain placement is both highly non-separable and relatively little, these positive results illustrate the potential wide-spread applicability of the pseudo-max method. 5 Discussion The key idea of our method is to find parameters that prefer the true assignment y m over assignments that differ from it in only one variable, in contrast to all other assignments. Perhaps surprisingly, this weak requirement is sufficient to achieve consistency given a rich enough input distribution. One extension of our approach is to add constraints for assignments that differ from y m in more than one variable. This would tighten the outer bound on Θ and possibly result in improved performance, but would also increase computational complexity. We could also add such competing assignments via a cutting-plane scheme so that optimization is performed only over a subset of these constraints. Our work raises a number of important open problems: It would be interesting to derive generalization bounds to understand the convergence rate of our method, as well as understanding the effect of the distribution p(x) on these rates. The distribution p(x) needs to have two key properties. On the one hand, it needs to explore the space Y in the sense that a sufficient number of labels need to be obtained as the correct label for the true parameters (this is indeed used in our consistency proofs). On the other hand, p(x) needs to be sufficiently sensitive close to the decision boundaries so that the true parameters can be inferred. We expect that generalization analysis will depend on these two properties of p(x). Note that [11] studied active learning schemes for structured data and may be relevant in the current context. How should one apply this learning algorithm to non-separable data sets? We suggested one approach, based on using a hinge loss for each of the pseudo constraints. One question in this context is, how resilient is this learning algorithm to label noise? Recent work has analyzed the sensitivity of pseudo-likelihood methods to model mis-specification [8], and it would be interesting to perform a similar analysis here. Also, is it possible to give any guarantees for the empirical and expected risks (with respect to exact inference) obtained by outer bound learning versus exact learning? Finally, our algorithm demonstrates a phenomenon where more data can make computation easier. Such a scenario was recently analyzed in the context of supervised learning [12], and it would be interesting to combine the approaches. Acknowledgments: We thank Chen Yanover for his assistance with the protein data. This work was supported by BSF grant 2008303 and a Google Research Grant. D.S. was supported by a Google PhD Fellowship. 8 References [1] J. Besag. The analysis of non-lattice data. The Statistician, 24:179–195, 1975. [2] M. Collins. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In EMNLP, 2002. [3] G. Dantas, C. Corrent, S. L. Reichow, J. J. Havranek, Z. M. Eletr, N. G. Isern, B. Kuhlman, G. Varani, E. A. Merritt, and D. Baker. High-resolution structural and thermodynamic analysis of extreme stabilization of human procarboxypeptidase by computational protein design. Journal of Molecular Biology, 366(4):1209 – 1221, 2007. [4] T. Finley and T. Joachims. Training structural SVMs when exact inference is intractable. In Proceedings of the 25th International Conference on Machine Learning 25, pages 304–311. ACM, 2008. [5] T. Joachims, T. Finley, and C.-N. Yu. Cutting-plane training of structural SVMs. Machine Learning, 77(1):27–59, 2009. [6] A. Kulesza and F. Pereira. Structured learning with approximate inference. In Advances in Neural Information Processing Systems 20, pages 785–792. 2008. [7] D. Lewis, , Y. Yang, T. Rose, and F. Li. RCV1: a new benchmark collection for text categorization research. JMLR, 5:361–397, 2004. [8] P. Liang and M. I. Jordan. An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators. In Proceedings of the 25th international conference on Machine learning, pages 584–591, New York, NY, USA, 2008. ACM Press. [9] A. F. T. Martins, N. A. Smith, and E. P. Xing. Polyhedral outer approximations with application to natural language parsing. In ICML 26, pages 713–720, 2009. [10] N. Ratliff, J. A. D. Bagnell, and M. Zinkevich. (Online) subgradient methods for structured prediction. In AISTATS, 2007. [11] D. Roth and K. Small. Margin-based active learning for structured output spaces. In Proc. of the European Conference on Machine Learning (ECML). Springer, September 2006. [12] S. Shalev-Shwartz and N. Srebro. SVM optimization: inverse dependence on training set size. In Proceedings of the 25th international conference on Machine learning, pages 928–935. ACM, 2008. [13] B. Taskar, C. Guestrin, and D. Koller. Max margin Markov networks. In Advances in Neural Information Processing Systems 16, pages 25–32. 2004. [14] C. Yanover, O. Schueler-Furman, and Y. Weiss. Minimizing and learning energy functions for side-chain prediction. Journal of Computational Biology, 15(7):899–911, 2008. 9
5 0.09972582 89 nips-2010-Factorized Latent Spaces with Structured Sparsity
Author: Yangqing Jia, Mathieu Salzmann, Trevor Darrell
Abstract: Recent approaches to multi-view learning have shown that factorizing the information into parts that are shared across all views and parts that are private to each view could effectively account for the dependencies and independencies between the different input modalities. Unfortunately, these approaches involve minimizing non-convex objective functions. In this paper, we propose an approach to learning such factorized representations inspired by sparse coding techniques. In particular, we show that structured sparsity allows us to address the multiview learning problem by alternately solving two convex optimization problems. Furthermore, the resulting factorized latent spaces generalize over existing approaches in that they allow having latent dimensions shared between any subset of the views instead of between all the views only. We show that our approach outperforms state-of-the-art methods on the task of human pose estimation. 1
6 0.097843915 99 nips-2010-Gated Softmax Classification
7 0.097405195 103 nips-2010-Generating more realistic images using gated MRF's
8 0.096160203 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
9 0.094723649 241 nips-2010-Size Matters: Metric Visual Search Constraints from Monocular Metadata
10 0.094464764 177 nips-2010-Multitask Learning without Label Correspondences
11 0.091736168 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models
12 0.090093948 149 nips-2010-Learning To Count Objects in Images
13 0.089036152 228 nips-2010-Reverse Multi-Label Learning
14 0.088453121 261 nips-2010-Supervised Clustering
15 0.085390396 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models
16 0.082694098 257 nips-2010-Structured Determinantal Point Processes
17 0.081451893 180 nips-2010-Near-Optimal Bayesian Active Learning with Noisy Observations
18 0.079694971 6 nips-2010-A Discriminative Latent Model of Image Region and Object Tag Correspondence
19 0.079322323 86 nips-2010-Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach
20 0.078381576 186 nips-2010-Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification
topicId topicWeight
[(0, 0.251), (1, 0.078), (2, -0.003), (3, -0.174), (4, -0.044), (5, 0.021), (6, -0.062), (7, 0.0), (8, 0.005), (9, -0.086), (10, -0.033), (11, -0.003), (12, 0.105), (13, -0.009), (14, -0.02), (15, 0.002), (16, -0.032), (17, -0.001), (18, 0.113), (19, -0.017), (20, 0.062), (21, 0.033), (22, 0.046), (23, 0.055), (24, -0.035), (25, 0.012), (26, 0.018), (27, -0.026), (28, 0.059), (29, -0.03), (30, 0.048), (31, 0.071), (32, 0.064), (33, 0.088), (34, 0.039), (35, 0.072), (36, 0.005), (37, -0.051), (38, -0.025), (39, -0.062), (40, -0.056), (41, 0.031), (42, 0.03), (43, 0.052), (44, -0.089), (45, 0.083), (46, -0.011), (47, 0.046), (48, -0.174), (49, 0.066)]
simIndex simValue paperId paperTitle
same-paper 1 0.93590045 235 nips-2010-Self-Paced Learning for Latent Variable Models
Author: M. P. Kumar, Benjamin Packer, Daphne Koller
Abstract: Latent variable models are a powerful tool for addressing several tasks in machine learning. However, the algorithms for learning the parameters of latent variable models are prone to getting stuck in a bad local optimum. To alleviate this problem, we build on the intuition that, rather than considering all samples simultaneously, the algorithm should be presented with the training data in a meaningful order that facilitates learning. The order of the samples is determined by how easy they are. The main challenge is that often we are not provided with a readily computable measure of the easiness of samples. We address this issue by proposing a novel, iterative self-paced learning algorithm where each iteration simultaneously selects easy samples and learns a new parameter vector. The number of samples selected is governed by a weight that is annealed until the entire training data has been considered. We empirically demonstrate that the self-paced learning algorithm outperforms the state of the art method for learning a latent structural SVM on four applications: object localization, noun phrase coreference, motif finding and handwritten digit recognition. 1
2 0.73959816 240 nips-2010-Simultaneous Object Detection and Ranking with Weak Supervision
Author: Matthew Blaschko, Andrea Vedaldi, Andrew Zisserman
Abstract: A standard approach to learning object category detectors is to provide strong supervision in the form of a region of interest (ROI) specifying each instance of the object in the training images [17]. In this work are goal is to learn from heterogeneous labels, in which some images are only weakly supervised, specifying only the presence or absence of the object or a weak indication of object location, whilst others are fully annotated. To this end we develop a discriminative learning approach and make two contributions: (i) we propose a structured output formulation for weakly annotated images where full annotations are treated as latent variables; and (ii) we propose to optimize a ranking objective function, allowing our method to more effectively use negatively labeled images to improve detection average precision performance. The method is demonstrated on the benchmark INRIA pedestrian detection dataset of Dalal and Triggs [14] and the PASCAL VOC dataset [17], and it is shown that for a significant proportion of weakly supervised images the performance achieved is very similar to the fully supervised (state of the art) results. 1
3 0.69764912 213 nips-2010-Predictive Subspace Learning for Multi-view Data: a Large Margin Approach
Author: Ning Chen, Jun Zhu, Eric P. Xing
Abstract: Learning from multi-view data is important in many applications, such as image classification and annotation. In this paper, we present a large-margin learning framework to discover a predictive latent subspace representation shared by multiple views. Our approach is based on an undirected latent space Markov network that fulfills a weak conditional independence assumption that multi-view observations and response variables are independent given a set of latent variables. We provide efficient inference and parameter estimation methods for the latent subspace model. Finally, we demonstrate the advantages of large-margin learning on real video and web image data for discovering predictive latent representations and improving the performance on image classification, annotation and retrieval.
4 0.6527313 188 nips-2010-On Herding and the Perceptron Cycling Theorem
Author: Andrew Gelfand, Yutian Chen, Laurens Maaten, Max Welling
Abstract: The paper develops a connection between traditional perceptron algorithms and recently introduced herding algorithms. It is shown that both algorithms can be viewed as an application of the perceptron cycling theorem. This connection strengthens some herding results and suggests new (supervised) herding algorithms that, like CRFs or discriminative RBMs, make predictions by conditioning on the input attributes. We develop and investigate variants of conditional herding, and show that conditional herding leads to practical algorithms that perform better than or on par with related classifiers such as the voted perceptron and the discriminative RBM. 1
5 0.64231354 103 nips-2010-Generating more realistic images using gated MRF's
Author: Marc'aurelio Ranzato, Volodymyr Mnih, Geoffrey E. Hinton
Abstract: Probabilistic models of natural images are usually evaluated by measuring performance on rather indirect tasks, such as denoising and inpainting. A more direct way to evaluate a generative model is to draw samples from it and to check whether statistical properties of the samples match the statistics of natural images. This method is seldom used with high-resolution images, because current models produce samples that are very different from natural images, as assessed by even simple visual inspection. We investigate the reasons for this failure and we show that by augmenting existing models so that there are two sets of latent variables, one set modelling pixel intensities and the other set modelling image-specific pixel covariances, we are able to generate high-resolution images that look much more realistic than before. The overall model can be interpreted as a gated MRF where both pair-wise dependencies and mean intensities of pixels are modulated by the states of latent variables. Finally, we confirm that if we disallow weight-sharing between receptive fields that overlap each other, the gated MRF learns more efficient internal representations, as demonstrated in several recognition tasks. 1 Introduction and Prior Work The study of the statistical properties of natural images has a long history and has influenced many fields, from image processing to computational neuroscience [1]. In this work we focus on probabilistic models of natural images. These models are useful for extracting representations [2, 3, 4] that can be used for discriminative tasks and they can also provide adaptive priors [5, 6, 7] that can be used in applications like denoising and inpainting. Our main focus, however, will be on improving the quality of the generative model, rather than exploring its possible applications. Markov Random Fields (MRF’s) provide a very general framework for modelling natural images. In an MRF, an image is assigned a probability which is a normalized product of potential functions, with each function typically being defined over a subset of the observed variables. In this work we consider a very versatile class of MRF’s in which potential functions are defined over both pixels and latent variables, thus allowing the states of the latent variables to modulate or gate the effective interactions between the pixels. This type of MRF, that we dub gated MRF, was proposed as an image model by Geman and Geman [8]. Welling et al. [9] showed how an MRF in this family1 could be learned for small image patches and their work was extended to high-resolution images by Roth and Black [6] who also demonstrated its success in some practical applications [7]. Besides their practical use, these models were specifically designed to match the statistical properties of natural images, and therefore, it seems natural to evaluate them in those terms. Indeed, several authors [10, 7] have proposed that these models should be evaluated by generating images and 1 Product of Student’s t models (without pooling) may not appear to have latent variables but each potential can be viewed as an infinite mixture of zero-mean Gaussians where the inverse variance of the Gaussian is the latent variable. 1 checking whether the samples match the statistical properties observed in natural images. It is, therefore, very troublesome that none of the existing models can generate good samples, especially for high-resolution images (see for instance fig. 2 in [7] which is one of the best models of highresolution images reported in the literature so far). In fact, as our experiments demonstrate the generated samples from these models are more similar to random images than to natural images! When MRF’s with gated interactions are applied to small image patches, they actually seem to work moderately well, as demonstrated by several authors [11, 12, 13]. The generated patches have some coherent and elongated structure and, like natural image patches, they are predominantly very smooth with sudden outbreaks of strong structure. This is unsurprising because these models have a built-in assumption that images are very smooth with occasional strong violations of smoothness [8, 14, 15]. However, the extension of these patch-based models to high-resolution images by replicating filters across the image has proven to be difficult. The receptive fields that are learned no longer resemble Gabor wavelets but look random [6, 16] and the generated images lack any of the long range structure that is so typical of natural images [7]. The success of these methods in applications such as denoising is a poor measure of the quality of the generative model that has been learned: Setting the parameters to random values works almost as well for eliminating independent Gaussian noise [17], because this can be done quite well by just using a penalty for high-frequency variation. In this work, we show that the generative quality of these models can be drastically improved by jointly modelling both pixel mean intensities and pixel covariances. This can be achieved by using two sets of latent variables, one that gates pair-wise interactions between pixels and another one that sets the mean intensities of pixels, as we already proposed in some earlier work [4]. Here, we show that this modelling choice is crucial to make the gated MRF work well on high-resolution images. Finally, we show that the most widely used method of sharing weights in MRF’s for high-resolution images is overly constrained. Earlier work considered homogeneous MRF’s in which each potential is replicated at all image locations. This has the subtle effect of making learning very difficult because of strong correlations at nearby sites. Following Gregor and LeCun [18] and also Tang and Eliasmith [19], we keep the number of parameters under control by using local potentials, but unlike Roth and Black [6] we only share weights between potentials that do not overlap. 2 Augmenting Gated MRF’s with Mean Hidden Units A Product of Student’s t (PoT) model [15] is a gated MRF defined on small image patches that can be viewed as modelling image-specific, pair-wise relationships between pixel values by using the states of its latent variables. It is very good at representing the fact that two-pixel have very similar intensities and no good at all at modelling what these intensities are. Failure to model the mean also leads to impoverished modelling of the covariances when the input images have nonzero mean intensity. The covariance RBM (cRBM) [20] is another model that shares the same limitation since it only differs from PoT in the distribution of its latent variables: The posterior over the latent variables is a product of Bernoulli distributions instead of Gamma distributions as in PoT. We explain the fundamental limitation of these models by using a simple toy example: Modelling two-pixel images using a cRBM with only one binary hidden unit, see fig. 1. This cRBM assumes that the conditional distribution over the input is a zero-mean Gaussian with a covariance that is determined by the state of the latent variable. Since the latent variable is binary, the cRBM can be viewed as a mixture of two zero-mean full covariance Gaussians. The latent variable uses the pairwise relationship between pixels to decide which of the two covariance matrices should be used to model each image. When the input data is pre-proessed by making each image have zero mean intensity (the empirical histogram is shown in the first row and first column), most images lie near the origin because most of the times nearby pixels are strongly correlated. Less frequently we encounter edge images that exhibit strong anti-correlation between the pixels, as shown by the long tails along the anti-diagonal line. A cRBM could model this data by using two Gaussians (first row and second column): one that is spherical and tight at the origin for smooth images and another one that has a covariance elongated along the anti-diagonal for structured images. If, however, the whole set of images is normalized by subtracting from every pixel the mean value of all pixels over all images (second row and first column), the cRBM fails at modelling structured images (second row and second column). It can fit a Gaussian to the smooth images by discovering 2 Figure 1: In the first row, each image is zero mean. In the second row, the whole set of data points is centered but each image can have non-zero mean. The first column shows 8x8 images picked at random from natural images. The images in the second column are generated by a model that does not account for mean intensity. The images in the third column are generated by a model that has both “mean” and “covariance” hidden units. The contours in the first column show the negative log of the empirical distribution of (tiny) natural two-pixel images (x-axis being the first pixel and the y-axis the second pixel). The plots in the other columns are toy examples showing how each model could represent the empirical distribution using a mixture of Gaussians with components that have one of two possible covariances (corresponding to the state of a binary “covariance” latent variable). Models that can change the means of the Gaussians (mPoT and mcRBM) can represent better structured images (edge images lie along the anti-diagonal and are fitted by the Gaussians shown in red) while the other models (PoT and cRBM) fail, overall when each image can have non-zero mean. the direction of strong correlation along the main diagonal, but it is very likely to fail to discover the direction of anti-correlation, which is crucial to represent discontinuities, because structured images with different mean intensity appear to be evenly spread over the whole input space. If the model has another set of latent variables that can change the means of the Gaussian distributions in the mixture (as explained more formally below and yielding the mPoT and mcRBM models), then the model can represent both changes of mean intensity and the correlational structure of pixels (see last column). The mean latent variables effectively subtract off the relevant mean from each data-point, letting the covariance latent variable capture the covariance structure of the data. As before, the covariance latent variable needs only to select between two covariance matrices. In fact, experiments on real 8x8 image patches confirm these conjectures. Fig. 1 shows samples drawn from PoT and mPoT. mPoT (and similarly mcRBM [4]) is not only better at modelling zero mean images but it can also represent images that have non zero mean intensity well. We now describe mPoT, referring the reader to [4] for a detailed description of mcRBM. In PoT [9] the energy function is: E PoT (x, hc ) = i 1 [hc (1 + (Ci T x)2 ) + (1 − γ) log hc ] i i 2 (1) where x is a vectorized image patch, hc is a vector of Gamma “covariance” latent variables, C is a filter bank matrix and γ is a scalar parameter. The joint probability over input pixels and latent variables is proportional to exp(−E PoT (x, hc )). Therefore, the conditional distribution over the input pixels is a zero-mean Gaussian with covariance equal to: Σc = (Cdiag(hc )C T )−1 . (2) In order to make the mean of the conditional distribution non-zero, we define mPoT as the normalized product of the above zero-mean Gaussian that models the covariance and a spherical covariance Gaussian that models the mean. The overall energy function becomes: E mPoT (x, hc , hm ) = E PoT (x, hc ) + E m (x, hm ) 3 (3) Figure 2: Illustration of different choices of weight-sharing scheme for a RBM. Links converging to one latent variable are filters. Filters with the same color share the same parameters. Kinds of weight-sharing scheme: A) Global, B) Local, C) TConv and D) Conv. E) TConv applied to an image. Cells correspond to neighborhoods to which filters are applied. Cells with the same color share the same parameters. F) 256 filters learned by a Gaussian RBM with TConv weight-sharing scheme on high-resolution natural images. Each filter has size 16x16 pixels and it is applied every 16 pixels in both the horizontal and vertical directions. Filters in position (i, j) and (1, 1) are applied to neighborhoods that are (i, j) pixels away form each other. Best viewed in color. where hm is another set of latent variables that are assumed to be Bernoulli distributed (but other distributions could be used). The new energy term is: E m (x, hm ) = 1 T x x− 2 hm Wj T x j (4) j yielding the following conditional distribution over the input pixels: p(x|hc , hm ) = N (Σ(W hm ), Σ), Σ = (Σc + I)−1 (5) with Σc defined in eq. 2. As desired, the conditional distribution has non-zero mean2 . Patch-based models like PoT have been extended to high-resolution images by using spatially localized filters [6]. While we can subtract off the mean intensity from independent image patches to successfully train PoT, we cannot do that on a high-resolution image because overlapping patches might have different mean. Unfortunately, replicating potentials over the image ignoring variations of mean intensity has been the leading strategy to date [6]3 . This is the major reason why generation of high-resolution images is so poor. Sec. 4 shows that generation can be drastically improved by explicitly accounting for variations of mean intensity, as performed by mPoT and mcRBM. 3 Weight-Sharing Schemes By integrating out the latent variables, we can write the density function of any gated MRF as a normalized product of potential functions (for mPoT refer to eq. 6). In this section we investigate different ways of constraining the parameters of the potentials of a generic MRF. Global: The obvious way to extend a patch-based model like PoT to high-resolution images is to define potentials over the whole image; we call this scheme global. This is not practical because 1) the number of parameters grows about quadratically with the size of the image making training too slow, 2) we do not need to model interactions between very distant pairs of pixels since their dependence is negligible, and 3) we would not be able to use the model on images of different size. Conv: The most popular way to handle big images is to define potentials on small subsets of variables (e.g., neighborhoods of size 5x5 pixels) and to replicate these potentials across space while 2 The need to model the means was clearly recognized in [21] but they used conjunctive latent features that simultaneously represented a contribution to the “precision matrix” in a specific direction and the mean along that same direction. 3 The success of PoT-like models in Bayesian denoising is not surprising since the noisy image effectively replaces the reconstruction term from the mean hidden units (see eq. 5), providing a set of noisy mean intensities that are cleaned up by the patterns of correlation enforced by the covariance latent variables. 4 sharing their parameters at each image location [23, 24, 6]. This yields a convolutional weightsharing scheme, also called homogeneous field in the statistics literature. This choice is justified by the stationarity of natural images. This weight-sharing scheme is extremely concise in terms of number of parameters, but also rather inefficient in terms of latent representation. First, if there are N filters at each location and these filters are stepped by one pixel then the internal representation is about N times overcomplete. The internal representation has not only high computational cost, but it is also highly redundant. Since the input is mostly smooth and the parameters are the same across space, the latent variables are strongly correlated as well. This inefficiency turns out to be particularly harmful for a model like PoT causing the learned filters to become “random” looking (see fig 3-iii). A simple intuition follows from the equivalence between PoT and square ICA [15]. If the filter matrix C of eq. 1 is square and invertible, we can marginalize out the latent variables and write: p(y) = i S(yi ), where yi = Ci T x and S is a Student’s t distribution. In other words, there is an underlying assumption that filter outputs are independent. However, if the filters of matrix C are shifted and overlapping versions of each other, this clearly cannot be true. Training PoT with the Conv weight-sharing scheme forces the model to find filters that make filter outputs as independent as possible, which explains the very high-frequency patterns that are usually discovered [6]. Local: The Global and Conv weight-sharing schemes are at the two extremes of a spectrum of possibilities. For instance, we can define potentials on a small subset of input variables but, unlike Conv, each potential can have its own set of parameters, as shown in fig. 2-B. This is called local, or inhomogeneous field. Compared to Conv the number of parameters increases only slightly but the number of latent variables required and their redundancy is greatly reduced. In fact, the model learns different receptive fields at different locations as a better strategy for representing the input, overall when the number of potentials is limited (see also fig. 2-F). TConv: Local would not allow the model to be trained and tested on images of different resolution, and it might seem wasteful not to exploit the translation invariant property of images. We therefore advocate the use of a weight-sharing scheme that we call tiled-convolutional (TConv) shown in fig. 2-C and E [18]. Each filter tiles the image without overlaps with copies of itself (i.e. the stride equals the filter diameter). This reduces spatial redundancy of latent variables and allows the input images to have arbitrary size. At the same time, different filters do overlap with each other in order to avoid tiling artifacts. Fig. 2-F shows filters that were (jointly) learned by a Restricted Boltzmann Machine (RBM) [29] with Gaussian input variables using the TConv weight-sharing scheme. 4 Experiments We train gated MRF’s with and without mean hidden units using different weight-sharing schemes. The training procedure is very similar in all cases. We perform approximate maximum likelihood by using Fast Persistence Contrastive Divergence (FPCD) [25] and we draw samples by using Hybrid Monte Carlo (HMC) [26]. Since all latent variables can be exactly marginalized out we can use HMC on the free energy (negative logarithm of the marginal distribution over the input pixels). For mPoT this is: F mPoT (x) = − log(p(x))+const. = k,i 1 1 γ log(1+ (Cik T xk )2 )+ xT x− 2 2 T log(1+exp(Wjk xk )) (6) k,j where the index k runs over spatial locations and xk is the k-th image patch. FPCD keeps samples, called negative particles, that it uses to represent the model distribution. These particles are all updated after each weight update. For each mini-batch of data-points a) we compute the derivative of the free energy w.r.t. the training samples, b) we update the negative particles by running HMC for one HMC step consisting of 20 leapfrog steps. We start at the previous set of negative particles and use as parameters the sum of the regular parameters and a small perturbation vector, c) we compute the derivative of the free energy at the negative particles, and d) we update the regular parameters by using the difference of gradients between step a) and c) while the perturbation vector is updated using the gradient from c) only. The perturbation is also strongly decayed to zero and is subject to a larger learning rate. The aim is to encourage the negative particles to explore the space more quickly by slightly and temporarily raising the energy at their current position. Note that the use of FPCD as opposed to other estimation methods (like Persistent Contrastive Divergence [27]) turns out to be crucial to achieve good mixing of the sampler even after training. We train on mini-batches of 32 samples using gray-scale images of approximate size 160x160 pixels randomly cropped from the Berkeley segmentation dataset [28]. We perform 160,000 weight updates decreasing the learning by a factor of 4 by the end of training. The initial learning rate is set to 0.1 for the covariance 5 Figure 3: 160x160 samples drawn by A) mPoT-TConv, B) mHPoT-TConv, C) mcRBM-TConv and D) PoTTConv. On the side also i) a subset of 8x8 “covariance” filters learned by mPoT-TConv (the plot below shows how the whole set of filters tile a small patch; each bar correspond to a Gabor fit of a filter and colors identify filters applied at the same 8x8 location, each group is shifted by 2 pixels down the diagonal and a high-resolution image is tiled by replicating this pattern every 8 pixels horizontally and vertically), ii) a subset of 8x8 “mean” filters learned by the same mPoT-TConv, iii) filters learned by PoT-Conv and iv) by PoT-TConv. filters (matrix C of eq. 1), 0.01 for the mean parameters (matrix W of eq. 4), and 0.001 for the other parameters (γ of eq. 1). During training we condition on the borders and initialize the negative particles at zero in order to avoid artifacts at the border of the image. We learn 8x8 filters and pre-multiply the covariance filters by a whitening transform retaining 99% of the variance; we also normalize the norm of the covariance filters to prevent some of them from decaying to zero during training4 . Whenever we use the TConv weight-sharing scheme the model learns covariance filters that mostly resemble localized and oriented Gabor functions (see fig. 3-i and iv), while the Conv weight-sharing scheme learns structured but poorly localized high-frequency patterns (see fig. 3-iii) [6]. The TConv models re-use the same 8x8 filters every 8 pixels and apply a diagonal offset of 2 pixels between neighboring filters with different weights in order to reduce tiling artifacts. There are 4 sets of filters, each with 64 filters for a total of 256 covariance filters (see bottom plot of fig. 3). Similarly, we have 4 sets of mean filters, each with 32 filters. These filters have usually non-zero mean and exhibit on-center off-surround and off-center on-surround patterns, see fig. 3-ii. In order to draw samples from the learned models, we run HMC for a long time (10,000 iterations, each composed of 20 leap-frog steps). Some samples of size 160x160 pixels are reported in fig. 3 A)D). Without modelling the mean intensity, samples lack structure and do not seem much different from those that would be generated by a simple Gaussian model merely fitting the second order statistics (see fig. 3 in [1] and also fig. 2 in [7]). By contrast, structure, sharp boundaries and some simple texture emerge only from models that have mean latent variables, namely mcRBM, mPoT and mHPoT which differs from mPoT by having a second layer pooling matrix on the squared covariance filter outputs [11]. A more quantitative comparison is reported in table 1. We first compute marginal statistics of filter responses using the generated images, natural images from the test set, and random images. The statistics are the normalized histogram of individual filter responses to 24 Gabor filters (8 orientations and 3 scales). We then calculate the KL divergence between the histograms on random images and generated images and the KL divergence between the histograms on natural images and generated images. The table also reports the average difference of energies between random images and natural images. All results demonstrate that models that account for mean intensity generate images 4 The code used in the experiments can be found at the first author’s web-page. 6 MODEL F (R) − F (T ) (104 ) KL(R G) KL(T G) KL(R G) − KL(T PoT - Conv 2.9 0.3 0.6 PoT - TConv 2.8 0.4 1.0 -0.6 mPoT - TConv 5.2 1.0 0.2 0.8 mHPoT - TConv 4.9 1.7 0.8 0.9 mcRBM - TConv 3.5 1.5 1.0 G) -0.3 0.5 Table 1: Comparing MRF’s by measuring: difference of energy (negative log ratio of probabilities) between random images (R) and test natural images (T), the KL divergence between statistics of random images (R) and generated images (G), KL divergence between statistics of test natural images (T) and generated images (G), and difference of these two KL divergences. Statistics are computed using 24 Gabor filters. that are closer to natural images than to random images, whereas models that do not account for the mean (like the widely used PoT-Conv) produce samples that are actually closer to random images. 4.1 Discriminative Experiments on Weight-Sharing Schemes In future work, we intend to use the features discovered by the generative model for recognition. To understand how the different weight sharing schemes affect recognition performance we have done preliminary tests using the discriminative performance of a simpler model on simpler data. We consider one of the simplest and most versatile models, namely the RBM [29]. Since we also aim to test the Global weight-sharing scheme we are constrained to using fairly low resolution datasets such as the MNIST dataset of handwritten digits [30] and the CIFAR 10 dataset of generic object categories [22]. The MNIST dataset has soft binary images of size 28x28 pixels, while the CIFAR 10 dataset has color images of size 32x32 pixels. CIFAR 10 has 10 classes, 5000 training samples per class and 1000 test samples per class. MNIST also has 10 classes with, on average, 6000 training samples per class and 1000 test samples per class. The energy function of the RBM trained on the CIFAR 10 dataset, modelling input pixels with 3 (R,G,B) Gaussian variables [31], is exactly the one shown in eq. 4; while the RBM trained on MNIST uses logistic units for the pixels and the energy function is again the same as before but without any quadratic term. All models are trained in an unsupervised way to approximately maximize the likelihood in the training set using Contrastive Divergence [32]. They are then used to represent each input image with a feature vector (mean of the posterior over the latent variables) which is fed to a multinomial logistic classifier for discrimination. Models are compared in terms of: 1) recognition accuracy, 2) convergence time and 3) dimensionality of the representation. In general, assuming filters much smaller than the input image and assuming equal number of latent variables, Conv, TConv and Local models process each sample faster than Global by a factor approximately equal to the ratio between the area of the image and the area of the filters, which can be very large in practice. In the first set of experiments reported on the left of fig. 4 we study the internal representation in terms of discrimination and dimensionality using the MNIST dataset. For each choice of dimensionality all models are trained using the same number of operations. This is set to the amount necessary to complete one epoch over the training set using the Global model. This experiment shows that: 1) Local outperforms all other weight-sharing schemes for a wide range of dimensionalities, 2) TConv does not perform as well as Local probably because the translation invariant assumption is clearly violated for these relatively small, centered, images, 3) Conv performs well only when the internal representation is very high dimensional (10 times overcomplete) otherwise it severely underfits, 4) Global performs well when the representation is compact but its performance degrades rapidly as this increases because it needs more than the allotted training time. The right hand side of fig. 4 shows how the recognition performance evolves as we increase the number of operations (or training time) using models that produce a twice overcomplete internal representation. With only very few filters Conv still underfits and it does not improve its performance by training for longer, but Global does improve and eventually it reaches the performance of Local. If we look at the crossing of the error rate at 2% we can see that Local is about 4 times faster than Global. To summarize, Local provides more compact representations than Conv, is much faster than Global while achieving 7 6 2.4 error rate % 5 error rate % 2.6 Global Local TConv Conv 4 3 2 1 0 2.2 Global Local 2 Conv 1.8 1000 2000 3000 4000 5000 dimensionality 6000 7000 1.6 0 8000 2 4 6 8 # flops (relative to # flops per epoch of Global model) 10 Figure 4: Experiments on MNIST using RBM’s with different weight-sharing schemes. Left: Error rate as a function of the dimensionality of the latent representation. Right: Error rate as a function of the number of operations (normalized to those needed to perform one epoch in the Global model); all models have a twice overcomplete latent representation. similar performance in discrimination. Also, Local can easily scale to larger images while Global cannot. Similar experiments are performed using the CIFAR 10 dataset [22] of natural images. Using the same protocol introduced in earlier work by Krizhevsky [22], the RBM’s are trained in an unsupervised way on a subset of the 80 million tiny images dataset [33] and then “fine-tuned” on the CIFAR 10 dataset by supervised back-propagation of the error through the linear classifier and feature extractor. All models produce an approximately 10,000 dimensional internal representation to make a fair comparison. Models using local filters learn 16x16 filters that are stepped every pixel. Again, we do not experiment with the TConv weight-sharing scheme because the image is not large enough to allow enough replicas. Similarly to fig. 3-iii the Conv weight-sharing scheme was very difficult to train and did not produce Gabor-like features. Indeed, careful injection of sparsity and long training time seem necessary [31] for these RBM’s. By contrast, both Local and Global produce Gabor-like filters similar to those shown in fig. 2 F). The model trained with Conv weight-sharing scheme yields an accuracy equal to 56.6%, while Local and Global yield much better performance, 63.6% and 64.8% [22], respectively. Although Local and Global have similar performance, training with the Local weight-sharing scheme took under an hour while using the Global weight-sharing scheme required more than a day. 5 Conclusions and Future Work This work is motivated by the poor generative quality of currently popular MRF models of natural images. These models generate images that are actually more similar to white noise than to natural images. Our contribution is to recognize that current models can benefit from 1) the addition of a simple model of the mean intensities and from 2) the use of a less constrained weight-sharing scheme. By augmenting these models with an extra set of latent variables that model mean intensity we can generate samples that look much more realistic: they are characterized by smooth regions, sharp boundaries and some simple high frequency texture. We validate our approach by comparing the statistics of filter outputs on natural images and generated images. In the future, we plan to integrate these MRF’s into deeper hierarchical models and to use their internal representation to perform object recognition in high-resolution images. The hope is to further improve generation by capturing longer range dependencies and to exploit this to better cope with missing values and ambiguous sensory inputs. References [1] E.P. Simoncelli. Statistical modeling of photographic images. Handbook of Image and Video Processing, pages 431–441, 2005. 8 [2] A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley & Sons, 2001. [3] G.E. Hinton and R. R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006. [4] M. Ranzato and G.E. Hinton. Modeling pixel means and covariances using factorized third-order boltzmann machines. In CVPR, 2010. [5] M.J. Wainwright and E.P. Simoncelli. Scale mixtures of gaussians and the statistics of natural images. In NIPS, 2000. [6] S. Roth and M.J. Black. Fields of experts: A framework for learning image priors. In CVPR, 2005. [7] U. Schmidt, Q. Gao, and S. Roth. A generative perspective on mrfs in low-level vision. In CVPR, 2010. [8] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. PAMI, 6:721–741, 1984. [9] M. Welling, G.E. Hinton, and S. Osindero. Learning sparse topographic representations with products of student-t distributions. In NIPS, 2003. [10] S.C. Zhu and D. Mumford. Prior learning and gibbs reaction diffusion. PAMI, pages 1236–1250, 1997. [11] S. Osindero, M. Welling, and G. E. Hinton. Topographic product models applied to natural scene statistics. Neural Comp., 18:344–381, 2006. [12] S. Osindero and G. E. Hinton. Modeling image patches with a directed hierarchy of markov random fields. In NIPS, 2008. [13] Y. Karklin and M.S. Lewicki. Emergence of complex cell properties by learning to generalize in natural scenes. Nature, 457:83–86, 2009. [14] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: a strategy employed by v1? Vision Research, 37:3311–3325, 1997. [15] Y. W. Teh, M. Welling, S. Osindero, and G. E. Hinton. Energy-based models for sparse overcomplete representations. JMLR, 4:1235–1260, 2003. [16] Y. Weiss and W.T. Freeman. What makes a good model of natural images? In CVPR, 2007. [17] S. Roth and M. J. Black. Fields of experts. Int. Journal of Computer Vision, 82:205–229, 2009. [18] K. Gregor and Y. LeCun. Emergence of complex-like cells in a temporal product network with local receptive fields. arXiv:1006.0448, 2010. [19] C. Tang and C. Eliasmith. Deep networks for robust visual recognition. In ICML, 2010. [20] M. Ranzato, A. Krizhevsky, and G.E. Hinton. Factored 3-way restricted boltzmann machines for modeling natural images. In AISTATS, 2010. [21] N. Heess, C.K.I. Williams, and G.E. Hinton. Learning generative texture models with extended fields-ofexperts. In BMCV, 2009. [22] A. Krizhevsky. Learning multiple layers of features from tiny images, 2009. MSc Thesis, Dept. of Comp. Science, Univ. of Toronto. [23] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang. Phoneme recognition using time-delay neural networks. IEEE Acoustics Speech and Signal Proc., 37:328–339, 1989. [24] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [25] T. Tieleman and G.E. Hinton. Using fast weights to improve persistent contrastive divergence. In ICML, 2009. [26] R.M. Neal. Bayesian learning for neural networks. Springer-Verlag, 1996. [27] T. Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient. In ICML, 2008. [28] http://www.cs.berkeley.edu/projects/vision/grouping/segbench/. [29] M. Welling, M. Rosen-Zvi, and G.E. Hinton. Exponential family harmoniums with an application to information retrieval. In NIPS, 2005. [30] http://yann.lecun.com/exdb/mnist/. [31] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proc. ICML, 2009. [32] G.E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:1771–1800, 2002. [33] A. Torralba, R. Fergus, and W.T. Freeman. 80 million tiny images: a large dataset for non-parametric object and scene recognition. PAMI, 30:1958–1970, 2008. 9
6 0.62728238 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning
7 0.6233784 151 nips-2010-Learning from Candidate Labeling Sets
8 0.60896194 6 nips-2010-A Discriminative Latent Model of Image Region and Object Tag Correspondence
9 0.59173286 52 nips-2010-Convex Multiple-Instance Learning by Estimating Likelihood Ratio
10 0.59102684 158 nips-2010-Learning via Gaussian Herding
11 0.58845258 99 nips-2010-Gated Softmax Classification
12 0.54697084 228 nips-2010-Reverse Multi-Label Learning
13 0.53128678 287 nips-2010-Worst-Case Linear Discriminant Analysis
14 0.52914828 149 nips-2010-Learning To Count Objects in Images
15 0.52432072 89 nips-2010-Factorized Latent Spaces with Structured Sparsity
16 0.52086139 267 nips-2010-The Multidimensional Wisdom of Crowds
17 0.51816136 86 nips-2010-Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach
18 0.51376611 82 nips-2010-Evaluation of Rarity of Fingerprints in Forensics
19 0.51186848 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
20 0.50175452 257 nips-2010-Structured Determinantal Point Processes
topicId topicWeight
[(13, 0.026), (17, 0.012), (27, 0.041), (30, 0.034), (35, 0.018), (45, 0.7), (50, 0.021), (52, 0.015), (60, 0.017), (77, 0.026), (90, 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99947309 235 nips-2010-Self-Paced Learning for Latent Variable Models
Author: M. P. Kumar, Benjamin Packer, Daphne Koller
Abstract: Latent variable models are a powerful tool for addressing several tasks in machine learning. However, the algorithms for learning the parameters of latent variable models are prone to getting stuck in a bad local optimum. To alleviate this problem, we build on the intuition that, rather than considering all samples simultaneously, the algorithm should be presented with the training data in a meaningful order that facilitates learning. The order of the samples is determined by how easy they are. The main challenge is that often we are not provided with a readily computable measure of the easiness of samples. We address this issue by proposing a novel, iterative self-paced learning algorithm where each iteration simultaneously selects easy samples and learns a new parameter vector. The number of samples selected is governed by a weight that is annealed until the entire training data has been considered. We empirically demonstrate that the self-paced learning algorithm outperforms the state of the art method for learning a latent structural SVM on four applications: object localization, noun phrase coreference, motif finding and handwritten digit recognition. 1
2 0.99898517 187 nips-2010-Occlusion Detection and Motion Estimation with Convex Optimization
Author: Alper Ayvaci, Michalis Raptis, Stefano Soatto
Abstract: We tackle the problem of simultaneously detecting occlusions and estimating optical flow. We show that, under standard assumptions of Lambertian reflection and static illumination, the task can be posed as a convex minimization problem. Therefore, the solution, computed using efficient algorithms, is guaranteed to be globally optimal, for any number of independently moving objects, and any number of occlusion layers. We test the proposed algorithm on benchmark datasets, expanded to enable evaluation of occlusion detection performance. 1
3 0.99833316 255 nips-2010-Static Analysis of Binary Executables Using Structural SVMs
Author: Nikos Karampatziakis
Abstract: We cast the problem of identifying basic blocks of code in a binary executable as learning a mapping from a byte sequence to a segmentation of the sequence. In general, inference in segmentation models, such as semi-CRFs, can be cubic in the length of the sequence. By taking advantage of the structure of our problem, we derive a linear-time inference algorithm which makes our approach practical, given that even small programs are tens or hundreds of thousands bytes long. Furthermore, we introduce two loss functions which are appropriate for our problem and show how to use structural SVMs to optimize the learned mapping for these losses. Finally, we present experimental results that demonstrate the advantages of our method against a strong baseline. 1
4 0.9977954 167 nips-2010-Mixture of time-warped trajectory models for movement decoding
Author: Elaine Corbett, Eric Perreault, Konrad Koerding
Abstract: Applications of Brain-Machine-Interfaces typically estimate user intent based on biological signals that are under voluntary control. For example, we might want to estimate how a patient with a paralyzed arm wants to move based on residual muscle activity. To solve such problems it is necessary to integrate obtained information over time. To do so, state of the art approaches typically use a probabilistic model of how the state, e.g. position and velocity of the arm, evolves over time – a so-called trajectory model. We wanted to further develop this approach using two intuitive insights: (1) At any given point of time there may be a small set of likely movement targets, potentially identified by the location of objects in the workspace or by gaze information from the user. (2) The user may want to produce movements at varying speeds. We thus use a generative model with a trajectory model incorporating these insights. Approximate inference on that generative model is implemented using a mixture of extended Kalman filters. We find that the resulting algorithm allows us to decode arm movements dramatically better than when we use a trajectory model with linear dynamics. 1 In trod u cti on When patients have lost a limb or the ability to communicate with the outside world, brain machine interfaces (BMIs) are often used to enable robotic prostheses or restore communication. To achieve this, the user's intended state of the device must be decoded from biological signals. In the context of Bayesian statistics, two aspects are important for the design of an estimator of a temporally evolving state: the observation model, which describes how measured variables relate to the system’s state and the trajectory model which describes how the state changes over time in a probabilistic manner. Following this logic many recent BMI applications have relied on Bayesian estimation for a wide range of problems including the decoding of intended human [1] and animal [2] movements. In the context of BMIs, Bayesian approaches offer a principled way of formalizing the uncertainty about signals and thus often result in improvements over other signal processing techniques [1]-[3]. Most work on state estimation in dynamical systems has assumed linear dynamics and Gaussian noise. Under these circumstances, efficient algorithms result from belief propagation. The most frequent application uses the Kalman filter (KF), which recursively combines noisy state observations with the probabilistic evolution of state defined by the trajectory model to estimate the marginal distribution over states [4]. Such approaches have been used widely for applications including upper [1] and lower [5] extremity prosthetic 1 devices, functional electric stimulation [6] and human computer interactions [7]. As these algorithms are so commonly used, it seems promising to develop extensions to nonlinear trajectory models that may better describe the probabilistic distribution of movements in everyday life. One salient departure from the standard assumptions is that people tend to produce both slow and fast movements, depending on the situation. Models with linear dynamics only allow such deviation through the noise term, which makes these models poor at describing the natural variation of movement speeds during real world tasks. Explicitly incorporating movement speed into the trajectory model should lead to better movement estimates. Knowledge of the target position should also strongly affect trajectory models. After all , we tend to accelerate our arm early during movement and slow down later on. Target information can be linearly incorporated into the trajectory model, and this has greatly improved predictions [8]-[12]. Alternatively, if there are a small number of potential targets then a mixture of trajectory models approach [13] can be used. Here we are interested in the case where available data provide a prior over potential t argets but where movement targets may be anywhere. We want to incorporate target uncertainty and allow generalization to novel targets. Prior information about potential targets could come from a number of sources but would generally be noisy. For example, activity in the dorsal premotor cortex provides information about intended target location prior to movement and may be used where such recordings are available [14]. Target information may also be found noninvasively by tracking eye movements. However, such data will generally provide non-zero priors for a number of possible target locations as the subject saccades over the scene. While subjects almost always look at a target before reaching for it [15], there may be a delay of up to a second between looking at the target and the reach – a time interval over which up to 3 saccades are typically made. Each of these fixations could be the target. Hence, a probabilistic distribution of targets is appropriate when using either neural recordings or eye tracking to estimate potential reach targets Here we present an algorithm that uses a mixture of extended Kalman Filters (EKFs) to combine our insights related to the variation of movement speed and the availability of probabilistic target knowledge. Each of the mixture component s allows the speed of the movement to vary continuously over time. We tested how well we could use EMGs and eye movements to decode hand position of humans performing a three -dimensional large workspace reaching task. We find that using a trajectory model that allows for probabilistic target information and variation of speed leads to dramatic improvements in decoding quality. 2 Gen e ral Decod i n g S etti n g We wanted to test how well different decoding algorithms can decode human movement, over a wide range of dynamics. While many recent studies have looked at more restrictive, two-dimensional movements, a system to restore arm function should produce a wide range of 3D trajectories. We recorded arm kinematics and EMGs of healthy subjects during unconstrained 3D reaches to targets over a large workspace. Two healthy subjects were asked to reach at slow, normal and fast speeds, as they would in everyday life. Subjects were seated as they reached towards 16 LEDs in blocks of 150s, which were located on two planes positioned such that all targets were just reachable (Fig 1A). The target LED was lit for one second prior to an auditory go cue, at which time the subject would reach to the target at the appropriate speed. Slow, normal and fast reaches were allotted 3 s, 1.5s and 1s respectively; however, subjects determined the speed. An approximate total of 450 reaches were performed per subject. The subjects provided informed consent, and the protocol was approved by the Northwestern University Institutional Review Board. EMG signals were measured from the pectoralis major, and the three deltoid muscles of the shoulder. This represents a small subset of the muscles involved in reaching, and approximates those muscles retaining some voluntary control following mid-level cervical spinal cord injuries. 2 The EMG signals were band-pass filtered between 10 and 1,000 Hz, and subsequently anti aliased filtered. Hand, wrist, shoulder and head positions were tracked using an Optotrak motion analysis system. We simultaneously recorded eye movements with an ASL EYETRAC-6 head mounted eye tracker. Approximately 25% of the reaches were assigned to the test set, and the rest were used for training. Reaches for which either the motion capture data was incomplete, or there was visible motion artifact on the EMG were removed. As the state we used hand positions and joint angles (3 shoulder, 2 elbow, position, velocity and acceleration, 24 dimensions). Joint angles were calculated from the shoulder and wrist marker data using digitized bony landmarks which defined a coordinate system for the upper limb as detailed by Wu et al. [16]. As the motion data were sampled at 60Hz, the mean absolute value o f the EMG in the corresponding 16.7ms windows was used as an observation of the state at each time-step. Algorithm accuracy was quantified by normalizing the root -mean-squared error by the straight line distance between the first and final position of the endpoint for each reach. We compared the algorithms statistically using repeated measures ANOVAs with Tukey post -hoc tests, treating reach and subject as random effects. In the rest of the paper we will ask how well these reaching movements can be decoded from EMG and eye-tracking data. Figure 1: A Experimental setup and B sample kinematics and processed EMGs for one reach 3 Kal man Fi l ters w i th Target i n f ormati on All models that we consider in this paper assume linear observations with Gaussian noise: (1) where x is the state, y is the observation and v is the measurement noise with p(v) ~ N(0,R), and R is the observation covariance matrix. The model fitted the measured EMGs with an average r2 of 0.55. This highlights the need to integrate information over time. The standard approach also assumes linear dynamics and Gaussian process noise: (2) where, x t represents the hand and joint angle positions, w is the process noise with p(w) ~ N(0,Q), and Q is the state covariance matrix. The Kalman filter does optimal inference for this generative model. This model can effectively capture the dynamics of stereotypical reaches to a single target by appropriately tuning its parameters. However, when used to describe reaches to multiple targets, the model cannot describe target dependent aspects of reaching but boils down to a random drift model. Fast velocities are underestimated as they are unlikely under the trajectory model and there is excessive drift close to the target (Fig. 2A). 3 In many decoding applications we may know the subject’s target. A range of recent studies have addressed the issue of incorporating this information into the trajectory model [8, 13], and we might assume the effect of the target on the dynamics to be linear. This naturally suggests adding the target to the state space, which works well in practice [9, 12]. By appending the target to the state vector (KFT), the simple linear format of the KF may be retained: (3) where xTt is the vector of target positions, with dimensionality less than or equal to that of xt. This trajectory model thus allows describing both the rapid acceleration that characterizes the beginning of a reach and the stabilization towards its end. We compared the accuracy of the KF and the KFT to the Single Target Model (STM), a KF trained only on reaches to the target being tested (Fig. 2). The STM represents the best possible prediction that could be obtained with a Kalman filter. Assuming the target is perfectly known, we implemented the KFT by correctly initializing the target state xT at the beginning of the reach. We will relax this assumption below. The initial hand and joint angle positions were also assumed to be known. Figure 2: A Sample reach and predictions and B average accuracies with standard errors for KFT, KF and MTM. Consistent with the recent literature, both methods that incorporated target information produced higher prediction accuracy than the standard KF (both p<0.0001). Interestingly, there was no significant difference between the KFT and the STM (p=0.9). It seems that when we have knowledge of the target, we do not lose much by training a single model over the whole workspace rather than modeling the targets individually. This is encouraging, as we desire a BMI system that can generalize to any target within the workspace, not just specifically to those that are available in the training data. Clearly, adding the target to the state space allows the dynamics of typical movements to be modeled effectively, resulting in dramatic increases in decoding performance. 4 Ti me Warp i n g 4.1 I m p l e m e n t i n g a t i m e - w a r p e d t r a j e c t o r y mo d e l While the KFT above can capture the general reach trajectory profile, it does not allow for natural variability in the speed of movements. Depending on our task objectives, which would not directly be observed by a BMI, we might lazily reach toward a target or move a t maximal speed. We aim to change the trajectory model to explicitly incorporate a warping factor by which the average movement speed is scaled, allowing for such variability. As the movement speed will be positive in all practical cases, we model the logarithm of this factor, 4 and append it to the state vector: (4) We create a time-warped trajectory model by noting that if the average rate of a trajectory is to be scaled by a factor S, the position at time t will equal that of the original trajectory at time St. Differentiating, the velocity will be multiplied by S, and the acceleration by S 2. For simplicity, the trajectory noise is assumed to be additive and Gaussian, and the model is assumed to be stationary: (5) where Ip is the p-dimensional identity matrix and is a p p matrix of zeros. Only the terms used to predict the acceleration states need to be estimated to build the state transition matrix, and they are scaled as a nonlinear function of xs. After adding the variable movement speed to the state space the system is no longer linear. Therefore we need a different solution strategy. Instead of the typical KFT we use the Extended Kalman Filter (EKFT) to implement a nonlinear trajectory model by linearizing the dynamics around the best estimate at each time-step [17]. With this approach we add only small computational overhead to the KFT recursions. 4.2 Tr a i n i n g t h e t i m e w a r p i n g mo d e l The filter parameters were trained using a variant of the Expectation Maximization (EM) algorithm [18]. For extended Kalman filter learning the initialization for the variables may matter. S was initialized with the ground truth average reach speeds for each movement relative to the average speed across all movements. The state transition parameters were estimated using nonlinear least squares regression, while C, Q and R were estimated linearly for the new system, using the maximum likelihood solution [18] (M-step). For the E-step we used a standard extended Kalman smoother. We thus found the expected values for t he states given the current filter parameters. For this computation, and later when testing the algorithm, xs was initialized to its average value across all reaches while the remaining states were initialized to their true values. The smoothed estimate fo r xs was then used, along with the true values for the other states, to re-estimate the filter parameters in the M-step as before. We alternated between the E and M steps until the log likelihood converged (which it did in all cases). Following the training procedure, the diagonal of the state covariance matrix Q corresponding to xs was set to the variance of the smoothed xs over all reaches, according to how much this state should be allowed to change during prediction. This allowed the estimate of xs to develop over the course of the reach due to the evidence provided by the observations, better capturing the dynamics of reaches at different speeds. 4.3 P e r f o r ma n c e o f t h e t i m e - w a r p e d E K F T Incorporating time warping explicitly into the trajectory model pro duced a noticeable increase in decoding performance over the KFT. As the speed state xs is estimated throughout the course of the reach, based on the evidence provided by the observations, the trajectory model has the flexibility to follow the dynamics of the reach more accurately (Fig. 3). While at the normal self-selected speed the difference between the algorithms is small, for the slow and fast speeds, where the dynamics deviate from average, there i s a clear advantage to the time warping model. 5 Figure 3: Hand positions and predictions of the KFT and EKFT for sample reaches at A slow, B normal and C fast speeds. Note the different time scales between reaches. The models were first trained using data from all speeds (Fig. 4A). The EKFT was 1.8% more accurate on average (p<0.01), and the effect was significant at the slow (1.9%, p<0.05) and the fast (2.8%, p<0.01), but not at the normal (p=0.3) speed. We also trained the models from data using only reaches at the self-selected normal speed, as we wanted to see if there was enough variation to effectively train the EKFT (Fig. 4B). Interestingly, the performance of the EKFT was reduced by only 0.6%, and the KFT by 1.1%. The difference in performance between the EKFT and KFT was even more pronounced on aver age (2.3%, p<0.001), and for the slow and fast speeds (3.6 and 4.1%, both p< 0.0001). At the normal speed, the algorithms again were not statistically different (p=0.6). This result demonstrates that the EKFT is a practical option for a real BMI system, as it is not necessary to greatly vary the speeds while collecting training data for the model to be effective over a wide range of intended speeds. Explicitly incorporating speed information into the trajectory model helps decoding, by modeling the natural variation in volitional speed. Figure 4: Mean and standard error of EKFT and KFT accuracy at the different subjectselected speeds. Models were trained on reaches at A all speeds and B just normal speed reaches. Asterisks indicate statistically significant differences between the algorithms. 5 Mi xtu res of Target s So far, we have assumed that the targets of our reaches are perfectly known. In a real-world system, there will be uncertainty about the intended target of the reach. However, in typical applications there are a small number of possible objectives. Here we address this situation. Drawing on the recent literature, we use a mixture model to consider each of the possible targets [11, 13]. We condition the posterior probability for the state on the N possible targets, T: (6) 6 Using Bayes' Rule, this equation becomes: (7) As we are dealing with a mixture model, we perform the Kalman filter recursion for each possible target, xT, and our solution is a weighted sum of the outputs. The weights are proportional to the prior for that target, , and the likelihood of the model given that target . is independent of the target and does not need to be calculated. We tested mixtures of both algorithms, the mKFT and mEKFT, with real uncert ain priors obtained from eye-tracking in the one-second period preceding movement. As the targets were situated on two planes, the three-dimensional location of the eye gaze was found by projecting its direction onto those planes. The first, middle and last eye samples were selected, and all other samples were assigned to a group according to which of the three was closest. The mean and variance of these three groups were used to initialize three Kalman filters in the mixture model. The priors of the three groups were assigned proportional to the number of samples in them. If the subject looks at multiple positions prior to reaching, this method ensures with a high probability that the correct target was accounted for in one of the filters in the mixture. We also compared the MTM approach of Yu et al. [13], where a different KF model was generated for each target, and a mixture is performed over these models. This approach explicitly captures the dynamics of stereotypical reaches to specific targets. Given perfect target information, it would reduce to the STM described above. Priors for the MTM were found by assigning each valid eye sample to its closest two targets, and weighting the models proportional to the number of samples assigned to the corresponding target, divided by its distance from the mean of those samples. We tried other ways of assigning priors and the one presented gave the best results. We calculated the reduction in decoding quality when instead of perfect priors we provide eye-movement based noisy priors (Fig. 5). The accuracies of the mEKFT, the mKFT and the MTM were only degraded by 0.8, 1.9 and 2.1% respectively, compared to the perfect prior situation. The mEKFT was still close to 10% better than the KF. The mixture model framework is effective in accounting for uncertain priors. Figure 5: Mean and standard errors of accuracy for algorithms with perfect priors, and uncertain priors with full and partial training set. The asterisk indicates a statistically significant effects between the two training types, where real priors are used. Here, only reaches at normal speed were used to train the models, as this is a more realistic training set for a BMI application. This accounts for the degraded performance of the MTM with perfect priors relative to the STM from above (Fig. 2). With even more stereotyped training data for each target, the MTM doesn't generalize as well to new speeds. 7 We also wanted to know if the algorithms could generalize to new targets. In a real application, the available training data will generally not span the entire useable worksp ace. We compared the algorithms where reaches to all targets except the one being tested had been used to train the models. The performance of the MTM was significantly de graded unsurprisingly, as it was designed for reaches to a set of known targets. Performance of the mKFT and mEKFT degraded by about 1%, but not significantly (both p>0.7), demonstrating that the continuous approach to target information is preferable when the target could be anywhere in space, not just at locations for which training data is available. 6 Di scu ssi on and concl u si on s The goal of this work was to design a trajectory model that would improve decoding for BMIs with an application to reaching. We incorporated two features that prominently influence the dynamics of natural reach: the movement speed and the target location. Our approach is appropriate where uncertain target information is available. The model generalizes well to new regions of the workspace for which there is no training data, and across a broad range of reaching dynamics to widely spaced targets in three dimensions. The advantages over linear models in decoding precision we report here could be equally obtained using mixtures over many targets and speeds. While mixture models [11, 13] could allow for slow versus fast movements and any number of potential targets, this strategy will generally require many mixture components. Such an approach would require a lot more training data, as we have shown that it does not generalize well. It would also be run-time intensive which is problematic for prosthetic devices that rely on low power controllers. In contrast, the algorithm introduced here only takes a small amount of additional run-time in comparison to the standard KF approach. The EKF is only marginally slower than the standard KF and the algorithm will not generally need to consider more than 3 mixture components assuming the subject fixates the target within the second pre ceding the reach. In this paper we assumed that subjects always would fixate a reach target – along with other non-targets. While this is close to the way humans usually coordinate eyes and reaches [15], there might be cases where people do not fixate a reach target. Our approach could be easily extended to deal with such situations by adding a dummy mixture component that all ows the description of movements to any target. As an alternative to mixture approaches, a system can explicitly estimate the target position in the state vector [9]. This approach, however, would not straightforwardly allow for the rich target information available; we look at the target but also at other locations, strongly suggesting mixture distributions. A combination of the two approaches could further improve decoding quality. We could both estimate speed and target position for the EKFT in a continuous manner while retaining the mixture over target priors. We believe that the issues that we have addressed here are almost universal. Virtually all types of movements are executed at varying speed. A probabilistic distribution for a small number of action candidates may also be expected in most BMI applications – after all there are usually only a small number of actions that make sense in a given environment. While this work is presented in the context of decoding human reaching, it may be applied to a wide range of BMI applications including lower limb prosthetic devices and human computer interactions, as well as different signal sources such as electrode grid recordings and electroencephalograms. The increased user control in conveying their intended movements would significantly improve the functionality of a neuroprosthetic device. A c k n o w l e d g e me n t s T h e a u t h o r s t h a n k T. H a s w e l l , E . K r e p k o v i c h , a n d V. Ravichandran for assistance with experiments. This work was funded by the NSF Program in Cyber-Physical Systems. R e f e re n c e s [1] L.R. Hochberg, M.D. Serruya, G.M. Friehs, J.A. Mukand, M. Saleh, A.H. Caplan, A. Branner, D. 8 [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] Chen, R.D. Penn, and J.P. Donoghue, “Neuronal ensemble control of prosthetic devices by a human with tetraplegia,” Nature, vol. 442, 2006, pp. 164–171. W. Wu, Y. Gao, E. Bienenstock, J.P. Donoghue, and M.J. Black, “Bayesian population decoding of motor cortical activity using a Kalman filter,” Neural Computation, vol. 18, 2006, pp. 80–118. W. Wu, M.J. Black, Y. Gao, E. Bienenstock, M. Serruya, A. Shaikhouni, and J.P. Donoghue, “Neural decoding of cursor motion using a Kalman filter,” Advances in Neural Information Processing Systems 15: Proceedings of the 2002 Conference, 2003, p. 133. R.E. Kalman, “A new approach to linear filtering and prediction problems,” Journal of basic Engineering, vol. 82, 1960, pp. 35–45. G.G. Scandaroli, G.A. Borges, J.Y. Ishihara, M.H. Terra, A.F.D. Rocha, and F.A.D.O. Nascimento, “Estimation of foot orientation with respect to ground for an above knee robotic prosthesis,” Proceedings of the 2009 IEEE/RSJ international conference on Intelligent robots and systems, St. Louis, MO, USA: IEEE Press, 2009, pp. 1112-1117. I. Cikajlo, Z. Matjačić, and T. Bajd, “Efficient FES triggering applying Kalman filter during sensory supported treadmill walking,” Journal of Medical Engineering & Technology, vol. 32, 2008, pp. 133144. S. Kim, J.D. Simeral, L.R. Hochberg, J.P. Donoghue, and M.J. Black, “Neural control of computer cursor velocity by decoding motor cortical spiking activity in humans with tetraplegia,” Journal of Neural Engineering, vol. 5, 2008, pp. 455-476. L. Srinivasan, U.T. Eden, A.S. Willsky, and E.N. Brown, “A state-space analysis for reconstruction of goal-directed movements using neural signals,” Neural computation, vol. 18, 2006, pp. 2465–2494. G.H. Mulliken, S. Musallam, and R.A. Andersen, “Decoding trajectories from posterior parietal cortex ensembles,” Journal of Neuroscience, vol. 28, 2008, p. 12913. W. Wu, J.E. Kulkarni, N.G. Hatsopoulos, and L. Paninski, “Neural Decoding of Hand Motion Using a Linear State-Space Model With Hidden States,” IEEE Transactions on neural systems and rehabilitation engineering, vol. 17, 2009, p. 1. J.E. Kulkarni and L. Paninski, “State-space decoding of goal-directed movements,” IEEE Signal Processing Magazine, vol. 25, 2008, p. 78. C. Kemere and T. Meng, “Optimal estimation of feed-forward-controlled linear systems,” IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. Proceedings.(ICASSP'05), 2005. B.M. Yu, C. Kemere, G. Santhanam, A. Afshar, S.I. Ryu, T.H. Meng, M. Sahani, and K.V. Shenoy, “Mixture of trajectory models for neural decoding of goal-directed movements,” Journal of neurophysiology, vol. 97, 2007, p. 3763. N. Hatsopoulos, J. Joshi, and J.G. O'Leary, “Decoding continuous and discrete motor behaviors using motor and premotor cortical ensembles,” Journal of neurophysiology, vol. 92, 2004, p. 1165. R.S. Johansson, G. Westling, A. Backstrom, and J.R. Flanagan, “Eye-hand coordination in object manipulation,” Journal of Neuroscience, vol. 21, 2001, p. 6917. G. Wu, F.C. van der Helm, H.E.J. Veeger, M. Makhsous, P. Van Roy, C. Anglin, J. Nagels, A.R. Karduna, and K. McQuade, “ISB recommendation on definitions of joint coordinate systems of various joints for the reporting of human joint motion–Part II: shoulder, elbow, wrist and hand,” Journal of biomechanics, vol. 38, 2005, pp. 981–992. D. Simon, Optimal state estimation: Kalman, H [infinity] and nonlinear approaches, John Wiley and Sons, 2006. Z. Ghahramani and G.E. Hinton, “Parameter estimation for linear dynamical systems,” University of Toronto technical report CRG-TR-96-2, vol. 6, 1996. 9
5 0.99774551 11 nips-2010-A POMDP Extension with Belief-dependent Rewards
Author: Mauricio Araya, Olivier Buffet, Vincent Thomas, Françcois Charpillet
Abstract: Partially Observable Markov Decision Processes (POMDPs) model sequential decision-making problems under uncertainty and partial observability. Unfortunately, some problems cannot be modeled with state-dependent reward functions, e.g., problems whose objective explicitly implies reducing the uncertainty on the state. To that end, we introduce ρPOMDPs, an extension of POMDPs where the reward function ρ depends on the belief state. We show that, under the common assumption that ρ is convex, the value function is also convex, what makes it possible to (1) approximate ρ arbitrarily well with a piecewise linear and convex (PWLC) function, and (2) use state-of-the-art exact or approximate solving algorithms with limited changes. 1
6 0.99726152 32 nips-2010-Approximate Inference by Compilation to Arithmetic Circuits
7 0.99602455 100 nips-2010-Gaussian Process Preference Elicitation
8 0.99598265 133 nips-2010-Kernel Descriptors for Visual Recognition
9 0.99471718 108 nips-2010-Graph-Valued Regression
10 0.98031193 240 nips-2010-Simultaneous Object Detection and Ranking with Weak Supervision
11 0.97624928 83 nips-2010-Evidence-Specific Structures for Rich Tractable CRFs
12 0.97380966 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning
13 0.97250253 141 nips-2010-Layered image motion with explicit occlusions, temporal consistency, and depth ordering
14 0.97052968 144 nips-2010-Learning Efficient Markov Networks
15 0.9689247 197 nips-2010-Optimal Bayesian Recommendation Sets and Myopically Optimal Choice Query Sets
16 0.9676708 212 nips-2010-Predictive State Temporal Difference Learning
17 0.96718597 94 nips-2010-Feature Set Embedding for Incomplete Data
18 0.96563017 118 nips-2010-Implicit Differentiation by Perturbation
19 0.9651686 86 nips-2010-Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach
20 0.9649055 61 nips-2010-Direct Loss Minimization for Structured Prediction