nips nips2005 nips2005-184 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ben Taskar, Simon Lacoste-Julian, Michael I. Jordan
Abstract: We present a simple and scalable algorithm for large-margin estimation of structured models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem and apply the extragradient method, yielding an algorithm with linear convergence using simple gradient and projection calculations. The projection step can be solved using combinatorial algorithms for min-cost quadratic flow. This makes the approach an efficient alternative to formulations based on reductions to a quadratic program (QP). We present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present a simple and scalable algorithm for large-margin estimation of structured models, including an important class of Markov networks and combinatorial models. [sent-8, score-0.333]
2 We formulate the estimation problem as a convex-concave saddle-point problem and apply the extragradient method, yielding an algorithm with linear convergence using simple gradient and projection calculations. [sent-9, score-0.459]
3 The projection step can be solved using combinatorial algorithms for min-cost quadratic flow. [sent-10, score-0.238]
4 We present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm. [sent-12, score-0.691]
5 1 Introduction The scope of discriminative learning methods has been expanding to encompass prediction tasks with increasingly complex structure. [sent-13, score-0.149]
6 Much of this recent development builds upon graphical models to capture sequential, spatial, recursive or relational structure, but as we will discuss in this paper, the structured prediction problem is broader still. [sent-14, score-0.376]
7 For the broader class of models that we consider here, the conditional likelihood approach is intractable, but the large margin formulation yields tractable convex problems. [sent-16, score-0.239]
8 We interpret the term structured output model very broadly, as a compact scoring scheme over a (possibly very large) set of combinatorial structures and a method for finding the highest scoring structure. [sent-17, score-0.458]
9 In graphical models, the scoring scheme is embodied in a probability distribution over possible assignments of the prediction variables as a function of input variables. [sent-18, score-0.206]
10 In models based on combinatorial problems, the scoring scheme is usually a simple sum of weights associated with vertices, edges, or other components of a structure; these weights are often represented as parametric functions of a set of features. [sent-19, score-0.143]
11 Given training instances labeled by desired structured outputs (e. [sent-20, score-0.273]
12 , matchings) and a set of features that parameterize the scoring function, the learning problem is to find parameters such that the highest scoring outputs are as close as possible to the desired outputs. [sent-22, score-0.152]
13 Example of prediction tasks solved via combinatorial optimization problems include bipartite and non-bipartite matching in alignment of 2D shapes [5], word alignment in natural language translation [14] and disulfide connectivity prediction for proteins [3]. [sent-23, score-1.042]
14 There are also interesting subfamilies of graphical models for which large-margin methods are tractable whereas likelihood-based methods are not; an example is the class of Markov random fields with restricted potentials used for object segmentation in vision [12, 2]. [sent-25, score-0.26]
15 Examples of this approach in the structured prediction setting include the Structured Sequential Minimal Optimization algorithm [20, 18] and the Structured Exponentiated Gradient algorithm [4]. [sent-29, score-0.342]
16 They do not extend to models in which dynamic programming is not applicable, for example, to problems such as matchings and min-cuts. [sent-33, score-0.211]
17 In this paper, we present an estimation methodology for structured prediction problems that does not require a general-purpose QP solver. [sent-34, score-0.369]
18 Moreover, we show that the key computational step in these methods—a certain projection operation—inherits the favorable computational complexity of the underlying optimization problem. [sent-36, score-0.203]
19 In particular, for matchings and min-cuts, projection involves a min-cost quadratic flow computation, a problem for which efficient, highly-specialized algorithms are available. [sent-38, score-0.323]
20 We illustrate the effectiveness of this approach on two very different large-scale structured prediction tasks: 3D image segmentation and word alignment in translation. [sent-39, score-0.795]
21 2 Structured models We begin by discussing two special cases of the general framework that we subsequently present: (1) a class of Markov networks used for segmentation, and (2) a bipartite matching model for word alignment. [sent-40, score-0.432]
22 , yN }, and pairwise potentials, we define a joint distribution over {0, 1}N via P (y) ∝ j∈V φj (yj ) jk∈E φjk (yj , yk ), where (V, E) is an undirected graph, and where {φj (yj ); j ∈ V} are the node potentials and {φjk (yj , yk ), jk ∈ E} are the edge potentials. [sent-47, score-0.558]
23 1(a)), the node potentials capture local evidence about the label of a pixel or laser scan point. [sent-49, score-0.182]
24 Assuming that such correlations tend to be positive What is the an ti c i p ate d c o s t o f c o l l e c ti n g fe e s u n d e r the n e w p r o p o s al ? [sent-51, score-0.138]
25 (a) (b) Figure 1: Examples of structured prediction applications: (a) articulated object segmentation and (b) word alignment in machine translation. [sent-53, score-0.872]
26 (connected nodes tend to have the same label), we restrict the form of edge potentials to be of the form φjk (yj , yk ) = exp{−sjk 1 j = yk )}, where sjk is a non-negative penalty for I(y assigning yj and yk different labels. [sent-54, score-0.712]
27 Expressing node potentials as φj (yj ) = exp{sj yj }, we have P (y) ∝ exp I(y j∈V sj yj − jk∈E sjk 1 j = yk ) . [sent-55, score-0.643]
28 Under this restriction of the potentials, it is known that the problem of computing the maximizing assignment, y ∗ = arg max P (y | x), has a tractable formulation as a min-cut problem [7]. [sent-56, score-0.169]
29 In particular, we obtain the following LP: max 0≤z≤1 s j zj − j∈V sjk zjk s. [sent-57, score-0.53]
30 (1) jk∈E In this LP, a continuous variable zj is a relaxation of the binary variable yj . [sent-60, score-0.2]
31 Note that the constraints are equivalent to |zj − zk | ≤ zjk . [sent-61, score-0.275]
32 Because sjk is positive, zjk = |zk − zj | at the maximum, which is equivalent to 1 j = zk ) if the zj , zk variables are binary. [sent-62, score-0.792]
33 We can parametrize the node and edge weights sj and sjk in terms of user-provided features xj and xjk associated with the nodes and edges. [sent-64, score-0.567]
34 In particular, in 3D range data, xj might be spin image features or spatial occupancy histograms of a point j, while xjk might include the distance between points j and k, the dot-product of their normals, etc. [sent-65, score-0.151]
35 The simplest model of dependence is a linear combination of features: sj = wn fn (xj ) and sjk = we fe (xjk ), where wn and we are node and edge parameters, and fn and fe are node and edge feature mappings, of dimension dn and de , respectively. [sent-66, score-0.783]
36 To ensure non-negativity of sjk , we assume the edge features fe to be non-negative and restrict we ≥ 0. [sent-67, score-0.361]
37 We abbreviate the score assigned to a labeling y for an input x as w f (x, y) = j yj wn fn (xj ) − jk∈E yjk we fe (xjk ), where yjk = 1 j = yk ). [sent-70, score-0.511]
38 Consider modeling the task of word alignment of parallel bilingual sentences (see Fig. [sent-72, score-0.449]
39 1(b)) as a maximum weight bipartite matching problem, where the nodes V = V s ∪ V t correspond to the words in the “source” sentence (V s ) and the “target” sentence (V t ) and the edges E = {jk : j ∈ V s , k ∈ V t } correspond to possible alignments between them. [sent-73, score-0.625]
40 For simplicity, assume that each word aligns to one or zero words in the other sentence. [sent-74, score-0.231]
41 The edge weight sjk represents the degree to which word j in one sentence can translate into the word k in the other sentence. [sent-75, score-0.877]
42 Our objective is to find an alignment that maximizes the sum of edge scores. [sent-76, score-0.209]
43 We represent a matching using a set of binary variables yjk that are set to 1 if word j is assigned to word k in the other sentence, and 0 otherwise. [sent-77, score-0.661]
44 The score of an assignment is the sum of edge scores: s(y) = jk∈E sjk yjk . [sent-78, score-0.39]
45 The maximum weight bipartite matching problem, arg maxy∈Y s(y), can be found by solving the following LP: zjk ≤ 1, ∀k ∈ V t ; sjk zjk s. [sent-79, score-0.849]
46 max 0≤z≤1 j∈V s jk∈E zjk ≤ 1, ∀j ∈ V s , (2) k∈V t where again the continuous variables zjk correspond to the relaxation of the binary variables yjk . [sent-81, score-0.581]
47 For word alignment, the scores sjk can be defined in terms of the word pair jk and input features associated with xjk . [sent-83, score-1.035]
48 We let sjk = w f (xjk ) for some user-provided feature mapping f and abbreviate w f (x, y) = jk yjk w f (xjk ). [sent-85, score-0.576]
49 More generally, we consider prediction problems in which the input x ∈ X is an arbitrary structured object and the output is a vector of values y = (y1 , . [sent-87, score-0.418]
50 In our word alignment example, the output space is defined by the length of the two sentences. [sent-92, score-0.378]
51 Consider the class of structured prediction models H defined by the linear family: h w (x) = arg maxy∈Y(x) w f (x, y), where f (x, y) is a vector of functions f : X × Y → IRn . [sent-94, score-0.414]
52 Below, we specialize to the class of models in which the arg max problem can be solved in polynomial time using linear programming (and more generally, convex optimization); this is still a very large class of models. [sent-97, score-0.167]
53 3 Max-margin estimation We assume a set of training instances S = {(xi , yi )}m , where each instance consists i=1 of a structured object xi (such as a graph) and a target solution yi (such as a matching). [sent-98, score-0.706]
54 However, computing the partition function Z w (x) is #P-complete [23, 10] for the two structured prediction problems we presented above, matchings and min-cuts. [sent-101, score-0.553]
55 Instead, we adopt the max-margin formulation of [20], which directly seeks to find parameters w such that: yi = arg maxyi ∈Yi w f (xi , yi ), ∀i, where Yi = Y(xi ) and yi denotes the appropriate vector of variables for example i. [sent-102, score-0.827]
56 The solution space Yi depends on the structured object xi ; for example, the space of possible matchings depends on the precise set of nodes and edges in the graph. [sent-103, score-0.598]
57 As in univariate prediction, we measure the error of prediction using a loss function (yi , yi ). [sent-104, score-0.346]
58 To obtain a convex formulation, we upper bound the loss (yi , hw (xi )) using the hinge function: maxyi ∈Yi [w fi (yi ) + i (yi )] − w fi (yi ), where i (yi ) = (yi , yi ), and fi (yi ) = f (xi , yi ). [sent-105, score-1.248]
59 Minimizing this upper bound will force the true structure yi to be 2 optimal with respect to w for each instance i. [sent-106, score-0.192]
60 We add a standard L2 weight penalty ||w|| : 2C min w∈W ||w||2 + 2C max [w fi (yi ) + i (yi )] − w fi (yi ), i yi ∈Yi (3) where C is a regularization parameter and W is the space of allowed weights (for example, W = IRn or W = IRn ). [sent-107, score-0.659]
61 Note that this formulation is equivalent to the standard formulation + using slack variables ξ and slack penalty C presented in [20, 19]. [sent-108, score-0.141]
62 (3) efficiently is the loss-augmented inference problem, maxyi ∈Yi [w fi (yi ) + i (yi )]. [sent-110, score-0.34]
63 This optimization problem has precisely the same form as the prediction problem whose parameters we are trying to learn—maxyi ∈Yi w fi (yi )— but with an additional term corresponding to the loss function. [sent-111, score-0.415]
64 Tractability of the lossaugmented inference thus depends not only on the tractability of maxyi ∈Yi w fi (yi ), but also on the form of the loss term i (yi ). [sent-112, score-0.434]
65 A natural choice in this regard is the Hamming distance, which simply counts the number of variables in which a candidate solution y i differs from the target output yi . [sent-113, score-0.219]
66 In general, we need only assume that the loss function decomposes over the variables in yi . [sent-114, score-0.27]
67 For example, in the case of bipartite matchings the Hamming loss counts the number of different edges in the matchings yi and yi and can be written as: H (yi ) = jk yi,jk + i jk (1 − 2yi,jk )yi,jk . [sent-115, score-1.413]
68 (2) (without the constant term jk yi,jk ): max 0≤z≤1 zi,jk [w f (xi,jk ) + 1 − 2yi,jk ] s. [sent-117, score-0.249]
69 Instead we take a different tack here, posing the problem in its natural saddle-point form: min max w∈W z∈Z ||w||2 + 2C w Fi zi + ci zi − w fi (yi ) . [sent-129, score-0.872]
70 We let L(w, z) = ||w|| + i w Fi zi + ci zi − w fi (yi ) , with gradients 2C given by: w L(w, z) = w + i Fi zi − fi (yi ) and zi L(w, z) = Fi w + ci . [sent-132, score-1.682]
71 We denote C the projection of a vector zi onto Zi as π Zi (zi ) = arg minzi ∈Zi ||zi − zi || and similarly, the projection onto W as π W (w ) = arg minw∈W ||w − w||. [sent-133, score-0.9]
72 A well-known solution strategy for saddle-point optimization is provided by the extragradient method [11]. [sent-134, score-0.39]
73 The algorithm starts with a feasible point w = 0, zi ’s that correspond to the assignments yi ’s and step size β = 1. [sent-136, score-0.512]
74 If r is greater than a threshold ν, the step size is decreased using an Armijo type rule: β = (2/3)β min(1, 1/r), and a new prediction step is computed until r ≤ ν, where ν ∈ (0, 1) is a parameter of the algorithm. [sent-138, score-0.167]
75 The key step influencing the efficiency of the algorithm is the Euclidean projection onto the feasible sets W and Zi . [sent-147, score-0.149]
76 In case of word alignment, Z i is the convex hull of bipartite matchings and the problem reduces to the much-studied minimum cost quadratic flow problem. [sent-150, score-0.605]
77 The projection zi = π Zi (zi ) is given by min 0≤z≤1 jk 1 − zi,jk )2 s. [sent-151, score-0.591]
78 k We use a standard reduction of bipartite matching to min-cost flow by introducing a source node s linked to all the nodes in Vis (words in the “source” sentence), and a sink node t linked from all the nodes in Vit (words in the “target” sentence), using edges of capacity 1 and cost 0. [sent-154, score-0.481]
79 The original edges jk have a quadratic cost 1 (zi,jk − zi,jk )2 and capacity 1. [sent-155, score-0.347]
80 2 Minimum (quadratic) cost flow from s to t is the projection of zi onto Zi . [sent-156, score-0.405]
81 The reduction of the projection to minimum quadratic cost flow for the min-cut polytope Zi is shown in the longer version of the paper. [sent-157, score-0.139]
82 In case of word alignment, the running time scales with the cube of the sentence length. [sent-159, score-0.353]
83 5 Experiments We investigate two structured models we described above: bipartite matchings for word alignments and restricted potential Markov nets for 3D segmentation. [sent-164, score-0.834]
84 We compared the extragradient method with the averaged perceptron algorithm [6]. [sent-166, score-0.4]
85 We use the same training regime for the extragradient by running it with C = ∞. [sent-170, score-0.381]
86 We test our algorithm on a 3D scan segmentation problem using the class of Markov networks with potentials that were described above. [sent-172, score-0.219]
87 06 600 (a) (b) Figure 2: Both plots show test error for the averaged perceptron and the extragradient (left y-axis) and training loss per node or edge for the extragradient (right y-axis) versus number of iterations for (a) object segmentation task and (b) word alignment task. [sent-211, score-1.49]
88 2(a) shows that the extragradient has a consistently lower error rate (about 3% for extragradient, 4% for averaged perceptron), using only slightly more expensive computations per iteration. [sent-218, score-0.347]
89 Also shown is the corresponding decrease in the hinge-loss upperbound on the training data as the extragradient progresses. [sent-219, score-0.381]
90 We also tested our learning algorithm on word-level alignment using a data set from the 2003 NAACL set [15], the English-French Hansards task. [sent-221, score-0.147]
91 1M automatically aligned sentences, and comes with a validation set of 39 sentence pairs and a test set of 447 sentences. [sent-223, score-0.205]
92 Using these alignments, alignment error rate (AER) is calculated as: AER(A, S, P ) = 1 − |A∩S|+|A∩P | . [sent-225, score-0.147]
93 We used the intersection of the predictions of the English-to-French and French-to-English IBM Model 4 alignments (using GIZA++ [16]) on the first 5000 sentence pairs from the 1. [sent-227, score-0.234]
94 The number of edges for 5000 sentences was about 555,000. [sent-229, score-0.146]
95 The features on the word pair (ej , fk ) include measures of association, orthography, relative position, predictions of generative models (see [22] for details). [sent-231, score-0.231]
96 2(b) shows the extragradient performing slightly better (by about 0. [sent-235, score-0.347]
97 6 Conclusion We have presented a general solution strategy for large-scale structured prediction problems. [sent-237, score-0.342]
98 We have shown that these problems can be formulated as saddle-point optimization problems, problems that are amenable to solution by the extragradient algorithm. [sent-238, score-0.475]
99 Key to our approach is the recognition that the projection step in the extragradient algorithm can be solved by network flow algorithms. [sent-239, score-0.464]
100 The extragradient method for finding saddle points and other problems. [sent-328, score-0.347]
wordName wordTfidf (topN-words)
[('extragradient', 0.347), ('zi', 0.288), ('structured', 0.239), ('sjk', 0.231), ('word', 0.231), ('jk', 0.218), ('fi', 0.218), ('yi', 0.192), ('matchings', 0.184), ('zjk', 0.184), ('alignment', 0.147), ('taskar', 0.13), ('xjk', 0.124), ('maxyi', 0.122), ('sentence', 0.122), ('ow', 0.116), ('prediction', 0.103), ('lp', 0.099), ('bipartite', 0.099), ('yjk', 0.097), ('zk', 0.091), ('irn', 0.089), ('yj', 0.085), ('projection', 0.085), ('zj', 0.084), ('extrag', 0.082), ('qp', 0.081), ('alignments', 0.081), ('scoring', 0.076), ('segmentation', 0.075), ('wp', 0.075), ('matching', 0.075), ('edges', 0.075), ('potentials', 0.073), ('aer', 0.071), ('sentences', 0.071), ('yk', 0.07), ('fe', 0.068), ('combinatorial', 0.067), ('berkeley', 0.065), ('zp', 0.065), ('node', 0.065), ('edge', 0.062), ('formulation', 0.057), ('markov', 0.054), ('quadratic', 0.054), ('perceptron', 0.053), ('validation', 0.052), ('loss', 0.051), ('nodes', 0.051), ('object', 0.049), ('zc', 0.049), ('margin', 0.048), ('ci', 0.047), ('uc', 0.046), ('discriminative', 0.046), ('arg', 0.045), ('scan', 0.044), ('optimization', 0.043), ('tractability', 0.043), ('favorable', 0.043), ('chatalbashev', 0.041), ('emnlp', 0.041), ('maxzi', 0.041), ('percep', 0.041), ('puppets', 0.041), ('exponentiated', 0.041), ('convex', 0.037), ('tractable', 0.036), ('pw', 0.035), ('ti', 0.035), ('training', 0.034), ('sj', 0.034), ('broader', 0.034), ('wn', 0.034), ('wc', 0.032), ('hamming', 0.032), ('maxy', 0.032), ('step', 0.032), ('onto', 0.032), ('elds', 0.032), ('relaxation', 0.031), ('max', 0.031), ('solving', 0.031), ('pairs', 0.031), ('formulated', 0.031), ('abbreviate', 0.03), ('fn', 0.03), ('iterations', 0.029), ('bregman', 0.028), ('articulated', 0.028), ('koller', 0.028), ('problems', 0.027), ('scenes', 0.027), ('class', 0.027), ('pentium', 0.027), ('scans', 0.027), ('spin', 0.027), ('gradient', 0.027), ('variables', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 184 nips-2005-Structured Prediction via the Extragradient Method
Author: Ben Taskar, Simon Lacoste-Julian, Michael I. Jordan
Abstract: We present a simple and scalable algorithm for large-margin estimation of structured models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem and apply the extragradient method, yielding an algorithm with linear convergence using simple gradient and projection calculations. The projection step can be solved using combinatorial algorithms for min-cost quadratic flow. This makes the approach an efficient alternative to formulations based on reductions to a quadratic program (QP). We present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm. 1
2 0.19622844 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
Author: Y. Altun, D. McAllester, M. Belkin
Abstract: Many real-world classification problems involve the prediction of multiple inter-dependent variables forming some structural dependency. Recent progress in machine learning has mainly focused on supervised classification of such structured variables. In this paper, we investigate structured classification in a semi-supervised setting. We present a discriminative approach that utilizes the intrinsic geometry of input patterns revealed by unlabeled data points and we derive a maximum-margin formulation of semi-supervised learning for structured variables. Unlike transductive algorithms, our formulation naturally extends to new test points. 1
3 0.15115528 161 nips-2005-Radial Basis Function Network for Multi-task Learning
Author: Xuejun Liao, Lawrence Carin
Abstract: We extend radial basis function (RBF) networks to the scenario in which multiple correlated tasks are learned simultaneously, and present the corresponding learning algorithms. We develop the algorithms for learning the network structure, in either a supervised or unsupervised manner. Training data may also be actively selected to improve the network’s generalization to test data. Experimental results based on real data demonstrate the advantage of the proposed algorithms and support our conclusions. 1
4 0.1407674 100 nips-2005-Interpolating between types and tokens by estimating power-law generators
Author: Sharon Goldwater, Mark Johnson, Thomas L. Griffiths
Abstract: Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that generically produce power-laws, augmenting standard generative models with an adaptor that produces the appropriate pattern of token frequencies. We show that taking a particular stochastic process – the Pitman-Yor process – as an adaptor justifies the appearance of type frequencies in formal analyses of natural language, and improves the performance of a model for unsupervised learning of morphology.
5 0.11487181 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data
Author: Nicolas Usunier, Massih-reza Amini, Patrick Gallinari
Abstract: In this paper we propose a general framework to study the generalization properties of binary classifiers trained with data which may be dependent, but are deterministically generated upon a sample of independent examples. It provides generalization bounds for binary classification and some cases of ranking problems, and clarifies the relationship between these learning tasks. 1
6 0.094731219 170 nips-2005-Scaling Laws in Natural Scenes and the Inference of 3D Shape
7 0.089054614 114 nips-2005-Learning Rankings via Convex Hull Separation
8 0.087044194 185 nips-2005-Subsequence Kernels for Relation Extraction
9 0.086989082 121 nips-2005-Location-based activity recognition
10 0.085902892 108 nips-2005-Layered Dynamic Textures
11 0.08477211 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm
12 0.082799703 50 nips-2005-Convex Neural Networks
13 0.080597147 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
14 0.07905519 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression
15 0.078883164 171 nips-2005-Searching for Character Models
16 0.074920654 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
17 0.073838919 79 nips-2005-Fusion of Similarity Data in Clustering
18 0.0724658 48 nips-2005-Context as Filtering
19 0.072434559 139 nips-2005-Non-iterative Estimation with Perturbed Gaussian Markov Processes
20 0.068345852 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification
topicId topicWeight
[(0, 0.245), (1, 0.082), (2, -0.047), (3, 0.018), (4, 0.003), (5, 0.036), (6, -0.011), (7, 0.215), (8, 0.104), (9, -0.02), (10, -0.046), (11, -0.106), (12, -0.025), (13, -0.09), (14, -0.123), (15, 0.063), (16, 0.13), (17, 0.143), (18, 0.038), (19, -0.123), (20, -0.07), (21, -0.039), (22, -0.001), (23, 0.054), (24, 0.083), (25, 0.141), (26, 0.01), (27, 0.132), (28, 0.123), (29, -0.27), (30, 0.108), (31, -0.037), (32, 0.033), (33, -0.066), (34, 0.046), (35, 0.06), (36, 0.057), (37, 0.022), (38, -0.093), (39, -0.01), (40, 0.106), (41, -0.006), (42, 0.068), (43, 0.033), (44, 0.01), (45, -0.058), (46, 0.151), (47, -0.039), (48, 0.041), (49, 0.064)]
simIndex simValue paperId paperTitle
same-paper 1 0.94790977 184 nips-2005-Structured Prediction via the Extragradient Method
Author: Ben Taskar, Simon Lacoste-Julian, Michael I. Jordan
Abstract: We present a simple and scalable algorithm for large-margin estimation of structured models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem and apply the extragradient method, yielding an algorithm with linear convergence using simple gradient and projection calculations. The projection step can be solved using combinatorial algorithms for min-cost quadratic flow. This makes the approach an efficient alternative to formulations based on reductions to a quadratic program (QP). We present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm. 1
2 0.69057482 161 nips-2005-Radial Basis Function Network for Multi-task Learning
Author: Xuejun Liao, Lawrence Carin
Abstract: We extend radial basis function (RBF) networks to the scenario in which multiple correlated tasks are learned simultaneously, and present the corresponding learning algorithms. We develop the algorithms for learning the network structure, in either a supervised or unsupervised manner. Training data may also be actively selected to improve the network’s generalization to test data. Experimental results based on real data demonstrate the advantage of the proposed algorithms and support our conclusions. 1
3 0.51009202 100 nips-2005-Interpolating between types and tokens by estimating power-law generators
Author: Sharon Goldwater, Mark Johnson, Thomas L. Griffiths
Abstract: Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that generically produce power-laws, augmenting standard generative models with an adaptor that produces the appropriate pattern of token frequencies. We show that taking a particular stochastic process – the Pitman-Yor process – as an adaptor justifies the appearance of type frequencies in formal analyses of natural language, and improves the performance of a model for unsupervised learning of morphology.
4 0.50117928 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
Author: Y. Altun, D. McAllester, M. Belkin
Abstract: Many real-world classification problems involve the prediction of multiple inter-dependent variables forming some structural dependency. Recent progress in machine learning has mainly focused on supervised classification of such structured variables. In this paper, we investigate structured classification in a semi-supervised setting. We present a discriminative approach that utilizes the intrinsic geometry of input patterns revealed by unlabeled data points and we derive a maximum-margin formulation of semi-supervised learning for structured variables. Unlike transductive algorithms, our formulation naturally extends to new test points. 1
5 0.47791484 121 nips-2005-Location-based activity recognition
Author: Lin Liao, Dieter Fox, Henry Kautz
Abstract: Learning patterns of human behavior from sensor data is extremely important for high-level activity inference. We show how to extract and label a person’s activities and significant places from traces of GPS data. In contrast to existing techniques, our approach simultaneously detects and classifies the significant locations of a person and takes the highlevel context into account. Our system uses relational Markov networks to represent the hierarchical activity model that encodes the complex relations among GPS readings, activities and significant places. We apply FFT-based message passing to perform efficient summation over large numbers of nodes in the networks. We present experiments that show significant improvements over existing techniques. 1
6 0.44885007 171 nips-2005-Searching for Character Models
7 0.38664564 114 nips-2005-Learning Rankings via Convex Hull Separation
8 0.38623515 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm
9 0.36534685 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression
10 0.3588317 185 nips-2005-Subsequence Kernels for Relation Extraction
11 0.35698128 108 nips-2005-Layered Dynamic Textures
12 0.35604838 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data
13 0.34211916 165 nips-2005-Response Analysis of Neuronal Population with Synaptic Depression
14 0.33281994 170 nips-2005-Scaling Laws in Natural Scenes and the Inference of 3D Shape
15 0.32623741 158 nips-2005-Products of ``Edge-perts
16 0.32411256 48 nips-2005-Context as Filtering
17 0.32095459 139 nips-2005-Non-iterative Estimation with Perturbed Gaussian Markov Processes
18 0.31980771 50 nips-2005-Convex Neural Networks
19 0.30780566 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach
20 0.30054235 45 nips-2005-Conditional Visual Tracking in Kernel Space
topicId topicWeight
[(2, 0.23), (3, 0.085), (10, 0.038), (27, 0.046), (31, 0.076), (34, 0.112), (39, 0.016), (41, 0.01), (50, 0.037), (55, 0.031), (57, 0.019), (69, 0.051), (73, 0.054), (88, 0.083), (91, 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.7913686 184 nips-2005-Structured Prediction via the Extragradient Method
Author: Ben Taskar, Simon Lacoste-Julian, Michael I. Jordan
Abstract: We present a simple and scalable algorithm for large-margin estimation of structured models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem and apply the extragradient method, yielding an algorithm with linear convergence using simple gradient and projection calculations. The projection step can be solved using combinatorial algorithms for min-cost quadratic flow. This makes the approach an efficient alternative to formulations based on reductions to a quadratic program (QP). We present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm. 1
2 0.64476335 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
Author: John D. Lafferty, Pradeep K. Ravikumar
Abstract: We present a family of approximation techniques for probabilistic graphical models, based on the use of graphical preconditioners developed in the scientific computing literature. Our framework yields rigorous upper and lower bounds on event probabilities and the log partition function of undirected graphical models, using non-iterative procedures that have low time complexity. As in mean field approaches, the approximations are built upon tractable subgraphs; however, we recast the problem of optimizing the tractable distribution parameters and approximate inference in terms of the well-studied linear systems problem of obtaining a good matrix preconditioner. Experiments are presented that compare the new approximation schemes to variational methods. 1
3 0.62954837 78 nips-2005-From Weighted Classification to Policy Search
Author: Doron Blatt, Alfred O. Hero
Abstract: This paper proposes an algorithm to convert a T -stage stochastic decision problem with a continuous state space to a sequence of supervised learning problems. The optimization problem associated with the trajectory tree and random trajectory methods of Kearns, Mansour, and Ng, 2000, is solved using the Gauss-Seidel method. The algorithm breaks a multistage reinforcement learning problem into a sequence of single-stage reinforcement learning subproblems, each of which is solved via an exact reduction to a weighted-classification problem that can be solved using off-the-self methods. Thus the algorithm converts a reinforcement learning problem into simpler supervised learning subproblems. It is shown that the method converges in a finite number of steps to a solution that cannot be further improved by componentwise optimization. The implication of the proposed algorithm is that a plethora of classification methods can be applied to find policies in the reinforcement learning problem. 1
4 0.62432182 177 nips-2005-Size Regularized Cut for Data Clustering
Author: Yixin Chen, Ya Zhang, Xiang Ji
Abstract: We present a novel spectral clustering method that enables users to incorporate prior knowledge of the size of clusters into the clustering process. The cost function, which is named size regularized cut (SRcut), is defined as the sum of the inter-cluster similarity and a regularization term measuring the relative size of two clusters. Finding a partition of the data set to minimize SRcut is proved to be NP-complete. An approximation algorithm is proposed to solve a relaxed version of the optimization problem as an eigenvalue problem. Evaluations over different data sets demonstrate that the method is not sensitive to outliers and performs better than normalized cut. 1
5 0.6228925 144 nips-2005-Off-policy Learning with Options and Recognizers
Author: Doina Precup, Cosmin Paduraru, Anna Koop, Richard S. Sutton, Satinder P. Singh
Abstract: We introduce a new algorithm for off-policy temporal-difference learning with function approximation that has lower variance and requires less knowledge of the behavior policy than prior methods. We develop the notion of a recognizer, a filter on actions that distorts the behavior policy to produce a related target policy with low-variance importance-sampling corrections. We also consider target policies that are deviations from the state distribution of the behavior policy, such as potential temporally abstract options, which further reduces variance. This paper introduces recognizers and their potential advantages, then develops a full algorithm for linear function approximation and proves that its updates are in the same direction as on-policy TD updates, which implies asymptotic convergence. Even though our algorithm is based on importance sampling, we prove that it requires absolutely no knowledge of the behavior policy for the case of state-aggregation function approximators. Off-policy learning is learning about one way of behaving while actually behaving in another way. For example, Q-learning is an off- policy learning method because it learns about the optimal policy while taking actions in a more exploratory fashion, e.g., according to an ε-greedy policy. Off-policy learning is of interest because only one way of selecting actions can be used at any time, but we would like to learn about many different ways of behaving from the single resultant stream of experience. For example, the options framework for temporal abstraction involves considering a variety of different ways of selecting actions. For each such option one would like to learn a model of its possible outcomes suitable for planning and other uses. Such option models have been proposed as fundamental building blocks of grounded world knowledge (Sutton, Precup & Singh, 1999; Sutton, Rafols & Koop, 2005). Using off-policy learning, one would be able to learn predictive models for many options at the same time from a single stream of experience. Unfortunately, off-policy learning using temporal-difference methods has proven problematic when used in conjunction with function approximation. Function approximation is essential in order to handle the large state spaces that are inherent in many problem do- mains. Q-learning, for example, has been proven to converge to an optimal policy in the tabular case, but is unsound and may diverge in the case of linear function approximation (Baird, 1996). Precup, Sutton, and Dasgupta (2001) introduced and proved convergence for the first off-policy learning algorithm with linear function approximation. They addressed the problem of learning the expected value of a target policy based on experience generated using a different behavior policy. They used importance sampling techniques to reduce the off-policy case to the on-policy case, where existing convergence theorems apply (Tsitsiklis & Van Roy, 1997; Tadic, 2001). There are two important difficulties with that approach. First, the behavior policy needs to be stationary and known, because it is needed to compute the importance sampling corrections. Second, the importance sampling weights are often ill-conditioned. In the worst case, the variance could be infinite and convergence would not occur. The conditions required to prevent this were somewhat awkward and, even when they applied and asymptotic convergence was assured, the variance could still be high and convergence could be slow. In this paper we address both of these problems in the context of off-policy learning for options. We introduce the notion of a recognizer. Rather than specifying an explicit target policy (for instance, the policy of an option), about which we want to make predictions, a recognizer specifies a condition on the actions that are selected. For example, a recognizer for the temporally extended action of picking up a cup would not specify which hand is to be used, or what the motion should be at all different positions of the cup. The recognizer would recognize a whole variety of directions of motion and poses as part of picking the cup. The advantage of this strategy is not that one might prefer a multitude of different behaviors, but that the behavior may be based on a variety of different strategies, all of which are relevant, and we would like to learn from any of them. In general, a recognizer is a function that recognizes or accepts a space of different ways of behaving and thus, can learn from a wider range of data. Recognizers have two advantages over direct specification of a target policy: 1) they are a natural and easy way to specify a target policy for which importance sampling will be well conditioned, and 2) they do not require the behavior policy to be known. The latter is important because in many cases we may have little knowledge of the behavior policy, or a stationary behavior policy may not even exist. We show that for the case of state aggregation, even if the behavior policy is unknown, convergence to a good model is achieved. 1 Non-sequential example The benefits of using recognizers in off-policy learning can be most easily seen in a nonsequential context with a single continuous action. Suppose you are given a sequence of sample actions ai ∈ [0, 1], selected i.i.d. according to probability density b : [0, 1] → ℜ+ (the behavior density). For example, suppose the behavior density is of the oscillatory form shown as a red line in Figure 1. For each each action, ai , we observe a corresponding outcome, zi ∈ ℜ, a random variable whose distribution depends only on ai . Thus the behavior density induces an outcome density. The on-policy problem is to estimate the mean mb of the outcome density. This problem can be solved simply by averaging the sample outcomes: mb = (1/n) ∑n zi . The off-policy problem is to use this same data to learn what ˆ i=1 the mean would be if actions were selected in some way other than b, for example, if the actions were restricted to a designated range, such as between 0.7 and 0.9. There are two natural ways to pose this off-policy problem. The most straightforward way is to be equally interested in all actions within the designated region. One professes to be interested in actions selected according to a target density π : [0, 1] → ℜ+ , which in the example would be 5.0 between 0.7 and 0.9, and zero elsewhere, as in the dashed line in 12 Probability density functions 1.5 Target policy with recognizer 1 Target policy w/o recognizer without recognizer .5 Behavior policy 0 0 Action 0.7 Empirical variances (average of 200 sample variances) 0.9 1 0 10 with recognizer 100 200 300 400 500 Number of sample actions Figure 1: The left panel shows the behavior policy and the target policies for the formulations of the problem with and without recognizers. The right panel shows empirical estimates of the variances for the two formulations as a function of the number sample actions. The lowest line is for the formulation using empirically-estimated recognition probabilities. Figure 1 (left). The importance- sampling estimate of the mean outcome is 1 n π(ai ) mπ = ∑ ˆ zi . n i=1 b(ai ) (1) This approach is problematic if there are parts of the region of interest where the behavior density is zero or very nearly so, such as near 0.72 and 0.85 in the example. Here the importance sampling ratios are exceedingly large and the estimate is poorly conditioned (large variance). The upper curve in Figure 1 (right) shows the empirical variance of this estimate as a function of the number of samples. The spikes and uncertain decline of the empirical variance indicate that the distribution is very skewed and that the estimates are very poorly conditioned. The second way to pose the problem uses recognizers. One professes to be interested in actions to the extent that they are both selected by b and within the designated region. This leads to the target policy shown in blue in the left panel of Figure 1 (it is taller because it still must sum to 1). For this problem, the variance of (1) is much smaller, as shown in the lower two lines of Figure 1 (right). To make this way of posing the problem clear, we introduce the notion of a recognizer function c : A → ℜ+ . The action space in the example is A = [0, 1] and the recognizer is c(a) = 1 for a between 0.7 and 0.9 and is zero elsewhere. The target policy is defined in general by c(a)b(a) c(a)b(a) = . (2) π(a) = µ ∑x c(x)b(x) where µ = ∑x c(x)b(x) is a constant, equal to the probability of recognizing an action from the behavior policy. Given π, mπ from (1) can be rewritten in terms of the recognizer as ˆ n π(ai ) 1 n c(ai )b(ai ) 1 1 n c(ai ) 1 mπ = ∑ zi ˆ = ∑ zi = ∑ zi (3) n i=1 b(ai ) n i=1 µ b(ai ) n i=1 µ Note that the target density does not appear at all in the last expression and that the behavior distribution appears only in µ, which is independent of the sample action. If this constant is known, then this estimator can be computed with no knowledge of π or b. The constant µ can easily be estimated as the fraction of recognized actions in the sample. The lowest line in Figure 1 (right) shows the variance of the estimator using this fraction in place of the recognition probability. Its variance is low, no worse than that of the exact algorithm, and apparently slightly lower. Because this algorithm does not use the behavior density, it can be applied when the behavior density is unknown or does not even exist. For example, suppose actions were selected in some deterministic, systematic way that in the long run produced an empirical distribution like b. This would be problematic for the other algorithms but would require no modification of the recognition-fraction algorithm. 2 Recognizers improve conditioning of off-policy learning The main use of recognizers is in formulating a target density π about which we can successfully learn predictions, based on the current behavior being followed. Here we formalize this intuition. Theorem 1 Let A = {a1 , . . . ak } ⊆ A be a subset of all the possible actions. Consider a fixed behavior policy b and let πA be the class of policies that only choose actions from A, i.e., if π(a) > 0 then a ∈ A. Then the policy induced by b and the binary recognizer cA is the policy with minimum-variance one-step importance sampling corrections, among those in πA : π(ai ) 2 π as given by (2) = arg min Eb (4) π∈πA b(ai ) Proof: Denote π(ai ) = πi , b(ai ) = bi . Then the expected variance of the one-step importance sampling corrections is: Eb πi bi πi bi 2 2 − Eb = ∑ bi i πi bi 2 −1 = ∑ i π2 i − 1, bi where the summation (here and everywhere below) is such that the action ai ∈ A. We want to find πi that minimizes this expression, subject to the constraint that ∑i πi = 1. This is a constrained optimization problem. To solve it, we write down the corresponding Lagrangian: π2 L(πi , β) = ∑ i − 1 + β(∑ πi − 1) i i bi We take the partial derivatives wrt πi and β and set them to 0: βbi ∂L 2 = πi + β = 0 ⇒ πi = − ∂πi bi 2 (5) ∂L = πi − 1 = 0 ∂β ∑ i (6) By taking (5) and plugging into (6), we get the following expression for β: − β 2 bi = 1 ⇒ β = − 2∑ ∑i bi i By substituting β into (5) we obtain: πi = bi ∑i b i This is exactly the policy induced by the recognizer defined by c(ai ) = 1 iff ai ∈ A. We also note that it is advantageous, from the point of view of minimizing the variance of the updates, to have recognizers that accept a broad range of actions: Theorem 2 Consider two binary recognizers c1 and c2 , such that µ1 > µ2 . Then the importance sampling corrections for c1 have lower variance than the importance sampling corrections for c2 . Proof: From the previous theorem, we have the variance of a recognizer cA : Var = ∑ i π2 bi i −1 = ∑ bi ∑ j∈A b j i 2 1 1 1 −1 = −1 = −1 bi µ ∑ j∈A b j 3 Formal framework for sequential problems We turn now to the full case of learning about sequential decision processes with function approximation. We use the standard framework in which an agent interacts with a stochastic environment. At each time step t, the agent receives a state st and chooses an action at . We assume for the moment that actions are selected according to a fixed behavior policy, b : S × A → [0, 1] where b(s, a) is the probability of selecting action a in state s. The behavior policy is used to generate a sequence of experience (observations, actions and rewards). The goal is to learn, from this data, predictions about different ways of behaving. In this paper we focus on learning predictions about expected returns, but other predictions can be tackled as well (for instance, predictions of transition models for options (Sutton, Precup & Singh, 1999), or predictions specified by a TD-network (Sutton & Tanner, 2005; Sutton, Rafols & Koop, 2006)). We assume that the state space is large or continuous, and function approximation must be used to compute any values of interest. In particular, we assume a space of feature vectors Φ and a mapping φ : S → Φ. We denote by φs the feature vector associated with s. An option is defined as a triple o = I, π, β where I ⊆ S is the set of states in which the option can be initiated, π is the internal policy of the option and β : S → [0, 1] is a stochastic termination condition. In the option work (Sutton, Precup & Singh, 1999), each of these elements has to be explicitly specified and fixed in order for an option to be well defined. Here, we will instead define options implicitly, using the notion of a recognizer. A recognizer is defined as a function c : S × A → [0, 1], where c(s, a) indicates to what extent the recognizer allows action a in state s. An important special case, which we treat in this paper, is that of binary recognizers. In this case, c is an indicator function, specifying a subset of actions that are allowed, or recognized, given a particular state. Note that recognizers do not specify policies; instead, they merely give restrictions on the policies that are allowed or recognized. A recognizer c together with a behavior policy b generates a target policy π, where: b(s, a)c(s, a) b(s, a)c(s, a) π(s, a) = (7) = µ(s) ∑x b(s, x)c(s, x) The denominator of this fraction, µ(s) = ∑x b(s, x)c(s, x), is the recognition probability at s, i.e., the probability that an action will be accepted at s when behavior is generated according to b. The policy π is only defined at states for which µ(s) > 0. The numerator gives the probability that action a is produced by the behavior and recognized in s. Note that if the recognizer accepts all state-action pairs, i.e. c(s, a) = 1, ∀s, a, then π is the same as b. Since a recognizer and a behavior policy can specify together a target policy, we can use recognizers as a way to specify policies for options, using (7). An option can only be initiated at a state for which at least one action is recognized, so µ(s) > 0, ∀s ∈ I. Similarly, the termination condition of such an option, β, is defined as β(s) = 1 if µ(s) = 0. In other words, the option must terminate if no actions are recognized at a given state. At all other states, β can be defined between 0 and 1 as desired. We will focus on computing the reward model of an option o, which represents the expected total return. The expected values of different features at the end of the option can be estimated similarly. The quantity that we want to compute is Eo {R(s)} = E{r1 + r2 + . . . + rT |s0 = s, π, β} where s ∈ I, experience is generated according to the policy of the option, π, and T denotes the random variable representing the time step at which the option terminates according to β. We assume that linear function approximation is used to represent these values, i.e. Eo {R(s)} ≈ θT φs where θ is a vector of parameters. 4 Off-policy learning algorithm In this section we present an adaptation of the off-policy learning algorithm of Precup, Sutton & Dasgupta (2001) to the case of learning about options. Suppose that an option’s policy π was used to generate behavior. In this case, learning the reward model of the option is a special case of temporal-difference learning of value functions. The forward ¯ (n) view of this algorithm is as follows. Let Rt denote the truncated n-step return starting at ¯ (0) time step t and let yt denote the 0-step truncated return, Rt . By the definition of the n-step truncated return, we have: ¯ (n) ¯ (n−1) Rt = rt+1 + (1 − βt+1 )Rt+1 . This is similar to the case of value functions, but it accounts for the possibility of terminating the option at time step t + 1. The λ-return is defined in the usual way: ∞ ¯ (n) ¯ Rtλ = (1 − λ) ∑ λn−1 Rt . n=1 The parameters of the linear function approximator are updated on every time step proportionally to: ¯ ¯ ∆θt = Rtλ − yt ∇θ yt (1 − β1 ) · · · (1 − βt ). In our case, however, trajectories are generated according to the behavior policy b. The main idea of the algorithm is to use importance sampling corrections in order to account for the difference in the state distribution of the two policies. Let ρt = (n) Rt , π(st ,at ) b(st ,at ) be the importance sampling ratio at time step t. The truncated n-step return, satisfies: (n) (n−1) Rt = ρt [rt+1 + (1 − βt+1 )Rt+1 ]. The update to the parameter vector is proportional to: ∆θt = Rtλ − yt ∇θ yt ρ0 (1 − β1 ) · · · ρt−1 (1 − βt ). The following result shows that the expected updates of the on-policy and off-policy algorithms are the same. Theorem 3 For every time step t ≥ 0 and any initial state s, ¯ Eb [∆θt |s] = Eπ [∆θt |s]. (n) (n) ¯ Proof: First we will show by induction that Eb {Rt |s} = Eπ {Rt |s}, ∀n (which implies ¯ that Eb {Rtλ |s} = Eπ (Rtλ |s}). For n = 0, the statement is trivial. Assuming that it is true for n − 1, we have (n) Eb Rt |s = a ∑b(s, a)∑Pss ρ(s, a) a = s ∑∑ a Pss b(s, a) a s = a ∑π(s, a)∑Pss a (n−1) a rss + (1 − β(s ))Eb Rt+1 |s π(s, a) a ¯ (n−1) r + (1 − β(s ))Eπ Rt+1 |s b(s, a) ss a ¯ (n−1) rss + (1 − β(s ))Eπ Rt+1 |s ¯ (n) = Eπ Rt |s . s Now we are ready to prove the theorem’s main statement. Defining Ωt to be the set of all trajectory components up to state st , we have: Eb {∆θt |s} = ∑ ω∈Ωt Pb (ω|s)Eb (Rtλ − yt )∇θ yt |ω t−1 ∏ ρi (1 − βi+1 ) i=0 πi (1 − βi+1 ) i=0 bi t−1 = t−1 ∑ ∏ bi Psaiisi+1 ω∈Ωt Eb Rtλ |st − yt ∇θ yt ∏ i=0 t−1 = ∑ ∏ πi Psaiisi+1 ω∈Ωt = ∑ ω∈Ωt ¯ Eπ Rtλ |st − yt ∇θ yt (1 − β1 )...(1 − βt ) i=0 ¯ ¯ Pπ (ω|s)Eπ (Rtλ − yt )∇θ yt |ω (1 − β1 )...(1 − βt ) = Eπ ∆θt |s . Note that we are able to use st and ω interchangeably because of the Markov property. ¯ Since we have shown that Eb [∆θt |s] = Eπ [∆θt |s] for any state s, it follows that the expected updates will also be equal for any distribution of the initial state s. When learning the model of options with data generated from the behavior policy b, the starting state distribution with respect to which the learning is performed, I0 is determined by the stationary distribution of the behavior policy, as well as the initiation set of the option I. We note also that the importance sampling corrections only have to be performed for the trajectory since the initiation of the updates for the option. No corrections are required for the experience prior to this point. This should generate updates that have significantly lower variance than in the case of learning values of policies (Precup, Sutton & Dasgupta, 2001). Because of the termination condition of the option, β, ∆θ can quickly decay to zero. To avoid this problem, we can use a restart function g : S → [0, 1], such that g(st ) specifies the extent to which the updating episode is considered to start at time t. Adding restarts generates a new forward update: t ∆θt = (Rtλ − yt )∇θ yt ∑ gi ρi ...ρt−1 (1 − βi+1 )...(1 − βt ), (8) i=0 where Rtλ is the same as above. With an adaptation of the proof in Precup, Sutton & Dasgupta (2001), we can show that we get the same expected value of updates by applying this algorithm from the original starting distribution as we would by applying the algorithm without restarts from a starting distribution defined by I0 and g. We can turn this forward algorithm into an incremental, backward view algorithm in the following way: • Initialize k0 = g0 , e0 = k0 ∇θ y0 • At every time step t: δt = θt+1 = kt+1 = et+1 = ρt (rt+1 + (1 − βt+1 )yt+1 ) − yt θt + αδt et ρt kt (1 − βt+1 ) + gt+1 λρt (1 − βt+1 )et + kt+1 ∇θ yt+1 Using a similar technique to that of Precup, Sutton & Dasgupta (2001) and Sutton & Barto (1998), we can prove that the forward and backward algorithm are equivalent (omitted due to lack of space). This algorithm is guaranteed to converge if the variance of the updates is finite (Precup, Sutton & Dasgupta, 2001). In the case of options, the termination condition β can be used to ensure that this is the case. 5 Learning when the behavior policy is unknown In this section, we consider the case in which the behavior policy is unknown. This case is generally problematic for importance sampling algorithms, but the use of recognizers will allow us to define importance sampling corrections, as well as a convergent algorithm. Recall that when using a recognizer, the target policy of the option is defined as: c(s, a)b(s, a) π(s, a) = µ(s) and the recognition probability becomes: π(s, a) c(s, a) = b(s, a) µ(s) Of course, µ(s) depends on b. If b is unknown, instead of µ(s), we will use a maximum likelihood estimate µ : S → [0, 1]. The structure used to compute µ will have to be compatible ˆ ˆ with the feature space used to represent the reward model. We will make this more precise below. Likewise, the recognizer c(s, a) will have to be defined in terms of the features used to represent the model. We will then define the importance sampling corrections as: c(s, a) ˆ ρ(s, a) = µ(s) ˆ ρ(s, a) = We consider the case in which the function approximator used to model the option is actually a state aggregator. In this case, we will define recognizers which behave consistently in each partition, i.e., c(s, a) = c(p, a), ∀s ∈ p. This means that an action is either recognized or not recognized in all states of the partition. The recognition probability µ will have one ˆ entry for every partition p of the state space. Its value will be: N(p, c = 1) µ(p) = ˆ N(p) where N(p) is the number of times partition p was visited, and N(p, c = 1) is the number of times the action taken in p was recognized. In the limit, w.p.1, µ converges to ˆ ∑s d b (s|p) ∑a c(p, a)b(s, a) where d b (s|p) is the probability of visiting state s from partiˆ ˆ tion p under the stationary distribution of b. At this limit, π(s, a) = ρ(s, a)b(s, a) will be a ˆ well-defined policy (i.e., ∑a π(s, a) = 1). Using Theorem 3, off-policy updates using imˆ portance sampling corrections ρ will have the same expected value as on-policy updates ˆ ˆ using π. Note though that the learning algorithm never uses π; the only quantities needed ˆ are ρ, which are learned incrementally from data. For the case of general linear function approximation, we conjecture that a similar idea can be used, where the recognition probability is learned using logistic regression. The development of this part is left for future work. Acknowledgements The authors gratefully acknowledge the ideas and encouragement they have received in this work from Eddie Rafols, Mark Ring, Lihong Li and other members of the rlai.net group. We thank Csaba Szepesvari and the reviewers of the paper for constructive comments. This research was supported in part by iCore, NSERC, Alberta Ingenuity, and CFI. References Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of ICML. Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In Proceedings of ICML. Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, vol . 112, pp. 181–211. Sutton,, R.S. and Tanner, B. (2005). Temporal-difference networks. In Proceedings of NIPS-17. Sutton R.S., Raffols E. and Koop, A. (2006). Temporal abstraction in temporal-difference networks”. In Proceedings of NIPS-18. Tadic, V. (2001). On the convergence of temporal-difference learning with linear function approximation. In Machine learning vol. 42, pp. 241-267. Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42:674–690.
6 0.62222993 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
7 0.6153388 47 nips-2005-Consistency of one-class SVM and related algorithms
8 0.61442769 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
9 0.61303031 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
10 0.61281323 50 nips-2005-Convex Neural Networks
11 0.61196959 178 nips-2005-Soft Clustering on Graphs
12 0.61179858 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
13 0.61112732 102 nips-2005-Kernelized Infomax Clustering
14 0.60852164 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
15 0.60692102 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
16 0.6066972 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
17 0.60598558 175 nips-2005-Sequence and Tree Kernels with Statistical Feature Mining
18 0.60588306 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
19 0.60585356 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games
20 0.60526776 112 nips-2005-Learning Minimum Volume Sets