nips nips2006 nips2006-47 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: J. A. Bagnell, Joel Chestnutt, David M. Bradley, Nathan D. Ratliff
Abstract: The Maximum Margin Planning (MMP) (Ratliff et al., 2006) algorithm solves imitation learning problems by learning linear mappings from features to cost functions in a planning domain. The learned policy is the result of minimum-cost planning using these cost functions. These mappings are chosen so that example policies (or trajectories) given by a teacher appear to be lower cost (with a lossscaled margin) than any other policy for a given planning domain. We provide a novel approach, M MP B OOST , based on the functional gradient descent view of boosting (Mason et al., 1999; Friedman, 1999a) that extends MMP by “boosting” in new features. This approach uses simple binary classification or regression to improve performance of MMP imitation learning, and naturally extends to the class of structured maximum margin prediction problems. (Taskar et al., 2005) Our technique is applied to navigation and planning problems for outdoor mobile robots and robotic legged locomotion. 1
Reference: text
sentIndex sentText sentNum sentScore
1 , 2006) algorithm solves imitation learning problems by learning linear mappings from features to cost functions in a planning domain. [sent-6, score-0.429]
2 The learned policy is the result of minimum-cost planning using these cost functions. [sent-7, score-0.399]
3 These mappings are chosen so that example policies (or trajectories) given by a teacher appear to be lower cost (with a lossscaled margin) than any other policy for a given planning domain. [sent-8, score-0.471]
4 We provide a novel approach, M MP B OOST , based on the functional gradient descent view of boosting (Mason et al. [sent-9, score-0.225]
5 This approach uses simple binary classification or regression to improve performance of MMP imitation learning, and naturally extends to the class of structured maximum margin prediction problems. [sent-11, score-0.255]
6 , 2005) Our technique is applied to navigation and planning problems for outdoor mobile robots and robotic legged locomotion. [sent-13, score-0.492]
7 , 2006) demonstrated that imitation learning of long horizon and goal-directed behavior can be naturally formulated as a structured prediction problem over a space of policies or system trajectories. [sent-16, score-0.265]
8 In this work, the authors demonstrate that efficient planning algorithms (e. [sent-17, score-0.192]
9 In essence, the algorithm attempts to linearly combine features into costs so that the resulting cost functions make demonstrated example policies appear optimal by a margin over all other policies. [sent-20, score-0.354]
10 Unfortunately, this Maximum Margin Planning (MMP) approach, as well as related techniques for maximum margin structured learning developed in (Taskar et al. [sent-22, score-0.194]
11 , 1999) or similarly (Friedman, 1999a), we propose an alternate extension to Maximum Margin Planning specifically, and maximum margin structured learning generally, in which we perform subgradient descent in the space of cost functions rather than within any fixed parameterization. [sent-26, score-0.363]
12 In this way, we show that we can “boost in” new features using simple classification that help us a solve a more difficult structured prediction problem. [sent-27, score-0.156]
13 application of boosting to structured learning techniques was first explored in (Dietterich et al. [sent-29, score-0.227]
14 First, using only smoothed versions of an overhead image as input, we show that M MP B OOST is able to match human navigation performance well on the task of navigating outdoor terrain. [sent-33, score-0.237]
15 Next we demonstrate that we can develop a local obstacle detection/avoidance control system for an autonomous outdoor robot by observing an expert teleoperator drive. [sent-35, score-0.335]
16 Finally, we demonstrate on legged locomotion problems the use a slow but highly accurate planner to train a fast, approximate planner using M MP B OOST . [sent-36, score-0.846]
17 Our cost (negative reward) functions are learned from supervised trajectories to produce policies that mimic the demonstrated behavior. [sent-41, score-0.309]
18 A cost function over an MDP M is defined through this space as c(fM ), where fM : M → X denotes a mapping from state-action pairs to points in base feature space, and c is a cost function over X . [sent-46, score-0.325]
19 Intuitively, each state-action pair in the MDP has an associated feature vector, and the cost of that state-action pair is a function of that vector. [sent-47, score-0.155]
20 i=1 Each training instance consists of an MDP with transition probabilities pi and state-action pairs (si , ai ) ∈ Mi over which d-dimensional vectors of features mapped from the base feature space X are placed in the form of a d × |M| feature matrix Fi . [sent-49, score-0.19]
21 ) It is i useful for some problems, such as robot path planning, to imagine representing the features as a set of maps and example paths through those maps. [sent-58, score-0.443]
22 We then present an intuitive and algorithmic exposition on the boosted version of this algorithm we use to learn a nonlinear cost function. [sent-62, score-0.18]
23 2 Structured boosting of MMP Maximum margin planning in its original formulation assumed the cost map is a linear function of a set of prespecified features. [sent-75, score-0.549]
24 In this section, we describe at an intuitive and algorithmic level a boosting procedure for learning a nonlinear function of our base features. [sent-78, score-0.169]
25 For clarity of exposition, a full derivation in terms of the functional gradient descent view of boosting (Mason et al. [sent-79, score-0.225]
26 This gradient boosting framework serves as a reduction (Beygelzimer et al. [sent-82, score-0.176]
27 , 2005) from the problem of finding good features for structured prediction to a problem of simple classification. [sent-83, score-0.156]
28 • Run the planner over this loss-augmented cost map to get the best loss-augmented path. [sent-86, score-0.527]
29 Presumably, when the current feature set is not yet expressive enough, this path will differ significantly from the example path. [sent-87, score-0.179]
30 • Form positive examples by gathering feature vectors encountered along this loss(i) augmented path {(xplanned , 1)} and form negative examples by gathering feature vectors (j) encountered along the example path {(xexample , −1)}. [sent-88, score-0.408]
31 The plot on the right compares boosting objective function value (red) and loss on a hold out set (blue) per boosting iteration between linear MMP (dashed) and boosted MMP (solid). [sent-92, score-0.325]
32 4 Applications In this section we demonstrate on three diverse problems how M MP B OOST improves performance in navigation and planning tasks. [sent-97, score-0.241]
33 The planner being trained was an 8-connected implementation of A*. [sent-103, score-0.363]
34 The upper right panel on the left side of that Figure shows the grayscale overhead image of the holdout region used for testing. [sent-105, score-0.223]
35 The features are particularly difficult for MMP since the space of cost maps it considers for this problem consists of only linear combinations of the same image at different resolutions. [sent-107, score-0.22]
36 The lower left panel on the left side of Figure 1 shows that the best cost map MMP was able to find within this space was largely just a map with uniformly high cost everywhere. [sent-111, score-0.362]
37 The learned cost map was largely uninformative causing the planner to choose the straight-line path between endpoints. [sent-112, score-0.755]
38 In this instance, we used regression trees with 10 terminal nodes as our dictionary H, and trained them on the base features to match the functional gradient as described in Section 3. [sent-114, score-0.208]
39 In the learned cost map several boosted features combine to make the lowest-cost path (red line) match the human’s preference of staying on the path. [sent-117, score-0.5]
40 The robot is currently located in the center of the cost map. [sent-118, score-0.317]
41 Right: A graph of the average A* path loss over the examples in each round of boosting. [sent-119, score-0.171]
42 In just a few rounds the learned system exceeds the performance of the carefully engineered system. [sent-120, score-0.235]
43 The first feature shown in figure 1 is interesting in that it largely represents the result of a path detector. [sent-124, score-0.213]
44 satellite) image directly to a cost map with no human intervention and feature engineering. [sent-129, score-0.261]
45 2 Learning from Human Driving Demonstration We next consider the problem of learning to mimic human driving of an autonomous robot in complex outdoor and off-road terrain. [sent-131, score-0.336]
46 We assume that a coarse global planning problem has been solved using overhead imagery and the M MP B OOST application presented above. [sent-132, score-0.293]
47 Our goal is to use the vehicle’s onboard sensors to detect obstacles which were not visible in the overhead imagery or were not present when the imagery was collected. [sent-135, score-0.248]
48 From these sensors we compute a set of base features for each cell in a discretized 2-D map of the local area. [sent-137, score-0.236]
49 These base features include quantities such as the estimated elevation and slope of the ground plane in the cell, and the average color and density of ladar points in the cell for various height ranges above the estimated ground plane. [sent-138, score-0.227]
50 As training data we use logged sensor data from several kilometers of teleoperation of a large mobile robot through a challenging outdoor environment by an experienced operator. [sent-139, score-0.349]
51 In the previous example, the algorithm had access to both the complete path demonstrated by the teacher, and the same input data (overhead image) the teacher used while generating the path. [sent-140, score-0.258]
52 For this experiment we assume that the next 10 m of the path driven by the vehicle after time t matches the operator’s intended path at time t, and only compute loss over that section of the path. [sent-142, score-0.361]
53 In practice this means that we create a set of local examples from each teleoperated path by sampling the internal state of the robot at discrete points in time. [sent-143, score-0.368]
54 At each time t we record the feature map generated by the robots onboard sensors of the local 10 m radius area surrounding it as well as the path the robot followed to the boundary of that area. [sent-144, score-0.613]
55 Additionally, we model the operator’s prior knowledge of the environment and their sensing of obstacles beyond the 10 m range by using our global planning solution to generate the minimum path costs from a set of points on the boundary of each local map to the global goal. [sent-145, score-0.558]
56 The operator also attempted to match the range at which he reacted to obstacles not visible in the overhead data (such as vehicles that were placed in the robot’s path) with the 10 m radius of the local map. [sent-146, score-0.152]
57 An 8-connected variant of A* then chooses a path to one of the points on the boundary of the local map that minimizes the sum of costs accumulated along the path to the boundary point with the cost-to-goal from the boundary point to the goal. [sent-147, score-0.47]
58 The results of running M MP B OOST on the 301 examples in our data set are compared to the results given by the current human engineered cost production system used on the robot in Figure 2. [sent-149, score-0.526]
59 The engineered system is the result of many man-hours of parameter tunning over weeks of field testing. [sent-150, score-0.179]
60 The learned system started with the engineered feature maps, and then boosted in additional features as necessary. [sent-151, score-0.394]
61 After just a few iterations of boosting the learned system displays significantly lower average loss than the engineered system, and corrects important navigational errors such as the one shown. [sent-152, score-0.5]
62 Algorithms have been developed which plan for foot placement in these environments, and have been successfully used on several biped robots (Chestnutt et al. [sent-156, score-0.277]
63 In these cases, the planner evaluates various steps the robot can execute, to find a sequence of steps that is safe and is within the robot’s capabilities. [sent-158, score-0.566]
64 Another approach to legged robot navigation uses local techniques to reactively adjust foot placement while following a predefined path (Yagi & Lumelsky, 1999). [sent-159, score-0.522]
65 This approach can fall into local minima or become stuck if the predefined path does not have valid footholds along its entire length. [sent-160, score-0.165]
66 Footstep planners have been shown to produce very good footstep sequences allowing legged robots to efficiently traverse a wide variety of terrain. [sent-161, score-0.342]
67 This approach uses much of the robot’s unique abilities, but is more computationally expensive than traditional mobile robot planners. [sent-162, score-0.258]
68 Footstep planning occurs in a high-dimensional state space and therefore is often too computationally burdensome to be used for real-time replanning, limiting its scope of application to largely static environments. [sent-163, score-0.226]
69 For most applications, the footstep planner implicitly solves a low dimensional navigational problem simultaneously with the footstep placement problem. [sent-164, score-0.87]
70 Using M MP B OOST , we use body trajectories produced by the footstep planner to learn the nuances of this navigational problem in the form of a 2. [sent-165, score-0.712]
71 We are training a simple, navigational planner to effectively reproduce the body trajectories that typically result from a sophisticated footstep planner. [sent-167, score-0.712]
72 We could use the resulting navigation planner in combination with a reactive solution (as in (Yagi & Lumelsky, 1999)). [sent-168, score-0.412]
73 Instead, we pursue a hybrid approach of using the resulting simple planner as a heuristic to guide the footstep planner. [sent-169, score-0.678]
74 Using a 2-dimensional robot planner as a heuristic has been shown previously (Chestnutt et al. [sent-170, score-0.738]
75 , 2005) to dramatically improve planning performance, but the planner must be manually tuned to provide costs that serve as reasonable approximations of the true cost. [sent-171, score-0.597]
76 Poorly informed admissible heuristics can cause the planner to erroneously attempt numerous dead ends before happening upon the optimal solution. [sent-173, score-0.483]
77 On the other hand, well informed inadmissible heuristics can pull the planner quickly toward a solution whose cost, though suboptimal, is very close to the minimum. [sent-174, score-0.459]
78 This lower-dimensional planner is then used in the heuristic to efficiently and Figure 3: Left is an image of the robot used for the quadruped experiments. [sent-175, score-0.817]
79 The center pair of images shows a typical height map (top), and the corresponding learned cost map (bottom) from a holdout set of the biped planning experiments. [sent-176, score-0.716]
80 Notice how platform-like regions are given low costs toward the center but higher costs toward the edges, and the learned features interact to lower cost chutes to direct the planner through complicated regions. [sent-177, score-0.731]
81 Right are two histograms showing the ratio distribution of the speed of both the admissible Euclidean (top) and the engineered heuristic (bottom) over an uninflated M MP B OOST heuristic on a holdout set of 90 examples from the biped experiment. [sent-178, score-0.706]
82 cost diff speedup cost diff speedup mean std mean std mean std mean std biped admissible biped inflated M MP B OOST vs Euclidean 0. [sent-180, score-0.928]
83 07 biped best-first quadruped inflated M MP B OOST vs Euclidean -609. [sent-196, score-0.265]
84 11 Figure 4: Statistics comparing the M MP B OOST heuristic to both a Euclidean and discrete navigational heuristic. [sent-212, score-0.255]
85 intelligently guide the footstep planner toward the goal, effectively displacing a large portion of the computational burden. [sent-214, score-0.573]
86 Our procedure is to run a footstep planner over a series of randomly drawn two-dimensional terrain height maps that describe the world the robot is to traverse. [sent-216, score-0.842]
87 The footstep planner produces trajectories of the robot from start to goal over the terrain map. [sent-217, score-0.835]
88 We then apply M MP B OOST again using regression trees with 10 terminal nodes as the base classifier to learn cost features and weights that turn height maps into cost functions so that a 2-dimensional planner over the cost map mimics the body trajectory. [sent-218, score-0.97]
89 We apply the planner to two robots: first the HRP-2 biped robot and second the LittleDog2 quadruped robot. [sent-219, score-0.804]
90 The cost diff column relates on average the extent to which the cost of planning under the M MP B OOST heuristic is above or below the opposing heuristic. [sent-224, score-0.601]
91 2 3 Boston Dynamics designed the robot and provided the motion capture system used in the tests. [sent-230, score-0.227]
92 A video demonstrating the robot walking across a terrain board is provided with this paper. [sent-231, score-0.242]
93 In this case, both the biped and quadruped planner using the learned heuristic significantly outperform their counterparts under a Euclidean heuristic. [sent-233, score-0.793]
94 4 While Euclidean often gets stuck for long periods of time in local minima, both the learned heuristic and to a lesser extent the engineered heuristic are able to navigate efficiently around these pitfalls. [sent-234, score-0.51]
95 We note that A* biped performance gains were considerably higher: we believe this is because orientation plays a large role in planning for the quadruped. [sent-235, score-0.341]
96 5 Conclusions and Future Work M MP B OOST combines the powerful ideas of structured prediction and functional gradient descent enabling learning by demonstration for a wide variety of applications. [sent-236, score-0.18]
97 Future work will include extending the learning of mobile robot path planning to more complex configuration spaces that allow for modeling of vehicle dynamics. [sent-237, score-0.64]
98 Further, we will pursue applications of the gradient boosting approach to other problems of structured prediction. [sent-238, score-0.218]
99 Large margin methods for structured and interdependent output variables. [sent-307, score-0.158]
100 4 The best-first quadruped planner under the M MP B OOST heuristic is on average approximately 1100 times faster than under the Euclidean heuristic in terms of the number of nodes searched. [sent-316, score-0.724]
wordName wordTfidf (topN-words)
[('oost', 0.441), ('planner', 0.363), ('mp', 0.343), ('mmp', 0.298), ('robot', 0.203), ('planning', 0.192), ('footstep', 0.179), ('engineered', 0.155), ('biped', 0.149), ('path', 0.138), ('heuristic', 0.136), ('navigational', 0.119), ('cost', 0.114), ('boosting', 0.113), ('ratliff', 0.104), ('teacher', 0.091), ('quadruped', 0.089), ('margin', 0.08), ('structured', 0.078), ('holdout', 0.075), ('legged', 0.075), ('overhead', 0.073), ('imitation', 0.071), ('boosted', 0.066), ('subgradient', 0.066), ('robots', 0.062), ('chestnutt', 0.06), ('outdoor', 0.059), ('ated', 0.059), ('fi', 0.059), ('base', 0.056), ('learned', 0.056), ('admissible', 0.055), ('mobile', 0.055), ('vehicle', 0.052), ('obstacles', 0.052), ('features', 0.052), ('trajectories', 0.051), ('euclidean', 0.05), ('wt', 0.05), ('map', 0.05), ('navigation', 0.049), ('mason', 0.047), ('diff', 0.045), ('locomotion', 0.045), ('lumelsky', 0.045), ('onboard', 0.045), ('yagi', 0.045), ('std', 0.044), ('taskar', 0.042), ('costs', 0.042), ('heuristics', 0.041), ('mdp', 0.041), ('feature', 0.041), ('terrain', 0.039), ('prespeci', 0.039), ('policy', 0.037), ('policies', 0.037), ('et', 0.036), ('largely', 0.034), ('loss', 0.033), ('environment', 0.032), ('toward', 0.031), ('height', 0.03), ('human', 0.03), ('elevation', 0.03), ('ladar', 0.03), ('placement', 0.03), ('li', 0.03), ('demonstrated', 0.029), ('cell', 0.029), ('imagery', 0.028), ('maps', 0.028), ('speedup', 0.027), ('terminal', 0.027), ('vs', 0.027), ('gradient', 0.027), ('local', 0.027), ('image', 0.026), ('robotics', 0.026), ('prediction', 0.026), ('beygelzimer', 0.026), ('imitate', 0.026), ('planners', 0.026), ('encountered', 0.025), ('descent', 0.025), ('boundary', 0.025), ('region', 0.025), ('system', 0.024), ('functional', 0.024), ('grayscale', 0.024), ('searched', 0.024), ('staying', 0.024), ('informed', 0.024), ('friedman', 0.022), ('sensors', 0.022), ('paths', 0.022), ('trees', 0.022), ('autonomous', 0.022), ('mimic', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 47 nips-2006-Boosting Structured Prediction for Imitation Learning
Author: J. A. Bagnell, Joel Chestnutt, David M. Bradley, Nathan D. Ratliff
Abstract: The Maximum Margin Planning (MMP) (Ratliff et al., 2006) algorithm solves imitation learning problems by learning linear mappings from features to cost functions in a planning domain. The learned policy is the result of minimum-cost planning using these cost functions. These mappings are chosen so that example policies (or trajectories) given by a teacher appear to be lower cost (with a lossscaled margin) than any other policy for a given planning domain. We provide a novel approach, M MP B OOST , based on the functional gradient descent view of boosting (Mason et al., 1999; Friedman, 1999a) that extends MMP by “boosting” in new features. This approach uses simple binary classification or regression to improve performance of MMP imitation learning, and naturally extends to the class of structured maximum margin prediction problems. (Taskar et al., 2005) Our technique is applied to navigation and planning problems for outdoor mobile robots and robotic legged locomotion. 1
2 0.21573989 43 nips-2006-Bayesian Model Scoring in Markov Random Fields
Author: Sridevi Parise, Max Welling
Abstract: Scoring structures of undirected graphical models by means of evaluating the marginal likelihood is very hard. The main reason is the presence of the partition function which is intractable to evaluate, let alone integrate over. We propose to approximate the marginal likelihood by employing two levels of approximation: we assume normality of the posterior (the Laplace approximation) and approximate all remaining intractable quantities using belief propagation and the linear response approximation. This results in a fast procedure for model scoring. Empirically, we find that our procedure has about two orders of magnitude better accuracy than standard BIC methods for small datasets, but deteriorates when the size of the dataset grows. 1
3 0.13008653 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation
Author: David B. Grimes, Daniel R. Rashid, Rajesh P. Rao
Abstract: Learning by imitation represents an important mechanism for rapid acquisition of new behaviors in humans and robots. A critical requirement for learning by imitation is the ability to handle uncertainty arising from the observation process as well as the imitator’s own dynamics and interactions with the environment. In this paper, we present a new probabilistic method for inferring imitative actions that takes into account both the observations of the teacher as well as the imitator’s dynamics. Our key contribution is a nonparametric learning method which generalizes to systems with very different dynamics. Rather than relying on a known forward model of the dynamics, our approach learns a nonparametric forward model via exploration. Leveraging advances in approximate inference in graphical models, we show how the learned forward model can be directly used to plan an imitating sequence. We provide experimental results for two systems: a biomechanical model of the human arm and a 25-degrees-of-freedom humanoid robot. We demonstrate that the proposed method can be used to learn appropriate motor inputs to the model arm which imitates the desired movements. A second set of results demonstrates dynamically stable full-body imitation of a human teacher by the humanoid robot. 1
4 0.11663319 136 nips-2006-Multi-Instance Multi-Label Learning with Application to Scene Classification
Author: Zhi-hua Zhou, Min-ling Zhang
Abstract: In this paper, we formalize multi-instance multi-label learning, where each training example is associated with not only multiple instances but also multiple class labels. Such a problem can occur in many real-world tasks, e.g. an image usually contains multiple patches each of which can be described by a feature vector, and the image can belong to multiple categories since its semantics can be recognized in different ways. We analyze the relationship between multi-instance multi-label learning and the learning frameworks of traditional supervised learning, multiinstance learning and multi-label learning. Then, we propose the M IML B OOST and M IML S VM algorithms which achieve good performance in an application to scene classification. 1
5 0.07525035 50 nips-2006-Chained Boosting
Author: Christian R. Shelton, Wesley Huie, Kin F. Kan
Abstract: We describe a method to learn to make sequential stopping decisions, such as those made along a processing pipeline. We envision a scenario in which a series of decisions must be made as to whether to continue to process. Further processing costs time and resources, but may add value. Our goal is to create, based on historic data, a series of decision rules (one at each stage in the pipeline) that decide, based on information gathered up to that point, whether to continue processing the part. We demonstrate how our framework encompasses problems from manufacturing to vision processing. We derive a quadratic (in the number of decisions) bound on testing performance and provide empirical results on object detection. 1 Pipelined Decisions In many decision problems, all of the data do not arrive at the same time. Often further data collection can be expensive and we would like to make a decision without accruing the added cost. Consider silicon wafer manufacturing. The wafer is processed in a series of stages. After each stage some tests are performed to judge the quality of the wafer. If the wafer fails (due to flaws), then the processing time, energy, and materials are wasted. So, we would like to detect such a failure as early as possible in the production pipeline. A similar problem can occur in vision processing. Consider the case of object detection in images. Often low-level pixel operations (such as downsampling an image) can be performed in parallel by dedicated hardware (on a video capture board, for example). However, searching each subimage patch of the whole image to test whether it is the object in question takes time that is proportional to the number of pixels. Therefore, we can imagine a image pipeline in which low resolution versions of the whole image are scanned first. Subimages which are extremely unlikely to contain the desired object are rejected and only those which pass are processed at higher resolution. In this way, we save on many pixel operations and can reduce the cost in time to process an image. Even if downsampling is not possible through dedicated hardware, for most object detection schemes, the image must be downsampled to form an image pyramid in order to search for the object at different scales. Therefore, we can run the early stages of such a pipelined detector at the low resolution versions of the image and throw out large regions of the high resolution versions. Most of the processing is spent searching for small faces (at the high resolutions), so this method can save a lot of processing. Such chained decisions also occur if there is a human in the decision process (to ask further clarifying questions in database search, for instance). We propose a framework that can model all of these scenarios and allow such decision rules to be learned from historic data. We give a learning algorithm based on the minimization of the exponential loss and conclude with some experimental results. 1.1 Problem Formulation Let there be s stages to the processing pipeline. We assume that there is a static distribution from which the parts, objects, or units to be processed are drawn. Let p(x, c) represent this distribution in which x is a vector of the features of this unit and c represents the costs associated with this unit. In particular, let xi (1 ≤ i ≤ s) be the set of measurements (features) available to the decision maker immediately following stage i. Let c i (1 ≤ i ≤ s) be the cost of rejecting (or stopping the processing of) this unit immediately following stage i. Finally, let c s+1 be the cost of allowing the part to pass through all processing stages. Note that ci need not be monotonic in i. To take our wafer manufacturing example, for wafers that are good we might let c i = i for 1 ≤ i ≤ s, indicating that if a wafer is rejected at any stage, one unit of work has been invested for each stage of processing. For the same good wafers, we might let cs+1 = s − 1000, indicating that the value of a completed wafer is 1000 units and therefore the total cost is the processing cost minus the resulting value. For a flawed wafer, the values might be the same, except for c s+1 which we would set to s, indicating that there is no value for a bad wafer. Note that the costs may be either positive or negative. However, only their relative values are important. Once a part has been drawn from the distribution, there is no way of affecting the “base level” for the value of the part. Therefore, we assume for the remainder of this paper that c i ≥ 0 for 1 ≤ i ≤ s + 1 and that ci = 0 for some value of i (between 1 and s + 1). Our goal is to produce a series of decision rules f i (xi ) for 1 ≤ i ≤ s. We let fi have a range of {0, 1} and let 0 indicate that processing should continue and 1 indicate that processing should be halted. We let f denote the collection of these s decision rules and augment the collection with an additional rule f s+1 which is identically 1 (for ease of notation). The cost of using these rules to halt processing an example is therefore s+1 i−1 ci fi (xi ) L(f (x), c) = i=1 (1 − fj (xj )) . j=1 We would like to find a set of decision rules that minimize E p [L(f (x), c)]. While p(x, c) is not known, we do have a series of samples (training set) D = {(x1 , c1 ), (x2 , c2 ), . . . , (xn , cn )} of n examples drawn from the distribution p. We use superscripts to denote the example index and subscripts to denote the stage index. 2 Boosting Solution For this paper, we consider constructing the rules f i from simpler decision rules, much as in the Adaboost algorithm [1, 2]. We assume that each decision f i (xi ) is computed as the threshold of another function g i (xi ): fi (xi ) = I(gi (xi ) > 0).1 We bound the empirical risk: n n s+1 L(f (xk ), ck ) = i−1 ck I(gi (xk ) > 0) i i k=1 i=1 k=1 n s+1 j=1 k i−1 ck egi (xi ) i ≤ k=1 i=1 I(gj (xk ) ≤ 0) j k n s+1 j=1 k ck egi (xi )− i e−gj (xj ) = Pi−1 j=1 gj (xk ) j . (1) k=1 i=1 Our decision to make all costs positive ensures that the bounds hold. Our decision to make the optimal cost zero helps to ensure that the bound is reasonably tight. As in boosting, we restrict g i (xi ) to take the form mi αi,l hi,l (xi ), the weighted sum of m i subl=1 classifiers, each of which returns either −1 or +1. We will construct these weighted sums incrementally and greedily, adding one additional subclassifier and associated weight at each step. We will pick the stage, weight, and function of the subclassifier in order to make the largest negative change in the exponential bound to the empirical risk. The subclassifiers, h i,l will be drawn from a small class of hypotheses, H. 1 I is the indicator function that equals 1 if the argument is true and 0 otherwise. 1. Initialize gi (x) = 0 for all stages i k 2. Initialize wi = ck for all stages i and examples k. i 3. For each stage i: (a) Calculate targets for each training example, as shown in equation 5. (b) Let h be the result of running the base learner on this set. (c) Calculate the corresponding α as per equation 3. (d) Score this classification as per equation 4 ¯ 4. Select the stage ¯ with the best (highest) score. Let h and α be the classifier and ı ¯ weight found at that stage. ¯¯ 5. Let g¯(x) ← g¯(x) + αh(x). ı ı 6. Update the weights (see equation 2): ¯ k k ¯ ı • ∀1 ≤ k ≤ n, multiply w¯ by eαh(x¯ ) . ı ¯ k k ¯ ı • ∀1 ≤ k ≤ n, j > ¯, multiply wj by e−αh(x¯ ) . ı 7. Repeat from step 3 Figure 1: Chained Boosting Algorithm 2.1 Weight Optimization We first assume that the stage at which to add a new subclassifier and the subclassifier to add have ¯ ¯ already been chosen: ¯ and h, respectively. That is, h will become h¯,m¯+1 but we simplify it for ı ı ı ease of expression. Our goal is to find α ¯,m¯+1 which we similarly abbreviate to α. We first define ¯ ı ı k k wi = ck egi (xi )− i Pi−1 j=1 gj (xk ) j (2) + as the weight of example k at stage i, or its current contribution to our risk bound. If we let D h be ¯ − ¯ the set of indexes of the members of D for which h returns +1, and let D h be similarly defined for ¯ ¯ returns −1, we can further define those for which h s+1 + W¯ = ı k w¯ + ı + k∈Dh ¯ s+1 k wi − W¯ = ı − ı k∈Dh i=¯+1 ¯ k w¯ + ı − k∈Dh ¯ k wi . + ı k∈Dh i=¯+1 ¯ + ¯ We interpret W¯ to be the sum of the weights which h will emphasize. That is, it corresponds to ı ¯ ¯ the weights along the path that h selects: For those examples for which h recommends termination, we add the current weight (related to the cost of stopping the processing at this stage). For those ¯ examples for which h recommends continued processing, we add in all future weights (related to all − future costs associated with this example). W¯ can be similarly interpreted to be the weights (or ı ¯ costs) that h recommends skipping. If we optimize the loss bound of Equation 1 with respect to α, we obtain ¯ α= ¯ 1 Wı− log ¯ . + 2 W¯ ı (3) The more weight (cost) that the rule recommends to skip, the higher its α coefficient. 2.2 Full Optimization Using Equation 3 it is straight forward to show that the reduction in Equation 1 due to the addition of this new subclassifier will be + ¯ − ¯ W¯ (1 − eα ) + W¯ (1 − e−α ) . ı ı (4) We know of no efficient method for determining ¯, the stage at which to add a subclassifier, except ı by exhaustive search. However, within a stage, the choice of which subclassifier to use becomes one of maximizing n s+1 k¯ k k z¯ h(x¯ ) , where z¯ = ı ı ı k k wi − w¯ ı (5) i=¯+1 ı k=1 ¯ with respect to h. This is equivalent to an weighted empirical risk minimization where the training 1 2 n k k set is {x¯ , x¯ , . . . , x¯ }. The label of x¯ is the sign of z¯ , and the weight of the same example is the ı ı ı ı ı k magnitude of z¯ . ı 2.3 Algorithm The resulting algorithm is only slightly more complex than standard Adaboost. Instead of a weight vector (one weight for each data example), we now have a weight matrix (one weight for each data example for each stage). We initialize each weight to be the cost associated with halting the corresponding example at the corresponding stage. We start with all g i (x) = 0. The complete algorithm is as in Figure 1. Each time through steps 3 through 7, we complete one “round” and add one additional rule to one stage of the processing. We stop executing this loop when α ≤ 0 or when an iteration counter ¯ exceeds a preset threshold. Bottom-Up Variation In situations where information is only gained after each stage (such as in section 4), we can also train the classifiers “bottom-up.” That is, we can start by only adding classifiers to the last stage. Once finished with it, we proceed to the previous stage, and so on. Thus instead of selecting the best stage, i, in each round, we systematically work our way backward through the stages, never revisiting previously set stages. 3 Performance Bounds Using the bounds in [3] we can provide a risk bound for this problem. We let E denote the expectaˆ tion with respect to the true distribution p(x, c) and En denote the empirical average with respect to the n training samples. We first bound the indicator function with a piece-wise linear function, b θ , 1 with a maximum slope of θ : z ,0 . I(z > 0) ≤ bθ (z) = max min 1, 1 + θ We then bound the loss: L(f (x), c) ≤ φ θ (f (x), c) where s+1 φθ (f (x), c) = ci min{bθ (gi (xi )), bθ (−gi−1 (xi−1 )), bθ (−gi−2 (xi−2 )), . . . , bθ (−g1 (x1 ))} i=1 s+1 i ci Bθ (gi (xi ), gi−1 (xi−1 ), . . . , g1 (x1 )) = i=1 We replaced the product of indicator functions with a minimization and then bounded each indicator i with bθ . Bθ is just a more compact presentation of the composition of the function b θ and the minimization. We assume that the weights α at each stage have been scaled to sum to 1. This has no affect on the resulting classifications, but is necessary for the derivation below. Before stating the theorem, for clarity, we state two standard definition: Definition 1. Let p(x) be a probability distribution on the set X and let {x 1 , x2 , . . . , xn } be n independent samples from p(x). Let σ 1 , σ 2 , . . . , σ n be n independent samples from a Rademacher random variable (a binary variable that takes on either +1 or −1 with equal probability). Let F be a class of functions mapping X to . Define the Rademacher Complexity of F to be Rn (F ) = E sup f ∈F n 1 n σ i f (xi ) i=1 1 where the expectation is over the random draws of x through xn and σ 1 through σ n . Definition 2. Let p(x), {x1 , x2 , . . . , xn }, and F be as above. Let g 1 , g 2 , . . . , g n be n independent samples from a Gaussian distribution with mean 0 and variance 1. Analogous to the above definition, define the Gaussian Complexity of G to be Gn (F ) = E sup f ∈F 1 n n g i f (xi ) . i=1 We can now state our theorem, bounding the true risk by a function of the empirical risk: Theorem 3. Let H1 , H2 , . . . , Hs be the sequence of the sets of functions from which the base classifier draws for chain boosting. If H i is closed under negation for all i, all costs are bounded between 0 and 1, and the weights for the classifiers at each stage sum to 1, then with probability 1 − δ, k ˆ E [L(f (x), c)] ≤ En [φθ (f (x), c)] + θ s (i + 1)Gn (Hi ) + i=1 8 ln 2 δ n for some constant k. Proof. Theorem 8 of [3] states ˆ E [L(x, c)] ≤ En (φθ (f (x), c)) + 2Rn (φθ ◦ F ) + 8 ln 2 δ n and therefore we need only bound the R n (φθ ◦ F ) term to demonstrate our theorem. For our case, we have 1 Rn (φθ ◦ F ) = E sup n f ∈F = E sup f ∈F 1 n s+1 ≤ E sup j=1 f ∈F n n σ i φθ (f (xi ), ci ) i=1 s+1 σi i=1 1 n s ci Bθ (gj (xi ), gj−1 (xi ), . . . , g1 (xi )) j j j−1 1 j=1 n s+1 s σ i Bθ (gj (xi ), gj−1 (xi ), . . . , g1 (xi )) = j j−1 1 i=1 s Rn (Bθ ◦ G j ) j=1 where Gi is the space of convex combinations of functions from H i and G i is the cross product of G1 through Gi . The inequality comes from switching the expectation and the maximization and then from dropping the c i (see [4], lemma 5). j s s Lemma 4 of [3] states that there exists a k such that R n (Bθ ◦ G j ) ≤ kGn (Bθ ◦ G j ). Theorem 14 j 2 s j s of the same paper allows us to conclude that G n (Bθ ◦ G ) ≤ θ i=1 Gn (Gi ). (Because Bθ is the 1 s minimum over a set of functions with maximum slope of θ , the maximum slope of B θ is also 1 .) θ Theorem 12, part 2 states G n (Gi ) = Gn (Hi ). Taken together, this proves our result. Note that this bound has only quadratic dependence on s, the length of the chain and does not explicitly depend on the number of rounds of boosting (the number of rounds affects φ θ which, in turn, affects the bound). 4 Application We tested our algorithm on the MIT face database [5]. This database contains 19-by-19 gray-scale images of faces and non-faces. The training set has 2429 face images and 4548 non-face images. The testing set has 472 faces and 23573 non-faces. We weighted the training set images so that the ratio of the weight of face images to non-face images matched the ratio in the testing set. 0.4 0.4 CB Global CB Bottom−up SVM Boosting 0.3 0.25 0.2 0.15 0.1 150 0.2 100 0.1 False positive rate 0.3 50 Average number of pixels 0.35 average cost/error per example 200 training cost training error testing cost testing error 0.05 0 100 200 300 400 500 number of rounds 700 1000 (a) 0 0 0.2 0.4 0.6 False negative rate 0.8 0 1 (b) Figure 2: (a) Accuracy verses the number of rounds for a typical run, (b) Error rates and average costs for a variety of cost settings. 4.1 Object Detection as Chained Boosting Our goal is to produce a classifier that can identify non-face images at very low resolutions, thereby allowing for quick processing of large images (as explained later). Most image patches (or subwindows) do not contain faces. We, therefore, built a multi-stage detection system where any early rejection is labeled as a non-face. The first stage looks at image patches of size 3-by-3 (i.e. a lowerresolution version of the 19-by-19 original image). The next stage looks at the same image, but at a resolution of 6-by-6. The third stage considers the image at 12-by-12. We did not present the full 19-by-19 images as the classification did not significantly improve over the 12-by-12 versions. We employ a simple base classifier: the set of all functions that look at a single pixel and predict the class by thresholding the pixel’s value. The total classifier at any stage is a linear combination of these simple classifiers. For a given stage, all of the base classifiers that target a particular pixel are added together producing a complex function of the value of the pixel. Yet, this pixel can only take on a finite number of values (256 in this case). Therefore, we can compile this set of base classifiers into a single look-up function that maps the brightness of the pixel into a real number. The total classifier for the whole stage is merely the sum of these look-up functions. Therefore, the total work necessary to compute the classification at a stage is proportional to the number of pixels in the image considered at that stage, regardless of the number of base classifiers used. We therefore assign a cost to each stage of processing proportional to the number of pixels at that stage. If the image is a face, we add a negative cost (i.e. bonus) if the image is allowed to pass through all of the processing stages (and is therefore “accepted” as a face). If the image is a nonface, we add a bonus if the image is rejected at any stage before completion (i.e. correctly labelled). While this dataset has only segmented image patches, in a real application, the classifier would be run on all sub-windows of an image. More importantly, it would also be run at multiple resolutions in order to detect faces of different sizes (or at different distances from the camera). The classifier chain could be run simultaneously at each of these resolutions. To wit, while running the final 12-by12 stage at one resolution of the image, the 6-by-6 (previous) stage could be run at the same image resolution. This 6-by-6 processing would be the necessary pre-processing step to running the 12-by12 stage at a higher resolution. As we run our final scan for big faces (at a low resolution), we can already (at the same image resolution) be performing initial tests to throw out portions of the image as not worthy of testing for smaller faces (at a higher resolution). Most of the work of detecting objects must be done at the high resolutions because there are many more overlapping subwindows. This chained method allows the culling of most of this high-resolution image processing. 4.2 Experiments For each example, we construct a vector of stage costs as above. We add a constant to this vector to ensure that the minimal element is zero, as per section 1.1. We scale all vectors by the same amount to ensure that the maximal value is 1.This means that the number of misclassifications is an upper bound on the total cost that the learning algorithm is trying to minimize. There are three flexible quantities in this problem formulation: the cost of a pixel evaluation, the bonus for a correct face classification, and the bonus for a correct non-face classification. Changing these quantities will control the trade-off between false positives and true positives, and between classification error and speed. Figure 2(a) shows the result of a typical run of the algorithm. As a function of the number of rounds, it plots the cost (that which the algorithm is trying to minimize) and the error (number of misclassified image patches), for both the training and testing sets (where the training set has been reweighted to have the same proportion of faces to non-faces as the testing set). We compared our algorithm’s performance to the performance of support vector machines (SVM) [6] and Adaboost [1] trained and tested on the highest resolution, 12-by-12, image patches. We employed SVM-light [7] with a linear kernels. Figure 2(b) compares the error rates for the methods (solid lines, read against the left vertical axis). Note that the error rates are almost identical for the methods. The dashed lines (read against the right vertical axis) show the average number of pixels evaluated (or total processing cost) for each of the methods. The SVM and Adaboost algorithms have a constant processing cost. Our method (by either training scheme) produces lower processing cost for most error rates. 5 Related Work Cascade detectors for vision processing (see [8] or [9] for example) may appear to be similar to the work in this paper. Especially at first glance for the area of object detection, they appear almost the same. However, cascade detection and this work (chained detection) are quite different. Cascade detectors are built one at a time. A coarse detector is first trained. The examples which pass that detector are then passed to a finer detector for training, and so on. A series of targets for false-positive rates define the increasing accuracy of the detector cascade. By contrast, our chain detectors are trained as an ensemble. This is necessary because of two differences in the problem formulation. First, we assume that the information available at each stage changes. Second, we assume there is an explicit cost model that dictates the cost of proceeding from stage to stage and the cost of rejection (or acceptance) at any particular stage. By contrast, cascade detectors are seeking to minimize computational power necessary for a fixed decision. Therefore, the information available to all of the stages is the same, and there are no fixed costs associated with each stage. The ability to train all of the classifiers at the same time is crucial to good performance in our framework. The first classifier in the chain cannot determine whether it is advantageous to send an example further along unless it knows how the later stages will process the example. Conversely, the later stages cannot construct optimal classifications until they know the distribution of examples that they will see. Section 4.1 may further confuse the matter. We demonstrated how chained boosting can be used to reduce the computational costs of object detection in images. Cascade detectors are often used for the same purpose. However, the reductions in computational time come from two different sources. In cascade detectors, the time taken to evaluate a given image patch is reduced. In our chained detector formulation, image patches are ignored completely based on analysis of lower resolution patches in the image pyramid. To further illustrate the difference, cascade detectors can always be used to speed up asymmetric classification tasks (and are often applied to image detection). By contrast, in Section 4.1 we have exploited the fact that object detection in images is typically performed at multiple scales to turn the problem into a pipeline and apply our framework. Cascade detectors address situations in which prior class probabilities are not equal, while chained detectors address situations in which information is gained at a cost. Both are valid (and separate) ways of tackling image processing (and other tasks as well). In many ways, they are complementary approaches. Classic sequence analysis [10, 11] also addresses the problem of optimal stopping. However, it assumes that the samples are drawn i.i.d. from (usually) a known distribution. Our problem is quite different in that each consecutive sample is drawn from a different (and related) distribution and our goal is to find a decision rule without producing a generative model. WaldBoost [12] is a boosting algorithm based on this. It builds a series of features and a ratio comparison test in order to decide when to stop. For WaldBoost, the available features (information) not change between stages. Rather, any feature is available for selection at any point in the chain. Again, this is a different problem than the one considered in this paper. 6 Conclusions We feel this framework of staged decision making is useful in a wide variety of areas. This paper demonstrated how the framework applies to one vision processing task. Obviously it also applies to manufacturing pipelines where errors can be introduced at different stages. It should also be applicable to scenarios where information gathering is costly. Our current formulation only allows for early negative detection. In the face detection example above, this means that in order to report “face,” the classifier must process each stage, even if the result is assured earlier. In Figure 2(b), clearly the upper-left corner (100% false positives and 0% false negatives) is reachable with little effort: classify everything positive without looking at any features. We would like to extend this framework to cover such two-sided early decisions. While perhaps not useful in manufacturing (or even face detection, where the interesting part of the ROC curve is far from the upper-left), it would make the framework more applicable to informationgathering applications. Acknowledgements This research was supported through the grant “Adaptive Decision Making for Silicon Wafer Testing” from Intel Research and UC MICRO. References [1] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT, pages 23–37, 1995. [2] Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In ICML, pages 148–156, 1996. [3] Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. JMLR, 2:463–482, 2002. [4] Ron Meir and Tong Zhang. Generalization error bounds for Bayesian mixture algorithms. JMLR, 4:839–860, 2003. [5] MIT. CBCL face database #1, 2000. http://cbcl.mit.edu/cbcl/softwaredatasets/FaceData2.html. [6] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classifiers. In COLT, pages 144–152, 1992. [7] T. Joachims. Making large-scale SVM learning practical. In B. Schlkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods — Support Vector Learning. MIT-Press, 1999. [8] Paul A. Viola and Michael J. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, pages 511–518, 2001. [9] Jianxin Wu, Matthew D. Mullin, and James M. Rehg. Linear asymmetric classifier for cascade detectors. In ICML, pages 988–995, 2005. [10] Abraham Wald. Sequential Analysis. Chapman & Hall, Ltd., 1947. [11] K. S. Fu. Sequential Methods in Pattern Recognition and Machine Learning. Academic Press, 1968. ˇ [12] Jan Sochman and Jiˇ´ Matas. Waldboost — learning for time constrained sequential detection. rı In CVPR, pages 150–156, 2005.
6 0.069934413 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
7 0.06347052 61 nips-2006-Convex Repeated Games and Fenchel Duality
8 0.060557898 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
9 0.054654673 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition
10 0.053971305 137 nips-2006-Multi-Robot Negotiation: Approximating the Set of Subgame Perfect Equilibria in General-Sum Stochastic Games
11 0.052526191 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments
12 0.051812612 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
13 0.048837356 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
14 0.047734872 172 nips-2006-Scalable Discriminative Learning for Natural Language Parsing and Translation
15 0.047205452 154 nips-2006-Optimal Change-Detection and Spiking Neurons
16 0.045891304 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
17 0.045332365 124 nips-2006-Linearly-solvable Markov decision problems
18 0.045022476 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions
19 0.044962611 130 nips-2006-Max-margin classification of incomplete data
20 0.044879552 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
topicId topicWeight
[(0, -0.161), (1, 0.018), (2, -0.044), (3, -0.106), (4, 0.033), (5, -0.003), (6, -0.038), (7, -0.022), (8, 0.033), (9, -0.016), (10, -0.01), (11, -0.036), (12, -0.024), (13, 0.06), (14, 0.031), (15, 0.02), (16, -0.027), (17, 0.034), (18, 0.043), (19, 0.058), (20, 0.191), (21, 0.173), (22, -0.029), (23, 0.134), (24, -0.02), (25, 0.198), (26, 0.007), (27, -0.079), (28, 0.02), (29, -0.076), (30, -0.167), (31, -0.01), (32, 0.046), (33, 0.286), (34, -0.016), (35, -0.22), (36, 0.108), (37, 0.247), (38, 0.011), (39, -0.215), (40, 0.088), (41, -0.039), (42, -0.037), (43, 0.038), (44, -0.081), (45, 0.077), (46, 0.022), (47, -0.113), (48, -0.079), (49, 0.068)]
simIndex simValue paperId paperTitle
same-paper 1 0.93147731 47 nips-2006-Boosting Structured Prediction for Imitation Learning
Author: J. A. Bagnell, Joel Chestnutt, David M. Bradley, Nathan D. Ratliff
Abstract: The Maximum Margin Planning (MMP) (Ratliff et al., 2006) algorithm solves imitation learning problems by learning linear mappings from features to cost functions in a planning domain. The learned policy is the result of minimum-cost planning using these cost functions. These mappings are chosen so that example policies (or trajectories) given by a teacher appear to be lower cost (with a lossscaled margin) than any other policy for a given planning domain. We provide a novel approach, M MP B OOST , based on the functional gradient descent view of boosting (Mason et al., 1999; Friedman, 1999a) that extends MMP by “boosting” in new features. This approach uses simple binary classification or regression to improve performance of MMP imitation learning, and naturally extends to the class of structured maximum margin prediction problems. (Taskar et al., 2005) Our technique is applied to navigation and planning problems for outdoor mobile robots and robotic legged locomotion. 1
2 0.58268362 43 nips-2006-Bayesian Model Scoring in Markov Random Fields
Author: Sridevi Parise, Max Welling
Abstract: Scoring structures of undirected graphical models by means of evaluating the marginal likelihood is very hard. The main reason is the presence of the partition function which is intractable to evaluate, let alone integrate over. We propose to approximate the marginal likelihood by employing two levels of approximation: we assume normality of the posterior (the Laplace approximation) and approximate all remaining intractable quantities using belief propagation and the linear response approximation. This results in a fast procedure for model scoring. Empirically, we find that our procedure has about two orders of magnitude better accuracy than standard BIC methods for small datasets, but deteriorates when the size of the dataset grows. 1
3 0.57406002 136 nips-2006-Multi-Instance Multi-Label Learning with Application to Scene Classification
Author: Zhi-hua Zhou, Min-ling Zhang
Abstract: In this paper, we formalize multi-instance multi-label learning, where each training example is associated with not only multiple instances but also multiple class labels. Such a problem can occur in many real-world tasks, e.g. an image usually contains multiple patches each of which can be described by a feature vector, and the image can belong to multiple categories since its semantics can be recognized in different ways. We analyze the relationship between multi-instance multi-label learning and the learning frameworks of traditional supervised learning, multiinstance learning and multi-label learning. Then, we propose the M IML B OOST and M IML S VM algorithms which achieve good performance in an application to scene classification. 1
4 0.40736398 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation
Author: David B. Grimes, Daniel R. Rashid, Rajesh P. Rao
Abstract: Learning by imitation represents an important mechanism for rapid acquisition of new behaviors in humans and robots. A critical requirement for learning by imitation is the ability to handle uncertainty arising from the observation process as well as the imitator’s own dynamics and interactions with the environment. In this paper, we present a new probabilistic method for inferring imitative actions that takes into account both the observations of the teacher as well as the imitator’s dynamics. Our key contribution is a nonparametric learning method which generalizes to systems with very different dynamics. Rather than relying on a known forward model of the dynamics, our approach learns a nonparametric forward model via exploration. Leveraging advances in approximate inference in graphical models, we show how the learned forward model can be directly used to plan an imitating sequence. We provide experimental results for two systems: a biomechanical model of the human arm and a 25-degrees-of-freedom humanoid robot. We demonstrate that the proposed method can be used to learn appropriate motor inputs to the model arm which imitates the desired movements. A second set of results demonstrates dynamically stable full-body imitation of a human teacher by the humanoid robot. 1
5 0.3729192 21 nips-2006-AdaBoost is Consistent
Author: Peter L. Bartlett, Mikhail Traskin
Abstract: The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after nν iterations—for sample size n and ν < 1—the sequence of risks of the classifiers it produces approaches the Bayes risk if Bayes risk L∗ > 0. 1
6 0.33171332 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
7 0.32807347 25 nips-2006-An Application of Reinforcement Learning to Aerobatic Helicopter Flight
8 0.32145858 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions
9 0.29009783 170 nips-2006-Robotic Grasping of Novel Objects
10 0.27907702 140 nips-2006-Multiple Instance Learning for Computer Aided Diagnosis
11 0.23914659 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition
12 0.23702812 156 nips-2006-Ordinal Regression by Extended Binary Classification
13 0.23403721 138 nips-2006-Multi-Task Feature Learning
14 0.23358874 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
15 0.2321181 50 nips-2006-Chained Boosting
16 0.22746749 201 nips-2006-Using Combinatorial Optimization within Max-Product Belief Propagation
17 0.22411679 129 nips-2006-Map-Reduce for Machine Learning on Multicore
18 0.22187628 105 nips-2006-Large Margin Component Analysis
19 0.21630888 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation
20 0.21562192 107 nips-2006-Large Margin Multi-channel Analog-to-Digital Conversion with Applications to Neural Prosthesis
topicId topicWeight
[(1, 0.073), (2, 0.013), (3, 0.036), (7, 0.068), (9, 0.024), (12, 0.013), (16, 0.292), (20, 0.011), (22, 0.051), (44, 0.066), (57, 0.121), (65, 0.067), (69, 0.041), (71, 0.015), (92, 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.78873158 47 nips-2006-Boosting Structured Prediction for Imitation Learning
Author: J. A. Bagnell, Joel Chestnutt, David M. Bradley, Nathan D. Ratliff
Abstract: The Maximum Margin Planning (MMP) (Ratliff et al., 2006) algorithm solves imitation learning problems by learning linear mappings from features to cost functions in a planning domain. The learned policy is the result of minimum-cost planning using these cost functions. These mappings are chosen so that example policies (or trajectories) given by a teacher appear to be lower cost (with a lossscaled margin) than any other policy for a given planning domain. We provide a novel approach, M MP B OOST , based on the functional gradient descent view of boosting (Mason et al., 1999; Friedman, 1999a) that extends MMP by “boosting” in new features. This approach uses simple binary classification or regression to improve performance of MMP imitation learning, and naturally extends to the class of structured maximum margin prediction problems. (Taskar et al., 2005) Our technique is applied to navigation and planning problems for outdoor mobile robots and robotic legged locomotion. 1
2 0.54427946 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation
Author: David B. Grimes, Daniel R. Rashid, Rajesh P. Rao
Abstract: Learning by imitation represents an important mechanism for rapid acquisition of new behaviors in humans and robots. A critical requirement for learning by imitation is the ability to handle uncertainty arising from the observation process as well as the imitator’s own dynamics and interactions with the environment. In this paper, we present a new probabilistic method for inferring imitative actions that takes into account both the observations of the teacher as well as the imitator’s dynamics. Our key contribution is a nonparametric learning method which generalizes to systems with very different dynamics. Rather than relying on a known forward model of the dynamics, our approach learns a nonparametric forward model via exploration. Leveraging advances in approximate inference in graphical models, we show how the learned forward model can be directly used to plan an imitating sequence. We provide experimental results for two systems: a biomechanical model of the human arm and a 25-degrees-of-freedom humanoid robot. We demonstrate that the proposed method can be used to learn appropriate motor inputs to the model arm which imitates the desired movements. A second set of results demonstrates dynamically stable full-body imitation of a human teacher by the humanoid robot. 1
3 0.543944 34 nips-2006-Approximate Correspondences in High Dimensions
Author: Kristen Grauman, Trevor Darrell
Abstract: Pyramid intersection is an efficient method for computing an approximate partial matching between two sets of feature vectors. We introduce a novel pyramid embedding based on a hierarchy of non-uniformly shaped bins that takes advantage of the underlying structure of the feature space and remains accurate even for sets with high-dimensional feature vectors. The matching similarity is computed in linear time and forms a Mercer kernel. Whereas previous matching approximation algorithms suffer from distortion factors that increase linearly with the feature dimension, we demonstrate that our approach can maintain constant accuracy even as the feature dimension increases. When used as a kernel in a discriminative classifier, our approach achieves improved object recognition results over a state-of-the-art set kernel. 1
4 0.53519022 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency
Author: Wolf Kienzle, Felix A. Wichmann, Matthias O. Franz, Bernhard Schölkopf
Abstract: This paper addresses the bottom-up influence of local image information on human eye movements. Most existing computational models use a set of biologically plausible linear filters, e.g., Gabor or Difference-of-Gaussians filters as a front-end, the outputs of which are nonlinearly combined into a real number that indicates visual saliency. Unfortunately, this requires many design parameters such as the number, type, and size of the front-end filters, as well as the choice of nonlinearities, weighting and normalization schemes etc., for which biological plausibility cannot always be justified. As a result, these parameters have to be chosen in a more or less ad hoc way. Here, we propose to learn a visual saliency model directly from human eye movement data. The model is rather simplistic and essentially parameter-free, and therefore contrasts recent developments in the field that usually aim at higher prediction rates at the cost of additional parameters and increasing model complexity. Experimental results show that—despite the lack of any biological prior knowledge—our model performs comparably to existing approaches, and in fact learns image features that resemble findings from several previous studies. In particular, its maximally excitatory stimuli have center-surround structure, similar to receptive fields in the early human visual system. 1
5 0.53130186 110 nips-2006-Learning Dense 3D Correspondence
Author: Florian Steinke, Volker Blanz, Bernhard Schölkopf
Abstract: Establishing correspondence between distinct objects is an important and nontrivial task: correctness of the correspondence hinges on properties which are difficult to capture in an a priori criterion. While previous work has used a priori criteria which in some cases led to very good results, the present paper explores whether it is possible to learn a combination of features that, for a given training set of aligned human heads, characterizes the notion of correct correspondence. By optimizing this criterion, we are then able to compute correspondence and morphs for novel heads. 1
6 0.52782154 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields
7 0.52463412 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
8 0.52455044 42 nips-2006-Bayesian Image Super-resolution, Continued
9 0.52333719 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions
10 0.52073693 160 nips-2006-Part-based Probabilistic Point Matching using Equivalence Constraints
11 0.51897824 43 nips-2006-Bayesian Model Scoring in Markov Random Fields
12 0.51689899 175 nips-2006-Simplifying Mixture Models through Function Approximation
13 0.51475614 79 nips-2006-Fast Iterative Kernel PCA
14 0.51394403 174 nips-2006-Similarity by Composition
15 0.51294982 188 nips-2006-Temporal and Cross-Subject Probabilistic Models for fMRI Prediction Tasks
16 0.512824 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model
17 0.51263267 154 nips-2006-Optimal Change-Detection and Spiking Neurons
18 0.51193058 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
19 0.51136398 115 nips-2006-Learning annotated hierarchies from relational data
20 0.51072621 185 nips-2006-Subordinate class recognition using relational object models