nips nips2012 nips2012-232 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jeremy Weiss, Sriraam Natarajan, David Page
Abstract: Learning temporal dependencies between variables over continuous time is an important and challenging task. Continuous-time Bayesian networks effectively model such processes but are limited by the number of conditional intensity matrices, which grows exponentially in the number of parents per variable. We develop a partition-based representation using regression trees and forests whose parameter spaces grow linearly in the number of node splits. Using a multiplicative assumption we show how to update the forest likelihood in closed form, producing efficient model updates. Our results show multiplicative forests can be learned from few temporal trajectories with large gains in performance and scalability.
Reference: text
sentIndex sentText sentNum sentScore
1 Continuous-time Bayesian networks effectively model such processes but are limited by the number of conditional intensity matrices, which grows exponentially in the number of parents per variable. [sent-8, score-0.403]
2 We develop a partition-based representation using regression trees and forests whose parameter spaces grow linearly in the number of node splits. [sent-9, score-0.205]
3 Using a multiplicative assumption we show how to update the forest likelihood in closed form, producing efficient model updates. [sent-10, score-0.402]
4 Our results show multiplicative forests can be learned from few temporal trajectories with large gains in performance and scalability. [sent-11, score-0.275]
5 However, this model does not scale for the case where a CTMP state is a joint state over many variable states. [sent-17, score-0.198]
6 , a timeline where the state of each variable is known for all times t, for a CTMP with four joint states (a, b), (a, B), (A, b), and (A, B) factorized into two binary CTBN variables α and β (with states a and A, and b and B, respectively). [sent-22, score-0.22]
7 numbers of transitions M and durations T in Figure 1) and structure learning over a directed (possibly cyclic) graph over the variables to maximize a penalized likelihood score. [sent-28, score-0.189]
8 Partition-based CTBNs remove the restriction used in CTBNs of storing one rate matrix per parents setting for every variable. [sent-31, score-0.179]
9 Instead partition-based CTBNs define partitions over the joint state space and define the transition rate of each variable to be dependent on the membership of the current joint state to an element (part) of a partition. [sent-32, score-0.38]
10 Then the transition into si from joint state (A, B) in Figure 1 would be parameterized by transition rate qa|p2 . [sent-34, score-0.294]
11 Partition-based CTBNs store one transition rate per part, as opposed to one transition rate matrix per parents setting. [sent-35, score-0.358]
12 Partition-based CTBNs avoid one limitation of CTBNs: that the model size is necessarily exponential in the maximum number of parents per variable. [sent-38, score-0.154]
13 However, tree learning induces recursive subpartitions, which limits their ability to partition the joint state space. [sent-44, score-0.236]
14 We therefore introduce multiplicative forests for CTBNs, which allow the model to represent up to an exponential number of transition rates with parameters still linear in the number of splits. [sent-45, score-0.28]
15 Following canonical tree learning methods, we perform greedy tree and forest learning using iterative structure modifications. [sent-49, score-0.438]
16 We show that the partition-based change in log likelihood can be calculated efficiently in closed form using a multiplicative assumption. [sent-50, score-0.198]
17 Thus, we can calculate the maximum change in log likelihood for a forest modification proposal, which gives us the best iterative update to the forest model. [sent-52, score-0.623]
18 Finally, we conduct experiments to compare CTBNs, regression tree CTBNs (treeCTBNs) and multiplicative forest CTBNs (mfCTBNs) on three data sets. [sent-53, score-0.434]
19 In Section 3 we present partition-based CTBNs, show that they subsume CTBNs and define the partitions that tree and forest structures induce. [sent-56, score-0.412]
20 We present results in Section 4 showing that forest CTBNs are scalable to large state spaces and learn better than CTBNs, from fewer examples and in less time. [sent-58, score-0.326]
21 , xk , and there is an intensity qx|u for every state x ∈ X given an instantiation over its parents u ∈ UX . [sent-65, score-0.396]
22 The intensity corresponds to the rate of transitioning out of state x; the probability density function for staying in state x given an instantiation of parents u is qx|u e−qx|u t . [sent-66, score-0.546]
23 Next, we introduce regression trees and multiplicative forests and describe the partitions they induce, which are then used in the partition-based CTBN framework. [sent-72, score-0.329]
24 Finally, we discuss the advantages of using trees and forests in terms of learning compact models efficiently. [sent-73, score-0.177]
25 Each variable X transitions among its states with rate parameter qx |s for entering state x given the joint state s2 . [sent-85, score-0.71]
26 This rate parameter (called an intensity) parameterizes the exponential distribution for transitioning into x , given by the pdf: p(x , s, t) = qx |s e−qx |s t for time t ∈ [0, ∞). [sent-86, score-0.434]
27 A partition-based CTBN has a collection of set partitions P over P, one Px for every variable state x . [sent-87, score-0.165]
28 We define the intensity parameter as qx |s = qx |p for all s ∈ p. [sent-89, score-0.906]
29 Note that this fixes this intensity to be the same for every s ∈ p, and also note that the set of parts p covers P. [sent-90, score-0.239]
30 The pdf for transitioning is given by p(x , s, t) = p(x , Px (s), t) = qx |p e−qx |p t for all s in p. [sent-91, score-0.409]
31 A partition-based CTBN model M is composed of a distribution over the initial state of our variables, defined by a Bayesian network B, and a set of partitions Px for every variable state x with corresponding sets of intensities qx |p . [sent-93, score-0.717]
32 Given an initial state s0 , transition times are sampled for each variable state x according to p(x , Px (s0 ), t). [sent-95, score-0.234]
33 The next state is selected based on the transition to the x with the shortest time, after which the transition times are resampled according to p(x , si , t). [sent-96, score-0.241]
34 = 2 Of note, partition-based CTBNs are modeling the intensity of transitioning to the recipient state x , rather than from the donor state x because we are more often interested in the causes of entering a state. [sent-103, score-0.455]
35 For each interval ti , the joint state remains unchanged, and then one variable transitions into x . [sent-105, score-0.245]
36 The likelihood given the interval is: qx |si−1 X x∈X e−qx|si−1 ti , i. [sent-106, score-0.452]
37 Taking the product over all intervals in z, we get the model likelihood: M x qx |s |s e−qx |s Ts (1) X∈X x ∈X s where Mx |s is the number of transitions into x from state s, and Ts is the total duration spent in s. [sent-109, score-0.552]
38 Combining terms based on the membership of s to p and defining Mx |p = s∈p Mx |s and Tp = s∈p Ts , we get: M x qx |p |p e−qx |p Tp Eq. [sent-110, score-0.354]
39 Each variable X is given a parent set UX , and the transition intensities qx|u are recorded for leaving donor states x given the current setting of the parents u ∈ UX . [sent-113, score-0.402]
40 The CTBN likelihood can be shown to be: M e−qx|u Tx|u qxxxx |u |u (2) x =x X∈X x∈X u∈UX as in [5], where qxx |u and Mxx |u denote the intensity and number of transitions from state x to state x given parents setting u, and x =x qxx |u = qx|u . [sent-114, score-0.738]
41 (2) = X∈X x∈X u∈UX x =x M x qx |p |p e−qx |p Tp = (3) X∈X x ∈X p∈Px where we define p as {x}×{u}×(X \(X ×UX )) in each partition Px , and likewise: qx |p = qxx |u , Mx |p = Mxx |u , and Tp = Tx|u . [sent-116, score-0.809]
42 Thus, CTBNs are one instance of partition-based CTBNs, with partitions corresponding to a specified donor state x and parents setting u. [sent-117, score-0.304]
43 2 Tree and forest partitions Trees and forests induce partitions over a space defined by the set of possible split criteria [11]. [sent-119, score-0.56]
44 Here we will define the Conditional Intensity Trees (CITs): regression trees that determine the intensities qx |p by inducing a partition over P. [sent-120, score-0.554]
45 Similarly, we will define Conditional Intensity Forests (CIFs), where tree intensities are named intensity factors whose product determines qx |p . [sent-121, score-0.779]
46 Formally, a Conditional Intensity Tree (CIT) fx is a directed tree structure on a graph G(V, E) with nodes V and edges E(Vi , Vj ). [sent-123, score-0.199]
47 Internal nodes Vi of the tree hold splits σVi = (πVi , {E(Vi , ·)}) composed of surjective maps πVi : s → E(Vi , Vj ) and lists of the outgoing edges. [sent-124, score-0.163]
48 External nodes CIT l, or leaves, hold non-negative real values qx |p called intensities. [sent-126, score-0.373]
49 The parts corresponding to paths of a CIT form a partition over P, which can be shown easily using induction and the fact that the maps πVi induce disjoint parts pVj that cover P. [sent-128, score-0.151]
50 Finally, a CIF produces intensities from joint states by taking the CIT CIF product over the intensity factors from each CIT: qx |pCIF = fx qx |pCIT . [sent-131, score-1.174]
51 4 Using regression trees and forests can greatly reduce the number of model parameters. [sent-132, score-0.177]
52 In CTBNs, the number of parameters grows exponentially in the number of parents per node. [sent-133, score-0.154]
53 In tree and forest CTBNs, the number of parameters may be linear in the number of parents per node, exploiting the efficiency of using partitions. [sent-134, score-0.501]
54 Notably, however, tree CTBNs are limited to having one intensity per parameter. [sent-135, score-0.315]
55 In forest CTBNs, the number of intensities can be exponential in the number of parameters. [sent-136, score-0.361]
56 Thus, the forest model has much greater potential expressivity per parameter than the other models. [sent-137, score-0.282]
57 3 Forest CTBN learning Here we discuss the reasoning for using the multiplicative assumption and derive the changes in likelihood given modifications to the forest structure. [sent-140, score-0.421]
58 However, if we take the sum over the intensity factors from each tree, there are no direct methods for calculating the change in likelihood aside from calculating the likelihood before and after a forest modification, which would require scanning the full data once per modification proposal. [sent-144, score-0.653]
59 Furthermore, summing intensity factors could lead to intensities outside the valid domain [0, ∞). [sent-145, score-0.334]
60 As we show below, using the multiplicative assumption also has the advantage that it is easy to compute the change in log likelihood with changes in forest structure. [sent-147, score-0.454]
61 Consider a partition-based CTBN M = (B, {Fx }) where the partitions Px and intensities qx |p are given by the CIFs {Fx }. [sent-148, score-0.524]
62 We focus on change in forest structure for one state x ∈ X and remove x from the subscript notation for simplicity. [sent-149, score-0.35]
63 Given a current forest structure F and its partition P , we formulate the change in likelihood by adding a new CIT f and its partition P . [sent-150, score-0.433]
64 Another example of f is a tree copied to have the same structure as a CIT f in F with all intensity factors set to one, except at one leaf node where a split is added. [sent-152, score-0.418]
65 The first and third terms are easy to compute given the old intensities and new intensity factors. [sent-156, score-0.336]
66 The second term is slightly more complicated: qp Tp = ˆ ˆ p ˆ qp qp Tp = ˆ p ˆ qp p qp Tp ˆ p∼p ˆ We introduce the notation p ∼ p to denote the parts p that correspond to the part p . [sent-157, score-1.261]
67 ˆ ˆ The number of parts in the joint partition set P can be exponentially large, but the only remaining dependency on the joint partition space in the change in log likelihood is the term p∼p qp Tp . [sent-159, score-0.546]
68 Thinking of intensities q as rates, and given durations T , we observe that the second and third terms in equation 4 are expected numbers of transitions: Ep = p qp Tp and Ep = p qp Tp . [sent-161, score-0.626]
69 Specifically, the ˆ ˆ ˆ ˆ ˆ ˆ expectations Ep and Ep are the expected number of transitions in part p and p using the old model intensities, respectively, whereas Ep is the expected number of transitions using the new intensities. [sent-163, score-0.189]
70 4 Maximum-likelihood parameters The change in log likelihood is dependent on the intensity factor values {qp } we choose for the new partition. [sent-165, score-0.309]
71 We calculate the maximum likelihood parameters by setting the derivative with respect to Mp M these factors to zero to get qp = = E p . [sent-166, score-0.334]
72 Following the derivation in [2], we assign ˆ p p∼p qp Tp ˆ priors to the sufficient statistics calculations. [sent-167, score-0.244]
73 Note, however, that the priors affect the multiplicative intensity factors, so a tree may split on the same partition set twice to get a stronger effect on the intensity, with the possible risk of undesirable overfitting. [sent-168, score-0.446]
74 5 Forest implementation We use greedy likelihood maximization steps to learn multiplicative forests (mfCTBNs). [sent-170, score-0.275]
75 Initially we are given a blank forest Fx per state x containing a blank tree fx , that is, a single root node acting as a leaf with an intensity factor of one. [sent-172, score-0.866]
76 First, for every leaf l in M, we (re)initialize the sufficient statistics Ml and El in M, as well as sufficient statistics for potential forest modifications: Ml,σ , El,σ , ∀l, σ. [sent-174, score-0.303]
77 For every (state, duration) pair (si , ti ), where ti is the time spent in state si−1 before the transition to si , we update the sufficient statistics that compose equation 4. [sent-176, score-0.276]
78 Finally, we compute the change in likelihood for possible forest modifications, and choose the modification with the greatest score. [sent-177, score-0.339]
79 The new leaf intensity factors are the product of the old intensity (factor) ql and the intensity factor qp . [sent-180, score-0.949]
80 Unlike most forest learning algorithms, mfCTBNs learn trees neither in series nor in parallel. [sent-181, score-0.304]
81 Notably, the best split is determined solely by the change in log likelihood, regardless of the tree to which it belongs. [sent-182, score-0.166]
82 If it belongs to the blank tree at the end of the forest, that tree produces non-trivial factors and a new blank tree is appended to the forest. [sent-183, score-0.384]
83 In this way, as mfCTBN learns, it automatically determines the forest size and tree depth according to the evidence in the data. [sent-184, score-0.347]
84 4 Experiments We evaluate our tree learning and forest learning algorithms on samples from three models. [sent-186, score-0.347]
85 gender bmi age The second is a simplified cardiovascular health model we call “CV health” shown in Figure HDL blood pressure 2. [sent-188, score-0.183]
86 For example, it glucose level has been well-established that independent posatherosclerosis arrhythmia itive risk factors for atherosclerosis include being male, a smoker, in old age, having high glustroke MI abnormal heart electrophysiology cose, high BMI, and high blood pressure. [sent-190, score-0.153]
87 05, 200) over variable states, with intensity factor ratios of 1 : 0. [sent-195, score-0.228]
88 In our experiments we set the potential splits {σ} to be the set of binary splits determined by indicators for each variable state x . [sent-198, score-0.16]
89 The expected number of parents per node in the S100 model is approximately 20. [sent-219, score-0.182]
90 In order to exactly reconstruct the S100 model, a traditional CTBN would then need to estimate 221 intensity values. [sent-220, score-0.198]
91 1 Figure 4: Ground truth (left) and mfCTBN forest learnt from 1000 trajectories (right) for intensity/rate of developing severe atherosclerosis. [sent-260, score-0.341]
92 Figure 4 shows the ground truth forest and the mfCTBN forest learned for the “severe atherosclerosis” state in the CV health model. [sent-261, score-0.679]
93 To calculate the intensity of transitioning into this state, we identify the leaf in each forest that matches the current state and take the product of their intensity factors. [sent-262, score-0.824]
94 Full forest models can be found in the Supplementary Materials. [sent-264, score-0.256]
95 5 Related Work We discuss the relationships between mfCTBNs and related work in two areas: forest learning and continuous-time processes. [sent-265, score-0.256]
96 Forest learning with a multiplicative assumption is equivalent to forest learning in the log space with an additive assumption and exponentiating the result. [sent-266, score-0.371]
97 However, our method is different in its direct use of a likelihood-based objective function and in its ability to modify any tree in the forest at any iteration. [sent-268, score-0.347]
98 Perhaps the most closely related work, piecewise-constant conditional intensity models (PCIMs), reframes the concept of a factored CTMP to allow learning over arbitrary basis state functions with trees, possibly piecewise over time [10]. [sent-272, score-0.268]
99 Our models grow linearly in the number of forest node splits, while CTBNs grow exponentially in the number of parent nodes per variable. [sent-281, score-0.329]
100 Motivated by the domain over intensities, we introduced multiplicative forests and showed that CTBN likelihood updates can be efficiently computed using changes in log likelihood. [sent-282, score-0.303]
wordName wordTfidf (topN-words)
[('ctbns', 0.516), ('qx', 0.354), ('ctbn', 0.312), ('forest', 0.256), ('qp', 0.244), ('intensity', 0.198), ('mfctbns', 0.177), ('tp', 0.147), ('mfctbn', 0.136), ('forests', 0.129), ('parents', 0.128), ('cit', 0.109), ('intensities', 0.105), ('tree', 0.091), ('mp', 0.091), ('nodelman', 0.088), ('multiplicative', 0.087), ('ux', 0.083), ('treectbns', 0.081), ('px', 0.079), ('transitions', 0.078), ('false', 0.075), ('health', 0.071), ('state', 0.07), ('fx', 0.07), ('cif', 0.068), ('ctmp', 0.068), ('partitions', 0.065), ('transition', 0.064), ('likelihood', 0.059), ('trajectories', 0.059), ('mx', 0.055), ('transitioning', 0.055), ('qxx', 0.054), ('treectbn', 0.054), ('cv', 0.051), ('trees', 0.048), ('partition', 0.047), ('leaf', 0.047), ('si', 0.043), ('parts', 0.041), ('cardiovascular', 0.041), ('donor', 0.041), ('mxx', 0.041), ('pvj', 0.041), ('blank', 0.04), ('shelton', 0.04), ('ti', 0.039), ('vi', 0.038), ('trajectory', 0.037), ('dependencies', 0.037), ('glucose', 0.036), ('smoker', 0.036), ('tx', 0.035), ('states', 0.034), ('old', 0.033), ('durations', 0.033), ('cits', 0.033), ('ep', 0.032), ('factors', 0.031), ('variable', 0.03), ('splits', 0.03), ('intervals', 0.029), ('joint', 0.028), ('log', 0.028), ('node', 0.028), ('uai', 0.027), ('atherosclerosis', 0.027), ('atherosclerotic', 0.027), ('cifs', 0.027), ('ctmps', 0.027), ('hypertensive', 0.027), ('multifactorial', 0.027), ('qxxxx', 0.027), ('modi', 0.027), ('truth', 0.026), ('blood', 0.026), ('per', 0.026), ('bayesian', 0.026), ('networks', 0.026), ('continuous', 0.026), ('processes', 0.025), ('rate', 0.025), ('boosting', 0.024), ('change', 0.024), ('bmi', 0.024), ('youth', 0.024), ('timeline', 0.024), ('composed', 0.023), ('split', 0.023), ('induce', 0.022), ('runs', 0.021), ('spent', 0.021), ('entering', 0.021), ('pressure', 0.021), ('reasoning', 0.019), ('directed', 0.019), ('nodes', 0.019), ('true', 0.019), ('reversing', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 232 nips-2012-Multiplicative Forests for Continuous-Time Processes
Author: Jeremy Weiss, Sriraam Natarajan, David Page
Abstract: Learning temporal dependencies between variables over continuous time is an important and challenging task. Continuous-time Bayesian networks effectively model such processes but are limited by the number of conditional intensity matrices, which grows exponentially in the number of parents per variable. We develop a partition-based representation using regression trees and forests whose parameter spaces grow linearly in the number of node splits. Using a multiplicative assumption we show how to update the forest likelihood in closed form, producing efficient model updates. Our results show multiplicative forests can be learned from few temporal trajectories with large gains in performance and scalability.
2 0.16161849 81 nips-2012-Context-Sensitive Decision Forests for Object Detection
Author: Peter Kontschieder, Samuel R. Bulò, Antonio Criminisi, Pushmeet Kohli, Marcello Pelillo, Horst Bischof
Abstract: In this paper we introduce Context-Sensitive Decision Forests - A new perspective to exploit contextual information in the popular decision forest framework for the object detection problem. They are tree-structured classifiers with the ability to access intermediate prediction (here: classification and regression) information during training and inference time. This intermediate prediction is available for each sample and allows us to develop context-based decision criteria, used for refining the prediction process. In addition, we introduce a novel split criterion which in combination with a priority based way of constructing the trees, allows more accurate regression mode selection and hence improves the current context information. In our experiments, we demonstrate improved results for the task of pedestrian detection on the challenging TUD data set when compared to state-ofthe-art methods. 1 Introduction and Related Work In the last years, the random forest framework [1, 6] has become a very popular and powerful tool for classification and regression problems by exhibiting many appealing properties like inherent multi-class capability, robustness to label noise and reduced tendencies to overfitting [7]. They are considered to be close to an ideal learner [13], making them attractive in many areas of computer vision like image classification [5, 17], clustering [19], regression [8] or semantic segmentation [24, 15, 18]. In this work we show how the decision forest algorithm can be extended to include contextual information during learning and inference for classification and regression problems. We focus on applying random forests to object detection, i.e. the problem of localizing multiple instances of a given object class in a test image. This task has been previously addressed in random forests [9], where the trees were modified to learn a mapping between the appearance of an image patch and its relative position to the object category centroid (i.e. center voting information). During inference, the resulting Hough Forest not only performs classification on test samples but also casts probabilistic votes in a generalized Hough-voting space [3] that is subsequently used to obtain object center hypotheses. Ever since, a series of applications such as tracking and action recognition [10], body-joint position estimation [12] and multi-class object detection [22] have been presented. However, Hough Forests typically produce non-distinctive object hypotheses in the Hough space and hence there is the need to perform non-maximum suppression (NMS) for obtaining the final results. While this has been addressed in [4, 26], another shortcoming is that standard (Hough) forests treat samples in a completely independent way, i.e. there is no mechanism that encourages the classifier to perform consistent predictions. Within this work we are proposing that context information can be used to overcome the aforementioned problems. For example, training data for visual learning is often represented by images in form of a (regular) pixel grid topology, i.e. objects appearing in natural images can often be found in a specific context. The importance of contextual information was already highlighted in the 80’s with 1 Figure 1: Top row: Training image, label image, visualization of priority-based growing of tree (the lower, the earlier the consideration during training.). Bottom row: Inverted Hough image using [9] and breadth-first training after 6 levels (26 = 64 nodes), Inverted Hough image after growing 64 nodes using our priority queue, Inverted Hough image using priority queue shows distinctive peaks at the end of training. a pioneering work on relaxation labelling [14] and a later work with focus on inference tasks [20] that addressed the issue of learning within the same framework. More recently, contextual information has been used in the field of object class segmentation [21], however, mostly for high-level reasoning in random field models or to resolve contradicting segmentation results. The introduction of contextual information as additional features in low-level classifiers was initially proposed in the Auto-context [25] and Semantic Texton Forest [24] models. Auto-context shows a general approach for classifier boosting by iteratively learning from appearance and context information. In this line of research [18] augmented the feature space for an Entanglement Random Forest with a classification feature, that is consequently refined by the class posterior distributions according to the progress of the trained subtree. The training procedure is allowed to perform tests for specific, contextual label configurations which was demonstrated to significantly improve the segmentation results. However, the In this paper we are presenting Context-Sensitve Decision Forests - A novel and unified interpretation of Hough Forests in light of contextual sensitivity. Our work is inspired by Auto-Context and Entanglement Forests, but instead of providing only posterior classification results from an earlier level of the classifier construction during learning and testing, we additionally provide regression (voting) information as it is used in Hough Forests. The second core contribution of our work is related to how we grow the trees: Instead of training them in a depth- or breadth-first way, we propose a priority-based construction (which could actually consider depth- or breadth-first as particular cases). The priority is determined by the current training error, i.e. we first grow the parts of the tree where we experience higher error. To this end, we introduce a unified splitting criterion that estimates the joint error of classification and regression. The consequence of using our priority-based training are illustrated in Figure 1: Given the training image with corresponding label image (top row, images 1 and 2), the tree first tries to learn the foreground samples as shown in the color-coded plot (top row, image 3, colors correspond to index number of nodes in the tree). The effects on the intermediate prediction quality are shown in the bottom row for the regression case: The first image shows the regression quality after training a tree with 6 levels (26 = 64 nodes) in a breadth-first way while the second image shows the progress after growing 64 nodes according to the priority based training. Clearly, the modes for the center hypotheses are more distinctive which in turn yields to more accurate intermediate regression information that can be used for further tree construction. Our third contribution is a new family of split functions that allows to learn from training images containing multiple training instances as shown for the pedestrians in the example. We introduce a test that checks the centroid compatibility for pairs of training samples taken from the context, based on the intermediate classification and regression derived as described before. To assess our contributions, we performed several experiments on the challenging TUD pedestrian data set [2], yielding a significant improvement of 9% in the recall at 90% precision rate in comparison to standard Hough Forests, when learning from crowded pedestrian images. 2 2 Context-Sensitive Decision Trees This section introduces the general idea behind the context-sensitive decision forest without references to specific applications. Only in Section 3 we show a particular application to the problem of object detection. After showing some basic notational conventions that are used in the paper, we provide a section that revisits the random forest framework for classification and regression tasks from a joint perspective, i.e. a theory allowing to consider e.g. [1, 11] and [9] in a unified way. Starting from this general view we finally introduce the context-sensitive forests in 2.2. Notations. In the paper we denote vectors using boldface lowercase (e.g. d, u, v) and sets by using uppercase calligraphic (e.g. X , Y) symbols. The sets of real, natural and integer numbers are denoted with R, N and Z as usually. We denote by 2X the power set of X and by 1 [P ] the indicator function returning 1 or 0 according to whether the proposition P is true or false. Moreover, with P(Y) we denote the set of probability distributions having Y as sample space and we implicitly assume that some σ-algebra is defined on Y. We denote by δ(x) the Dirac delta function. Finally, Ex∼Q [f (x)] denotes the expectation of f (x) with respect to x sampled according to distribution Q. 2.1 Random Decision Forests for joint classification and regression A (binary) decision tree is a tree-structured predictor1 where, starting from the root, a sample is routed until it reaches a leaf where the prediction takes place. At each internal node of the tree the decision is taken whether the sample should be forwarded to the left or right child, according to a binary-valued function. In formal terms, let X denote the input space, let Y denote the output space and let T dt be the set of decision trees. In its simplest form a decision tree consists of a single node (a leaf ) and is parametrized by a probability distribution Q ∈ P(Y) which represents the posterior probability of elements in Y given any data sample reaching the leaf. We denote this (admittedly rudimentary) tree as L F (Q) ∈ T td . Otherwise, a decision tree consists of a node with a left and a right sub-tree. This node is parametrized by a split function φ : X → {0, 1}, which determines whether to route a data sample x ∈ X reaching it to the left decision sub-tree tl ∈ T dt (if φ(x) = 0) or to the right one tr ∈ T dt (if φ(x) = 1). We denote such a tree as N D (φ, tl , tr ) ∈ T td . Finally, a decision forest is an ensemble F ⊆ T td of decision trees which makes a prediction about a data sample by averaging over the single predictions gathered from all trees. Inference. Given a decision tree t ∈ T dt , the associated posterior probability of each element in Y given a sample x ∈ X is determined by finding the probability distribution Q parametrizing the leaf that is reached by x when routed along the tree. This is compactly presented with the following definition of P (y|x, t), which is inductive in the structure of t: if t = L F (Q) Q(y) P (y | x, t ) = P (y | x, tl ) if t = N D (φ, tl , tr ) and φ(x) = 0 (1) P (y | x, tr ) if t = N D (φ, tl , tr ) and φ(x) = 1 . Finally, the combination of the posterior probabilities derived from the trees in a forest F ⊆ T dt can be done by an averaging operation [6], yielding a single posterior probability for the whole forest: P (y|x, F) = 1 |F| P (y|x, t) . (2) t∈F Randomized training. A random forest is created by training a set of random decision trees independently on random subsets of the training data D ⊆ X ×Y. The training procedure for a single decision tree heuristically optimizes a set of parameters like the tree structure, the split functions at the internal nodes and the density estimates at the leaves in order to reduce the prediction error on the training data. In order to prevent overfitting problems, the search space of possible split functions is limited to a random set and a minimum number of training samples is required to grow a leaf node. During the training procedure, each new node is fed with a set of training samples Z ⊆ D. If some stopping condition holds, depending on Z, the node becomes a leaf and a density on Y is estimated based on Z. Otherwise, an internal node is grown and a split function is selected from a pool of random ones in a way to minimize some sort of training error on Z. The selected split function induces a partition 1 we use the term predictor because we will jointly consider classification and regression. 3 of Z into two sets, which are in turn becoming the left and right childs of the current node where the training procedure is continued, respectively. We will now write this training procedure in more formal terms. To this end we introduce a function π(Z) ∈ P(Y) providing a density on Y estimated from the training data Z ⊆ D and a loss function L(Z | Q) ∈ R penalizing wrong predictions on the training samples in Z, when predictions are given according to a distribution Q ∈ P(Y). The loss function L can be further decomposed in terms of a loss function (·|Q) : Y → R acting on each sample of the training set: L(Z | Q) = (y | Q) . (3) (x,y)∈Z Also, let Φ(Z) be a set of split functions randomly generated for a training set Z and given a split φ function φ ∈ Φ(Z), we denote by Zlφ and Zr the sets identified by splitting Z according to φ, i.e. Zlφ = {(x, y) ∈ Z : φ(x) = 0} and φ Zr = {(x, y) ∈ Z : φ(x) = 1} . We can now summarize the training procedure in terms of a recursive function g : 2X ×Y → T , which generates a random decision tree from a training set given as argument: g(Z) = L F (π(Z)) ND if some stopping condition holds φ φ, g(Zlφ ), g(Zr ) otherwise . (4) Here, we determine the optimal split function φ in the pool Φ(Z) as the one minimizing the loss we incur as a result of the node split: φ φ ∈ arg min L(Zlφ ) + L(Zr ) : φ ∈ Φ(Z) (5) where we compactly write L(Z) for L(Z|π(Z)), i.e. the loss on Z obtained with predictions driven by π(Z). A typical split function selection criterion commonly adopted for classification and regression is information gain. The equivalent counterpart in terms of loss can be obtained by using a log-loss, i.e. (y|Q) = − log(Q(y)). A further widely used criterion is based on Gini impurity, which can be expressed in this setting by using (y|Q) = 1 − Q(y). Finally, the stopping condition that is used in (4) to determine whether to create a leaf or to continue branching the tree typically consists in checking |Z|, i.e. the number of training samples at the node, or the loss L(Z) are below some given thresholds, or if a maximum depth is reached. 2.2 Context-sensitive decision forests A context-sensitive (CS) decision tree is a decision tree in which split functions are enriched with the ability of testing contextual information of a sample, before taking a decision about where to route it. We generate contextual information at each node of a decision tree by exploiting a truncated version of the same tree as a predictor. This idea is shared with [18], however, we introduce some novelties by tackling both, classification and regression problems in a joint manner and by leaving a wider flexibility in the tree truncation procedure. We denote the set of CS decision trees as T . The main differences characterizing a CS decision tree t ∈ T compared with a standard decision tree are the following: a) every node (leaves and internal nodes) of t has an associated probability distribution Q ∈ P(Y) representing the posterior probability of an element in Y given any data sample reaching it; b) internal nodes are indexed with distinct natural numbers n ∈ N in a way to preserve the property that children nodes have a larger index compared to their parent node; c) the split function at each internal node, denoted by ϕ(·|t ) : X → {0, 1}, is bound to a CS decision tree t ∈ T , which is a truncated version of t and can be used to compute intermediate, contextual information. Similar to Section 2.1 we denote by L F (Q) ∈ T the simplest CS decision tree consisting of a single leaf node parametrized by the distribution Q, while we denote by N D (n, Q, ϕ, tl , tr ) ∈ T , the rest of the trees consisting of a node having a left and a right sub-tree, denoted by tl , tr ∈ T respectively, and being parametrized by the index n, a probability distribution Q and the split function ϕ as described above. As shown in Figure 2, the truncation of a CS decision tree at each node is obtained by exploiting the indexing imposed on the internal nodes of the tree. Given a CS decision tree t ∈ T and m ∈ N, 4 1 1 4 2 3 6 2 5 4 3 (b) The truncated version t(<5) (a) A CS decision tree t Figure 2: On the left, we find a CS decision tree t, where only the internal nodes are indexed. On the right, we see the truncated version t(<5) of t, which is obtained by converting to leaves all nodes having index ≥ 5 (we marked with colors the corresponding node transformations). we denote by t( < τ 2 In the experiments conducted, we never exceeded 10 iterations for finding a mode. 6 (8) where Pj = P (·|(u + hj , I), t), with j = 1, 2, are the posterior probabilities obtained from tree t given samples at position u+h1 and u+h2 of image I, respectively. Please note that this test should not be confused with the regression split criterion in [9], which tries to partition the training set in a way to group examples with similar voting direction and length. Besides the novel context-sensitive split function we employ also standard split functions performing tests on X as defined in [24]. 4 Experiments To assess our proposed approach, we have conducted several experiments on the task of pedestrian detection. Detecting pedestrians is very challenging for Hough-voting based methods as they typically exhibit strong articulations of feet and arms, yielding to non-distinctive hypotheses in the Hough space. We evaluated our method on the TUD pedestrian data base [2] in two different ways: First, we show our detection results with training according to the standard protocol using 400 training images (where each image contains a single annotation of a pedestrian) and evaluation on the Campus and Crossing scenes, respectively (Section 4.1). With this experiment we show the improvement over state-of-the-art approaches when learning can be performed with simultaneous knowledge about context information. In a second variation (Section 4.2), we use the images of the Crossing scene (201 images) as a training set. Most images of this scene contain more than four persons with strong overlap and mutual occlusions. However, instead of using the original annotation which covers only pedestrians with at least 50% overlap (1008 bounding boxes), we use the more accurate, pixel-wise ground truth annotations of [23] for the entire scene that includes all persons and consists of 1215 bounding boxes. Please note that this annotation is even more detailed than the one presented in [4] with 1018 bounding boxes. The purpose of the second experiment is to show that our context-sensitive forest can exploit the availability of multiple training instances significantly better than state-of-the-art. The most related work and therefore also the baseline in our experiments is the Hough Forest [9]. To guarantee a fair comparison, we use the same training parameters for [9] and our context sensitive forest: We trained 20 trees and the training data (including horizontally flipped images) was sampled homogeneously per category per image. The patch size was fixed to 30 × 30 and we performed 1600 node tests for finding the best split function parameters per node. The trees were stopped growing when < 7 samples were available. As image features, we used the the first 16 feature channels provided in the publicly available Hough Forest code of [9]. In order to obtain the object detection hypotheses from the Hough space, we use the same Non-maximum suppression (NMS) technique in all our experiments as suggested in [9]. To evaluate the obtained hypotheses, we use the standard PASAL-VOC criterion which requires the mutual overlap between ground truth and detected bounding boxes to be ≥ 50%. The additional parameter of (7) was fixed to σ = 7. 4.1 Evaluation using standard protocol training set The standard training set contains 400 images where each image comes with a single pedestrian annotation. For our experiments, we rescaled the images by a factor of 0.5 and doubled the training image set by including also the horizontally flipped images. We randomly chose 125 training samples per image for foreground and background, resulting in 2 · 400 · 2 · 125 = 200k training samples per tree. For additional comparisons, we provide the results presented in the recent work on joint object detection and segmentation of [23], from which we also provide evaluation results of the Implicit Shape Model (ISM) [16]. However, please note that the results of [23] are based on a different baseline implementation. Moreover, we show the results of [4] when using the provided code and configuration files from the first authors homepage. Unfortunately, we could not reproduce the results of the original paper. First, we discuss the results obtained on the Campus scene. This data set consists of 71 images showing walking pedestrians at severe scale differences and partial occlusions. The ground truth we use has been released with [4] and contains a total number of 314 pedestrians. Figure 3, first row, plot 1 shows the precision-recall curves when using 3 scales (factors 0.3, 0.4, 0.55) for our baseline [9] (blue), results from re-evaluating [4] (cyan, 5 scales), [23] (green) and our ContextSensitive Forest without and with using the priority queue based tree construction (red/magenta). In case of not using the priority queue, we trained the trees according to a breadth-first way. We obtain a performance boost of ≈ 6% in recall at a precision of 90% when using both, context information and the priority based construction of our forest. The second plot in the first row of Figure 3 shows the results when the same forests are tested on the Crossing scene, using the more detailed ground 7 TUD Campus (3 scales) TUD−Crossing (3 scales) 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Precision 1 0.9 Precision 1 0.5 0.4 0.3 0.2 0.1 0 0 0.5 0.4 0.3 Baseline Hough Forest Barinova et al. CVPR’10, 5 scales Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue Riemenschneider et al. ECCV’12 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.2 0.1 0.9 0 0 1 Baseline Hough Forest Barinova et al. CVPR’10 Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue Riemenschneider et al. ECCV’12 (1 scale) Leibe et al. IJCV’08 (1 scale) 0.1 TUD Campus (3 scales) 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.9 1 0.9 1 1 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Precision 1 0.9 Precision 0.2 TUD Campus (5 scales) 0.5 0.4 0.3 0 0 0.4 0.3 0.2 0.1 0.5 0.2 Baseline Hough Forest Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.1 0.9 1 0 0 Baseline Hough Forest Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 Figure 3: Precision-Recall Curves for detections, Top row: Standard training (400 images), evaluation on Campus and Crossing (3 scales). Bottom row: Training on Crossing annotations of [23], evaluation on Campus, 3 and 5 scales. Right images: Qualitative examples for Campus (top 2) and Crossing (bottom 2) scenes. (green) correctly found by our method (blue) ground truth (red) wrong association (cyan) missed detection. truth annotations. The data set shows walking pedestrians (Figure 3, right side, last 2 images) with a smaller variation in scale compared to the Campus scene but with strong mutual occlusions and overlaps. The improvement with respect to the baseline is lower (≈ 2% gain at a precision of 90%) and we find similar developments of the curves. However, this comes somewhat expectedly as the training data does not properly reflect the occlusions we actually want to model. 4.2 Evaluation on Campus scene using Crossing scene as training set In our next experiment we trained the forests (same parameters) on the novel annotations of [23] for the Crossing scene. Please note that this reduces the training set to only 201 images (we did not include the flipped images). Qualitative detection results are shown in Figure 3, right side, images 1 and 2. From the first precison-recall curve in the second row of Figure 3 we can see, that the margin between the baseline and our proposed method could be clearly improved (gain of ≈ 9% recall at precision 90%) when evaluating on the same 3 scales. With evaluation on 5 scales (factors 0.34, 0.42, 0.51, 0.65, 0.76) we found a strong increase in the recall, however, at the cost of loosing 2 − 3% of precision below a recall of 60%, as illustrated in the second plot of row 2 in Figure 3. While our method is able to maintain a precision above 90% up to a recall of ≈ 83%, the baseline implementation drops already at a recall of ≈ 20%. 5 Conclusions In this work we have presented Context-Sensitive Decision Forests with application to the object detection problem. Our new forest has the ability to access intermediate prediction (classification and regression) information about all samples of the training set and can therefore learn from contextual information throughout the growing process. This is in contrast to existing random forest methods used for object detection which typically treat training samples in an independent manner. Moreover, we have introduced a novel splitting criterion together with a mode isolation technique, which allows us to (a) perform a priority-driven way of tree growing and (b) install novel context-based test functions to check for mutual object centroid agreements. In our experimental results on pedestrian detection we demonstrated superior performance with respect to state-of-the-art methods and additionally found that our new algorithm can significantly better exploit training data containing multiple training objects. Acknowledgements. Peter Kontschieder acknowledges financial support of the Austrian Science Fund (FWF) from project ’Fibermorph’ with number P22261-N22. 8 References [1] Y. Amit and D. Geman. Shape quantization and recognition with randomized trees. Neural Computation, 1997. [2] M. Andriluka, S. Roth, and B. Schiele. People-tracking-by-detection and people-detection-by-tracking. In (CVPR), 2008. [3] D. H. Ballard. Generalizing the hough transform to detect arbitrary shapes. Pattern Recognition, 13(2), 1981. [4] O. Barinova, V. Lempitsky, and P. Kohli. On detection of multiple object instances using hough transforms. In (CVPR), 2010. [5] A. Bosch, A. Zisserman, and X. Mu˜oz. Image classification using random forests and ferns. In (ICCV), n 2007. [6] L. Breiman. Random forests. In Machine Learning, 2001. [7] A. Criminisi, J. Shotton, and E. Konukoglu. Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. In Foundations and Trends in Computer Graphics and Vision, volume 7, pages 81–227, 2012. [8] A. Criminisi, J. Shotton, D. Robertson, and E. Konukoglu. Regression forests for efficient anatomy detection and localization in CT scans. In MICCAI-MCV Workshop, 2010. [9] J. Gall and V. Lempitsky. Class-specific hough forests for object detection. In (CVPR), 2009. [10] J. Gall, A. Yao, N. Razavi, L. Van Gool, and V. Lempitsky. Hough forests for object detection, tracking, and action recognition. (PAMI), 2011. [11] P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 2006. [12] R. Girshick, J. Shotton, P. Kohli, A. Criminisi, and A. Fitzgibbon. Efficient regression of general-activity human poses from depth images. In (ICCV), 2011. [13] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer, 2009. [14] R. A. Hummel and S. W. Zucker. On the foundations of relaxation labeling. (PAMI), 5(3):267–287, 1983. [15] P. Kontschieder, S. Rota Bul` , H. Bischof, and M. Pelillo. Structured class-labels in random forests for o semantic image labelling. In (ICCV), 2011. [16] B. Leibe, A. Leonardis, and B. Schiele. Robust object detection with interleaved categorization and segmentation. (IJCV), 2008. [17] R. Mar´ e, P. Geurts, J. Piater, and L. Wehenkel. Random subwindows for robust image classification. In e (CVPR), 2005. [18] A. Montillo, J. Shotton, J. Winn, J. E. Iglesias, D. Metaxas, and A. Criminisi. Entangled decision forests and their application for semantic segmentation of CT images. In (IPMI), 2011. [19] F. Moosmann, B. Triggs, and F. Jurie. Fast discriminative visual codebooks using randomized clustering forests. In (NIPS), 2006. [20] M. Pelillo and M. Refice. Learning compatibility coefficients for relaxation labeling processes. (PAMI), 16(9):933–945, 1994. [21] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In (ICCV), 2007. [22] N. Razavi, J. Gall, and L. Van Gool. Scalable multi-class object detection. In (CVPR), 2011. [23] H. Riemenschneider, S. Sternig, M. Donoser, P. M. Roth, and H. Bischof. Hough regions for joining instance localization and segmentation. In (ECCV), 2012. [24] J. Shotton, M. Johnson, and R. Cipolla. Semantic texton forests for image categorization and segmentation. In (CVPR), 2008. [25] Z. Tu. Auto-context and its application to high-level vision tasks. In (CVPR), 2008. [26] O. Woodford, M. Pham, A. Maki, F. Perbet, and B. Stenger. Demisting the hough transform for 3d shape recognition and registration. In (BMVC), 2011. 9
3 0.081139877 254 nips-2012-On the Sample Complexity of Robust PCA
Author: Matthew Coudron, Gilad Lerman
Abstract: We estimate the rate of convergence and sample complexity of a recent robust estimator for a generalized version of the inverse covariance matrix. This estimator is used in a convex algorithm for robust subspace recovery (i.e., robust PCA). Our model assumes a sub-Gaussian underlying distribution and an i.i.d. sample from it. Our main result shows with high probability that the norm of the difference between the generalized inverse covariance of the underlying distribution and its estimator from an i.i.d. sample of size N is of order O(N −0.5+ ) for arbitrarily small > 0 (affecting the probabilistic estimate); this rate of convergence is close to the one of direct covariance estimation, i.e., O(N −0.5 ). Our precise probabilistic estimate implies for some natural settings that the sample complexity of the generalized inverse covariance estimation when using the Frobenius norm is O(D2+δ ) for arbitrarily small δ > 0 (whereas the sample complexity of direct covariance estimation with Frobenius norm is O(D2 )). These results provide similar rates of convergence and sample complexity for the corresponding robust subspace recovery algorithm. To the best of our knowledge, this is the only work analyzing the sample complexity of any robust PCA algorithm. 1
4 0.071430892 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering
Author: Levi Boyles, Max Welling
Abstract: We introduce a new prior for use in Nonparametric Bayesian Hierarchical Clustering. The prior is constructed by marginalizing out the time information of Kingman’s coalescent, providing a prior over tree structures which we call the Time-Marginalized Coalescent (TMC). This allows for models which factorize the tree structure and times, providing two benefits: more flexible priors may be constructed and more efficient Gibbs type inference can be used. We demonstrate this on an example model for density estimation and show the TMC achieves competitive experimental results. 1
5 0.068754949 110 nips-2012-Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems
Author: Morteza Ibrahimi, Adel Javanmard, Benjamin V. Roy
Abstract: We study the problem of adaptive control of a high dimensional linear quadratic (LQ) system. Previous work established the asymptotic convergence to an optimal controller for various adaptive control schemes. More recently, for the average √ cost LQ problem, a regret bound of O( T ) was shown, apart form logarithmic factors. However, this bound scales exponentially with p, the dimension of the state space. In this work we consider the case where the matrices describing the dynamic of the LQ system are sparse and their dimensions are large. We present √ an adaptive control scheme that achieves a regret bound of O(p T ), apart from logarithmic factors. In particular, our algorithm has an average cost of (1 + ) times the optimum cost after T = polylog(p)O(1/ 2 ). This is in comparison to previous work on the dense dynamics where the algorithm requires time that scales exponentially with dimension in order to achieve regret of times the optimal cost. We believe that our result has prominent applications in the emerging area of computational advertising, in particular targeted online advertising and advertising in social networks. 1
6 0.061985269 123 nips-2012-Exponential Concentration for Mutual Information Estimation with Application to Forests
7 0.060507871 306 nips-2012-Semantic Kernel Forests from Multiple Taxonomies
8 0.058254868 205 nips-2012-MCMC for continuous-time discrete-state systems
9 0.057877444 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees
10 0.054138735 260 nips-2012-Online Sum-Product Computation Over Trees
11 0.051529408 34 nips-2012-Active Learning of Multi-Index Function Models
12 0.049480751 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions
13 0.048736542 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs
14 0.047870267 346 nips-2012-Topology Constraints in Graphical Models
15 0.046917785 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function
16 0.045798771 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach
17 0.045574285 233 nips-2012-Multiresolution Gaussian Processes
18 0.044885881 180 nips-2012-Learning Mixtures of Tree Graphical Models
19 0.044693679 206 nips-2012-Majorization for CRFs and Latent Likelihoods
20 0.044558451 182 nips-2012-Learning Networks of Heterogeneous Influence
topicId topicWeight
[(0, 0.128), (1, -0.013), (2, -0.013), (3, -0.022), (4, -0.055), (5, -0.008), (6, 0.0), (7, -0.021), (8, -0.106), (9, 0.033), (10, -0.041), (11, 0.014), (12, 0.038), (13, -0.042), (14, -0.082), (15, 0.0), (16, -0.015), (17, -0.012), (18, 0.031), (19, -0.038), (20, -0.029), (21, 0.066), (22, 0.034), (23, -0.019), (24, -0.028), (25, -0.009), (26, -0.021), (27, -0.04), (28, 0.021), (29, 0.022), (30, 0.076), (31, -0.091), (32, -0.013), (33, -0.082), (34, -0.034), (35, 0.071), (36, -0.041), (37, 0.017), (38, -0.018), (39, 0.028), (40, 0.017), (41, -0.005), (42, 0.026), (43, -0.043), (44, 0.036), (45, 0.009), (46, -0.054), (47, -0.082), (48, -0.015), (49, 0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.89522153 232 nips-2012-Multiplicative Forests for Continuous-Time Processes
Author: Jeremy Weiss, Sriraam Natarajan, David Page
Abstract: Learning temporal dependencies between variables over continuous time is an important and challenging task. Continuous-time Bayesian networks effectively model such processes but are limited by the number of conditional intensity matrices, which grows exponentially in the number of parents per variable. We develop a partition-based representation using regression trees and forests whose parameter spaces grow linearly in the number of node splits. Using a multiplicative assumption we show how to update the forest likelihood in closed form, producing efficient model updates. Our results show multiplicative forests can be learned from few temporal trajectories with large gains in performance and scalability.
2 0.72570133 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering
Author: Levi Boyles, Max Welling
Abstract: We introduce a new prior for use in Nonparametric Bayesian Hierarchical Clustering. The prior is constructed by marginalizing out the time information of Kingman’s coalescent, providing a prior over tree structures which we call the Time-Marginalized Coalescent (TMC). This allows for models which factorize the tree structure and times, providing two benefits: more flexible priors may be constructed and more efficient Gibbs type inference can be used. We demonstrate this on an example model for density estimation and show the TMC achieves competitive experimental results. 1
3 0.69649494 81 nips-2012-Context-Sensitive Decision Forests for Object Detection
Author: Peter Kontschieder, Samuel R. Bulò, Antonio Criminisi, Pushmeet Kohli, Marcello Pelillo, Horst Bischof
Abstract: In this paper we introduce Context-Sensitive Decision Forests - A new perspective to exploit contextual information in the popular decision forest framework for the object detection problem. They are tree-structured classifiers with the ability to access intermediate prediction (here: classification and regression) information during training and inference time. This intermediate prediction is available for each sample and allows us to develop context-based decision criteria, used for refining the prediction process. In addition, we introduce a novel split criterion which in combination with a priority based way of constructing the trees, allows more accurate regression mode selection and hence improves the current context information. In our experiments, we demonstrate improved results for the task of pedestrian detection on the challenging TUD data set when compared to state-ofthe-art methods. 1 Introduction and Related Work In the last years, the random forest framework [1, 6] has become a very popular and powerful tool for classification and regression problems by exhibiting many appealing properties like inherent multi-class capability, robustness to label noise and reduced tendencies to overfitting [7]. They are considered to be close to an ideal learner [13], making them attractive in many areas of computer vision like image classification [5, 17], clustering [19], regression [8] or semantic segmentation [24, 15, 18]. In this work we show how the decision forest algorithm can be extended to include contextual information during learning and inference for classification and regression problems. We focus on applying random forests to object detection, i.e. the problem of localizing multiple instances of a given object class in a test image. This task has been previously addressed in random forests [9], where the trees were modified to learn a mapping between the appearance of an image patch and its relative position to the object category centroid (i.e. center voting information). During inference, the resulting Hough Forest not only performs classification on test samples but also casts probabilistic votes in a generalized Hough-voting space [3] that is subsequently used to obtain object center hypotheses. Ever since, a series of applications such as tracking and action recognition [10], body-joint position estimation [12] and multi-class object detection [22] have been presented. However, Hough Forests typically produce non-distinctive object hypotheses in the Hough space and hence there is the need to perform non-maximum suppression (NMS) for obtaining the final results. While this has been addressed in [4, 26], another shortcoming is that standard (Hough) forests treat samples in a completely independent way, i.e. there is no mechanism that encourages the classifier to perform consistent predictions. Within this work we are proposing that context information can be used to overcome the aforementioned problems. For example, training data for visual learning is often represented by images in form of a (regular) pixel grid topology, i.e. objects appearing in natural images can often be found in a specific context. The importance of contextual information was already highlighted in the 80’s with 1 Figure 1: Top row: Training image, label image, visualization of priority-based growing of tree (the lower, the earlier the consideration during training.). Bottom row: Inverted Hough image using [9] and breadth-first training after 6 levels (26 = 64 nodes), Inverted Hough image after growing 64 nodes using our priority queue, Inverted Hough image using priority queue shows distinctive peaks at the end of training. a pioneering work on relaxation labelling [14] and a later work with focus on inference tasks [20] that addressed the issue of learning within the same framework. More recently, contextual information has been used in the field of object class segmentation [21], however, mostly for high-level reasoning in random field models or to resolve contradicting segmentation results. The introduction of contextual information as additional features in low-level classifiers was initially proposed in the Auto-context [25] and Semantic Texton Forest [24] models. Auto-context shows a general approach for classifier boosting by iteratively learning from appearance and context information. In this line of research [18] augmented the feature space for an Entanglement Random Forest with a classification feature, that is consequently refined by the class posterior distributions according to the progress of the trained subtree. The training procedure is allowed to perform tests for specific, contextual label configurations which was demonstrated to significantly improve the segmentation results. However, the In this paper we are presenting Context-Sensitve Decision Forests - A novel and unified interpretation of Hough Forests in light of contextual sensitivity. Our work is inspired by Auto-Context and Entanglement Forests, but instead of providing only posterior classification results from an earlier level of the classifier construction during learning and testing, we additionally provide regression (voting) information as it is used in Hough Forests. The second core contribution of our work is related to how we grow the trees: Instead of training them in a depth- or breadth-first way, we propose a priority-based construction (which could actually consider depth- or breadth-first as particular cases). The priority is determined by the current training error, i.e. we first grow the parts of the tree where we experience higher error. To this end, we introduce a unified splitting criterion that estimates the joint error of classification and regression. The consequence of using our priority-based training are illustrated in Figure 1: Given the training image with corresponding label image (top row, images 1 and 2), the tree first tries to learn the foreground samples as shown in the color-coded plot (top row, image 3, colors correspond to index number of nodes in the tree). The effects on the intermediate prediction quality are shown in the bottom row for the regression case: The first image shows the regression quality after training a tree with 6 levels (26 = 64 nodes) in a breadth-first way while the second image shows the progress after growing 64 nodes according to the priority based training. Clearly, the modes for the center hypotheses are more distinctive which in turn yields to more accurate intermediate regression information that can be used for further tree construction. Our third contribution is a new family of split functions that allows to learn from training images containing multiple training instances as shown for the pedestrians in the example. We introduce a test that checks the centroid compatibility for pairs of training samples taken from the context, based on the intermediate classification and regression derived as described before. To assess our contributions, we performed several experiments on the challenging TUD pedestrian data set [2], yielding a significant improvement of 9% in the recall at 90% precision rate in comparison to standard Hough Forests, when learning from crowded pedestrian images. 2 2 Context-Sensitive Decision Trees This section introduces the general idea behind the context-sensitive decision forest without references to specific applications. Only in Section 3 we show a particular application to the problem of object detection. After showing some basic notational conventions that are used in the paper, we provide a section that revisits the random forest framework for classification and regression tasks from a joint perspective, i.e. a theory allowing to consider e.g. [1, 11] and [9] in a unified way. Starting from this general view we finally introduce the context-sensitive forests in 2.2. Notations. In the paper we denote vectors using boldface lowercase (e.g. d, u, v) and sets by using uppercase calligraphic (e.g. X , Y) symbols. The sets of real, natural and integer numbers are denoted with R, N and Z as usually. We denote by 2X the power set of X and by 1 [P ] the indicator function returning 1 or 0 according to whether the proposition P is true or false. Moreover, with P(Y) we denote the set of probability distributions having Y as sample space and we implicitly assume that some σ-algebra is defined on Y. We denote by δ(x) the Dirac delta function. Finally, Ex∼Q [f (x)] denotes the expectation of f (x) with respect to x sampled according to distribution Q. 2.1 Random Decision Forests for joint classification and regression A (binary) decision tree is a tree-structured predictor1 where, starting from the root, a sample is routed until it reaches a leaf where the prediction takes place. At each internal node of the tree the decision is taken whether the sample should be forwarded to the left or right child, according to a binary-valued function. In formal terms, let X denote the input space, let Y denote the output space and let T dt be the set of decision trees. In its simplest form a decision tree consists of a single node (a leaf ) and is parametrized by a probability distribution Q ∈ P(Y) which represents the posterior probability of elements in Y given any data sample reaching the leaf. We denote this (admittedly rudimentary) tree as L F (Q) ∈ T td . Otherwise, a decision tree consists of a node with a left and a right sub-tree. This node is parametrized by a split function φ : X → {0, 1}, which determines whether to route a data sample x ∈ X reaching it to the left decision sub-tree tl ∈ T dt (if φ(x) = 0) or to the right one tr ∈ T dt (if φ(x) = 1). We denote such a tree as N D (φ, tl , tr ) ∈ T td . Finally, a decision forest is an ensemble F ⊆ T td of decision trees which makes a prediction about a data sample by averaging over the single predictions gathered from all trees. Inference. Given a decision tree t ∈ T dt , the associated posterior probability of each element in Y given a sample x ∈ X is determined by finding the probability distribution Q parametrizing the leaf that is reached by x when routed along the tree. This is compactly presented with the following definition of P (y|x, t), which is inductive in the structure of t: if t = L F (Q) Q(y) P (y | x, t ) = P (y | x, tl ) if t = N D (φ, tl , tr ) and φ(x) = 0 (1) P (y | x, tr ) if t = N D (φ, tl , tr ) and φ(x) = 1 . Finally, the combination of the posterior probabilities derived from the trees in a forest F ⊆ T dt can be done by an averaging operation [6], yielding a single posterior probability for the whole forest: P (y|x, F) = 1 |F| P (y|x, t) . (2) t∈F Randomized training. A random forest is created by training a set of random decision trees independently on random subsets of the training data D ⊆ X ×Y. The training procedure for a single decision tree heuristically optimizes a set of parameters like the tree structure, the split functions at the internal nodes and the density estimates at the leaves in order to reduce the prediction error on the training data. In order to prevent overfitting problems, the search space of possible split functions is limited to a random set and a minimum number of training samples is required to grow a leaf node. During the training procedure, each new node is fed with a set of training samples Z ⊆ D. If some stopping condition holds, depending on Z, the node becomes a leaf and a density on Y is estimated based on Z. Otherwise, an internal node is grown and a split function is selected from a pool of random ones in a way to minimize some sort of training error on Z. The selected split function induces a partition 1 we use the term predictor because we will jointly consider classification and regression. 3 of Z into two sets, which are in turn becoming the left and right childs of the current node where the training procedure is continued, respectively. We will now write this training procedure in more formal terms. To this end we introduce a function π(Z) ∈ P(Y) providing a density on Y estimated from the training data Z ⊆ D and a loss function L(Z | Q) ∈ R penalizing wrong predictions on the training samples in Z, when predictions are given according to a distribution Q ∈ P(Y). The loss function L can be further decomposed in terms of a loss function (·|Q) : Y → R acting on each sample of the training set: L(Z | Q) = (y | Q) . (3) (x,y)∈Z Also, let Φ(Z) be a set of split functions randomly generated for a training set Z and given a split φ function φ ∈ Φ(Z), we denote by Zlφ and Zr the sets identified by splitting Z according to φ, i.e. Zlφ = {(x, y) ∈ Z : φ(x) = 0} and φ Zr = {(x, y) ∈ Z : φ(x) = 1} . We can now summarize the training procedure in terms of a recursive function g : 2X ×Y → T , which generates a random decision tree from a training set given as argument: g(Z) = L F (π(Z)) ND if some stopping condition holds φ φ, g(Zlφ ), g(Zr ) otherwise . (4) Here, we determine the optimal split function φ in the pool Φ(Z) as the one minimizing the loss we incur as a result of the node split: φ φ ∈ arg min L(Zlφ ) + L(Zr ) : φ ∈ Φ(Z) (5) where we compactly write L(Z) for L(Z|π(Z)), i.e. the loss on Z obtained with predictions driven by π(Z). A typical split function selection criterion commonly adopted for classification and regression is information gain. The equivalent counterpart in terms of loss can be obtained by using a log-loss, i.e. (y|Q) = − log(Q(y)). A further widely used criterion is based on Gini impurity, which can be expressed in this setting by using (y|Q) = 1 − Q(y). Finally, the stopping condition that is used in (4) to determine whether to create a leaf or to continue branching the tree typically consists in checking |Z|, i.e. the number of training samples at the node, or the loss L(Z) are below some given thresholds, or if a maximum depth is reached. 2.2 Context-sensitive decision forests A context-sensitive (CS) decision tree is a decision tree in which split functions are enriched with the ability of testing contextual information of a sample, before taking a decision about where to route it. We generate contextual information at each node of a decision tree by exploiting a truncated version of the same tree as a predictor. This idea is shared with [18], however, we introduce some novelties by tackling both, classification and regression problems in a joint manner and by leaving a wider flexibility in the tree truncation procedure. We denote the set of CS decision trees as T . The main differences characterizing a CS decision tree t ∈ T compared with a standard decision tree are the following: a) every node (leaves and internal nodes) of t has an associated probability distribution Q ∈ P(Y) representing the posterior probability of an element in Y given any data sample reaching it; b) internal nodes are indexed with distinct natural numbers n ∈ N in a way to preserve the property that children nodes have a larger index compared to their parent node; c) the split function at each internal node, denoted by ϕ(·|t ) : X → {0, 1}, is bound to a CS decision tree t ∈ T , which is a truncated version of t and can be used to compute intermediate, contextual information. Similar to Section 2.1 we denote by L F (Q) ∈ T the simplest CS decision tree consisting of a single leaf node parametrized by the distribution Q, while we denote by N D (n, Q, ϕ, tl , tr ) ∈ T , the rest of the trees consisting of a node having a left and a right sub-tree, denoted by tl , tr ∈ T respectively, and being parametrized by the index n, a probability distribution Q and the split function ϕ as described above. As shown in Figure 2, the truncation of a CS decision tree at each node is obtained by exploiting the indexing imposed on the internal nodes of the tree. Given a CS decision tree t ∈ T and m ∈ N, 4 1 1 4 2 3 6 2 5 4 3 (b) The truncated version t(<5) (a) A CS decision tree t Figure 2: On the left, we find a CS decision tree t, where only the internal nodes are indexed. On the right, we see the truncated version t(<5) of t, which is obtained by converting to leaves all nodes having index ≥ 5 (we marked with colors the corresponding node transformations). we denote by t( < τ 2 In the experiments conducted, we never exceeded 10 iterations for finding a mode. 6 (8) where Pj = P (·|(u + hj , I), t), with j = 1, 2, are the posterior probabilities obtained from tree t given samples at position u+h1 and u+h2 of image I, respectively. Please note that this test should not be confused with the regression split criterion in [9], which tries to partition the training set in a way to group examples with similar voting direction and length. Besides the novel context-sensitive split function we employ also standard split functions performing tests on X as defined in [24]. 4 Experiments To assess our proposed approach, we have conducted several experiments on the task of pedestrian detection. Detecting pedestrians is very challenging for Hough-voting based methods as they typically exhibit strong articulations of feet and arms, yielding to non-distinctive hypotheses in the Hough space. We evaluated our method on the TUD pedestrian data base [2] in two different ways: First, we show our detection results with training according to the standard protocol using 400 training images (where each image contains a single annotation of a pedestrian) and evaluation on the Campus and Crossing scenes, respectively (Section 4.1). With this experiment we show the improvement over state-of-the-art approaches when learning can be performed with simultaneous knowledge about context information. In a second variation (Section 4.2), we use the images of the Crossing scene (201 images) as a training set. Most images of this scene contain more than four persons with strong overlap and mutual occlusions. However, instead of using the original annotation which covers only pedestrians with at least 50% overlap (1008 bounding boxes), we use the more accurate, pixel-wise ground truth annotations of [23] for the entire scene that includes all persons and consists of 1215 bounding boxes. Please note that this annotation is even more detailed than the one presented in [4] with 1018 bounding boxes. The purpose of the second experiment is to show that our context-sensitive forest can exploit the availability of multiple training instances significantly better than state-of-the-art. The most related work and therefore also the baseline in our experiments is the Hough Forest [9]. To guarantee a fair comparison, we use the same training parameters for [9] and our context sensitive forest: We trained 20 trees and the training data (including horizontally flipped images) was sampled homogeneously per category per image. The patch size was fixed to 30 × 30 and we performed 1600 node tests for finding the best split function parameters per node. The trees were stopped growing when < 7 samples were available. As image features, we used the the first 16 feature channels provided in the publicly available Hough Forest code of [9]. In order to obtain the object detection hypotheses from the Hough space, we use the same Non-maximum suppression (NMS) technique in all our experiments as suggested in [9]. To evaluate the obtained hypotheses, we use the standard PASAL-VOC criterion which requires the mutual overlap between ground truth and detected bounding boxes to be ≥ 50%. The additional parameter of (7) was fixed to σ = 7. 4.1 Evaluation using standard protocol training set The standard training set contains 400 images where each image comes with a single pedestrian annotation. For our experiments, we rescaled the images by a factor of 0.5 and doubled the training image set by including also the horizontally flipped images. We randomly chose 125 training samples per image for foreground and background, resulting in 2 · 400 · 2 · 125 = 200k training samples per tree. For additional comparisons, we provide the results presented in the recent work on joint object detection and segmentation of [23], from which we also provide evaluation results of the Implicit Shape Model (ISM) [16]. However, please note that the results of [23] are based on a different baseline implementation. Moreover, we show the results of [4] when using the provided code and configuration files from the first authors homepage. Unfortunately, we could not reproduce the results of the original paper. First, we discuss the results obtained on the Campus scene. This data set consists of 71 images showing walking pedestrians at severe scale differences and partial occlusions. The ground truth we use has been released with [4] and contains a total number of 314 pedestrians. Figure 3, first row, plot 1 shows the precision-recall curves when using 3 scales (factors 0.3, 0.4, 0.55) for our baseline [9] (blue), results from re-evaluating [4] (cyan, 5 scales), [23] (green) and our ContextSensitive Forest without and with using the priority queue based tree construction (red/magenta). In case of not using the priority queue, we trained the trees according to a breadth-first way. We obtain a performance boost of ≈ 6% in recall at a precision of 90% when using both, context information and the priority based construction of our forest. The second plot in the first row of Figure 3 shows the results when the same forests are tested on the Crossing scene, using the more detailed ground 7 TUD Campus (3 scales) TUD−Crossing (3 scales) 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Precision 1 0.9 Precision 1 0.5 0.4 0.3 0.2 0.1 0 0 0.5 0.4 0.3 Baseline Hough Forest Barinova et al. CVPR’10, 5 scales Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue Riemenschneider et al. ECCV’12 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.2 0.1 0.9 0 0 1 Baseline Hough Forest Barinova et al. CVPR’10 Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue Riemenschneider et al. ECCV’12 (1 scale) Leibe et al. IJCV’08 (1 scale) 0.1 TUD Campus (3 scales) 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.9 1 0.9 1 1 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Precision 1 0.9 Precision 0.2 TUD Campus (5 scales) 0.5 0.4 0.3 0 0 0.4 0.3 0.2 0.1 0.5 0.2 Baseline Hough Forest Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.1 0.9 1 0 0 Baseline Hough Forest Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 Figure 3: Precision-Recall Curves for detections, Top row: Standard training (400 images), evaluation on Campus and Crossing (3 scales). Bottom row: Training on Crossing annotations of [23], evaluation on Campus, 3 and 5 scales. Right images: Qualitative examples for Campus (top 2) and Crossing (bottom 2) scenes. (green) correctly found by our method (blue) ground truth (red) wrong association (cyan) missed detection. truth annotations. The data set shows walking pedestrians (Figure 3, right side, last 2 images) with a smaller variation in scale compared to the Campus scene but with strong mutual occlusions and overlaps. The improvement with respect to the baseline is lower (≈ 2% gain at a precision of 90%) and we find similar developments of the curves. However, this comes somewhat expectedly as the training data does not properly reflect the occlusions we actually want to model. 4.2 Evaluation on Campus scene using Crossing scene as training set In our next experiment we trained the forests (same parameters) on the novel annotations of [23] for the Crossing scene. Please note that this reduces the training set to only 201 images (we did not include the flipped images). Qualitative detection results are shown in Figure 3, right side, images 1 and 2. From the first precison-recall curve in the second row of Figure 3 we can see, that the margin between the baseline and our proposed method could be clearly improved (gain of ≈ 9% recall at precision 90%) when evaluating on the same 3 scales. With evaluation on 5 scales (factors 0.34, 0.42, 0.51, 0.65, 0.76) we found a strong increase in the recall, however, at the cost of loosing 2 − 3% of precision below a recall of 60%, as illustrated in the second plot of row 2 in Figure 3. While our method is able to maintain a precision above 90% up to a recall of ≈ 83%, the baseline implementation drops already at a recall of ≈ 20%. 5 Conclusions In this work we have presented Context-Sensitive Decision Forests with application to the object detection problem. Our new forest has the ability to access intermediate prediction (classification and regression) information about all samples of the training set and can therefore learn from contextual information throughout the growing process. This is in contrast to existing random forest methods used for object detection which typically treat training samples in an independent manner. Moreover, we have introduced a novel splitting criterion together with a mode isolation technique, which allows us to (a) perform a priority-driven way of tree growing and (b) install novel context-based test functions to check for mutual object centroid agreements. In our experimental results on pedestrian detection we demonstrated superior performance with respect to state-of-the-art methods and additionally found that our new algorithm can significantly better exploit training data containing multiple training objects. Acknowledgements. Peter Kontschieder acknowledges financial support of the Austrian Science Fund (FWF) from project ’Fibermorph’ with number P22261-N22. 8 References [1] Y. Amit and D. Geman. Shape quantization and recognition with randomized trees. Neural Computation, 1997. [2] M. Andriluka, S. Roth, and B. Schiele. People-tracking-by-detection and people-detection-by-tracking. In (CVPR), 2008. [3] D. H. Ballard. Generalizing the hough transform to detect arbitrary shapes. Pattern Recognition, 13(2), 1981. [4] O. Barinova, V. Lempitsky, and P. Kohli. On detection of multiple object instances using hough transforms. In (CVPR), 2010. [5] A. Bosch, A. Zisserman, and X. Mu˜oz. Image classification using random forests and ferns. In (ICCV), n 2007. [6] L. Breiman. Random forests. In Machine Learning, 2001. [7] A. Criminisi, J. Shotton, and E. Konukoglu. Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. In Foundations and Trends in Computer Graphics and Vision, volume 7, pages 81–227, 2012. [8] A. Criminisi, J. Shotton, D. Robertson, and E. Konukoglu. Regression forests for efficient anatomy detection and localization in CT scans. In MICCAI-MCV Workshop, 2010. [9] J. Gall and V. Lempitsky. Class-specific hough forests for object detection. In (CVPR), 2009. [10] J. Gall, A. Yao, N. Razavi, L. Van Gool, and V. Lempitsky. Hough forests for object detection, tracking, and action recognition. (PAMI), 2011. [11] P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 2006. [12] R. Girshick, J. Shotton, P. Kohli, A. Criminisi, and A. Fitzgibbon. Efficient regression of general-activity human poses from depth images. In (ICCV), 2011. [13] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer, 2009. [14] R. A. Hummel and S. W. Zucker. On the foundations of relaxation labeling. (PAMI), 5(3):267–287, 1983. [15] P. Kontschieder, S. Rota Bul` , H. Bischof, and M. Pelillo. Structured class-labels in random forests for o semantic image labelling. In (ICCV), 2011. [16] B. Leibe, A. Leonardis, and B. Schiele. Robust object detection with interleaved categorization and segmentation. (IJCV), 2008. [17] R. Mar´ e, P. Geurts, J. Piater, and L. Wehenkel. Random subwindows for robust image classification. In e (CVPR), 2005. [18] A. Montillo, J. Shotton, J. Winn, J. E. Iglesias, D. Metaxas, and A. Criminisi. Entangled decision forests and their application for semantic segmentation of CT images. In (IPMI), 2011. [19] F. Moosmann, B. Triggs, and F. Jurie. Fast discriminative visual codebooks using randomized clustering forests. In (NIPS), 2006. [20] M. Pelillo and M. Refice. Learning compatibility coefficients for relaxation labeling processes. (PAMI), 16(9):933–945, 1994. [21] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In (ICCV), 2007. [22] N. Razavi, J. Gall, and L. Van Gool. Scalable multi-class object detection. In (CVPR), 2011. [23] H. Riemenschneider, S. Sternig, M. Donoser, P. M. Roth, and H. Bischof. Hough regions for joining instance localization and segmentation. In (ECCV), 2012. [24] J. Shotton, M. Johnson, and R. Cipolla. Semantic texton forests for image categorization and segmentation. In (CVPR), 2008. [25] Z. Tu. Auto-context and its application to high-level vision tasks. In (CVPR), 2008. [26] O. Woodford, M. Pham, A. Maki, F. Perbet, and B. Stenger. Demisting the hough transform for 3d shape recognition and registration. In (BMVC), 2011. 9
4 0.65452552 260 nips-2012-Online Sum-Product Computation Over Trees
Author: Mark Herbster, Stephen Pasteris, Fabio Vitale
Abstract: We consider the problem of performing efficient sum-product computations in an online setting over a tree. A natural application of our methods is to compute the marginal distribution at a vertex in a tree-structured Markov random field. Belief propagation can be used to solve this problem, but requires time linear in the size of the tree, and is therefore too slow in an online setting where we are continuously receiving new data and computing individual marginals. With our method we aim to update the data and compute marginals in time that is no more than logarithmic in the size of the tree, and is often significantly less. We accomplish this via a hierarchical covering structure that caches previous local sum-product computations. Our contribution is three-fold: we i) give a linear time algorithm to find an optimal hierarchical cover of a tree; ii) give a sum-productlike algorithm to efficiently compute marginals with respect to this cover; and iii) apply “i” and “ii” to find an efficient algorithm with a regret bound for the online allocation problem in a multi-task setting. 1
5 0.64123303 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees
Author: Erik Talvitie
Abstract: This paper introduces timeline trees, which are partial models of partially observable environments. Timeline trees are given some specific predictions to make and learn a decision tree over history. The main idea of timeline trees is to use temporally abstract features to identify and split on features of key events, spread arbitrarily far apart in the past (whereas previous decision-tree-based methods have been limited to a finite suffix of history). Experiments demonstrate that timeline trees can learn to make high quality predictions in complex, partially observable environments with high-dimensional observations (e.g. an arcade game). 1
6 0.61850387 215 nips-2012-Minimizing Uncertainty in Pipelines
7 0.56746376 207 nips-2012-Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification
8 0.53609324 182 nips-2012-Learning Networks of Heterogeneous Influence
9 0.51867664 206 nips-2012-Majorization for CRFs and Latent Likelihoods
10 0.51088011 334 nips-2012-Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs
11 0.50065398 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
12 0.49968258 156 nips-2012-Identifiability and Unmixing of Latent Parse Trees
13 0.49808928 53 nips-2012-Bayesian Pedigree Analysis using Measure Factorization
14 0.48715118 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function
15 0.48403084 78 nips-2012-Compressive Sensing MRI with Wavelet Tree Sparsity
16 0.47710285 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification
17 0.46015987 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs
18 0.45494401 118 nips-2012-Entangled Monte Carlo
19 0.45181799 250 nips-2012-On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization
20 0.45069358 267 nips-2012-Perceptron Learning of SAT
topicId topicWeight
[(0, 0.043), (10, 0.045), (16, 0.185), (21, 0.05), (38, 0.094), (39, 0.025), (42, 0.026), (44, 0.015), (54, 0.025), (55, 0.044), (74, 0.072), (76, 0.132), (77, 0.011), (80, 0.084), (92, 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.80786836 232 nips-2012-Multiplicative Forests for Continuous-Time Processes
Author: Jeremy Weiss, Sriraam Natarajan, David Page
Abstract: Learning temporal dependencies between variables over continuous time is an important and challenging task. Continuous-time Bayesian networks effectively model such processes but are limited by the number of conditional intensity matrices, which grows exponentially in the number of parents per variable. We develop a partition-based representation using regression trees and forests whose parameter spaces grow linearly in the number of node splits. Using a multiplicative assumption we show how to update the forest likelihood in closed form, producing efficient model updates. Our results show multiplicative forests can be learned from few temporal trajectories with large gains in performance and scalability.
2 0.80418354 329 nips-2012-Super-Bit Locality-Sensitive Hashing
Author: Jianqiu Ji, Jianmin Li, Shuicheng Yan, Bo Zhang, Qi Tian
Abstract: Sign-random-projection locality-sensitive hashing (SRP-LSH) is a probabilistic dimension reduction method which provides an unbiased estimate of angular similarity, yet suffers from the large variance of its estimation. In this work, we propose the Super-Bit locality-sensitive hashing (SBLSH). It is easy to implement, which orthogonalizes the random projection vectors in batches, and it is theoretically guaranteed that SBLSH also provides an unbiased estimate of angular similarity, yet with a smaller variance when the angle to estimate is within (0, ⇡/2]. The extensive experiments on real data well validate that given the same length of binary code, SBLSH may achieve significant mean squared error reduction in estimating pairwise angular similarity. Moreover, SBLSH shows the superiority over SRP-LSH in approximate nearest neighbor (ANN) retrieval experiments. 1
3 0.77037877 222 nips-2012-Multi-Task Averaging
Author: Sergey Feldman, Maya Gupta, Bela Frigyik
Abstract: We present a multi-task learning approach to jointly estimate the means of multiple independent data sets. The proposed multi-task averaging (MTA) algorithm results in a convex combination of the single-task averages. We derive the optimal amount of regularization, and show that it can be effectively estimated. Simulations and real data experiments demonstrate that MTA outperforms both maximum likelihood and James-Stein estimators, and that our approach to estimating the amount of regularization rivals cross-validation in performance but is more computationally efficient. 1
4 0.74055034 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization
Author: Yuchen Zhang, Martin J. Wainwright, John C. Duchi
Abstract: We study two communication-efficient algorithms for distributed statistical optimization on large-scale data. The first algorithm is an averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves mean-squared error that decays as √ O(N −1 + (N/m)−2 ). Whenever m ≤ N , this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of the bootstrap. Requiring only a single round of communication, it has mean-squared error that decays as O(N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. We complement our theoretical results with experiments on largescale problems from the internet search domain. In particular, we show that our methods efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which consists of N ≈ 2.4 × 108 samples and d ≥ 700, 000 dimensions. 1
5 0.72567594 193 nips-2012-Learning to Align from Scratch
Author: Gary Huang, Marwan Mattar, Honglak Lee, Erik G. Learned-miller
Abstract: Unsupervised joint alignment of images has been demonstrated to improve performance on recognition tasks such as face verification. Such alignment reduces undesired variability due to factors such as pose, while only requiring weak supervision in the form of poorly aligned examples. However, prior work on unsupervised alignment of complex, real-world images has required the careful selection of feature representation based on hand-crafted image descriptors, in order to achieve an appropriate, smooth optimization landscape. In this paper, we instead propose a novel combination of unsupervised joint alignment with unsupervised feature learning. Specifically, we incorporate deep learning into the congealing alignment framework. Through deep learning, we obtain features that can represent the image at differing resolutions based on network depth, and that are tuned to the statistics of the specific data being aligned. In addition, we modify the learning algorithm for the restricted Boltzmann machine by incorporating a group sparsity penalty, leading to a topographic organization of the learned filters and improving subsequent alignment results. We apply our method to the Labeled Faces in the Wild database (LFW). Using the aligned images produced by our proposed unsupervised algorithm, we achieve higher accuracy in face verification compared to prior work in both unsupervised and supervised alignment. We also match the accuracy for the best available commercial method. 1
6 0.72117895 210 nips-2012-Memorability of Image Regions
7 0.71561486 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed
8 0.71447426 168 nips-2012-Kernel Latent SVM for Visual Recognition
9 0.71358967 274 nips-2012-Priors for Diversity in Generative Latent Variable Models
10 0.71202868 188 nips-2012-Learning from Distributions via Support Measure Machines
11 0.70991868 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
12 0.7092244 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes
13 0.70922029 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models
14 0.70820814 81 nips-2012-Context-Sensitive Decision Forests for Object Detection
15 0.70791715 197 nips-2012-Learning with Recursive Perceptual Representations
16 0.7078163 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs
17 0.70660257 112 nips-2012-Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model
18 0.70625734 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
19 0.70591223 303 nips-2012-Searching for objects driven by context
20 0.70446748 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization