nips nips2012 nips2012-339 knowledge-graph by maker-knowledge-mining

339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering


Source: pdf

Author: Levi Boyles, Max Welling

Abstract: We introduce a new prior for use in Nonparametric Bayesian Hierarchical Clustering. The prior is constructed by marginalizing out the time information of Kingman’s coalescent, providing a prior over tree structures which we call the Time-Marginalized Coalescent (TMC). This allows for models which factorize the tree structure and times, providing two benefits: more flexible priors may be constructed and more efficient Gibbs type inference can be used. We demonstrate this on an example model for density estimation and show the TMC achieves competitive experimental results. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We introduce a new prior for use in Nonparametric Bayesian Hierarchical Clustering. [sent-3, score-0.103]

2 The prior is constructed by marginalizing out the time information of Kingman’s coalescent, providing a prior over tree structures which we call the Time-Marginalized Coalescent (TMC). [sent-4, score-0.684]

3 This allows for models which factorize the tree structure and times, providing two benefits: more flexible priors may be constructed and more efficient Gibbs type inference can be used. [sent-5, score-0.356]

4 1 Introduction Hierarchical clustering models aim to fit hierarchies to data, and enjoy the property that clusterings of varying size can be obtained by “pruning” the tree at particular levels. [sent-7, score-0.388]

5 In contrast, standard clustering models must specify the number of clusters beforehand, while Nonparametric Bayesian (NPB) clustering models such as the Dirichlet Process Mixture (DPM) [5, 13] directly infer the (effective) number of clusters. [sent-8, score-0.146]

6 NPB hierarchical clustering models are an important regime of such models, and have been shown to have superior performance to alternative models in many domains [8]. [sent-12, score-0.128]

7 Dirichlet Diffusion Trees (DDT) [16], Kingman’s Coalescent [9, 10, 4, 20], and Pitman-Yor Diffusion Trees (PYDT) [11] all provide models in which data is generated from a Continuous-Time Markov Chain (CTMC) that lives on a tree that splits (or coalesces) according to some continuous-time process. [sent-15, score-0.315]

8 The nested CRP and DP [3, 17] and Tree-Structured Stick Breaking (TSSB) [1] define priors over tree structures from which data is directly generated. [sent-16, score-0.471]

9 Although there is extensive and impressive literature on the subject demonstrating its useful clustering properties, NPB hierarchical clustering has yet to see widespred use. [sent-17, score-0.201]

10 The direct generation models are typically faster, but usually 1 Figure 1: Coalescent tree construction (left) A pair is uniformly drawn from N = 5 points to coalesce. [sent-21, score-0.366]

11 (middle) The coalescence time t5 is drawn from Exp( 5 ), and another pair on the remaining 4 points is drawn 2 uniformly. [sent-22, score-0.082]

12 2 Figure 2: Consider the trees one might construct by uniformly picking pairs of points to join, starting with four leaves {a, b, c, d}. [sent-24, score-0.172]

13 One can join a and b first, and then c and d (and then the two parents), or c and d and then a and b to construct the tree on the left. [sent-25, score-0.379]

14 By defining a uniform prior over φn , and then marginalizing out the order of the internal nodes ρ (equivalently, the order in which pairs are joined), we then have a prior over ψn that puts more mass on balanced trees than unbalanced ones. [sent-26, score-0.707]

15 For example the tree on right can only be constructed in one way by node joining. [sent-27, score-0.412]

16 come at some cost or limitation; for example the TSSB allows (and requires) that data live at some of its internal nodes. [sent-28, score-0.126]

17 Our contribution is a new prior over tree structures that is simpler than many of the priors described above, yet still retains the exchangeability and consistency properties of a NPB model. [sent-29, score-0.539]

18 The prior is derived by marginalizing out the times and ordering of internal nodes of the coalescent. [sent-30, score-0.464]

19 The remaining distribution is an exchangeable and consistent prior over tree structures. [sent-31, score-0.552]

20 This prior may be used directly with a data generation process, or a notion of time may be reintroduced, providing a prior with a factorization between the tree structures and times. [sent-32, score-0.652]

21 The simplicity of the prior allows for great flexibility and the potential for more efficient inference algorithms. [sent-33, score-0.103]

22 For the purposes of this paper, we focus on one such possible model, wherein we generate branch lengths according to a process similar to stick-breaking. [sent-34, score-0.314]

23 We introduce the proposed prior on tree structures in Section 2, the distribution over times conditioned on tree structure in 3. [sent-35, score-0.856]

24 1 Coalescent Prior on Tree Structures Kingman’s Coalescent Kingman’s Coalescent provides a prior over balanced, edge-weighted trees, wherein the weights are often interpreted as representing some notion of time. [sent-40, score-0.103]

25 A coalescent tree can be sampled as follows: start with n points and n dangling edges hanging from them, and all weights set to 0. [sent-42, score-0.771]

26 Repeat this process on the remaining n − 1 points until a full weighted tree is constructed. [sent-45, score-0.349]

27 Note however, that the weights do not influence the choice of tree structures sampled, which suggests we can marginalize out the times and still retain an exchangeable and consistent distribution over trees. [sent-46, score-0.538]

28 2 Coalescent Distribution over Trees We consider two types of tree-like structures, generic (rooted) unweighted tree graphs which we denote ψn living in Ψn , and trees of the previous type, but with a specified ordering ρ on the internal (non-leaf) nodes of the tree, denoted (ψn , ρ) = φn ∈ Φn . [sent-49, score-0.695]

29 Marginalizing out the times of the coalescent gives a uniform prior over ordered tree structures φn . [sent-50, score-0.977]

30 The order information is 2 Figure 3: (left) A sample from the described prior with stick-breaking parameter B(1, 1) (uniform). [sent-51, score-0.103]

31 Let yi denote the ith internal node2 of φn , i ∈ {1. [sent-56, score-0.126]

32 There are n ways of attaching the new internal node below y1 , n−1 ways of attaching below y2 , and so on, giving n(n+1) = n ways of attaching y ∗ into φn . [sent-60, score-0.67]

33 Thus if we make this choice uniformly at 2 2 random, we get the probability of the new tree is p(φn+1 ) = p(φn ) n+1 −1 2 = n+1 i −1 . [sent-61, score-0.315]

34 i=2 2 It is possible to marginalize out the ordering information in the coalescent tree structures φn to derive exchangeable, consistent priors on “unordered” tree structures ψn . [sent-62, score-1.251]

35 We can perform this marginalization by counting how many ordered tree structures φn ∈ Φn are consistent with a particular unordered tree structure ψn . [sent-63, score-0.809]

36 possible orderings on its internal nodes, where mi is n−1 i=1 mi the number of internal nodes in the subtree rooted at node i. [sent-66, score-0.615]

37 ) This is in agreement with what we would expect: for an unbalanced tree, mi = {1, 2, . [sent-68, score-0.119]

38 Since an unbalanced tree imposes a full ordering on the internal nodes, there can only be one unbalanced ordered tree that maps to the corresponding unbalanced unordered tree. [sent-72, score-1.125]

39 As the tree becomes more balanced, the mi s decrease, increasing T . [sent-73, score-0.356]

40 Thus the probability of a particular ψn is T (ψn ) times the probability of an ordered tree φn under the coalescent:3 n n −1 −1 i (n − 1)! [sent-74, score-0.41]

41 p(ψn ) defines an exchangeable and consistent prior over Ψn p(ψn ) is clearly still exchangeable as it does not depend on any order of the data, and was defined by marginalizing out a consistent process, so its conditional priors are still well defined and thus p(ψn ) is consistent. [sent-76, score-0.427]

42 1 The sequential sampling scheme often associated with NPB models; for example the conditional prior for the CRP is the prior probability of adding the n + 1st point to one of the existing clusters (or a new cluster) given the clustering on the first n points. [sent-78, score-0.317]

43 2 When times are given, we index the internal nodes from most recent to root. [sent-79, score-0.242]

44 Otherwise, nodes are ordered such that parents always succeed children. [sent-80, score-0.125]

45 3 It has been brought to our attention that this prior and its connection to the coalescent has been studied before in [2] as the beta-splitting model with parameter β = 0, and later in [14] under the framework of Gibbs-Fragmentation trees. [sent-81, score-0.487]

46 3 Figure 4: The subtree Sl rooted at the red node l is pruned in preparation for an MCMC move. [sent-82, score-0.268]

47 We perform slice sampling to determine where the pruned subtree’s parent should be placed next in the remaining tree πl . [sent-83, score-0.613]

48 Figure 5: We compute the posterior pdf for each branch that we might attach to. [sent-84, score-0.328]

49 If α or β are greater than one, the Beta prior on branch lengths can cause these pdfs to go to zero at the limits of their domain. [sent-85, score-0.454]

50 Thus to enable moves across branches we compute the extrema of each pdf so that all valid intervals are found. [sent-86, score-0.121]

51 Examples of other potential data generation models include those in [1], such as the “Generalized Gaussian Diffusions,” and the multinomial likelihood often used with the Coalescent. [sent-88, score-0.089]

52 1 Branch Lengths Given a tree structure ψ, we can sample branch lengths, si = tpi − ti , with ti the time of coalescent event i, with t = 0 at the leaves and pi is the parent of i. [sent-90, score-1.485]

53 Consider the following construction similar to a stick-breaking process: Start with a stick of unit length. [sent-91, score-0.097]

54 Starting at the root, travel down the given ψ, and at each split point duplicate the current stick into two sticks, assigning one to each child. [sent-92, score-0.097]

55 B will be the proportion of the remaining stick attributed to that branch of the tree until the next split point (sticks afterwards will be of length proportional to (1 − B)). [sent-94, score-0.693]

56 The total prior over branch lengths can thus be written as: N −2 B(Bi |α, β) p({Bi }|ψ) = (2) i=1 See Figure 3 for samples from this prior. [sent-96, score-0.417]

57 Note that any distribution whose support is the unit interval may be used, and in fact more innovative schemes for sampling the times may be used as well; one of the major advantages of the TMC over the Coalescent and DDT is that the times may be defined in a variety of ways. [sent-97, score-0.124]

58 There is a single Beta random variable attributed to each internal node of the tree (except the root, which has B set to 1). [sent-98, score-0.538]

59 Since the order in which we see the data does not affect the way in which we sample these stick proportions, the process remains exchangeable. [sent-99, score-0.097]

60 2 Brownian Motion Given a tree π we can define a likelihood for continuous data xi ∈ Rp using Brownian motion. [sent-104, score-0.353]

61 We denote the length of each branch segment of the tree si . [sent-105, score-0.563]

62 Data is generated as follows: we start at some unknown location in Rp at time t = 1 and immediately split into two independent Wiener processes (with parameter Λ), each denoted yi , where i is the index of the relevant branch in π. [sent-106, score-0.214]

63 Figure 7: Posterior sample from our model applied to the leukemia dataset. [sent-113, score-0.127]

64 of the processes evolves for times si , then, a new independent Wiener process is instantiated at the time of each split, and this continues until all processes reach the leaves of π (i. [sent-118, score-0.107]

65 1 Likelihood Computation The likelihood p(x|π) can be calculated using a single bottom-up sweep of message passing. [sent-124, score-0.087]

66 As in [20], by marginalizing out from the leaves towards the root, the message at an internal node i is a Gaussian with mean yi and variance Λνi . [sent-125, score-0.419]

67 ˆ The ν and y messages can be written for any number of incoming nodes in a single form: ˆ −1 νi = (νj + sj )−1 ; yi = ν i ˆ j∈c(i) j∈c(i) yj ˆ ν j + sj where c(i) are the nodes sending incoming messages to i. [sent-126, score-0.278]

68 We can compute the likelihood using any arbitrary node as the root for message passing. [sent-127, score-0.232]

69 Fixing a particular node as root, we can write the total likelihood of the tree as: n−1 Zc(i) (x, π) p(x|π) = (3) i=1 When |c(i)| = 1 (e. [sent-128, score-0.45]

70 These messages are derived by using the product of Gaussian pdf identities. [sent-133, score-0.123]

71 First, a random node l is pruned from the tree (so that its parent pl has no parent and only one child), giving the pruned subtree Sl and remaining tree πl . [sent-136, score-1.193]

72 For each branch indexed by the node i “below” it, we compute the posterior density function of where pl should be placed on that branch. [sent-139, score-0.476]

73 We then slice sample on this collection of density functions. [sent-140, score-0.151]

74 Denote π(S, i, t) as the tree formed by attaching S above node i at time t in π. [sent-145, score-0.497]

75 For the new tree we imagine collecting messages to node pl resulting in a new factor Zi,l,pi (x, πl (Sl , i, t)). [sent-146, score-0.607]

76 If we now imagine that the original likelihood was computed by collecting to node pi , then we see that the first factor should replace the factor Zlpi ,rpi ,ppi (x, πl ) at node pi while the latter factor was already included in Zlpi ,rpi ,ppi (x, πl ). [sent-148, score-0.581]

77 By taking the product of (6) and (7) we get the joint posterior of (i, t): p(πl (Sl , i, t)) ∝ 1 ti tpi B(1 − t ; α, β)B(1 − p(πl (Sl , i, t)|X) ∝ ∆Z(πl (Sl , i, t)) p(πl (Sl , i, t)) (8) p(πl (Sl , i, t)|X) defines the distribution from which we would like to sample. [sent-152, score-0.279]

78 We propose a slice sampler that can propose adding Sl to any segment in πl . [sent-153, score-0.127]

79 If we can find all of the extrema of the posterior, we can easily find the intervals that contain positive probability density for slice sampling moves (see Figure 5). [sent-155, score-0.253]

80 Thus this slice sampling procedure will mix as quickly as slice sampling on a single unimodal distribution. [sent-156, score-0.262]

81 The overall sampling procedure is then to sample a new location for each node (both leaves and internal nodes) of the tree using the Gibbs sampling scheme explained above. [sent-158, score-0.678]

82 We use an Inverse-Gamma prior on k, so that k −1 ∼ G(κ, θ). [sent-163, score-0.103]

83 k −1 |X ∼ G 4 1 (N − 1)p + κ, 2 2 N −1 di + θ i=1 Note that if either l or i is a leaf, then the prior term will be simpler than the one listed here 6 where N is the number of datapoints, p is the dimension, and di = Euclidean norm. [sent-164, score-0.103]

84 || is the By putting a G(ξ, λ) prior on α − 1 and β − 1, we achieve a posterior for these parameters: N −1 p(α, β|ξ, λ, X) ∝ (α − 1)ξ (β − 1)ξ e−λ(α−1+β−1) i=1 1 B(α, β) 1− ti tpi α−1 ti tpi β−1 This posterior is log-concave and thus unimodal. [sent-166, score-0.661]

85 By integrating out yli in (5), we get a modification of (8) ′ that is proportional to p(π ′ |π, X). [sent-170, score-0.105]

86 Slice sampling from this gives us several new trees πj for each ′ πi , where one of the leaves is not observed. [sent-171, score-0.21]

87 p(yt |πj , X) is then available by message passing to the leaf associated with yt (denoted l), which results in a Gaussian over yt . [sent-172, score-0.308]

88 Performing the aforementioned integration (after replacing yli with yt ), we get: ˆ Zpred (i, t) ∝ dyt Zi,pi ,l (x, π ′ (yt )) k 1 1 ∗ ∗ ∗ ∗ = |2πΛ|− 2 (νi + νpi )− 2 exp − (νl∗ + (νi −1 + νpi −1 )−1 )di,pi 2 y ˆ Λν where di,pi = ||ˆi − ypi ||2 ∗ . [sent-174, score-0.226]

89 This gives the posterior density for the location of the unobserved point: Zli ,ri (x, πl ) p(π ′ = π({l}, i, t)|π, X) ∝ Zpred (i, t) p(π({l}, i, t)) Zli ,ri ,pi (x, πl ) where p(π({l}, i, t)) is as in (7). [sent-175, score-0.115]

90 This is a result of the fact that the divergence function of the DDT strongly encourages the branch lengths to be small, and thus a larger variance is required to explain the data. [sent-184, score-0.314]

91 Figure 7 shows the posterior tree sampled after about 28 Gibbs passes (about 10 minutes). [sent-198, score-0.372]

92 We attribute this difference in performance due our model’s weaker prior on the branch lengths, which causes our model to overfit slightly; if we preset the diffusion variance of our model to a value somewhat larger than the data variance, our performance improves. [sent-201, score-0.443]

93 5 Conclusion We introduced a new prior for use in NPB hierarchical clustering, one that can be used in a variety of ways to define a generative model for data. [sent-208, score-0.207]

94 By marginalizing out the time of the coalescent, we achieve a prior from which data can be either generated directly via a graphical model living on trees, or by a CTMC lying on a distribution over times for the branch lengths – in the style of the coalescent and DDT. [sent-209, score-0.964]

95 However, unlike the coalescent and DDT, in our model the times are generated conditioned on the tree structure; giving potential for more interesting models or more efficient inference. [sent-210, score-0.787]

96 The simplicity of the prior allows for efficient Gibbs style inference, and we provide an example model and demonstrate that it can achieve similar performance to that of the DDT. [sent-211, score-0.103]

97 However, to achieve that performance the diffusion variance must be set in advance, suggesting that alternative distributions over the branch lengths may provide better performance than the one explored here. [sent-212, score-0.44]

98 Scalable inference on kingman2019s coalescent using ou pair similarity. [sent-251, score-0.384]

99 An efficient sequential Monte Carlo algorithm for coalescent clusou tering. [sent-257, score-0.384]

100 Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. [sent-370, score-0.127]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('coalescent', 0.384), ('tree', 0.315), ('ddt', 0.264), ('sl', 0.228), ('branch', 0.214), ('npb', 0.192), ('dpm', 0.145), ('tmc', 0.144), ('tpi', 0.144), ('pi', 0.135), ('leukemia', 0.127), ('internal', 0.126), ('diffusion', 0.126), ('yt', 0.112), ('trees', 0.108), ('prior', 0.103), ('exchangeable', 0.1), ('lengths', 0.1), ('node', 0.097), ('stick', 0.097), ('slice', 0.093), ('ri', 0.092), ('kingman', 0.085), ('attaching', 0.085), ('dirichlet', 0.085), ('marginalizing', 0.083), ('structures', 0.08), ('ti', 0.078), ('unbalanced', 0.078), ('nodes', 0.073), ('clustering', 0.073), ('parent', 0.073), ('ctmc', 0.072), ('dangling', 0.072), ('yli', 0.072), ('zli', 0.072), ('zlpi', 0.072), ('subtree', 0.071), ('brownian', 0.067), ('messages', 0.066), ('leaves', 0.064), ('join', 0.064), ('sli', 0.064), ('extrema', 0.064), ('sticks', 0.064), ('pruned', 0.06), ('density', 0.058), ('pdf', 0.057), ('posterior', 0.057), ('irvine', 0.056), ('hierarchical', 0.055), ('birds', 0.055), ('beta', 0.053), ('datapoints', 0.052), ('zc', 0.052), ('ordered', 0.052), ('generation', 0.051), ('pl', 0.05), ('message', 0.049), ('ways', 0.049), ('root', 0.048), ('boyles', 0.048), ('coalescence', 0.048), ('hyperpriors', 0.048), ('patel', 0.048), ('sri', 0.048), ('tssb', 0.048), ('zppi', 0.048), ('zpred', 0.048), ('unordered', 0.047), ('giving', 0.045), ('gibbs', 0.044), ('times', 0.043), ('dyt', 0.042), ('ntest', 0.042), ('jude', 0.042), ('priors', 0.041), ('mi', 0.041), ('rooted', 0.04), ('collecting', 0.04), ('bi', 0.039), ('bayesian', 0.039), ('joined', 0.039), ('imagine', 0.039), ('likelihood', 0.038), ('sampling', 0.038), ('pdfs', 0.037), ('living', 0.037), ('ordering', 0.036), ('li', 0.036), ('leaf', 0.035), ('knowles', 0.035), ('nested', 0.035), ('remaining', 0.034), ('segment', 0.034), ('crp', 0.033), ('proportional', 0.033), ('balanced', 0.033), ('flexible', 0.032), ('wiener', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering

Author: Levi Boyles, Max Welling

Abstract: We introduce a new prior for use in Nonparametric Bayesian Hierarchical Clustering. The prior is constructed by marginalizing out the time information of Kingman’s coalescent, providing a prior over tree structures which we call the Time-Marginalized Coalescent (TMC). This allows for models which factorize the tree structure and times, providing two benefits: more flexible priors may be constructed and more efficient Gibbs type inference can be used. We demonstrate this on an example model for density estimation and show the TMC achieves competitive experimental results. 1

2 0.16891059 260 nips-2012-Online Sum-Product Computation Over Trees

Author: Mark Herbster, Stephen Pasteris, Fabio Vitale

Abstract: We consider the problem of performing efficient sum-product computations in an online setting over a tree. A natural application of our methods is to compute the marginal distribution at a vertex in a tree-structured Markov random field. Belief propagation can be used to solve this problem, but requires time linear in the size of the tree, and is therefore too slow in an online setting where we are continuously receiving new data and computing individual marginals. With our method we aim to update the data and compute marginals in time that is no more than logarithmic in the size of the tree, and is often significantly less. We accomplish this via a hierarchical covering structure that caches previous local sum-product computations. Our contribution is three-fold: we i) give a linear time algorithm to find an optimal hierarchical cover of a tree; ii) give a sum-productlike algorithm to efficiently compute marginals with respect to this cover; and iii) apply “i” and “ii” to find an efficient algorithm with a regret bound for the online allocation problem in a multi-task setting. 1

3 0.13383266 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

Author: Anima Anandkumar, Ragupathyraj Valluvan

Abstract: Graphical model selection refers to the problem of estimating the unknown graph structure given observations at the nodes in the model. We consider a challenging instance of this problem when some of the nodes are latent or hidden. We characterize conditions for tractable graph estimation and develop efficient methods with provable guarantees. We consider the class of Ising models Markov on locally tree-like graphs, which are in the regime of correlation decay. We propose an efficient method for graph estimation, and establish its structural consistency −δη(η+1)−2 when the number of samples n scales as n = Ω(θmin log p), where θmin is the minimum edge potential, δ is the depth (i.e., distance from a hidden node to the nearest observed nodes), and η is a parameter which depends on the minimum and maximum node and edge potentials in the Ising model. The proposed method is practical to implement and provides flexibility to control the number of latent variables and the cycle lengths in the output graph. We also present necessary conditions for graph estimation by any method and show that our method nearly matches the lower bound on sample requirements. Keywords: Graphical model selection, latent variables, quartet methods, locally tree-like graphs. 1

4 0.13377896 180 nips-2012-Learning Mixtures of Tree Graphical Models

Author: Anima Anandkumar, Furong Huang, Daniel J. Hsu, Sham M. Kakade

Abstract: We consider unsupervised estimation of mixtures of discrete graphical models, where the class variable is hidden and each mixture component can have a potentially different Markov graph structure and parameters over the observed variables. We propose a novel method for estimating the mixture components with provable guarantees. Our output is a tree-mixture model which serves as a good approximation to the underlying graphical model mixture. The sample and computational requirements for our method scale as poly(p, r), for an r-component mixture of pvariate graphical models, for a wide class of models which includes tree mixtures and mixtures over bounded degree graphs. Keywords: Graphical models, mixture models, spectral methods, tree approximation.

5 0.12440757 81 nips-2012-Context-Sensitive Decision Forests for Object Detection

Author: Peter Kontschieder, Samuel R. Bulò, Antonio Criminisi, Pushmeet Kohli, Marcello Pelillo, Horst Bischof

Abstract: In this paper we introduce Context-Sensitive Decision Forests - A new perspective to exploit contextual information in the popular decision forest framework for the object detection problem. They are tree-structured classifiers with the ability to access intermediate prediction (here: classification and regression) information during training and inference time. This intermediate prediction is available for each sample and allows us to develop context-based decision criteria, used for refining the prediction process. In addition, we introduce a novel split criterion which in combination with a priority based way of constructing the trees, allows more accurate regression mode selection and hence improves the current context information. In our experiments, we demonstrate improved results for the task of pedestrian detection on the challenging TUD data set when compared to state-ofthe-art methods. 1 Introduction and Related Work In the last years, the random forest framework [1, 6] has become a very popular and powerful tool for classification and regression problems by exhibiting many appealing properties like inherent multi-class capability, robustness to label noise and reduced tendencies to overfitting [7]. They are considered to be close to an ideal learner [13], making them attractive in many areas of computer vision like image classification [5, 17], clustering [19], regression [8] or semantic segmentation [24, 15, 18]. In this work we show how the decision forest algorithm can be extended to include contextual information during learning and inference for classification and regression problems. We focus on applying random forests to object detection, i.e. the problem of localizing multiple instances of a given object class in a test image. This task has been previously addressed in random forests [9], where the trees were modified to learn a mapping between the appearance of an image patch and its relative position to the object category centroid (i.e. center voting information). During inference, the resulting Hough Forest not only performs classification on test samples but also casts probabilistic votes in a generalized Hough-voting space [3] that is subsequently used to obtain object center hypotheses. Ever since, a series of applications such as tracking and action recognition [10], body-joint position estimation [12] and multi-class object detection [22] have been presented. However, Hough Forests typically produce non-distinctive object hypotheses in the Hough space and hence there is the need to perform non-maximum suppression (NMS) for obtaining the final results. While this has been addressed in [4, 26], another shortcoming is that standard (Hough) forests treat samples in a completely independent way, i.e. there is no mechanism that encourages the classifier to perform consistent predictions. Within this work we are proposing that context information can be used to overcome the aforementioned problems. For example, training data for visual learning is often represented by images in form of a (regular) pixel grid topology, i.e. objects appearing in natural images can often be found in a specific context. The importance of contextual information was already highlighted in the 80’s with 1 Figure 1: Top row: Training image, label image, visualization of priority-based growing of tree (the lower, the earlier the consideration during training.). Bottom row: Inverted Hough image using [9] and breadth-first training after 6 levels (26 = 64 nodes), Inverted Hough image after growing 64 nodes using our priority queue, Inverted Hough image using priority queue shows distinctive peaks at the end of training. a pioneering work on relaxation labelling [14] and a later work with focus on inference tasks [20] that addressed the issue of learning within the same framework. More recently, contextual information has been used in the field of object class segmentation [21], however, mostly for high-level reasoning in random field models or to resolve contradicting segmentation results. The introduction of contextual information as additional features in low-level classifiers was initially proposed in the Auto-context [25] and Semantic Texton Forest [24] models. Auto-context shows a general approach for classifier boosting by iteratively learning from appearance and context information. In this line of research [18] augmented the feature space for an Entanglement Random Forest with a classification feature, that is consequently refined by the class posterior distributions according to the progress of the trained subtree. The training procedure is allowed to perform tests for specific, contextual label configurations which was demonstrated to significantly improve the segmentation results. However, the In this paper we are presenting Context-Sensitve Decision Forests - A novel and unified interpretation of Hough Forests in light of contextual sensitivity. Our work is inspired by Auto-Context and Entanglement Forests, but instead of providing only posterior classification results from an earlier level of the classifier construction during learning and testing, we additionally provide regression (voting) information as it is used in Hough Forests. The second core contribution of our work is related to how we grow the trees: Instead of training them in a depth- or breadth-first way, we propose a priority-based construction (which could actually consider depth- or breadth-first as particular cases). The priority is determined by the current training error, i.e. we first grow the parts of the tree where we experience higher error. To this end, we introduce a unified splitting criterion that estimates the joint error of classification and regression. The consequence of using our priority-based training are illustrated in Figure 1: Given the training image with corresponding label image (top row, images 1 and 2), the tree first tries to learn the foreground samples as shown in the color-coded plot (top row, image 3, colors correspond to index number of nodes in the tree). The effects on the intermediate prediction quality are shown in the bottom row for the regression case: The first image shows the regression quality after training a tree with 6 levels (26 = 64 nodes) in a breadth-first way while the second image shows the progress after growing 64 nodes according to the priority based training. Clearly, the modes for the center hypotheses are more distinctive which in turn yields to more accurate intermediate regression information that can be used for further tree construction. Our third contribution is a new family of split functions that allows to learn from training images containing multiple training instances as shown for the pedestrians in the example. We introduce a test that checks the centroid compatibility for pairs of training samples taken from the context, based on the intermediate classification and regression derived as described before. To assess our contributions, we performed several experiments on the challenging TUD pedestrian data set [2], yielding a significant improvement of 9% in the recall at 90% precision rate in comparison to standard Hough Forests, when learning from crowded pedestrian images. 2 2 Context-Sensitive Decision Trees This section introduces the general idea behind the context-sensitive decision forest without references to specific applications. Only in Section 3 we show a particular application to the problem of object detection. After showing some basic notational conventions that are used in the paper, we provide a section that revisits the random forest framework for classification and regression tasks from a joint perspective, i.e. a theory allowing to consider e.g. [1, 11] and [9] in a unified way. Starting from this general view we finally introduce the context-sensitive forests in 2.2. Notations. In the paper we denote vectors using boldface lowercase (e.g. d, u, v) and sets by using uppercase calligraphic (e.g. X , Y) symbols. The sets of real, natural and integer numbers are denoted with R, N and Z as usually. We denote by 2X the power set of X and by 1 [P ] the indicator function returning 1 or 0 according to whether the proposition P is true or false. Moreover, with P(Y) we denote the set of probability distributions having Y as sample space and we implicitly assume that some σ-algebra is defined on Y. We denote by δ(x) the Dirac delta function. Finally, Ex∼Q [f (x)] denotes the expectation of f (x) with respect to x sampled according to distribution Q. 2.1 Random Decision Forests for joint classification and regression A (binary) decision tree is a tree-structured predictor1 where, starting from the root, a sample is routed until it reaches a leaf where the prediction takes place. At each internal node of the tree the decision is taken whether the sample should be forwarded to the left or right child, according to a binary-valued function. In formal terms, let X denote the input space, let Y denote the output space and let T dt be the set of decision trees. In its simplest form a decision tree consists of a single node (a leaf ) and is parametrized by a probability distribution Q ∈ P(Y) which represents the posterior probability of elements in Y given any data sample reaching the leaf. We denote this (admittedly rudimentary) tree as L F (Q) ∈ T td . Otherwise, a decision tree consists of a node with a left and a right sub-tree. This node is parametrized by a split function φ : X → {0, 1}, which determines whether to route a data sample x ∈ X reaching it to the left decision sub-tree tl ∈ T dt (if φ(x) = 0) or to the right one tr ∈ T dt (if φ(x) = 1). We denote such a tree as N D (φ, tl , tr ) ∈ T td . Finally, a decision forest is an ensemble F ⊆ T td of decision trees which makes a prediction about a data sample by averaging over the single predictions gathered from all trees. Inference. Given a decision tree t ∈ T dt , the associated posterior probability of each element in Y given a sample x ∈ X is determined by finding the probability distribution Q parametrizing the leaf that is reached by x when routed along the tree. This is compactly presented with the following definition of P (y|x, t), which is inductive in the structure of t:  if t = L F (Q) Q(y) P (y | x, t ) = P (y | x, tl ) if t = N D (φ, tl , tr ) and φ(x) = 0 (1)  P (y | x, tr ) if t = N D (φ, tl , tr ) and φ(x) = 1 . Finally, the combination of the posterior probabilities derived from the trees in a forest F ⊆ T dt can be done by an averaging operation [6], yielding a single posterior probability for the whole forest: P (y|x, F) = 1 |F| P (y|x, t) . (2) t∈F Randomized training. A random forest is created by training a set of random decision trees independently on random subsets of the training data D ⊆ X ×Y. The training procedure for a single decision tree heuristically optimizes a set of parameters like the tree structure, the split functions at the internal nodes and the density estimates at the leaves in order to reduce the prediction error on the training data. In order to prevent overfitting problems, the search space of possible split functions is limited to a random set and a minimum number of training samples is required to grow a leaf node. During the training procedure, each new node is fed with a set of training samples Z ⊆ D. If some stopping condition holds, depending on Z, the node becomes a leaf and a density on Y is estimated based on Z. Otherwise, an internal node is grown and a split function is selected from a pool of random ones in a way to minimize some sort of training error on Z. The selected split function induces a partition 1 we use the term predictor because we will jointly consider classification and regression. 3 of Z into two sets, which are in turn becoming the left and right childs of the current node where the training procedure is continued, respectively. We will now write this training procedure in more formal terms. To this end we introduce a function π(Z) ∈ P(Y) providing a density on Y estimated from the training data Z ⊆ D and a loss function L(Z | Q) ∈ R penalizing wrong predictions on the training samples in Z, when predictions are given according to a distribution Q ∈ P(Y). The loss function L can be further decomposed in terms of a loss function (·|Q) : Y → R acting on each sample of the training set: L(Z | Q) = (y | Q) . (3) (x,y)∈Z Also, let Φ(Z) be a set of split functions randomly generated for a training set Z and given a split φ function φ ∈ Φ(Z), we denote by Zlφ and Zr the sets identified by splitting Z according to φ, i.e. Zlφ = {(x, y) ∈ Z : φ(x) = 0} and φ Zr = {(x, y) ∈ Z : φ(x) = 1} . We can now summarize the training procedure in terms of a recursive function g : 2X ×Y → T , which generates a random decision tree from a training set given as argument: g(Z) = L F (π(Z)) ND if some stopping condition holds φ φ, g(Zlφ ), g(Zr ) otherwise . (4) Here, we determine the optimal split function φ in the pool Φ(Z) as the one minimizing the loss we incur as a result of the node split: φ φ ∈ arg min L(Zlφ ) + L(Zr ) : φ ∈ Φ(Z) (5) where we compactly write L(Z) for L(Z|π(Z)), i.e. the loss on Z obtained with predictions driven by π(Z). A typical split function selection criterion commonly adopted for classification and regression is information gain. The equivalent counterpart in terms of loss can be obtained by using a log-loss, i.e. (y|Q) = − log(Q(y)). A further widely used criterion is based on Gini impurity, which can be expressed in this setting by using (y|Q) = 1 − Q(y). Finally, the stopping condition that is used in (4) to determine whether to create a leaf or to continue branching the tree typically consists in checking |Z|, i.e. the number of training samples at the node, or the loss L(Z) are below some given thresholds, or if a maximum depth is reached. 2.2 Context-sensitive decision forests A context-sensitive (CS) decision tree is a decision tree in which split functions are enriched with the ability of testing contextual information of a sample, before taking a decision about where to route it. We generate contextual information at each node of a decision tree by exploiting a truncated version of the same tree as a predictor. This idea is shared with [18], however, we introduce some novelties by tackling both, classification and regression problems in a joint manner and by leaving a wider flexibility in the tree truncation procedure. We denote the set of CS decision trees as T . The main differences characterizing a CS decision tree t ∈ T compared with a standard decision tree are the following: a) every node (leaves and internal nodes) of t has an associated probability distribution Q ∈ P(Y) representing the posterior probability of an element in Y given any data sample reaching it; b) internal nodes are indexed with distinct natural numbers n ∈ N in a way to preserve the property that children nodes have a larger index compared to their parent node; c) the split function at each internal node, denoted by ϕ(·|t ) : X → {0, 1}, is bound to a CS decision tree t ∈ T , which is a truncated version of t and can be used to compute intermediate, contextual information. Similar to Section 2.1 we denote by L F (Q) ∈ T the simplest CS decision tree consisting of a single leaf node parametrized by the distribution Q, while we denote by N D (n, Q, ϕ, tl , tr ) ∈ T , the rest of the trees consisting of a node having a left and a right sub-tree, denoted by tl , tr ∈ T respectively, and being parametrized by the index n, a probability distribution Q and the split function ϕ as described above. As shown in Figure 2, the truncation of a CS decision tree at each node is obtained by exploiting the indexing imposed on the internal nodes of the tree. Given a CS decision tree t ∈ T and m ∈ N, 4 1 1 4 2 3 6 2 5 4 3 (b) The truncated version t(<5) (a) A CS decision tree t Figure 2: On the left, we find a CS decision tree t, where only the internal nodes are indexed. On the right, we see the truncated version t(<5) of t, which is obtained by converting to leaves all nodes having index ≥ 5 (we marked with colors the corresponding node transformations). we denote by t( < τ 2 In the experiments conducted, we never exceeded 10 iterations for finding a mode. 6 (8) where Pj = P (·|(u + hj , I), t), with j = 1, 2, are the posterior probabilities obtained from tree t given samples at position u+h1 and u+h2 of image I, respectively. Please note that this test should not be confused with the regression split criterion in [9], which tries to partition the training set in a way to group examples with similar voting direction and length. Besides the novel context-sensitive split function we employ also standard split functions performing tests on X as defined in [24]. 4 Experiments To assess our proposed approach, we have conducted several experiments on the task of pedestrian detection. Detecting pedestrians is very challenging for Hough-voting based methods as they typically exhibit strong articulations of feet and arms, yielding to non-distinctive hypotheses in the Hough space. We evaluated our method on the TUD pedestrian data base [2] in two different ways: First, we show our detection results with training according to the standard protocol using 400 training images (where each image contains a single annotation of a pedestrian) and evaluation on the Campus and Crossing scenes, respectively (Section 4.1). With this experiment we show the improvement over state-of-the-art approaches when learning can be performed with simultaneous knowledge about context information. In a second variation (Section 4.2), we use the images of the Crossing scene (201 images) as a training set. Most images of this scene contain more than four persons with strong overlap and mutual occlusions. However, instead of using the original annotation which covers only pedestrians with at least 50% overlap (1008 bounding boxes), we use the more accurate, pixel-wise ground truth annotations of [23] for the entire scene that includes all persons and consists of 1215 bounding boxes. Please note that this annotation is even more detailed than the one presented in [4] with 1018 bounding boxes. The purpose of the second experiment is to show that our context-sensitive forest can exploit the availability of multiple training instances significantly better than state-of-the-art. The most related work and therefore also the baseline in our experiments is the Hough Forest [9]. To guarantee a fair comparison, we use the same training parameters for [9] and our context sensitive forest: We trained 20 trees and the training data (including horizontally flipped images) was sampled homogeneously per category per image. The patch size was fixed to 30 × 30 and we performed 1600 node tests for finding the best split function parameters per node. The trees were stopped growing when < 7 samples were available. As image features, we used the the first 16 feature channels provided in the publicly available Hough Forest code of [9]. In order to obtain the object detection hypotheses from the Hough space, we use the same Non-maximum suppression (NMS) technique in all our experiments as suggested in [9]. To evaluate the obtained hypotheses, we use the standard PASAL-VOC criterion which requires the mutual overlap between ground truth and detected bounding boxes to be ≥ 50%. The additional parameter of (7) was fixed to σ = 7. 4.1 Evaluation using standard protocol training set The standard training set contains 400 images where each image comes with a single pedestrian annotation. For our experiments, we rescaled the images by a factor of 0.5 and doubled the training image set by including also the horizontally flipped images. We randomly chose 125 training samples per image for foreground and background, resulting in 2 · 400 · 2 · 125 = 200k training samples per tree. For additional comparisons, we provide the results presented in the recent work on joint object detection and segmentation of [23], from which we also provide evaluation results of the Implicit Shape Model (ISM) [16]. However, please note that the results of [23] are based on a different baseline implementation. Moreover, we show the results of [4] when using the provided code and configuration files from the first authors homepage. Unfortunately, we could not reproduce the results of the original paper. First, we discuss the results obtained on the Campus scene. This data set consists of 71 images showing walking pedestrians at severe scale differences and partial occlusions. The ground truth we use has been released with [4] and contains a total number of 314 pedestrians. Figure 3, first row, plot 1 shows the precision-recall curves when using 3 scales (factors 0.3, 0.4, 0.55) for our baseline [9] (blue), results from re-evaluating [4] (cyan, 5 scales), [23] (green) and our ContextSensitive Forest without and with using the priority queue based tree construction (red/magenta). In case of not using the priority queue, we trained the trees according to a breadth-first way. We obtain a performance boost of ≈ 6% in recall at a precision of 90% when using both, context information and the priority based construction of our forest. The second plot in the first row of Figure 3 shows the results when the same forests are tested on the Crossing scene, using the more detailed ground 7 TUD Campus (3 scales) TUD−Crossing (3 scales) 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Precision 1 0.9 Precision 1 0.5 0.4 0.3 0.2 0.1 0 0 0.5 0.4 0.3 Baseline Hough Forest Barinova et al. CVPR’10, 5 scales Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue Riemenschneider et al. ECCV’12 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.2 0.1 0.9 0 0 1 Baseline Hough Forest Barinova et al. CVPR’10 Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue Riemenschneider et al. ECCV’12 (1 scale) Leibe et al. IJCV’08 (1 scale) 0.1 TUD Campus (3 scales) 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.9 1 0.9 1 1 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Precision 1 0.9 Precision 0.2 TUD Campus (5 scales) 0.5 0.4 0.3 0 0 0.4 0.3 0.2 0.1 0.5 0.2 Baseline Hough Forest Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.1 0.9 1 0 0 Baseline Hough Forest Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 Figure 3: Precision-Recall Curves for detections, Top row: Standard training (400 images), evaluation on Campus and Crossing (3 scales). Bottom row: Training on Crossing annotations of [23], evaluation on Campus, 3 and 5 scales. Right images: Qualitative examples for Campus (top 2) and Crossing (bottom 2) scenes. (green) correctly found by our method (blue) ground truth (red) wrong association (cyan) missed detection. truth annotations. The data set shows walking pedestrians (Figure 3, right side, last 2 images) with a smaller variation in scale compared to the Campus scene but with strong mutual occlusions and overlaps. The improvement with respect to the baseline is lower (≈ 2% gain at a precision of 90%) and we find similar developments of the curves. However, this comes somewhat expectedly as the training data does not properly reflect the occlusions we actually want to model. 4.2 Evaluation on Campus scene using Crossing scene as training set In our next experiment we trained the forests (same parameters) on the novel annotations of [23] for the Crossing scene. Please note that this reduces the training set to only 201 images (we did not include the flipped images). Qualitative detection results are shown in Figure 3, right side, images 1 and 2. From the first precison-recall curve in the second row of Figure 3 we can see, that the margin between the baseline and our proposed method could be clearly improved (gain of ≈ 9% recall at precision 90%) when evaluating on the same 3 scales. With evaluation on 5 scales (factors 0.34, 0.42, 0.51, 0.65, 0.76) we found a strong increase in the recall, however, at the cost of loosing 2 − 3% of precision below a recall of 60%, as illustrated in the second plot of row 2 in Figure 3. While our method is able to maintain a precision above 90% up to a recall of ≈ 83%, the baseline implementation drops already at a recall of ≈ 20%. 5 Conclusions In this work we have presented Context-Sensitive Decision Forests with application to the object detection problem. Our new forest has the ability to access intermediate prediction (classification and regression) information about all samples of the training set and can therefore learn from contextual information throughout the growing process. This is in contrast to existing random forest methods used for object detection which typically treat training samples in an independent manner. Moreover, we have introduced a novel splitting criterion together with a mode isolation technique, which allows us to (a) perform a priority-driven way of tree growing and (b) install novel context-based test functions to check for mutual object centroid agreements. In our experimental results on pedestrian detection we demonstrated superior performance with respect to state-of-the-art methods and additionally found that our new algorithm can significantly better exploit training data containing multiple training objects. Acknowledgements. Peter Kontschieder acknowledges financial support of the Austrian Science Fund (FWF) from project ’Fibermorph’ with number P22261-N22. 8 References [1] Y. Amit and D. Geman. Shape quantization and recognition with randomized trees. Neural Computation, 1997. [2] M. Andriluka, S. Roth, and B. Schiele. People-tracking-by-detection and people-detection-by-tracking. In (CVPR), 2008. [3] D. H. Ballard. Generalizing the hough transform to detect arbitrary shapes. Pattern Recognition, 13(2), 1981. [4] O. Barinova, V. Lempitsky, and P. Kohli. On detection of multiple object instances using hough transforms. In (CVPR), 2010. [5] A. Bosch, A. Zisserman, and X. Mu˜oz. Image classification using random forests and ferns. In (ICCV), n 2007. [6] L. Breiman. Random forests. In Machine Learning, 2001. [7] A. Criminisi, J. Shotton, and E. Konukoglu. Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. In Foundations and Trends in Computer Graphics and Vision, volume 7, pages 81–227, 2012. [8] A. Criminisi, J. Shotton, D. Robertson, and E. Konukoglu. Regression forests for efficient anatomy detection and localization in CT scans. In MICCAI-MCV Workshop, 2010. [9] J. Gall and V. Lempitsky. Class-specific hough forests for object detection. In (CVPR), 2009. [10] J. Gall, A. Yao, N. Razavi, L. Van Gool, and V. Lempitsky. Hough forests for object detection, tracking, and action recognition. (PAMI), 2011. [11] P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 2006. [12] R. Girshick, J. Shotton, P. Kohli, A. Criminisi, and A. Fitzgibbon. Efficient regression of general-activity human poses from depth images. In (ICCV), 2011. [13] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer, 2009. [14] R. A. Hummel and S. W. Zucker. On the foundations of relaxation labeling. (PAMI), 5(3):267–287, 1983. [15] P. Kontschieder, S. Rota Bul` , H. Bischof, and M. Pelillo. Structured class-labels in random forests for o semantic image labelling. In (ICCV), 2011. [16] B. Leibe, A. Leonardis, and B. Schiele. Robust object detection with interleaved categorization and segmentation. (IJCV), 2008. [17] R. Mar´ e, P. Geurts, J. Piater, and L. Wehenkel. Random subwindows for robust image classification. In e (CVPR), 2005. [18] A. Montillo, J. Shotton, J. Winn, J. E. Iglesias, D. Metaxas, and A. Criminisi. Entangled decision forests and their application for semantic segmentation of CT images. In (IPMI), 2011. [19] F. Moosmann, B. Triggs, and F. Jurie. Fast discriminative visual codebooks using randomized clustering forests. In (NIPS), 2006. [20] M. Pelillo and M. Refice. Learning compatibility coefficients for relaxation labeling processes. (PAMI), 16(9):933–945, 1994. [21] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In (ICCV), 2007. [22] N. Razavi, J. Gall, and L. Van Gool. Scalable multi-class object detection. In (CVPR), 2011. [23] H. Riemenschneider, S. Sternig, M. Donoser, P. M. Roth, and H. Bischof. Hough regions for joining instance localization and segmentation. In (ECCV), 2012. [24] J. Shotton, M. Johnson, and R. Cipolla. Semantic texton forests for image categorization and segmentation. In (CVPR), 2008. [25] Z. Tu. Auto-context and its application to high-level vision tasks. In (CVPR), 2008. [26] O. Woodford, M. Pham, A. Maki, F. Perbet, and B. Stenger. Demisting the hough transform for 3d shape recognition and registration. In (BMVC), 2011. 9

6 0.11100668 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

7 0.10896727 336 nips-2012-The Coloured Noise Expansion and Parameter Estimation of Diffusion Processes

8 0.10088141 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback

9 0.099827737 182 nips-2012-Learning Networks of Heterogeneous Influence

10 0.098372765 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

11 0.093874231 233 nips-2012-Multiresolution Gaussian Processes

12 0.093716562 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

13 0.091965377 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function

14 0.091094919 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

15 0.091088317 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees

16 0.088722408 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

17 0.087937914 126 nips-2012-FastEx: Hash Clustering with Exponential Families

18 0.086848818 215 nips-2012-Minimizing Uncertainty in Pipelines

19 0.08636082 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification

20 0.086220093 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.213), (1, 0.036), (2, 0.0), (3, 0.064), (4, -0.157), (5, -0.157), (6, -0.018), (7, -0.091), (8, -0.144), (9, 0.092), (10, -0.005), (11, -0.035), (12, 0.062), (13, -0.105), (14, -0.076), (15, -0.02), (16, -0.088), (17, -0.002), (18, 0.06), (19, -0.066), (20, -0.051), (21, 0.161), (22, 0.097), (23, 0.038), (24, -0.031), (25, 0.01), (26, -0.036), (27, -0.148), (28, 0.023), (29, 0.021), (30, 0.065), (31, -0.102), (32, -0.071), (33, -0.104), (34, -0.013), (35, 0.067), (36, -0.038), (37, 0.006), (38, -0.061), (39, 0.038), (40, 0.022), (41, -0.016), (42, 0.057), (43, 0.04), (44, -0.06), (45, -0.047), (46, -0.049), (47, 0.006), (48, 0.068), (49, 0.087)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95707166 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering

Author: Levi Boyles, Max Welling

Abstract: We introduce a new prior for use in Nonparametric Bayesian Hierarchical Clustering. The prior is constructed by marginalizing out the time information of Kingman’s coalescent, providing a prior over tree structures which we call the Time-Marginalized Coalescent (TMC). This allows for models which factorize the tree structure and times, providing two benefits: more flexible priors may be constructed and more efficient Gibbs type inference can be used. We demonstrate this on an example model for density estimation and show the TMC achieves competitive experimental results. 1

2 0.736193 232 nips-2012-Multiplicative Forests for Continuous-Time Processes

Author: Jeremy Weiss, Sriraam Natarajan, David Page

Abstract: Learning temporal dependencies between variables over continuous time is an important and challenging task. Continuous-time Bayesian networks effectively model such processes but are limited by the number of conditional intensity matrices, which grows exponentially in the number of parents per variable. We develop a partition-based representation using regression trees and forests whose parameter spaces grow linearly in the number of node splits. Using a multiplicative assumption we show how to update the forest likelihood in closed form, producing efficient model updates. Our results show multiplicative forests can be learned from few temporal trajectories with large gains in performance and scalability.

3 0.71073705 260 nips-2012-Online Sum-Product Computation Over Trees

Author: Mark Herbster, Stephen Pasteris, Fabio Vitale

Abstract: We consider the problem of performing efficient sum-product computations in an online setting over a tree. A natural application of our methods is to compute the marginal distribution at a vertex in a tree-structured Markov random field. Belief propagation can be used to solve this problem, but requires time linear in the size of the tree, and is therefore too slow in an online setting where we are continuously receiving new data and computing individual marginals. With our method we aim to update the data and compute marginals in time that is no more than logarithmic in the size of the tree, and is often significantly less. We accomplish this via a hierarchical covering structure that caches previous local sum-product computations. Our contribution is three-fold: we i) give a linear time algorithm to find an optimal hierarchical cover of a tree; ii) give a sum-productlike algorithm to efficiently compute marginals with respect to this cover; and iii) apply “i” and “ii” to find an efficient algorithm with a regret bound for the online allocation problem in a multi-task setting. 1

4 0.687002 215 nips-2012-Minimizing Uncertainty in Pipelines

Author: Nilesh Dalvi, Aditya Parameswaran, Vibhor Rastogi

Abstract: In this paper, we consider the problem of debugging large pipelines by human labeling. We represent the execution of a pipeline using a directed acyclic graph of AND and OR nodes, where each node represents a data item produced by some operator in the pipeline. We assume that each operator assigns a confidence to each of its output data. We want to reduce the uncertainty in the output by issuing queries to a human, where a query consists of checking if a given data item is correct. In this paper, we consider the problem of asking the optimal set of queries to minimize the resulting output uncertainty. We perform a detailed evaluation of the complexity of the problem for various classes of graphs. We give efficient algorithms for the problem for trees, and show that, for a general dag, the problem is intractable. 1

5 0.62247843 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees

Author: Erik Talvitie

Abstract: This paper introduces timeline trees, which are partial models of partially observable environments. Timeline trees are given some specific predictions to make and learn a decision tree over history. The main idea of timeline trees is to use temporally abstract features to identify and split on features of key events, spread arbitrarily far apart in the past (whereas previous decision-tree-based methods have been limited to a finite suffix of history). Experiments demonstrate that timeline trees can learn to make high quality predictions in complex, partially observable environments with high-dimensional observations (e.g. an arcade game). 1

6 0.62232131 81 nips-2012-Context-Sensitive Decision Forests for Object Detection

7 0.61630893 207 nips-2012-Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification

8 0.57280463 180 nips-2012-Learning Mixtures of Tree Graphical Models

9 0.53181183 182 nips-2012-Learning Networks of Heterogeneous Influence

10 0.52786905 156 nips-2012-Identifiability and Unmixing of Latent Parse Trees

11 0.5277608 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

12 0.52700472 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification

13 0.50471658 334 nips-2012-Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs

14 0.50145972 118 nips-2012-Entangled Monte Carlo

15 0.49826699 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

16 0.49632642 233 nips-2012-Multiresolution Gaussian Processes

17 0.47726503 78 nips-2012-Compressive Sensing MRI with Wavelet Tree Sparsity

18 0.47652704 267 nips-2012-Perceptron Learning of SAT

19 0.46988297 283 nips-2012-Putting Bayes to sleep

20 0.46355498 294 nips-2012-Repulsive Mixtures


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.065), (21, 0.011), (38, 0.07), (42, 0.013), (54, 0.019), (55, 0.012), (74, 0.483), (76, 0.109), (80, 0.085), (92, 0.038)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92433023 202 nips-2012-Locally Uniform Comparison Image Descriptor

Author: Andrew Ziegler, Eric Christiansen, David Kriegman, Serge J. Belongie

Abstract: Keypoint matching between pairs of images using popular descriptors like SIFT or a faster variant called SURF is at the heart of many computer vision algorithms including recognition, mosaicing, and structure from motion. However, SIFT and SURF do not perform well for real-time or mobile applications. As an alternative very fast binary descriptors like BRIEF and related methods use pairwise comparisons of pixel intensities in an image patch. We present an analysis of BRIEF and related approaches revealing that they are hashing schemes on the ordinal correlation metric Kendall’s tau. Here, we introduce Locally Uniform Comparison Image Descriptor (LUCID), a simple description method based on linear time permutation distances between the ordering of RGB values of two image patches. LUCID is computable in linear time with respect to the number of pixels and does not require floating point computation. 1

2 0.91924691 360 nips-2012-Visual Recognition using Embedded Feature Selection for Curvature Self-Similarity

Author: Angela Eigenstetter, Bjorn Ommer

Abstract: Category-level object detection has a crucial need for informative object representations. This demand has led to feature descriptors of ever increasing dimensionality like co-occurrence statistics and self-similarity. In this paper we propose a new object representation based on curvature self-similarity that goes beyond the currently popular approximation of objects using straight lines. However, like all descriptors using second order statistics, ours also exhibits a high dimensionality. Although improving discriminability, the high dimensionality becomes a critical issue due to lack of generalization ability and curse of dimensionality. Given only a limited amount of training data, even sophisticated learning algorithms such as the popular kernel methods are not able to suppress noisy or superfluous dimensions of such high-dimensional data. Consequently, there is a natural need for feature selection when using present-day informative features and, particularly, curvature self-similarity. We therefore suggest an embedded feature selection method for SVMs that reduces complexity and improves generalization capability of object models. By successfully integrating the proposed curvature self-similarity representation together with the embedded feature selection in a widely used state-of-the-art object detection framework we show the general pertinence of the approach. 1

3 0.9158982 40 nips-2012-Analyzing 3D Objects in Cluttered Images

Author: Mohsen Hejrati, Deva Ramanan

Abstract: We present an approach to detecting and analyzing the 3D configuration of objects in real-world images with heavy occlusion and clutter. We focus on the application of finding and analyzing cars. We do so with a two-stage model; the first stage reasons about 2D shape and appearance variation due to within-class variation (station wagons look different than sedans) and changes in viewpoint. Rather than using a view-based model, we describe a compositional representation that models a large number of effective views and shapes using a small number of local view-based templates. We use this model to propose candidate detections and 2D estimates of shape. These estimates are then refined by our second stage, using an explicit 3D model of shape and viewpoint. We use a morphable model to capture 3D within-class variation, and use a weak-perspective camera model to capture viewpoint. We learn all model parameters from 2D annotations. We demonstrate state-of-the-art accuracy for detection, viewpoint estimation, and 3D shape reconstruction on challenging images from the PASCAL VOC 2011 dataset. 1

4 0.85781151 337 nips-2012-The Lovász ϑ function, SVMs and finding large dense subgraphs

Author: Vinay Jethava, Anders Martinsson, Chiranjib Bhattacharyya, Devdatt Dubhashi

Abstract: The Lov´ sz ϑ function of a graph, a fundamental tool in combinatorial optimizaa tion and approximation algorithms, is computed by solving a SDP. In this paper we establish that the Lov´ sz ϑ function is equivalent to a kernel learning problem a related to one class SVM. This interesting connection opens up many opportunities bridging graph theoretic algorithms and machine learning. We show that there exist graphs, which we call SVM − ϑ graphs, on which the Lov´ sz ϑ function a can be approximated well by a one-class SVM. This leads to novel use of SVM techniques for solving algorithmic problems in large graphs e.g. identifying a √ 1 planted clique of size Θ( n) in a random graph G(n, 2 ). A classic approach for this problem involves computing the ϑ function, however it is not scalable due to SDP computation. We show that the random graph with a planted clique is an example of SVM − ϑ graph. As a consequence a SVM based approach easily identifies the clique in large graphs and is competitive with the state-of-the-art. We introduce the notion of common orthogonal labelling and show that it can be computed by solving a Multiple Kernel learning problem. It is further shown that such a labelling is extremely useful in identifying a large common dense subgraph in multiple graphs, which is known to be a computationally difficult problem. The proposed algorithm achieves an order of magnitude scalability compared to state of the art methods. 1

same-paper 5 0.84758019 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering

Author: Levi Boyles, Max Welling

Abstract: We introduce a new prior for use in Nonparametric Bayesian Hierarchical Clustering. The prior is constructed by marginalizing out the time information of Kingman’s coalescent, providing a prior over tree structures which we call the Time-Marginalized Coalescent (TMC). This allows for models which factorize the tree structure and times, providing two benefits: more flexible priors may be constructed and more efficient Gibbs type inference can be used. We demonstrate this on an example model for density estimation and show the TMC achieves competitive experimental results. 1

6 0.79488373 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

7 0.73616332 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

8 0.72400951 357 nips-2012-Unsupervised Template Learning for Fine-Grained Object Recognition

9 0.71281475 185 nips-2012-Learning about Canonical Views from Internet Image Collections

10 0.70413619 201 nips-2012-Localizing 3D cuboids in single-view images

11 0.68562144 176 nips-2012-Learning Image Descriptors with the Boosting-Trick

12 0.67793411 8 nips-2012-A Generative Model for Parts-based Object Segmentation

13 0.66724825 210 nips-2012-Memorability of Image Regions

14 0.65796775 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection

15 0.63797605 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves

16 0.63157254 303 nips-2012-Searching for objects driven by context

17 0.63031751 106 nips-2012-Dynamical And-Or Graph Learning for Object Shape Modeling and Detection

18 0.62271845 168 nips-2012-Kernel Latent SVM for Visual Recognition

19 0.61415678 146 nips-2012-Graphical Gaussian Vector for Image Categorization

20 0.5951798 81 nips-2012-Context-Sensitive Decision Forests for Object Detection