nips nips2012 nips2012-123 knowledge-graph by maker-knowledge-mining

123 nips-2012-Exponential Concentration for Mutual Information Estimation with Application to Forests

Source: pdf

Author: Han Liu, Larry Wasserman, John D. Lafferty

Abstract: We prove a new exponential concentration inequality for a plug-in estimator of the Shannon mutual information. Previous results on mutual information estimation only bounded expected error. The advantage of having the exponential inequality is that, combined with the union bound, we can guarantee accurate estimators of the mutual information for many pairs of random variables simultaneously. As an application, we show how to use such a result to optimally estimate the density function and graph of a distribution which is Markov to a forest graph. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We prove a new exponential concentration inequality for a plug-in estimator of the Shannon mutual information. [sent-6, score-0.381]

2 Previous results on mutual information estimation only bounded expected error. [sent-7, score-0.226]

3 The advantage of having the exponential inequality is that, combined with the union bound, we can guarantee accurate estimators of the mutual information for many pairs of random variables simultaneously. [sent-8, score-0.275]

4 As an application, we show how to use such a result to optimally estimate the density function and graph of a distribution which is Markov to a forest graph. [sent-9, score-0.463]

5 1 Introduction We consider the problem of nonparametrically estimating the Shannon mutual information between two random variables. [sent-10, score-0.196]

6 Let X1 ∈ X1 and X2 ∈ X2 be two random variables with domains X1 and X2 and joint density p(x1 , x2 ). [sent-11, score-0.168]

7 The mutual information between X1 and X2 is p(x1 , x2 ) log I(X1 ; X2 ) := X1 X2 where H(X1 , X2 ) = − p(x1 , x2 ) p(x1 )p(x2 ) dx1 dx2 = H(X1 ) + H(X2 ) − H(X1 , X2 ), p(x1 , x2 ) log p(x1 , x2 )dx1 dx2 (and similarly for H(X1 ) and H(X2 )) are the corresponding Shannon entropies [4]. [sent-12, score-0.241]

8 The mutual information is a measure of dependence between X1 and X2 . [sent-13, score-0.153]

9 A simple way to estimate the Shannon entropy is to use a kernel density estimator (KDE) [22, 1, 9, 5, 20, 7], i. [sent-15, score-0.438]

10 , the densities p(x, y), p(x), and p(y) are separately estimated from samples and the estimated densities are used to calculate the entropy. [sent-17, score-0.11]

11 There have been many recent developments in the problem of estimating Shannon entropy and related quantities as well as application of these results to machine learning problems [18, 21, 8,√ Under 6]. [sent-20, score-0.121]

12 In this paper, we construct an estimator with this rate, but we also prove an exponential concentration inequality for the estimator. [sent-22, score-0.228]

13 More speciﬁcally, we show that our estimator H of H(p) satisﬁes sup P |H − H(p)| > p∈Σ 1 ≤ 2 exp − n 2 36κ2 (1. [sent-23, score-0.213]

14 To the best of our knowledge, this is the ﬁrst such exponential inequality for nonparametric Shannon entropy and mutual information estimation. [sent-25, score-0.357]

15 The advantage of this result, over the usual results which state that E |H − H(p)|2 = O(n−1 ), is that we can apply the union bound and thus guarantee accurate mutual information estimation for many pairs of random variables simultaneously. [sent-26, score-0.244]

16 As an application, we consider forest density estimation [15], which, in a d-dimenionsal problem, requires estimating d(d+1) mutual informations in order to apply the Chow-Liu algorithm. [sent-27, score-0.619]

17 As long as log d → 2 n 0 as n → ∞, we can estimate the forest graph well, even if d = d(n) increases with n exponentially fast. [sent-28, score-0.304]

18 The assumptions and estimator are given in Section 2. [sent-30, score-0.128]

19 In Section 4 we show how to apply the result to forest density estimation. [sent-32, score-0.366]

20 2 Estimator and Main Result Let X = (X1 , X2 ) ∈ R2 be a random vector with density p(x) := p(x1 , x2 ) and let x1 , . [sent-34, score-0.168]

21 We want to estimate the Shannon entropy H(p) = − p(x) log p(x)dx. [sent-39, score-0.142]

22 1) X We start with some assumptions on the density function p(x1 , x2 ). [sent-41, score-0.189]

23 We assume the density p(x1 , x2 ) belongs to a 2nd-order H¨ lder class Σκ (2, L) and is bounded away from zero and inﬁnity. [sent-44, score-0.241]

24 If {xn } ∈ X is any sequence converging to a boundary point x∗ , we require the density p(x) has vanishing ﬁrst order partial derivatives: p(x1 + u, x2 + v) − p(x1 , x2 ) − lim n→∞ ∂p(xn ) ∂p(xn ) = lim = 0. [sent-50, score-0.238]

25 5) Here h is the bandwidth and K(·) is a univariate kernel function. [sent-57, score-0.14]

26 This estimator has nine terms; one corresponds to the original data in the unit square [0, 1]2 , and each of the remaining terms corresponds to reﬂecting the data across one of the four sides or four corners of the square. [sent-59, score-0.107]

27 1, the values of the true density lie in the interval [κ1 , κ2 ]. [sent-64, score-0.168]

28 We propose a clipped KDE estimator ph (x) = Tκ1 ,κ2 (ph (x)) , (2. [sent-65, score-0.918]

29 6) where Tκ1 ,κ2 (a) = κ1 · I(a < κ1 ) + a · I(κ1 ≤ a ≤ κ2 ) + κ2 · I(a > κ2 ), so that the estimated density also has this property. [sent-66, score-0.187]

30 Letting g(u) = u log u, we propose the following plug-in entropy estimator: H (ph ) := − g (ph (x)) dx = − X ph (x) log ph (x)dx. [sent-67, score-1.841]

31 The clipped estimator ph requires the knowledge of κ1 and κ2 . [sent-71, score-0.918]

32 Our main technical result is the following exponential concentration inequality on H(ph ) around the population quantity H(p). [sent-73, score-0.121]

33 3, if we choose the bandwidth according to h n−1/4 , then there exists a constant N0 such that for all n > N0 , n 2 sup P |H (ph ) − H (p)| > ≤ 2 exp − , (2. [sent-80, score-0.152]

34 8) has been established for Shannon entropy estimation over the H¨ lder class. [sent-83, score-0.21]

35 The bandwidth h n−1/4 in the above theorem is different from the usual choice for optimal bivariate density estimation, which is hP n−1/6 for the 2nd-order H¨ lder class. [sent-86, score-0.378]

36 By using h o n−1/4 , we undersmooth the density estimate. [sent-87, score-0.168]

37 As we show in the next section, such a bandwidth choice is important for achieving the optimal rate for entropy estimation. [sent-88, score-0.157]

38 Let I(p) := I(X1 ; X2 ) be the Shannon mutual information, and deﬁne ph (x1 , x2 ) I(ph ) := ph (x1 , x2 ) log dx1 dx2 . [sent-89, score-1.745]

39 9) ph (x1 )ph (x2 ) X1 X2 The next corollary provides an exponential inequality for Shannon mutual information estimation. [sent-91, score-1.022]

40 1, if we choose h n−1/4 , then there exists a constant N1 , such that for all n > N1 , n 2 sup P |I (ph ) − I (p)| > ≤ 6 exp − , (2. [sent-95, score-0.106]

41 8) also holds for estimating univariate entropies H(X1 ) and H(X2 ). [sent-100, score-0.116]

42 We use the same bandwidth h n−1/4 to estimate the bivariate density p(x1 , x2 ) and univariate densities p(x1 ), p(x2 ). [sent-104, score-0.4]

43 They consider the same problem setting as ours and also use a KDE based plug-in estimator to estimate the mutual information. [sent-106, score-0.284]

44 However, unlike our proposal, they advocate the use of different bandwidths for bivariate and univariate entropy estimations. [sent-107, score-0.246]

45 For bivariate case they use h2 n−1/6 ; for univariate case they use h1 n−1/5 . [sent-108, score-0.126]

46 Such bandwidths h1 and h2 are useful for optimally estimating the density functions. [sent-109, score-0.236]

47 However, such a choice achieves a suboptimal rate in terms of mutual information estimation: supp∈Σκ (2,L) P |I (ph ) − I (p)| > ≤ c1 exp −c2 n2/3 2 , where c1 and c2 are two constants. [sent-110, score-0.17]

49 1) Bias We are going to show that P |H (ph ) − EH (ph )| > sup ≤ 2 exp − p∈Σκ (2,L) n 2 32κ2 , (3. [sent-116, score-0.106]

50 2) Variance |EH (ph ) − H (p)| ≤ sup c1 h2 + p∈Σκ (2,L) c3 , nh2 (3. [sent-117, score-0.106]

51 The next lemma bounds the integrated squared bias of the kernel density estimator over the support X := [0, 1]2 . [sent-129, score-0.389]

52 3, there exists a constant c > 0 such that 2 (Eph (x) − p(x)) dx ≤ ch4 . [sent-135, score-0.131]

53 7) We have the following decomposition: 2 (Eph (x) − p(x)) dx = X + I 2 (Eph (x) − p(x)) dx = TI + TC + TB . [sent-142, score-0.262]

54 + C B From standard results on kernel density estimation, we know that supp∈Σ(2,L) TI ≤ ch4 . [sent-143, score-0.213]

55 1 2 (Eph (x) − p(x)) dx and TC := two subsections, we bound TB := (Eph (x) − p(x)) dx. [sent-146, score-0.145]

56 We have TB 2 2 (Eph (x) − p(x)) dx ≤ c = B (Eph (x) − p(x)) dx. [sent-148, score-0.131]

57 8) A For x ∈ A, we have ph (x) = 1 nh2 n K i=1 x1 − xi 1 h K x2 − xi 2 h K x1 − t 1 h +K x1 + xi 1 h K x2 − xi 2 h Therefore, for x ∈ A we have Eph (x) = 1 h2 1 0 1 0 4 K x2 − t 2 h p(t1 , t2 )dt1 dt2 . [sent-150, score-1.012]

58 14) ∂p(x) ∂p(x) h+2 h + 12Lh2 ∂x1 ∂x2 ≤ 10 ≤ 12Lh2 + 12Lh2 = 24Lh2 , where the last inequality follows from the fact that ∂p(x) ∂x1 , ∂p(x) ∂x2 ≤ Lh, by the H¨ lder condition o and the assumption that the density p(x) has vanishing partial derivatives on the boundary points. [sent-161, score-0.374]

59 We now analyze the term TC : 2 2 (Eph (x) − p(x)) dx ≤ c TC = C (Eph (x) − p(x)) dx. [sent-166, score-0.147]

60 16) For x ∈ A1 , we have ph (x) = 1 nh2 n Ux,h (xi , xi ) + Ux,h (−xi , xi ) + Ux,h (xi , −xi ) + Ux,h (−xi , −xi ) . [sent-170, score-0.898]

61 Combining the analysis of TB , TC , and TI , we show that the mirror image kernel density estimator is free of boundary bias. [sent-195, score-0.388]

62 3, there exists a universal constant C ∗ that does not depend on the true density p, such that C∗ sup EH (ph ) − H(p) ≤ √ . [sent-205, score-0.274]

63 Recalling that g(u) = u log u, by Taylor’s theorem we have 2 1 · ph (x) − p(x) , (3. [sent-208, score-0.837]

64 36) 2ξ(x) where ξ(x) lies in between ph (x) and p(x). [sent-209, score-0.784]

65 g (ph (x)) − g (p(x)) = log(p(x)) + 1 · ph (x) − p(x) + Let κ be as deﬁned in the statement of the theorem. [sent-211, score-0.784]

66 39) E X = X ≤ log(p(x)) + 1 · E ph (x) − p(x) dx + X X 2 ≤ Eph (x) − p(x) dx + κ X 2 ≤ Eph (x) − p(x) dx + κ X 1 · 2κ1 1 · 2κ1 2 1 · E ph (x) − p(x) dx 2ξ(x) 2 E ph (x) − p(x) dx (3. [sent-215, score-3.007]

67 42) nh2 The last inequality follows from standard results of kernel density estimation and Lemma 3. [sent-220, score-0.329]

68 3, we have, P |H (ph ) − EH (ph )| > sup ≤ 2 exp − p∈Σκ (2,L) n 2 32κ2 . [sent-229, score-0.106]

69 Let ph (x) be the kernel density estimator deﬁned as in (2. [sent-232, score-1.104]

70 Using the mean-value theorem and the fact that Tκ1 ,κ2 (·) is a contraction, we have sup |H (ph ) − H (ph )| (3. [sent-237, score-0.135]

71 ,xn ,(xj ) = ≤ g (ph (x)) − g (ph (x)) dx κ |Tκ1 ,κ2 [ph (x)] − Tκ1 ,κ2 [ph (x)]| dx sup κ sup x1 ,. [sent-241, score-0.474]

72 ,xn ,(xj ) X 1 K2 nh2 xj − x h − 1 K2 nh2 (xj ) − x h dx (3. [sent-256, score-0.202]

73 50) n n Therefore, using McDiarmaid’s inequality [16], we get the desired inequality (3. [sent-260, score-0.134]

74 The uniformity result holds since the constant does not depend on the true density p. [sent-262, score-0.168]

75 ≤ 8κ sup 1 K2 nh2 7 4 Application to Forest Density estimation We apply the concentration inequality (2. [sent-263, score-0.258]

76 10) to analyze an algorithm for learning high dimensional forest graph models [15]. [sent-264, score-0.256]

77 In a forest density estimation problem, we observe n data points x1 , . [sent-265, score-0.424]

78 [15] show that the graph estimation problem can be recast as the problem of ﬁnding the maximum weight spanning forest for a weighted graph, where the weight of the edge connecting nodes j and k is I(Xj ; Xk ), the mutual information between these two variables. [sent-271, score-0.529]

79 The forest graph can be obtained by the Chow-Liu algorithm [3, 13], which is an iterative algorithm. [sent-274, score-0.256]

80 At each iteration the algorithm adds an edge connecting that pair of variables with maximum mutual information among all pairs not yet visited by the algorithm, if doing so does not form a cycle. [sent-275, score-0.19]

81 Once a forest graph F = (V, E) is estimated, we propose to estimate the forest density as ph2 (xj , xk ) pF (x) = · ph2 (xu ) · ph1 (x ), (4. [sent-277, score-0.698]

82 1) ph2 (xj )ph2 (xk ) (j,k)∈E ∈V \U u∈U where U is the set of isolated vertices in the estimated forest F . [sent-278, score-0.237]

83 Our estimator is different from the estimator proposed by [15]—once the graph F is given, we treat the isolated variables differently than the connected variables. [sent-279, score-0.292]

84 2, such a choice leads to minimax optimal forest density estimation, while the obtained rate from [15] is suboptimal. [sent-281, score-0.407]

85 s Let Fd denote the set of forest graphs with d nodes and no more than s edges. [sent-282, score-0.198]

86 We deﬁne the s-oracle forest Fs := (V, E ∗ ) and its corresponding oracle density estimator pF ∗ to be p(xj , xk ) ∗ Fs = arg min D(p pF ) and pF ∗ := p(x ). [sent-284, score-0.525]

87 We deﬁne a density class Pκ as Pκ := p : p is a d-dimensional density with p(xj , xk ) ∈ Σκ (2, L) for any j = k . [sent-288, score-0.388]

88 3) The next two theorems show that the above forest density estimation procedure is minimax optimal for both graph recovery and density estimation. [sent-290, score-0.692]

89 Let F be the estimated s-edge forest graph using the Chow-Liu algorithm. [sent-294, score-0.275]

90 Under the same condition as Theorem 12 in [15], If we choose h n−1/4 for the mutual information estimator in (2. [sent-295, score-0.26]

91 1 has been obtained, we calculate the density estimator (B. [sent-301, score-0.275]

92 Then, sup E p∈Pκ 5 X pF (x) − pF ∗ (x) dx ≤ C · s n2/3 + d−s . [sent-303, score-0.237]

93 The term sn−2/3 corresponds to the price paid to estimate bivariate densities; while the term (d − s)n−4/5 corresponds to the price paid to estimate univariate densities. [sent-310, score-0.29]

94 In this way, we see that the exponential concentration inequality for Shannon mutual information leads to signiﬁcantly improved theoretical analysis of the forest density estimation, in terms of both graph estimation and density estimation. [sent-311, score-0.924]

95 A nonparametric estimation of the entropy for absolutely continuous distributions (corresp. [sent-315, score-0.177]

96 Best asymptotic normality of the kernel density entropy estimator for smooth densities. [sent-335, score-0.439]

97 Estimation of entropy and other functionals of a multivariate density. [sent-362, score-0.127]

98 Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data. [sent-376, score-0.211]

99 Estimation of R´ nyi entropy and mutual information based on a o a e generalized nearest-neighbor graphs. [sent-410, score-0.247]

100 Estimating functionals related to a density by a class of statistics based on spacings. [sent-444, score-0.201]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ph', 0.784), ('eph', 0.315), ('forest', 0.198), ('density', 0.168), ('mutual', 0.153), ('dx', 0.131), ('shannon', 0.113), ('estimator', 0.107), ('sup', 0.106), ('tc', 0.099), ('entropy', 0.094), ('eh', 0.09), ('tb', 0.081), ('bivariate', 0.077), ('pf', 0.074), ('xj', 0.071), ('larry', 0.06), ('lder', 0.058), ('graph', 0.058), ('estimation', 0.058), ('inequality', 0.058), ('xi', 0.057), ('xk', 0.052), ('kde', 0.05), ('univariate', 0.049), ('bias', 0.049), ('boundary', 0.049), ('bandwidth', 0.046), ('kernel', 0.045), ('entropies', 0.04), ('concentration', 0.036), ('fs', 0.036), ('densities', 0.036), ('functionals', 0.033), ('czos', 0.033), ('edgeworth', 0.03), ('lafferty', 0.029), ('analyzing', 0.029), ('theorem', 0.029), ('han', 0.028), ('exponential', 0.027), ('clipped', 0.027), ('estimating', 0.027), ('paid', 0.026), ('bandwidths', 0.026), ('spanning', 0.025), ('nonparametric', 0.025), ('normality', 0.025), ('du', 0.025), ('estimate', 0.024), ('minimax', 0.024), ('log', 0.024), ('edge', 0.023), ('ti', 0.023), ('fd', 0.023), ('xn', 0.021), ('vanishing', 0.021), ('supp', 0.021), ('assumptions', 0.021), ('isolated', 0.02), ('lemma', 0.02), ('assumption', 0.02), ('liu', 0.02), ('union', 0.019), ('estimated', 0.019), ('mirror', 0.019), ('variance', 0.019), ('desired', 0.018), ('recovery', 0.018), ('estimators', 0.018), ('rate', 0.017), ('term', 0.016), ('ibrahim', 0.016), ('saigal', 0.016), ('ven', 0.016), ('righthand', 0.016), ('ahmad', 0.016), ('beirlant', 0.016), ('dudewicz', 0.016), ('eggermont', 0.016), ('bandyopadhyay', 0.016), ('harry', 0.016), ('kraskov', 0.016), ('nonparametrically', 0.016), ('universite', 0.016), ('price', 0.016), ('parametric', 0.016), ('john', 0.016), ('discussions', 0.016), ('notational', 0.015), ('bounded', 0.015), ('optimally', 0.015), ('joy', 0.015), ('informations', 0.015), ('harald', 0.015), ('lh', 0.015), ('area', 0.015), ('bound', 0.014), ('connecting', 0.014), ('van', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 123 nips-2012-Exponential Concentration for Mutual Information Estimation with Application to Forests

Author: Han Liu, Larry Wasserman, John D. Lafferty

2 0.14622642 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation

Author: Kumar Sricharan, Alfred O. Hero

Abstract: The problem of estimation of entropy functionals of probability densities has received much attention in the information theory, machine learning and statistics communities. Kernel density plug-in estimators are simple, easy to implement and widely used for estimation of entropy. However, for large feature dimension d, kernel plug-in estimators suﬀer from the curse of dimensionality: the MSE rate of convergence is glacially slow - of order O(T −γ/d ), where T is the number of samples, and γ > 0 is a rate parameter. In this paper, it is shown that for suﬃciently smooth densities, an ensemble of kernel plug-in estimators can be combined via a weighted convex combination, such that the resulting weighted estimator has a superior parametric MSE rate of convergence of order O(T −1 ). Furthermore, it is shown that these optimal weights can be determined by solving a convex optimization problem which does not require training data or knowledge of the underlying density, and therefore can be performed oﬄine. This novel result is remarkable in that, while each of the individual kernel plug-in estimators belonging to the ensemble suﬀer from the curse of dimensionality, by appropriate ensemble averaging we can achieve parametric convergence rates. 1

3 0.10745091 119 nips-2012-Entropy Estimations Using Correlated Symmetric Stable Random Projections

Author: Ping Li, Cun-hui Zhang

Abstract: Methods for efﬁciently estimating Shannon entropy of data streams have important applications in learning, data mining, and network anomaly detections (e.g., the DDoS attacks). For nonnegative data streams, the method of Compressed Counting (CC) [11, 13] based on maximally-skewed stable random projections can provide accurate estimates of the Shannon entropy using small storage. However, CC is no longer applicable when entries of data streams can be below zero, which is a common scenario when comparing two streams. In this paper, we propose an algorithm for entropy estimation in general data streams which allow negative entries. In our method, the Shannon entropy is approximated by the ﬁnite difference of two correlated frequency moments estimated from correlated samples of symmetric stable random variables. Interestingly, the estimator for the moment we recommend for entropy estimation barely has bounded variance itself, whereas the common geometric mean estimator (which has bounded higher-order moments) is not sufﬁcient for entropy estimation. Our experiments conﬁrm that this method is able to well approximate the Shannon entropy using small storage.

4 0.088645861 298 nips-2012-Scalable Inference of Overlapping Communities

Author: Prem Gopalan, Sean Gerrish, Michael Freedman, David M. Blei, David M. Mimno

Abstract: We develop a scalable algorithm for posterior inference of overlapping communities in large networks. Our algorithm is based on stochastic variational inference in the mixed-membership stochastic blockmodel (MMSB). It naturally interleaves subsampling the network with estimating its community structure. We apply our algorithm on ten large, real-world networks with up to 60,000 nodes. It converges several orders of magnitude faster than the state-of-the-art algorithm for MMSB, ﬁnds hundreds of communities in large real-world networks, and detects the true communities in 280 benchmark networks with equal or better accuracy compared to other scalable algorithms. 1

5 0.071555123 81 nips-2012-Context-Sensitive Decision Forests for Object Detection

Author: Peter Kontschieder, Samuel R. Bulò, Antonio Criminisi, Pushmeet Kohli, Marcello Pelillo, Horst Bischof

Abstract: In this paper we introduce Context-Sensitive Decision Forests - A new perspective to exploit contextual information in the popular decision forest framework for the object detection problem. They are tree-structured classiﬁers with the ability to access intermediate prediction (here: classiﬁcation and regression) information during training and inference time. This intermediate prediction is available for each sample and allows us to develop context-based decision criteria, used for reﬁning the prediction process. In addition, we introduce a novel split criterion which in combination with a priority based way of constructing the trees, allows more accurate regression mode selection and hence improves the current context information. In our experiments, we demonstrate improved results for the task of pedestrian detection on the challenging TUD data set when compared to state-ofthe-art methods. 1 Introduction and Related Work In the last years, the random forest framework [1, 6] has become a very popular and powerful tool for classiﬁcation and regression problems by exhibiting many appealing properties like inherent multi-class capability, robustness to label noise and reduced tendencies to overﬁtting [7]. They are considered to be close to an ideal learner [13], making them attractive in many areas of computer vision like image classiﬁcation [5, 17], clustering [19], regression [8] or semantic segmentation [24, 15, 18]. In this work we show how the decision forest algorithm can be extended to include contextual information during learning and inference for classiﬁcation and regression problems. We focus on applying random forests to object detection, i.e. the problem of localizing multiple instances of a given object class in a test image. This task has been previously addressed in random forests [9], where the trees were modiﬁed to learn a mapping between the appearance of an image patch and its relative position to the object category centroid (i.e. center voting information). During inference, the resulting Hough Forest not only performs classiﬁcation on test samples but also casts probabilistic votes in a generalized Hough-voting space [3] that is subsequently used to obtain object center hypotheses. Ever since, a series of applications such as tracking and action recognition [10], body-joint position estimation [12] and multi-class object detection [22] have been presented. However, Hough Forests typically produce non-distinctive object hypotheses in the Hough space and hence there is the need to perform non-maximum suppression (NMS) for obtaining the ﬁnal results. While this has been addressed in [4, 26], another shortcoming is that standard (Hough) forests treat samples in a completely independent way, i.e. there is no mechanism that encourages the classiﬁer to perform consistent predictions. Within this work we are proposing that context information can be used to overcome the aforementioned problems. For example, training data for visual learning is often represented by images in form of a (regular) pixel grid topology, i.e. objects appearing in natural images can often be found in a speciﬁc context. The importance of contextual information was already highlighted in the 80’s with 1 Figure 1: Top row: Training image, label image, visualization of priority-based growing of tree (the lower, the earlier the consideration during training.). Bottom row: Inverted Hough image using [9] and breadth-ﬁrst training after 6 levels (26 = 64 nodes), Inverted Hough image after growing 64 nodes using our priority queue, Inverted Hough image using priority queue shows distinctive peaks at the end of training. a pioneering work on relaxation labelling [14] and a later work with focus on inference tasks [20] that addressed the issue of learning within the same framework. More recently, contextual information has been used in the ﬁeld of object class segmentation [21], however, mostly for high-level reasoning in random ﬁeld models or to resolve contradicting segmentation results. The introduction of contextual information as additional features in low-level classiﬁers was initially proposed in the Auto-context [25] and Semantic Texton Forest [24] models. Auto-context shows a general approach for classiﬁer boosting by iteratively learning from appearance and context information. In this line of research [18] augmented the feature space for an Entanglement Random Forest with a classiﬁcation feature, that is consequently reﬁned by the class posterior distributions according to the progress of the trained subtree. The training procedure is allowed to perform tests for speciﬁc, contextual label conﬁgurations which was demonstrated to signiﬁcantly improve the segmentation results. However, the In this paper we are presenting Context-Sensitve Decision Forests - A novel and uniﬁed interpretation of Hough Forests in light of contextual sensitivity. Our work is inspired by Auto-Context and Entanglement Forests, but instead of providing only posterior classiﬁcation results from an earlier level of the classiﬁer construction during learning and testing, we additionally provide regression (voting) information as it is used in Hough Forests. The second core contribution of our work is related to how we grow the trees: Instead of training them in a depth- or breadth-ﬁrst way, we propose a priority-based construction (which could actually consider depth- or breadth-ﬁrst as particular cases). The priority is determined by the current training error, i.e. we ﬁrst grow the parts of the tree where we experience higher error. To this end, we introduce a uniﬁed splitting criterion that estimates the joint error of classiﬁcation and regression. The consequence of using our priority-based training are illustrated in Figure 1: Given the training image with corresponding label image (top row, images 1 and 2), the tree ﬁrst tries to learn the foreground samples as shown in the color-coded plot (top row, image 3, colors correspond to index number of nodes in the tree). The effects on the intermediate prediction quality are shown in the bottom row for the regression case: The ﬁrst image shows the regression quality after training a tree with 6 levels (26 = 64 nodes) in a breadth-ﬁrst way while the second image shows the progress after growing 64 nodes according to the priority based training. Clearly, the modes for the center hypotheses are more distinctive which in turn yields to more accurate intermediate regression information that can be used for further tree construction. Our third contribution is a new family of split functions that allows to learn from training images containing multiple training instances as shown for the pedestrians in the example. We introduce a test that checks the centroid compatibility for pairs of training samples taken from the context, based on the intermediate classiﬁcation and regression derived as described before. To assess our contributions, we performed several experiments on the challenging TUD pedestrian data set [2], yielding a signiﬁcant improvement of 9% in the recall at 90% precision rate in comparison to standard Hough Forests, when learning from crowded pedestrian images. 2 2 Context-Sensitive Decision Trees This section introduces the general idea behind the context-sensitive decision forest without references to speciﬁc applications. Only in Section 3 we show a particular application to the problem of object detection. After showing some basic notational conventions that are used in the paper, we provide a section that revisits the random forest framework for classiﬁcation and regression tasks from a joint perspective, i.e. a theory allowing to consider e.g. [1, 11] and [9] in a uniﬁed way. Starting from this general view we ﬁnally introduce the context-sensitive forests in 2.2. Notations. In the paper we denote vectors using boldface lowercase (e.g. d, u, v) and sets by using uppercase calligraphic (e.g. X , Y) symbols. The sets of real, natural and integer numbers are denoted with R, N and Z as usually. We denote by 2X the power set of X and by 1 [P ] the indicator function returning 1 or 0 according to whether the proposition P is true or false. Moreover, with P(Y) we denote the set of probability distributions having Y as sample space and we implicitly assume that some σ-algebra is deﬁned on Y. We denote by δ(x) the Dirac delta function. Finally, Ex∼Q [f (x)] denotes the expectation of f (x) with respect to x sampled according to distribution Q. 2.1 Random Decision Forests for joint classiﬁcation and regression A (binary) decision tree is a tree-structured predictor1 where, starting from the root, a sample is routed until it reaches a leaf where the prediction takes place. At each internal node of the tree the decision is taken whether the sample should be forwarded to the left or right child, according to a binary-valued function. In formal terms, let X denote the input space, let Y denote the output space and let T dt be the set of decision trees. In its simplest form a decision tree consists of a single node (a leaf ) and is parametrized by a probability distribution Q ∈ P(Y) which represents the posterior probability of elements in Y given any data sample reaching the leaf. We denote this (admittedly rudimentary) tree as L F (Q) ∈ T td . Otherwise, a decision tree consists of a node with a left and a right sub-tree. This node is parametrized by a split function φ : X → {0, 1}, which determines whether to route a data sample x ∈ X reaching it to the left decision sub-tree tl ∈ T dt (if φ(x) = 0) or to the right one tr ∈ T dt (if φ(x) = 1). We denote such a tree as N D (φ, tl , tr ) ∈ T td . Finally, a decision forest is an ensemble F ⊆ T td of decision trees which makes a prediction about a data sample by averaging over the single predictions gathered from all trees. Inference. Given a decision tree t ∈ T dt , the associated posterior probability of each element in Y given a sample x ∈ X is determined by ﬁnding the probability distribution Q parametrizing the leaf that is reached by x when routed along the tree. This is compactly presented with the following deﬁnition of P (y|x, t), which is inductive in the structure of t:  if t = L F (Q) Q(y) P (y | x, t ) = P (y | x, tl ) if t = N D (φ, tl , tr ) and φ(x) = 0 (1)  P (y | x, tr ) if t = N D (φ, tl , tr ) and φ(x) = 1 . Finally, the combination of the posterior probabilities derived from the trees in a forest F ⊆ T dt can be done by an averaging operation [6], yielding a single posterior probability for the whole forest: P (y|x, F) = 1 |F| P (y|x, t) . (2) t∈F Randomized training. A random forest is created by training a set of random decision trees independently on random subsets of the training data D ⊆ X ×Y. The training procedure for a single decision tree heuristically optimizes a set of parameters like the tree structure, the split functions at the internal nodes and the density estimates at the leaves in order to reduce the prediction error on the training data. In order to prevent overﬁtting problems, the search space of possible split functions is limited to a random set and a minimum number of training samples is required to grow a leaf node. During the training procedure, each new node is fed with a set of training samples Z ⊆ D. If some stopping condition holds, depending on Z, the node becomes a leaf and a density on Y is estimated based on Z. Otherwise, an internal node is grown and a split function is selected from a pool of random ones in a way to minimize some sort of training error on Z. The selected split function induces a partition 1 we use the term predictor because we will jointly consider classiﬁcation and regression. 3 of Z into two sets, which are in turn becoming the left and right childs of the current node where the training procedure is continued, respectively. We will now write this training procedure in more formal terms. To this end we introduce a function π(Z) ∈ P(Y) providing a density on Y estimated from the training data Z ⊆ D and a loss function L(Z | Q) ∈ R penalizing wrong predictions on the training samples in Z, when predictions are given according to a distribution Q ∈ P(Y). The loss function L can be further decomposed in terms of a loss function (·|Q) : Y → R acting on each sample of the training set: L(Z | Q) = (y | Q) . (3) (x,y)∈Z Also, let Φ(Z) be a set of split functions randomly generated for a training set Z and given a split φ function φ ∈ Φ(Z), we denote by Zlφ and Zr the sets identiﬁed by splitting Z according to φ, i.e. Zlφ = {(x, y) ∈ Z : φ(x) = 0} and φ Zr = {(x, y) ∈ Z : φ(x) = 1} . We can now summarize the training procedure in terms of a recursive function g : 2X ×Y → T , which generates a random decision tree from a training set given as argument: g(Z) = L F (π(Z)) ND if some stopping condition holds φ φ, g(Zlφ ), g(Zr ) otherwise . (4) Here, we determine the optimal split function φ in the pool Φ(Z) as the one minimizing the loss we incur as a result of the node split: φ φ ∈ arg min L(Zlφ ) + L(Zr ) : φ ∈ Φ(Z) (5) where we compactly write L(Z) for L(Z|π(Z)), i.e. the loss on Z obtained with predictions driven by π(Z). A typical split function selection criterion commonly adopted for classiﬁcation and regression is information gain. The equivalent counterpart in terms of loss can be obtained by using a log-loss, i.e. (y|Q) = − log(Q(y)). A further widely used criterion is based on Gini impurity, which can be expressed in this setting by using (y|Q) = 1 − Q(y). Finally, the stopping condition that is used in (4) to determine whether to create a leaf or to continue branching the tree typically consists in checking |Z|, i.e. the number of training samples at the node, or the loss L(Z) are below some given thresholds, or if a maximum depth is reached. 2.2 Context-sensitive decision forests A context-sensitive (CS) decision tree is a decision tree in which split functions are enriched with the ability of testing contextual information of a sample, before taking a decision about where to route it. We generate contextual information at each node of a decision tree by exploiting a truncated version of the same tree as a predictor. This idea is shared with [18], however, we introduce some novelties by tackling both, classiﬁcation and regression problems in a joint manner and by leaving a wider ﬂexibility in the tree truncation procedure. We denote the set of CS decision trees as T . The main differences characterizing a CS decision tree t ∈ T compared with a standard decision tree are the following: a) every node (leaves and internal nodes) of t has an associated probability distribution Q ∈ P(Y) representing the posterior probability of an element in Y given any data sample reaching it; b) internal nodes are indexed with distinct natural numbers n ∈ N in a way to preserve the property that children nodes have a larger index compared to their parent node; c) the split function at each internal node, denoted by ϕ(·|t ) : X → {0, 1}, is bound to a CS decision tree t ∈ T , which is a truncated version of t and can be used to compute intermediate, contextual information. Similar to Section 2.1 we denote by L F (Q) ∈ T the simplest CS decision tree consisting of a single leaf node parametrized by the distribution Q, while we denote by N D (n, Q, ϕ, tl , tr ) ∈ T , the rest of the trees consisting of a node having a left and a right sub-tree, denoted by tl , tr ∈ T respectively, and being parametrized by the index n, a probability distribution Q and the split function ϕ as described above. As shown in Figure 2, the truncation of a CS decision tree at each node is obtained by exploiting the indexing imposed on the internal nodes of the tree. Given a CS decision tree t ∈ T and m ∈ N, 4 1 1 4 2 3 6 2 5 4 3 (b) The truncated version t(<5) (a) A CS decision tree t Figure 2: On the left, we ﬁnd a CS decision tree t, where only the internal nodes are indexed. On the right, we see the truncated version t(<5) of t, which is obtained by converting to leaves all nodes having index ≥ 5 (we marked with colors the corresponding node transformations). we denote by t( < τ 2 In the experiments conducted, we never exceeded 10 iterations for ﬁnding a mode. 6 (8) where Pj = P (·|(u + hj , I), t), with j = 1, 2, are the posterior probabilities obtained from tree t given samples at position u+h1 and u+h2 of image I, respectively. Please note that this test should not be confused with the regression split criterion in [9], which tries to partition the training set in a way to group examples with similar voting direction and length. Besides the novel context-sensitive split function we employ also standard split functions performing tests on X as deﬁned in [24]. 4 Experiments To assess our proposed approach, we have conducted several experiments on the task of pedestrian detection. Detecting pedestrians is very challenging for Hough-voting based methods as they typically exhibit strong articulations of feet and arms, yielding to non-distinctive hypotheses in the Hough space. We evaluated our method on the TUD pedestrian data base [2] in two different ways: First, we show our detection results with training according to the standard protocol using 400 training images (where each image contains a single annotation of a pedestrian) and evaluation on the Campus and Crossing scenes, respectively (Section 4.1). With this experiment we show the improvement over state-of-the-art approaches when learning can be performed with simultaneous knowledge about context information. In a second variation (Section 4.2), we use the images of the Crossing scene (201 images) as a training set. Most images of this scene contain more than four persons with strong overlap and mutual occlusions. However, instead of using the original annotation which covers only pedestrians with at least 50% overlap (1008 bounding boxes), we use the more accurate, pixel-wise ground truth annotations of [23] for the entire scene that includes all persons and consists of 1215 bounding boxes. Please note that this annotation is even more detailed than the one presented in [4] with 1018 bounding boxes. The purpose of the second experiment is to show that our context-sensitive forest can exploit the availability of multiple training instances signiﬁcantly better than state-of-the-art. The most related work and therefore also the baseline in our experiments is the Hough Forest [9]. To guarantee a fair comparison, we use the same training parameters for [9] and our context sensitive forest: We trained 20 trees and the training data (including horizontally ﬂipped images) was sampled homogeneously per category per image. The patch size was ﬁxed to 30 × 30 and we performed 1600 node tests for ﬁnding the best split function parameters per node. The trees were stopped growing when < 7 samples were available. As image features, we used the the ﬁrst 16 feature channels provided in the publicly available Hough Forest code of [9]. In order to obtain the object detection hypotheses from the Hough space, we use the same Non-maximum suppression (NMS) technique in all our experiments as suggested in [9]. To evaluate the obtained hypotheses, we use the standard PASAL-VOC criterion which requires the mutual overlap between ground truth and detected bounding boxes to be ≥ 50%. The additional parameter of (7) was ﬁxed to σ = 7. 4.1 Evaluation using standard protocol training set The standard training set contains 400 images where each image comes with a single pedestrian annotation. For our experiments, we rescaled the images by a factor of 0.5 and doubled the training image set by including also the horizontally ﬂipped images. We randomly chose 125 training samples per image for foreground and background, resulting in 2 · 400 · 2 · 125 = 200k training samples per tree. For additional comparisons, we provide the results presented in the recent work on joint object detection and segmentation of [23], from which we also provide evaluation results of the Implicit Shape Model (ISM) [16]. However, please note that the results of [23] are based on a different baseline implementation. Moreover, we show the results of [4] when using the provided code and conﬁguration ﬁles from the ﬁrst authors homepage. Unfortunately, we could not reproduce the results of the original paper. First, we discuss the results obtained on the Campus scene. This data set consists of 71 images showing walking pedestrians at severe scale differences and partial occlusions. The ground truth we use has been released with [4] and contains a total number of 314 pedestrians. Figure 3, ﬁrst row, plot 1 shows the precision-recall curves when using 3 scales (factors 0.3, 0.4, 0.55) for our baseline [9] (blue), results from re-evaluating [4] (cyan, 5 scales), [23] (green) and our ContextSensitive Forest without and with using the priority queue based tree construction (red/magenta). In case of not using the priority queue, we trained the trees according to a breadth-ﬁrst way. We obtain a performance boost of ≈ 6% in recall at a precision of 90% when using both, context information and the priority based construction of our forest. The second plot in the ﬁrst row of Figure 3 shows the results when the same forests are tested on the Crossing scene, using the more detailed ground 7 TUD Campus (3 scales) TUD−Crossing (3 scales) 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Precision 1 0.9 Precision 1 0.5 0.4 0.3 0.2 0.1 0 0 0.5 0.4 0.3 Baseline Hough Forest Barinova et al. CVPR’10, 5 scales Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue Riemenschneider et al. ECCV’12 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.2 0.1 0.9 0 0 1 Baseline Hough Forest Barinova et al. CVPR’10 Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue Riemenschneider et al. ECCV’12 (1 scale) Leibe et al. IJCV’08 (1 scale) 0.1 TUD Campus (3 scales) 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.9 1 0.9 1 1 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Precision 1 0.9 Precision 0.2 TUD Campus (5 scales) 0.5 0.4 0.3 0 0 0.4 0.3 0.2 0.1 0.5 0.2 Baseline Hough Forest Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.1 0.9 1 0 0 Baseline Hough Forest Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 Figure 3: Precision-Recall Curves for detections, Top row: Standard training (400 images), evaluation on Campus and Crossing (3 scales). Bottom row: Training on Crossing annotations of [23], evaluation on Campus, 3 and 5 scales. Right images: Qualitative examples for Campus (top 2) and Crossing (bottom 2) scenes. (green) correctly found by our method (blue) ground truth (red) wrong association (cyan) missed detection. truth annotations. The data set shows walking pedestrians (Figure 3, right side, last 2 images) with a smaller variation in scale compared to the Campus scene but with strong mutual occlusions and overlaps. The improvement with respect to the baseline is lower (≈ 2% gain at a precision of 90%) and we ﬁnd similar developments of the curves. However, this comes somewhat expectedly as the training data does not properly reﬂect the occlusions we actually want to model. 4.2 Evaluation on Campus scene using Crossing scene as training set In our next experiment we trained the forests (same parameters) on the novel annotations of [23] for the Crossing scene. Please note that this reduces the training set to only 201 images (we did not include the ﬂipped images). Qualitative detection results are shown in Figure 3, right side, images 1 and 2. From the ﬁrst precison-recall curve in the second row of Figure 3 we can see, that the margin between the baseline and our proposed method could be clearly improved (gain of ≈ 9% recall at precision 90%) when evaluating on the same 3 scales. With evaluation on 5 scales (factors 0.34, 0.42, 0.51, 0.65, 0.76) we found a strong increase in the recall, however, at the cost of loosing 2 − 3% of precision below a recall of 60%, as illustrated in the second plot of row 2 in Figure 3. While our method is able to maintain a precision above 90% up to a recall of ≈ 83%, the baseline implementation drops already at a recall of ≈ 20%. 5 Conclusions In this work we have presented Context-Sensitive Decision Forests with application to the object detection problem. Our new forest has the ability to access intermediate prediction (classiﬁcation and regression) information about all samples of the training set and can therefore learn from contextual information throughout the growing process. This is in contrast to existing random forest methods used for object detection which typically treat training samples in an independent manner. Moreover, we have introduced a novel splitting criterion together with a mode isolation technique, which allows us to (a) perform a priority-driven way of tree growing and (b) install novel context-based test functions to check for mutual object centroid agreements. In our experimental results on pedestrian detection we demonstrated superior performance with respect to state-of-the-art methods and additionally found that our new algorithm can signiﬁcantly better exploit training data containing multiple training objects. Acknowledgements. Peter Kontschieder acknowledges ﬁnancial support of the Austrian Science Fund (FWF) from project ’Fibermorph’ with number P22261-N22. 8 References [1] Y. Amit and D. Geman. Shape quantization and recognition with randomized trees. Neural Computation, 1997. [2] M. Andriluka, S. Roth, and B. Schiele. People-tracking-by-detection and people-detection-by-tracking. In (CVPR), 2008. [3] D. H. Ballard. Generalizing the hough transform to detect arbitrary shapes. Pattern Recognition, 13(2), 1981. [4] O. Barinova, V. Lempitsky, and P. Kohli. On detection of multiple object instances using hough transforms. In (CVPR), 2010. [5] A. Bosch, A. Zisserman, and X. Mu˜oz. Image classiﬁcation using random forests and ferns. In (ICCV), n 2007. [6] L. Breiman. Random forests. In Machine Learning, 2001. [7] A. Criminisi, J. Shotton, and E. Konukoglu. Decision forests: A uniﬁed framework for classiﬁcation, regression, density estimation, manifold learning and semi-supervised learning. In Foundations and Trends in Computer Graphics and Vision, volume 7, pages 81–227, 2012. [8] A. Criminisi, J. Shotton, D. Robertson, and E. Konukoglu. Regression forests for efﬁcient anatomy detection and localization in CT scans. In MICCAI-MCV Workshop, 2010. [9] J. Gall and V. Lempitsky. Class-speciﬁc hough forests for object detection. In (CVPR), 2009. [10] J. Gall, A. Yao, N. Razavi, L. Van Gool, and V. Lempitsky. Hough forests for object detection, tracking, and action recognition. (PAMI), 2011. [11] P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 2006. [12] R. Girshick, J. Shotton, P. Kohli, A. Criminisi, and A. Fitzgibbon. Efﬁcient regression of general-activity human poses from depth images. In (ICCV), 2011. [13] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer, 2009. [14] R. A. Hummel and S. W. Zucker. On the foundations of relaxation labeling. (PAMI), 5(3):267–287, 1983. [15] P. Kontschieder, S. Rota Bul` , H. Bischof, and M. Pelillo. Structured class-labels in random forests for o semantic image labelling. In (ICCV), 2011. [16] B. Leibe, A. Leonardis, and B. Schiele. Robust object detection with interleaved categorization and segmentation. (IJCV), 2008. [17] R. Mar´ e, P. Geurts, J. Piater, and L. Wehenkel. Random subwindows for robust image classiﬁcation. In e (CVPR), 2005. [18] A. Montillo, J. Shotton, J. Winn, J. E. Iglesias, D. Metaxas, and A. Criminisi. Entangled decision forests and their application for semantic segmentation of CT images. In (IPMI), 2011. [19] F. Moosmann, B. Triggs, and F. Jurie. Fast discriminative visual codebooks using randomized clustering forests. In (NIPS), 2006. [20] M. Pelillo and M. Reﬁce. Learning compatibility coefﬁcients for relaxation labeling processes. (PAMI), 16(9):933–945, 1994. [21] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In (ICCV), 2007. [22] N. Razavi, J. Gall, and L. Van Gool. Scalable multi-class object detection. In (CVPR), 2011. [23] H. Riemenschneider, S. Sternig, M. Donoser, P. M. Roth, and H. Bischof. Hough regions for joining instance localization and segmentation. In (ECCV), 2012. [24] J. Shotton, M. Johnson, and R. Cipolla. Semantic texton forests for image categorization and segmentation. In (CVPR), 2008. [25] Z. Tu. Auto-context and its application to high-level vision tasks. In (CVPR), 2008. [26] O. Woodford, M. Pham, A. Maki, F. Perbet, and B. Stenger. Demisting the hough transform for 3d shape recognition and registration. In (BMVC), 2011. 9

6 0.064718321 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function

7 0.06337864 95 nips-2012-Density-Difference Estimation

8 0.061985269 232 nips-2012-Multiplicative Forests for Continuous-Time Processes

9 0.061803002 269 nips-2012-Persistent Homology for Learning Densities with Bounded Support

10 0.060469139 57 nips-2012-Bayesian estimation of discrete entropy with mixtures of stick-breaking priors

11 0.059525538 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation

12 0.055001825 294 nips-2012-Repulsive Mixtures

13 0.052215472 16 nips-2012-A Polynomial-time Form of Robust Regression

14 0.051381316 300 nips-2012-Scalable nonconvex inexact proximal splitting

15 0.051120404 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses

16 0.050295427 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models

17 0.049571007 275 nips-2012-Privacy Aware Learning

18 0.049232204 262 nips-2012-Optimal Neural Tuning Curves for Arbitrary Stimulus Distributions: Discrimax, Infomax and Minimum $L p$ Loss

19 0.047506016 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

20 0.047155961 248 nips-2012-Nonparanormal Belief Propagation (NPNBP)

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.119), (1, 0.029), (2, 0.062), (3, -0.038), (4, 0.004), (5, 0.029), (6, 0.002), (7, 0.006), (8, -0.044), (9, -0.022), (10, -0.041), (11, 0.002), (12, 0.033), (13, -0.133), (14, -0.049), (15, -0.116), (16, 0.103), (17, -0.029), (18, -0.049), (19, 0.026), (20, 0.014), (21, -0.032), (22, 0.135), (23, -0.028), (24, -0.024), (25, -0.007), (26, 0.048), (27, 0.024), (28, 0.128), (29, -0.065), (30, 0.089), (31, -0.095), (32, -0.007), (33, 0.026), (34, -0.027), (35, -0.005), (36, 0.076), (37, 0.017), (38, -0.081), (39, 0.023), (40, -0.002), (41, -0.028), (42, 0.019), (43, 0.005), (44, 0.052), (45, 0.01), (46, 0.011), (47, -0.039), (48, 0.039), (49, -0.075)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93866879 123 nips-2012-Exponential Concentration for Mutual Information Estimation with Application to Forests

Author: Han Liu, Larry Wasserman, John D. Lafferty

2 0.82672 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation

Author: Kumar Sricharan, Alfred O. Hero

3 0.78155553 119 nips-2012-Entropy Estimations Using Correlated Symmetric Stable Random Projections

Author: Ping Li, Cun-hui Zhang

4 0.68234688 95 nips-2012-Density-Difference Estimation

Author: Masashi Sugiyama, Takafumi Kanamori, Taiji Suzuki, Marthinus D. Plessis, Song Liu, Ichiro Takeuchi

Abstract: We address the problem of estimating the difference between two probability densities. A naive approach is a two-step procedure of ﬁrst estimating two densities separately and then computing their difference. However, such a two-step procedure does not necessarily work well because the ﬁrst step is performed without regard to the second step and thus a small estimation error incurred in the ﬁrst stage can cause a big error in the second stage. In this paper, we propose a single-shot procedure for directly estimating the density difference without separately estimating two densities. We derive a non-parametric ﬁnite-sample error bound for the proposed single-shot density-difference estimator and show that it achieves the optimal convergence rate. We then show how the proposed density-difference estimator can be utilized in L2 -distance approximation. Finally, we experimentally demonstrate the usefulness of the proposed method in robust distribution comparison such as class-prior estimation and change-point detection.

5 0.60310972 57 nips-2012-Bayesian estimation of discrete entropy with mixtures of stick-breaking priors

Author: Evan Archer, Il M. Park, Jonathan W. Pillow

Abstract: We consider the problem of estimating Shannon’s entropy H in the under-sampled regime, where the number of possible symbols may be unknown or countably inﬁnite. Dirichlet and Pitman-Yor processes provide tractable prior distributions over the space of countably inﬁnite discrete distributions, and have found major applications in Bayesian non-parametric statistics and machine learning. Here we show that they provide natural priors for Bayesian entropy estimation, due to the analytic tractability of the moments of the induced posterior distribution over entropy H. We derive formulas for the posterior mean and variance of H given data. However, we show that a ﬁxed Dirichlet or Pitman-Yor process prior implies a narrow prior on H, meaning the prior strongly determines the estimate in the under-sampled regime. We therefore deﬁne a family of continuous mixing measures such that the resulting mixture of Dirichlet or Pitman-Yor processes produces an approximately ﬂat prior over H. We explore the theoretical properties of the resulting estimators and show that they perform well on data sampled from both exponential and power-law tailed distributions. 1

6 0.5721125 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data

7 0.54749078 269 nips-2012-Persistent Homology for Learning Densities with Bounded Support

8 0.5262 222 nips-2012-Multi-Task Averaging

9 0.46055338 223 nips-2012-Multi-criteria Anomaly Detection using Pareto Depth Analysis

10 0.42532986 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function

11 0.41537493 43 nips-2012-Approximate Message Passing with Consistent Parameter Estimation and Applications to Sparse Learning

12 0.4095405 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

13 0.39636663 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models

14 0.39156201 145 nips-2012-Gradient Weights help Nonparametric Regressors

15 0.38651022 343 nips-2012-Tight Bounds on Profile Redundancy and Distinguishability

16 0.38312051 340 nips-2012-The representer theorem for Hilbert spaces: a necessary and sufficient condition

17 0.38145584 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization

18 0.37775701 317 nips-2012-Smooth-projected Neighborhood Pursuit for High-dimensional Nonparanormal Graph Estimation

19 0.36859715 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

20 0.36709353 189 nips-2012-Learning from the Wisdom of Crowds by Minimax Entropy

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.016), (21, 0.336), (38, 0.118), (39, 0.032), (42, 0.021), (54, 0.011), (55, 0.032), (74, 0.049), (76, 0.169), (80, 0.057), (92, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91735613 195 nips-2012-Learning visual motion in recurrent neural networks

Author: Marius Pachitariu, Maneesh Sahani

Abstract: We present a dynamic nonlinear generative model for visual motion based on a latent representation of binary-gated Gaussian variables. Trained on sequences of images, the model learns to represent different movement directions in different variables. We use an online approximate inference scheme that can be mapped to the dynamics of networks of neurons. Probed with drifting grating stimuli and moving bars of light, neurons in the model show patterns of responses analogous to those of direction-selective simple cells in primary visual cortex. Most model neurons also show speed tuning and respond equally well to a range of motion directions and speeds aligned to the constraint line of their respective preferred speed. We show how these computations are enabled by a speciﬁc pattern of recurrent connections learned by the model. 1

2 0.90721172 1 nips-2012-3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model

Author: Sanja Fidler, Sven Dickinson, Raquel Urtasun

Abstract: This paper addresses the problem of category-level 3D object detection. Given a monocular image, our aim is to localize the objects in 3D by enclosing them with tight oriented 3D bounding boxes. We propose a novel approach that extends the well-acclaimed deformable part-based model [1] to reason in 3D. Our model represents an object class as a deformable 3D cuboid composed of faces and parts, which are both allowed to deform with respect to their anchors on the 3D box. We model the appearance of each face in fronto-parallel coordinates, thus effectively factoring out the appearance variation induced by viewpoint. Our model reasons about face visibility patters called aspects. We train the cuboid model jointly and discriminatively and share weights across all aspects to attain efﬁciency. Inference then entails sliding and rotating the box in 3D and scoring object hypotheses. While for inference we discretize the search space, the variables are continuous in our model. We demonstrate the effectiveness of our approach in indoor and outdoor scenarios, and show that our approach signiﬁcantly outperforms the stateof-the-art in both 2D [1] and 3D object detection [2]. 1

3 0.84826887 300 nips-2012-Scalable nonconvex inexact proximal splitting

Author: Suvrit Sra

Abstract: We study a class of large-scale, nonsmooth, and nonconvex optimization problems. In particular, we focus on nonconvex problems with composite objectives. This class includes the extensively studied class of convex composite objective problems as a subclass. To solve composite nonconvex problems we introduce a powerful new framework based on asymptotically nonvanishing errors, avoiding the common stronger assumption of vanishing errors. Within our new framework we derive both batch and incremental proximal splitting algorithms. To our knowledge, our work is ﬁrst to develop and analyze incremental nonconvex proximalsplitting algorithms, even if we were to disregard the ability to handle nonvanishing errors. We illustrate one instance of our general framework by showing an application to large-scale nonsmooth matrix factorization. 1

4 0.84086376 60 nips-2012-Bayesian nonparametric models for ranked data

Author: Francois Caron, Yee W. Teh

Abstract: We develop a Bayesian nonparametric extension of the popular Plackett-Luce choice model that can handle an inﬁnite number of choice items. Our framework is based on the theory of random atomic measures, with the prior speciﬁed by a gamma process. We derive a posterior characterization and a simple and effective Gibbs sampler for posterior simulation. We develop a time-varying extension of our model, and apply it to the New York Times lists of weekly bestselling books. 1

5 0.81164449 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction

Author: Christoph H. Lampert

Abstract: We study the problem of maximum marginal prediction (MMP) in probabilistic graphical models, a task that occurs, for example, as the Bayes optimal decision rule under a Hamming loss. MMP is typically performed as a two-stage procedure: one estimates each variable’s marginal probability and then forms a prediction from the states of maximal probability. In this work we propose a simple yet effective technique for accelerating MMP when inference is sampling-based: instead of the above two-stage procedure we directly estimate the posterior probability of each decision variable. This allows us to identify the point of time when we are sufﬁciently certain about any individual decision. Whenever this is the case, we dynamically prune the variables we are conﬁdent about from the underlying factor graph. Consequently, at any time only samples of variables whose decision is still uncertain need to be created. Experiments in two prototypical scenarios, multi-label classiﬁcation and image inpainting, show that adaptive sampling can drastically accelerate MMP without sacriﬁcing prediction accuracy. 1

same-paper 6 0.81000179 123 nips-2012-Exponential Concentration for Mutual Information Estimation with Application to Forests

7 0.78081602 302 nips-2012-Scaling MPE Inference for Constrained Continuous Markov Random Fields with Consensus Optimization

8 0.73207849 18 nips-2012-A Simple and Practical Algorithm for Differentially Private Data Release

9 0.70018369 113 nips-2012-Efficient and direct estimation of a neural subunit model for sensory coding

10 0.69557422 23 nips-2012-A lattice filter model of the visual pathway

11 0.67199135 137 nips-2012-From Deformations to Parts: Motion-based Segmentation of 3D Objects

12 0.66973704 112 nips-2012-Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model

13 0.66129845 114 nips-2012-Efficient coding provides a direct link between prior and likelihood in perceptual Bayesian inference

14 0.65795368 190 nips-2012-Learning optimal spike-based representations

15 0.65501112 94 nips-2012-Delay Compensation with Dynamical Synapses

16 0.64711678 256 nips-2012-On the connections between saliency and tracking

17 0.64622527 152 nips-2012-Homeostatic plasticity in Bayesian spiking networks as Expectation Maximization with posterior constraints

18 0.64559829 24 nips-2012-A mechanistic model of early sensory processing based on subtracting sparse representations

19 0.64545333 341 nips-2012-The topographic unsupervised learning of natural sounds in the auditory cortex

20 0.64293867 120 nips-2012-Exact and Stable Recovery of Sequences of Signals with Sparse Increments via Differential 1-Minimization