nips nips2008 nips2008-82 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Liang Zhang, Deepak Agarwal
Abstract: Multi-level hierarchical models provide an attractive framework for incorporating correlations induced in a response variable that is organized hierarchically. Model fitting is challenging, especially for a hierarchy with a large number of nodes. We provide a novel algorithm based on a multi-scale Kalman filter that is both scalable and easy to implement. For Gaussian response, we show our method provides the maximum a-posteriori (MAP) parameter estimates; for non-Gaussian response, parameter estimation is performed through a Laplace approximation. However, the Laplace approximation provides biased parameter estimates that is corrected through a parametric bootstrap procedure. We illustrate through simulation studies and analyses of real world data sets in health care and online advertising.
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract Multi-level hierarchical models provide an attractive framework for incorporating correlations induced in a response variable that is organized hierarchically. [sent-6, score-0.181]
2 Model fitting is challenging, especially for a hierarchy with a large number of nodes. [sent-7, score-0.193]
3 However, the Laplace approximation provides biased parameter estimates that is corrected through a parametric bootstrap procedure. [sent-10, score-0.256]
4 1 Introduction In many real-world prediction problems, the response variable of interest is clustered hierarchically. [sent-12, score-0.108]
5 For instance, in studying the immunization status of a set of children in a particular geographic location, the children are naturally clustered by families, which in turn are clustered into communities. [sent-13, score-0.224]
6 The clustering often induce correlations in the response variable; models that exploit this provide significant improvement in predictive performance. [sent-14, score-0.103]
7 Multi-level hierarchical models provide an attractive framework for modeling such correlations. [sent-15, score-0.107]
8 Although routinely applied to moderate sized data (few thousand nodes) in several fields like epidemiology, social sciences, biology [3], model fitting is computationally expensive and is usually performed through a Cholesky decomposition of a q (number of nodes in the hierarchy) dimensional matrix. [sent-16, score-0.21]
9 Recently, such models have shown promise in a novel application of internet advertising [1] where the goal is to select top-k advertisements to be shown on a webpage to maximize the click-through rates. [sent-17, score-0.119]
10 To capture the semantic meaning of content in a parsimonious way, it is commonplace to classify webpages and ads into large pre-defined hierarchies. [sent-18, score-0.124]
11 The hierarchy in such applications consist of several levels and the total number of nodes may run into millions. [sent-19, score-0.384]
12 Moreover, the main goal is to exploit the hierarchy for obtaining better predictions; computing the full posterior predictive distribution is of secondary importance. [sent-20, score-0.294]
13 In this paper, we provide a novel, fast and easy to implement algorithm to compute the posterior mode of parameters for such models on datasets organized hierarchically into millions of nodes with several levels. [sent-22, score-0.26]
14 The key component of our algorithm is a multi-scale Kalman filter that expedites the computation of an expensive to compute conditional posterior. [sent-23, score-0.06]
15 The central idea in multi-level hierarchical (MLH hereafter) models is “shrinkage” across the nodes in the hierarchy. [sent-24, score-0.207]
16 More specifically, these models assume a multi-level prior wherein parameters of children nodes are assumed to be drawn from a distribution centered around the parameter of the parent. [sent-25, score-0.222]
17 This bottom-up, recursive assumption provides a posterior whose estimates at the finest resolution are smoothed using data on the lineage path of the node in the hierarchy. [sent-26, score-0.322]
18 Although MLH models are intuitive, parameter estimation presents a formidable challenge, especially for large hierarchies. [sent-29, score-0.06]
19 For Gaussian response, the main computational bottleneck is the Cholesky factorization of a dense covariance matrix whose order depends on the number of nodes, this is expensive for large problems. [sent-30, score-0.102]
20 g binary data), the nonquadratic nature of the log-likelihood adds on an additional challenge of approximating an integral whose dimension depends on the number of nodes in the hierarchy. [sent-32, score-0.161]
21 Our contributions are as follows: We provide a novel fitting procedure based on multi-scale Kalman filter algorithm that directly exploits the hierarchical structure of the problem and computes the posterior mode of MLH parameters. [sent-38, score-0.135]
22 The complexity of our method is almost linear in the number of nodes in the hierarchy. [sent-39, score-0.137]
23 Moreover, fitting such models to non-Gaussian data present formidable challenges as we illustrate in the paper. [sent-42, score-0.094]
24 We provide strategies to overcome those through a bootstrap correction and compare with the commonly used cross-validation approach. [sent-43, score-0.172]
25 Our methods are illustrated on simulated data, benchmark data and data obtained from an internet advertising application. [sent-44, score-0.09]
26 2 MLH for Gaussian Responses Assume we have a hierarchy T consisting of L levels (root is level 0), for which mj , j = 0, · · · , L, denotes the number of nodes at level j. [sent-45, score-0.586]
27 Denote the set of nodes at level j in the hierarchy T as Tj . [sent-46, score-0.401]
28 For node r in T , denote the parent of r as pa(r), and the ith child of node r as ci (r). [sent-47, score-0.544]
29 If a node r′ is 2 a descendent of r, we say r′ ≺ r. [sent-48, score-0.138]
30 Since the hierarchy has L levels, TL denotes the set of leaf nodes in the hierarchy. [sent-49, score-0.466]
31 Let yir , i = 1, · · · , nr denote the ith observation at leaf node r, and xir denote the p-dimensional covariate vector associated with yir . [sent-50, score-1.701]
32 For simplicity, we assume all observations are available at leaf nodes (a more general case where each node in the hierarchy can have observations is easily obtained from our algorithm). [sent-51, score-0.604]
33 An equivalent way of specifying MLH is obtained by associating independent random variables bj ∼ N (0, γj ) to the r nodes and replacing φL in (1) by the sum of the bj parameters along the lineage path from root to r r ′ leaf node in the hierarchy. [sent-59, score-0.766]
34 We denote this compactly as zr b, where b is a vector of bj for all the r nodes in the hierarchy, and zr is a vector of 0/1’s turned on for nodes in the path of node r. [sent-60, score-0.684]
35 More compactly, let y = {yir , i = 1, · · · , nr , r ∈ T }, and X as well as Z be the corresponding matrix of ′ vectors xir and zr for i = 1, · · · nr and r ∈ T , then y ∼ N (X β + Zb, V I) with b ∼ N (0, Ω(γ)). [sent-61, score-0.734]
36 L The problem is to compute the posterior mode of (βp×1 , bq×1 , γL×1 , V ) where q = j=1 mj . [sent-62, score-0.154]
37 The ′ main computational bottleneck is computing the Cholesky factor of a q×q matrix (Z Z +Ω−1 ), this is expensive for large values of q. [sent-63, score-0.061]
38 For non-Gaussian case, we provide an approximation to the MAP through the Laplace method coupled with a bootstrap correction. [sent-66, score-0.111]
39 We also note that our method apply if the random effects are vectors and enter into equation (2) as linear combination with some covariate vector. [sent-67, score-0.086]
40 The main component of our fitting algorithm is computing the conditional posterior distribution of φ = {φj , r ∈ T, j = 1, · · · , L} r given (β, V, γ). [sent-71, score-0.084]
41 The multi-scale Kalman filter (described next) computes the conditional posterior of φ mentioned above and is used in the inner loop of the EM. [sent-73, score-0.084]
42 As in temporal state space models, the Kalman filter consists of two steps - a)Filtering: where one propagates information from leaves to the root and b) Smoothing: where information is propagated from root all the way down to the leaves. [sent-74, score-0.07]
43 Filtering: ′ ˆ ˆ ˆ ˆ Denote the current estimates of β, γ and V as β, γ , and V respectively. [sent-75, score-0.092]
44 Then, eir = yir − xir β j j ˆ are the residuals and V ar(φr ) = Σj = i=1 γi , r ∈ Tj are the marginal variances of the random L effects. [sent-76, score-0.75]
45 If the conditional posterior distribution φL |{yir , i = 1, · · · , nr } ∼ N (φL , σr|r ), the first r r|r L step is to update φL and σr|r for all leaf random effects φL using standard Bayesian update formula r r|r for Gaussian models nr ΣL eir i=1 φL = , (3) r|r ˆ V + n r ΣL ˆ ΣL V L σr|r = . [sent-77, score-0.932]
46 (4) ˆ V + n r ΣL 3 j Next, the posteriors φj |{yir′ , i = 1, · · · , nr′ , ∀r′ ≺ r} ∼ N (φj , σr|r ), are recursively updated r r|r from j = L − 1 to j = 1, by regressing the parent node effect towards each child and combining information from all the children. [sent-78, score-0.265]
47 Note that r j−1 φpa(r) = E(φj−1 |φj ) + (φj−1 − E(φj−1 |φj )) pa(r) r pa(r) pa(r) r (5) Simple algebra provides the conditional expectation and variance of φj−1 |φj as pa(r) r j φj−1 = Bj φj + ψr , r pa(r) where Bj = N (0, Bj γj ). [sent-80, score-0.083]
48 ˆ j−1 i=1 j i=1 γi / ˆ (6) j γi , correlation between any two siblings at level j and ψr ∼ ˆ First, a new prior is obtained for the parent node based on the current estimate of each child by plugging-in the current estimates of a child into equation (6). [sent-81, score-0.498]
49 For the ith child of node r (here we assume that r is at level j − 1, and ci (r) is at level j), φj−1(r) = Bj φj i (r)|ci (r) , r|ci c (7) j−1 2 j ˆ σr|ci (r) = Bj σci (r)|ci (r) + Bj γj , (8) Next, we combine information obtained by the parent from all its children. [sent-82, score-0.548]
50 kr j−1 φj−1 = σr|r r|r j−1 (φj−1(r) /σr|ci (r) ), r|ci (9) i=1 kr j−1 1/σr|r = Σ−1 + j−1 i=1 j−1 ((1/σr|ci (r) ) − Σ−1 ). [sent-83, score-0.1]
51 j−1 (10) where kr is the number of children of node r at level j − 1. [sent-84, score-0.315]
52 Smoothing: In the smoothing step, parents propagate information recursively from root to the leaves to provide us with the posterior of each φj based on the entire data. [sent-85, score-0.128]
53 Denoting the posterior mean and variance r ˆ j of φj given all the observations by φj and σr respectively, the update equations are given below. [sent-86, score-0.087]
54 r r 1 1 ˆ For level 1 nodes, set φ1 = φ1 , and σr = σr|r . [sent-87, score-0.071]
55 r r|r For node r at other levels, ˆ ˆ j j φj = φj + σr|r Bj (φj−1 − φj−1 )/σpa(r)|r , r r|r pa(r) pa(r)|r 2 2 (11) j j j−1 j j 2 j−1 σr = σr|r + σr|r Bj (σpa(r) − σpa(r)|r )/σpa(r)|r , (12) j,j−1 j j−1 j−1 σr,pa(r) = σr|r Bj σpa(r) /σpa(r)|r . [sent-88, score-0.138]
56 (13) and let The computational complexity of the algorithm is linear in the number of nodes in the hierarchy and for each parent node, we perform an operation which is cubic in the number of children. [sent-89, score-0.387]
57 The maximization step obtains revised estimates of other parameters by maximizing the expected complete log-posterior. [sent-93, score-0.092]
58 4 nr ˆ V = L ˆ (eir − φL )2 + nr σr r i=1 nr r∈TL , (14) r∈TL For j = 1, · · · , L, γj = ˆ r∈Tj j−1 j,j−1 j ˆj ˆ j−1 (σr + σpa(r) − 2σr,pa(r) + (φr − φpa(r) )2 ) |mj | . [sent-94, score-0.765]
59 (15) ˆ Updating β: We use the posterior mean of φ obtained from the Kalman filtering step, to compute the posterior mean of β as given in equation (16). [sent-95, score-0.112]
60 ˆ ˆ β = (X ′ X)−1 X ′ (Y − φL ), (16) ˆ ˆ where φL is the vector of φL corresponding to each observation yir at different leaf node r. [sent-96, score-0.739]
61 2 Simulation Performance We first perform a simulation study with a hierarchy described in [7, 8]. [sent-98, score-0.217]
62 The response variable of interest is binary with a positive label assigned to a child if he/she received a full set of immunizations. [sent-100, score-0.168]
63 The actual data contains 15 covariates capturing individual, family and community level characteristics as shown in Table 2. [sent-101, score-0.168]
64 We simulated Gaussian ′ response as follows: yir |b ∼ N (xir β + b1 + b2 , 10) where b1 ∼ N (0, 4), and b2 ∼ N (0, 1). [sent-103, score-0.539]
65 We r r r r simulated 100 data sets and compared the estimates from Kalman filter to the one obtained from standard routine lme4 in the statistical software R. [sent-104, score-0.092]
66 3 MLH for Non-Gaussian Responses We discuss model fitting for Bernoulli response but note that other distributions in the generalized linear model family can be easily fitted using the procedure. [sent-107, score-0.074]
67 The MLH logistic regression is ir defined as: ′ θir = xir β + φL , (17) r with the same multi-level prior as described in equation (2). [sent-112, score-0.312]
68 The estimates obtained are biased; we recommend cross-validation and parametric bootstrap (adapted from [4]) to correct for the bias. [sent-115, score-0.256]
69 The bootstrap procedure though expensive is easily parallelizable and accurate. [sent-116, score-0.143]
70 1 Approximation Methods ˆ ˆ ˆ ˆ Let ηir = xir β + φL , where β, φL are current estimates of the parameters in our algorithm. [sent-118, score-0.266]
71 This enables us to do the calculations as in the Gaussian case with the response yir being replaced by Zir where Zir = ηir + 2yir − 1 , g((2yir − 1)ηir ) 5 (18) Algorithm 1 The bootstrap procedure Let θ = (β, γ). [sent-120, score-0.65]
72 g(ηir )g(−ηir ) (19) ˆ Now denote eir = Zir − x′ β, and the approximated variance of Zir as Vir . [sent-129, score-0.142]
73 Analogous to equair tion (3) and (4), the resulting filtering step for the leaf nodes becomes: nr L φL = σr|r r|r i=1 eir , Vir (20) n L σr|r = ( r 1 1 −1 + ) . [sent-130, score-0.639]
74 ir ˆ ˆ β = (X ′ W X)−1 X ′ W (Z − φL ), (22) All the other computational steps remain the same as in the Gaussian case. [sent-132, score-0.138]
75 2 Bias correction Table 2 shows estimates of parameters obtained from our approximation method in the column titled KF . [sent-134, score-0.153]
76 Compared to the unbiased estimates obtained from the slow Gibbs sampler, it is clear our estimates are biased. [sent-135, score-0.184]
77 Our bias correction procedure is described in Algorithm 1. [sent-136, score-0.088]
78 The bias corrected estimates are reported under KF-B in Table 2. [sent-138, score-0.147]
79 The estimates after bootstrap correction are closer to the estimates obtained from Gibbs sampling. [sent-139, score-0.356]
80 The estimates at the optimal value of γ are shown in Table 2 under KF-C. [sent-145, score-0.092]
81 The estimates are better than KF but worse than KF-B. [sent-146, score-0.092]
82 Based on our findings, we recommend KF-B when computing resources are available (especially multiple processors) and running time is not a big constraint; if runtime is an issue we recommend grid search using a small number of points around the initial estimate. [sent-147, score-0.106]
83 4 Content Match Data Analysis We analyze data from an internet advertising application where every showing of an ad on a web page (called an impression) constitutes an event. [sent-148, score-0.156]
84 The goal is to rank ads on a given page based on click-through rates. [sent-149, score-0.13]
85 13 Random effects Standard deviations γ Family Community Table 2: Estimates for the binary MLH model of complete immunization (Kalman Filtering results) ads is an attractive approach. [sent-218, score-0.261]
86 In our case, semantic features are obtained by classifying pages and ads into a large seven-level content hierarchy that is manually constructed by humans. [sent-219, score-0.317]
87 We form a new hierarchy (a pyramid) by taking the cross product of the two hierarchies. [sent-220, score-0.193]
88 1 Training and Test Data Although the page and ad hierarchies consist of 7 levels, classification is often done at coarser levels by the classifier. [sent-223, score-0.217]
89 In fact, the average level at which classification took place is 3. [sent-224, score-0.071]
90 Pages and ads that are classified at coarser levels are randomly assigned to the children nodes. [sent-227, score-0.264]
91 Overall, the pyramid has 441, 25751 and 241292 nodes for the top 3 levels. [sent-228, score-0.18]
92 The MLH model which uses only 2 levels is inferior to the 3 level MLH while the general model that uses both covariates and hierarchy is the best. [sent-240, score-0.387]
93 5 Discussion In applications where data is aggregated at multiple resolutions with sparsity at finer resolutions, multi-level hierarchical models provide an attractive class to reduce variance by smoothing estimates at finer resolutions using data at coarser resolutions. [sent-245, score-0.433]
94 However, the smoothing provides a better biasvariance tradeoff only when the hierarchy provides a natural clustering for the response variable and captures some latent characteristics of the process; often true in practice. [sent-246, score-0.304]
95 For the non-Gaussian case, the estimates are biased but performance can be improved by using a bootstrap correction or estimation through a tuning set. [sent-248, score-0.289]
96 In future work, we will report on models that generalize our approach to arbitrary number of hierarchies that may all have different structure. [sent-249, score-0.066]
97 This is a challenging problem since in general cross-product of trees is not a hierarchy but a graph. [sent-250, score-0.193]
98 Maximum likelihood for generalized linear models with nested random effects via high-order, multivariate Laplace approximation. [sent-292, score-0.091]
99 An assessment of estimation procedures for multilevel models with binary responses. [sent-297, score-0.086]
100 Improved estimation procedures for multilevel models with binary response: A case-study. [sent-302, score-0.086]
wordName wordTfidf (topN-words)
[('yir', 0.465), ('mlh', 0.399), ('nr', 0.255), ('pa', 0.22), ('hierarchy', 0.193), ('kalman', 0.18), ('xir', 0.174), ('bj', 0.142), ('ir', 0.138), ('node', 0.138), ('nodes', 0.137), ('leaf', 0.136), ('eir', 0.111), ('zir', 0.111), ('bootstrap', 0.111), ('ci', 0.097), ('ads', 0.094), ('estimates', 0.092), ('pir', 0.089), ('positional', 0.089), ('tl', 0.079), ('education', 0.079), ('mother', 0.078), ('response', 0.074), ('level', 0.071), ('child', 0.07), ('cholesky', 0.069), ('covariates', 0.069), ('indigenous', 0.066), ('vir', 0.066), ('lter', 0.064), ('effects', 0.062), ('correction', 0.061), ('laplace', 0.061), ('mj', 0.06), ('coarser', 0.06), ('birth', 0.058), ('husband', 0.058), ('kf', 0.058), ('parent', 0.057), ('children', 0.056), ('posterior', 0.056), ('levels', 0.054), ('resolutions', 0.053), ('filtering', 0.053), ('recommend', 0.053), ('tting', 0.052), ('tj', 0.05), ('zr', 0.05), ('kr', 0.05), ('advertising', 0.05), ('secondary', 0.045), ('guatemalan', 0.044), ('immunization', 0.044), ('ith', 0.044), ('ner', 0.043), ('pyramid', 0.043), ('sized', 0.041), ('factorization', 0.041), ('hierarchical', 0.041), ('internet', 0.04), ('mode', 0.038), ('smoothing', 0.037), ('attractive', 0.037), ('gibbs', 0.037), ('hierarchies', 0.037), ('page', 0.036), ('lineage', 0.036), ('root', 0.035), ('clustered', 0.034), ('illustrate', 0.034), ('rodriguez', 0.033), ('spanish', 0.033), ('multilevel', 0.033), ('expensive', 0.032), ('formidable', 0.031), ('scalable', 0.031), ('variance', 0.031), ('ad', 0.03), ('content', 0.03), ('compactly', 0.03), ('em', 0.029), ('models', 0.029), ('bottleneck', 0.029), ('community', 0.028), ('corrected', 0.028), ('parametrization', 0.028), ('conditional', 0.028), ('taylor', 0.027), ('bias', 0.027), ('missing', 0.026), ('agarwal', 0.026), ('age', 0.026), ('royal', 0.026), ('worked', 0.025), ('biased', 0.025), ('covariate', 0.024), ('binary', 0.024), ('simulation', 0.024), ('algebra', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 82 nips-2008-Fast Computation of Posterior Mode in Multi-Level Hierarchical Models
Author: Liang Zhang, Deepak Agarwal
Abstract: Multi-level hierarchical models provide an attractive framework for incorporating correlations induced in a response variable that is organized hierarchically. Model fitting is challenging, especially for a hierarchy with a large number of nodes. We provide a novel algorithm based on a multi-scale Kalman filter that is both scalable and easy to implement. For Gaussian response, we show our method provides the maximum a-posteriori (MAP) parameter estimates; for non-Gaussian response, parameter estimation is performed through a Laplace approximation. However, the Laplace approximation provides biased parameter estimates that is corrected through a parametric bootstrap procedure. We illustrate through simulation studies and analyses of real world data sets in health care and online advertising.
2 0.081258066 36 nips-2008-Beyond Novelty Detection: Incongruent Events, when General and Specific Classifiers Disagree
Author: Daphna Weinshall, Hynek Hermansky, Alon Zweig, Jie Luo, Holly Jimison, Frank Ohl, Misha Pavel
Abstract: Unexpected stimuli are a challenge to any machine learning algorithm. Here we identify distinct types of unexpected events, focusing on ’incongruent events’ when ’general level’ and ’specific level’ classifiers give conflicting predictions. We define a formal framework for the representation and processing of incongruent events: starting from the notion of label hierarchy, we show how partial order on labels can be deduced from such hierarchies. For each event, we compute its probability in different ways, based on adjacent levels (according to the partial order) in the label hierarchy. An incongruent event is an event where the probability computed based on some more specific level (in accordance with the partial order) is much smaller than the probability computed based on some more general level, leading to conflicting predictions. We derive algorithms to detect incongruent events from different types of hierarchies, corresponding to class membership or part membership. Respectively, we show promising results with real data on two specific problems: Out Of Vocabulary words in speech recognition, and the identification of a new sub-class (e.g., the face of a new individual) in audio-visual facial object recognition.
3 0.078018904 191 nips-2008-Recursive Segmentation and Recognition Templates for 2D Parsing
Author: Leo Zhu, Yuanhao Chen, Yuan Lin, Chenxi Lin, Alan L. Yuille
Abstract: Language and image understanding are two major goals of artificial intelligence which can both be conceptually formulated in terms of parsing the input signal into a hierarchical representation. Natural language researchers have made great progress by exploiting the 1D structure of language to design efficient polynomialtime parsing algorithms. By contrast, the two-dimensional nature of images makes it much harder to design efficient image parsers and the form of the hierarchical representations is also unclear. Attempts to adapt representations and algorithms from natural language have only been partially successful. In this paper, we propose a Hierarchical Image Model (HIM) for 2D image parsing which outputs image segmentation and object recognition. This HIM is represented by recursive segmentation and recognition templates in multiple layers and has advantages for representation, inference, and learning. Firstly, the HIM has a coarse-to-fine representation which is capable of capturing long-range dependency and exploiting different levels of contextual information. Secondly, the structure of the HIM allows us to design a rapid inference algorithm, based on dynamic programming, which enables us to parse the image rapidly in polynomial time. Thirdly, we can learn the HIM efficiently in a discriminative manner from a labeled dataset. We demonstrate that HIM outperforms other state-of-the-art methods by evaluation on the challenging public MSRC image dataset. Finally, we sketch how the HIM architecture can be extended to model more complex image phenomena. 1
4 0.069950901 40 nips-2008-Bounds on marginal probability distributions
Author: Joris M. Mooij, Hilbert J. Kappen
Abstract: We propose a novel bound on single-variable marginal probability distributions in factor graphs with discrete variables. The bound is obtained by propagating local bounds (convex sets of probability distributions) over a subtree of the factor graph, rooted in the variable of interest. By construction, the method not only bounds the exact marginal probability distribution of a variable, but also its approximate Belief Propagation marginal (“belief”). Thus, apart from providing a practical means to calculate bounds on marginals, our contribution also lies in providing a better understanding of the error made by Belief Propagation. We show that our bound outperforms the state-of-the-art on some inference problems arising in medical diagnosis. 1
5 0.069154121 159 nips-2008-On Bootstrapping the ROC Curve
Author: Patrice Bertail, Stéphan J. Clémençcon, Nicolas Vayatis
Abstract: This paper is devoted to thoroughly investigating how to bootstrap the ROC curve, a widely used visual tool for evaluating the accuracy of test/scoring statistics in the bipartite setup. The issue of confidence bands for the ROC curve is considered and a resampling procedure based on a smooth version of the empirical distribution called the ”smoothed bootstrap” is introduced. Theoretical arguments and simulation results are presented to show that the ”smoothed bootstrap” is preferable to a ”naive” bootstrap in order to construct accurate confidence bands. 1
6 0.062179215 4 nips-2008-A Scalable Hierarchical Distributed Language Model
7 0.057124138 235 nips-2008-The Infinite Hierarchical Factor Regression Model
8 0.05666345 184 nips-2008-Predictive Indexing for Fast Search
9 0.056555539 231 nips-2008-Temporal Dynamics of Cognitive Control
10 0.055793878 98 nips-2008-Hierarchical Semi-Markov Conditional Random Fields for Recursive Sequential Data
11 0.054881614 15 nips-2008-Adaptive Martingale Boosting
12 0.054385435 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel
13 0.051600955 213 nips-2008-Sparse Convolved Gaussian Processes for Multi-output Regression
14 0.050737415 170 nips-2008-Online Optimization in X-Armed Bandits
15 0.050405614 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks
16 0.047007862 12 nips-2008-Accelerating Bayesian Inference over Nonlinear Differential Equations with Gaussian Processes
17 0.046675682 78 nips-2008-Exact Convex Confidence-Weighted Learning
18 0.046133503 70 nips-2008-Efficient Inference in Phylogenetic InDel Trees
19 0.045818802 224 nips-2008-Structured ranking learning using cumulative distribution networks
20 0.045741189 24 nips-2008-An improved estimator of Variance Explained in the presence of noise
topicId topicWeight
[(0, -0.146), (1, -0.007), (2, 0.032), (3, -0.008), (4, 0.019), (5, -0.077), (6, 0.062), (7, 0.075), (8, 0.061), (9, 0.004), (10, -0.049), (11, 0.014), (12, 0.014), (13, -0.028), (14, 0.004), (15, 0.029), (16, 0.003), (17, -0.075), (18, 0.041), (19, 0.015), (20, 0.072), (21, 0.068), (22, 0.013), (23, -0.023), (24, -0.053), (25, -0.007), (26, 0.01), (27, -0.011), (28, -0.05), (29, -0.03), (30, -0.013), (31, -0.042), (32, 0.041), (33, -0.012), (34, 0.024), (35, -0.071), (36, 0.079), (37, 0.131), (38, 0.022), (39, 0.038), (40, -0.063), (41, -0.02), (42, -0.018), (43, -0.017), (44, -0.086), (45, -0.046), (46, 0.185), (47, 0.102), (48, -0.014), (49, -0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.91936868 82 nips-2008-Fast Computation of Posterior Mode in Multi-Level Hierarchical Models
Author: Liang Zhang, Deepak Agarwal
Abstract: Multi-level hierarchical models provide an attractive framework for incorporating correlations induced in a response variable that is organized hierarchically. Model fitting is challenging, especially for a hierarchy with a large number of nodes. We provide a novel algorithm based on a multi-scale Kalman filter that is both scalable and easy to implement. For Gaussian response, we show our method provides the maximum a-posteriori (MAP) parameter estimates; for non-Gaussian response, parameter estimation is performed through a Laplace approximation. However, the Laplace approximation provides biased parameter estimates that is corrected through a parametric bootstrap procedure. We illustrate through simulation studies and analyses of real world data sets in health care and online advertising.
2 0.607467 186 nips-2008-Probabilistic detection of short events, with application to critical care monitoring
Author: Norm Aleks, Stuart Russell, Michael G. Madden, Diane Morabito, Kristan Staudenmayer, Mitchell Cohen, Geoffrey T. Manley
Abstract: We describe an application of probabilistic modeling and inference technology to the problem of analyzing sensor data in the setting of an intensive care unit (ICU). In particular, we consider the arterial-line blood pressure sensor, which is subject to frequent data artifacts that cause false alarms in the ICU and make the raw data almost useless for automated decision making. The problem is complicated by the fact that the sensor data are averaged over fixed intervals whereas the events causing data artifacts may occur at any time and often have durations significantly shorter than the data collection interval. We show that careful modeling of the sensor, combined with a general technique for detecting sub-interval events and estimating their duration, enables detection of artifacts and accurate estimation of the underlying blood pressure values. Our model’s performance identifying artifacts is superior to two other classifiers’ and about as good as a physician’s. 1
3 0.50505894 98 nips-2008-Hierarchical Semi-Markov Conditional Random Fields for Recursive Sequential Data
Author: Tran T. Truyen, Dinh Phung, Hung Bui, Svetha Venkatesh
Abstract: Inspired by the hierarchical hidden Markov models (HHMM), we present the hierarchical semi-Markov conditional random field (HSCRF), a generalisation of embedded undirected Markov chains to model complex hierarchical, nested Markov processes. It is parameterised in a discriminative framework and has polynomial time algorithms for learning and inference. Importantly, we develop efficient algorithms for learning and constrained inference in a partially-supervised setting, which is important issue in practice where labels can only be obtained sparsely. We demonstrate the HSCRF in two applications: (i) recognising human activities of daily living (ADLs) from indoor surveillance cameras, and (ii) noun-phrase chunking. We show that the HSCRF is capable of learning rich hierarchical models with reasonable accuracy in both fully and partially observed data cases. 1
4 0.48601606 167 nips-2008-One sketch for all: Theory and Application of Conditional Random Sampling
Author: Ping Li, Kenneth W. Church, Trevor J. Hastie
Abstract: Conditional Random Sampling (CRS) was originally proposed for efficiently computing pairwise (l2 , l1 ) distances, in static, large-scale, and sparse data. This study modifies the original CRS and extends CRS to handle dynamic or streaming data, which much better reflect the real-world situation than assuming static data. Compared with many other sketching algorithms for dimension reductions such as stable random projections, CRS exhibits a significant advantage in that it is “one-sketch-for-all.” In particular, we demonstrate the effectiveness of CRS in efficiently computing the Hamming norm, the Hamming distance, the lp distance, and the χ2 distance. A generic estimator and an approximate variance formula are also provided, for approximating any type of distances. We recommend CRS as a promising tool for building highly scalable systems, in machine learning, data mining, recommender systems, and information retrieval. 1
5 0.48019299 36 nips-2008-Beyond Novelty Detection: Incongruent Events, when General and Specific Classifiers Disagree
Author: Daphna Weinshall, Hynek Hermansky, Alon Zweig, Jie Luo, Holly Jimison, Frank Ohl, Misha Pavel
Abstract: Unexpected stimuli are a challenge to any machine learning algorithm. Here we identify distinct types of unexpected events, focusing on ’incongruent events’ when ’general level’ and ’specific level’ classifiers give conflicting predictions. We define a formal framework for the representation and processing of incongruent events: starting from the notion of label hierarchy, we show how partial order on labels can be deduced from such hierarchies. For each event, we compute its probability in different ways, based on adjacent levels (according to the partial order) in the label hierarchy. An incongruent event is an event where the probability computed based on some more specific level (in accordance with the partial order) is much smaller than the probability computed based on some more general level, leading to conflicting predictions. We derive algorithms to detect incongruent events from different types of hierarchies, corresponding to class membership or part membership. Respectively, we show promising results with real data on two specific problems: Out Of Vocabulary words in speech recognition, and the identification of a new sub-class (e.g., the face of a new individual) in audio-visual facial object recognition.
6 0.4797653 4 nips-2008-A Scalable Hierarchical Distributed Language Model
7 0.46881667 191 nips-2008-Recursive Segmentation and Recognition Templates for 2D Parsing
8 0.46076342 11 nips-2008-A spatially varying two-sample recombinant coalescent, with applications to HIV escape response
9 0.45511624 221 nips-2008-Stochastic Relational Models for Large-scale Dyadic Data using MCMC
10 0.44313157 50 nips-2008-Continuously-adaptive discretization for message-passing algorithms
11 0.442215 105 nips-2008-Improving on Expectation Propagation
12 0.44143569 24 nips-2008-An improved estimator of Variance Explained in the presence of noise
13 0.43234709 71 nips-2008-Efficient Sampling for Gaussian Process Inference using Control Variables
14 0.42847848 70 nips-2008-Efficient Inference in Phylogenetic InDel Trees
15 0.42461297 188 nips-2008-QUIC-SVD: Fast SVD Using Cosine Trees
16 0.42347911 30 nips-2008-Bayesian Experimental Design of Magnetic Resonance Imaging Sequences
17 0.4229506 169 nips-2008-Online Models for Content Optimization
18 0.41682056 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks
19 0.40702069 134 nips-2008-Mixed Membership Stochastic Blockmodels
20 0.39640504 139 nips-2008-Modeling the effects of memory on human online sentence processing with particle filters
topicId topicWeight
[(6, 0.043), (7, 0.079), (12, 0.043), (15, 0.011), (28, 0.166), (57, 0.057), (63, 0.02), (77, 0.033), (78, 0.412), (83, 0.038)]
simIndex simValue paperId paperTitle
1 0.88374704 52 nips-2008-Correlated Bigram LSA for Unsupervised Language Model Adaptation
Author: Yik-cheung Tam, Tanja Schultz
Abstract: We present a correlated bigram LSA approach for unsupervised LM adaptation for automatic speech recognition. The model is trained using efficient variational EM and smoothed using the proposed fractional Kneser-Ney smoothing which handles fractional counts. We address the scalability issue to large training corpora via bootstrapping of bigram LSA from unigram LSA. For LM adaptation, unigram and bigram LSA are integrated into the background N-gram LM via marginal adaptation and linear interpolation respectively. Experimental results on the Mandarin RT04 test set show that applying unigram and bigram LSA together yields 6%–8% relative perplexity reduction and 2.5% relative character error rate reduction which is statistically significant compared to applying only unigram LSA. On the large-scale evaluation on Arabic, 3% relative word error rate reduction is achieved which is also statistically significant. 1
2 0.86971861 111 nips-2008-Kernel Change-point Analysis
Author: Zaïd Harchaoui, Eric Moulines, Francis R. Bach
Abstract: We introduce a kernel-based method for change-point analysis within a sequence of temporal observations. Change-point analysis of an unlabelled sample of observations consists in, first, testing whether a change in the distribution occurs within the sample, and second, if a change occurs, estimating the change-point instant after which the distribution of the observations switches from one distribution to another different distribution. We propose a test statistic based upon the maximum kernel Fisher discriminant ratio as a measure of homogeneity between segments. We derive its limiting distribution under the null hypothesis (no change occurs), and establish the consistency under the alternative hypothesis (a change occurs). This allows to build a statistical hypothesis testing procedure for testing the presence of a change-point, with a prescribed false-alarm probability and detection probability tending to one in the large-sample setting. If a change actually occurs, the test statistic also yields an estimator of the change-point location. Promising experimental results in temporal segmentation of mental tasks from BCI data and pop song indexation are presented. 1
3 0.85594571 124 nips-2008-Load and Attentional Bayes
Author: Peter Dayan
Abstract: Selective attention is a most intensively studied psychological phenomenon, rife with theoretical suggestions and schisms. A critical idea is that of limited capacity, the allocation of which has produced continual conflict about such phenomena as early and late selection. An influential resolution of this debate is based on the notion of perceptual load (Lavie, 2005), which suggests that low-load, easy tasks, because they underuse the total capacity of attention, mandatorily lead to the processing of stimuli that are irrelevant to the current attentional set; whereas high-load, difficult tasks grab all resources for themselves, leaving distractors high and dry. We argue that this theory presents a challenge to Bayesian theories of attention, and suggest an alternative, statistical, account of key supporting data. 1
same-paper 4 0.80489194 82 nips-2008-Fast Computation of Posterior Mode in Multi-Level Hierarchical Models
Author: Liang Zhang, Deepak Agarwal
Abstract: Multi-level hierarchical models provide an attractive framework for incorporating correlations induced in a response variable that is organized hierarchically. Model fitting is challenging, especially for a hierarchy with a large number of nodes. We provide a novel algorithm based on a multi-scale Kalman filter that is both scalable and easy to implement. For Gaussian response, we show our method provides the maximum a-posteriori (MAP) parameter estimates; for non-Gaussian response, parameter estimation is performed through a Laplace approximation. However, the Laplace approximation provides biased parameter estimates that is corrected through a parametric bootstrap procedure. We illustrate through simulation studies and analyses of real world data sets in health care and online advertising.
5 0.51511234 216 nips-2008-Sparse probabilistic projections
Author: Cédric Archambeau, Francis R. Bach
Abstract: We present a generative model for performing sparse probabilistic projections, which includes sparse principal component analysis and sparse canonical correlation analysis as special cases. Sparsity is enforced by means of automatic relevance determination or by imposing appropriate prior distributions, such as generalised hyperbolic distributions. We derive a variational Expectation-Maximisation algorithm for the estimation of the hyperparameters and show that our novel probabilistic approach compares favourably to existing techniques. We illustrate how the proposed method can be applied in the context of cryptoanalysis as a preprocessing tool for the construction of template attacks. 1
6 0.50501031 147 nips-2008-Multiscale Random Fields with Application to Contour Grouping
7 0.50389814 36 nips-2008-Beyond Novelty Detection: Incongruent Events, when General and Specific Classifiers Disagree
8 0.50366426 20 nips-2008-An Extended Level Method for Efficient Multiple Kernel Learning
9 0.5000838 98 nips-2008-Hierarchical Semi-Markov Conditional Random Fields for Recursive Sequential Data
10 0.49161783 169 nips-2008-Online Models for Content Optimization
11 0.4860225 4 nips-2008-A Scalable Hierarchical Distributed Language Model
12 0.48378131 66 nips-2008-Dynamic visual attention: searching for coding length increments
13 0.47612974 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
14 0.47385728 231 nips-2008-Temporal Dynamics of Cognitive Control
15 0.47378847 15 nips-2008-Adaptive Martingale Boosting
16 0.47092655 10 nips-2008-A rational model of preference learning and choice prediction by children
17 0.47059208 167 nips-2008-One sketch for all: Theory and Application of Conditional Random Sampling
18 0.46978483 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
19 0.46935949 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks
20 0.46920282 142 nips-2008-Multi-Level Active Prediction of Useful Image Annotations for Recognition