nips nips2012 nips2012-58 knowledge-graph by maker-knowledge-mining

58 nips-2012-Bayesian models for Large-scale Hierarchical Classification


Source: pdf

Author: Siddharth Gopal, Yiming Yang, Bing Bai, Alexandru Niculescu-mizil

Abstract: A challenging problem in hierarchical classification is to leverage the hierarchical relations among classes for improving classification performance. An even greater challenge is to do so in a manner that is computationally feasible for large scale problems. This paper proposes a set of Bayesian methods to model hierarchical dependencies among class labels using multivariate logistic regression. Specifically, the parent-child relationships are modeled by placing a hierarchical prior over the children nodes centered around the parameters of their parents; thereby encouraging classes nearby in the hierarchy to share similar model parameters. We present variational algorithms for tractable posterior inference in these models, and provide a parallel implementation that can comfortably handle largescale problems with hundreds of thousands of dimensions and tens of thousands of classes. We run a comparative evaluation on multiple large-scale benchmark datasets that highlights the scalability of our approach and shows improved performance over the other state-of-the-art hierarchical methods. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Carnegie Mellon University NEC Laboratories America, Princeton Abstract A challenging problem in hierarchical classification is to leverage the hierarchical relations among classes for improving classification performance. [sent-6, score-0.508]

2 This paper proposes a set of Bayesian methods to model hierarchical dependencies among class labels using multivariate logistic regression. [sent-8, score-0.349]

3 Specifically, the parent-child relationships are modeled by placing a hierarchical prior over the children nodes centered around the parameters of their parents; thereby encouraging classes nearby in the hierarchy to share similar model parameters. [sent-9, score-0.777]

4 We present variational algorithms for tractable posterior inference in these models, and provide a parallel implementation that can comfortably handle largescale problems with hundreds of thousands of dimensions and tens of thousands of classes. [sent-10, score-0.51]

5 We run a comparative evaluation on multiple large-scale benchmark datasets that highlights the scalability of our approach and shows improved performance over the other state-of-the-art hierarchical methods. [sent-11, score-0.249]

6 1 Introduction With the tremendous growth of data, providing a multi-granularity conceptual view using hierarchical classification (HC) has become increasingly important. [sent-12, score-0.201]

7 The large hierarchical structures present both challenges and opportunities for statistical classification research. [sent-15, score-0.201]

8 Instead of focusing on individual classes in isolation, we need to address joint training and inference based on the hierarchical dependencies among the classes. [sent-16, score-0.367]

9 In this paper, we investigate a Bayesian framework for leveraging the hierarchical class structure. [sent-18, score-0.235]

10 The Bayesian framework is a natural fit for this problem as it can seamlessly capture the idea that the models at the lower levels of the hierarchy are specialization of models at the ancestor nodes. [sent-19, score-0.268]

11 We define a hierarchical Bayesian model where the prior distribution for the parameters at a node is a Gaussian centered at the parameters of the parent node. [sent-20, score-0.599]

12 This prior encourages the parameters of nodes that are close in the hierarchy to be similar thereby enabling propagation of information across the hierarchical structure and leading to inductive transfer (sharing statistical strength) among the models corresponding to the different nodes. [sent-21, score-0.621]

13 Modelling the covariance structures gives us the flexibility to incorporate different ways of sharing information in the hierarchy. [sent-23, score-0.142]

14 For example, consider a hierarchical organization of all animals with two sub-topics mammals and birds. [sent-24, score-0.288]

15 As another example, the model can incorporate children-specific covariances that allows some sub-topic 1 parameters to be less similar to their parent and some to be more similar; for e. [sent-26, score-0.182]

16 sub-topic whales is quite distinct from its parent mammals compared to its siblings felines, primates. [sent-28, score-0.354]

17 Formulating such constraints in non-Bayesian large-margin approaches is not as easy, and to our knowledge has not done before in the context of hierarchical classification. [sent-29, score-0.201]

18 Our approach shares similarity to the correlated Multinomial logit [18] (corrMNL) in taking a Bayesian approach to model the hierarchical class structure, but improves over it in two significant aspects - scalability and setting hyperparameters. [sent-31, score-0.283]

19 The approximate variational inference (1 plus 2) reduces the computation time by several order of magnitudes (750x) over MCMC, and the parallel implementation in a Hadoop cluster [4] further improves the time almost linearly in the number of processors. [sent-34, score-0.291]

20 These enabled us to comfortably conduct joint posterior inference for hierarchical logistic regression models with tens of thousands of categories and hundreds of thousands of features. [sent-35, score-0.576]

21 To evaluate the proposed techniques we run a comprehensive empirical study on several large scale hierarchical classification problems. [sent-41, score-0.201]

22 The results show that our approach is able to leverage the class hierarchy and obtain a significant performance boost over leading non-Bayesian hierarchical classification methods, as well as consistently outperform flat methods that do not use the hierarchy information. [sent-42, score-0.767]

23 Some of the early works in HC [10, 14] use the hierarchical structure to decompose the classification problem into sub-problems recursively along the hierarchy and allocate a classifier at each node. [sent-44, score-0.429]

24 The hierarchy is used to partition the training data into node-specific subsets and classifiers at each node are trained independently without using the hierarchy any further. [sent-45, score-0.595]

25 Many approaches have been proposed to better utilize the hierarchical structure. [sent-46, score-0.201]

26 Smoothing the estimated parameters in naive Bayes classifiers along each path from the root to a leaf node has been tried in [17]. [sent-48, score-0.226]

27 [20, 6] proposed large-margin discriminative methods where the discriminant function at each node takes the contributions from all nodes along the path to the root node, and the parameters are jointly learned to minimize a global loss over the hierarchy. [sent-49, score-0.27]

28 Recently, enforcing orthogonality constraints between parent and children classifiers was shown to achieve state-of-art performance [23]. [sent-50, score-0.267]

29 2 The Hierarchical Bayesian Logistic Regression (HBLR) Framework Define a hierarchy as a set of nodes Y = {1, 2. [sent-51, score-0.313]

30 } with the parent relationship π : Y → Y where π(y) is the parent of node y ∈ Y . [sent-54, score-0.381]

31 Let D = {(xi , ti )}N denote the training data where xi ∈ Rd is i=1 an instance, ti ∈ T is a label, where T ⊂ Y is the set of leaf nodes in the hierarchy labeled from 1 to |T |. [sent-55, score-0.414]

32 We assume that each instance is assigned to one of the leaf nodes in the hierarchy. [sent-56, score-0.191]

33 2 For each node y ∈ Y , we associate a parameter vector wy which has a Gaussian prior. [sent-58, score-0.58]

34 We set the mean of the prior to the parameter of the parent node, wπ(y) . [sent-59, score-0.197]

35 Different constraints on the covariance matrix of the prior corresponds to different ways of propagating information across the hierarchy. [sent-60, score-0.166]

36 In what follows, we consider three alternate ways to model the covariance matrix which we call M1, M2 and M3 variants of HBLR. [sent-61, score-0.133]

37 In the M1 variant all the siblings share the same spherical covariance matrix. [sent-62, score-0.136]

38 Formally, the generative model for M1 is M1 wroot ∼ N (w0 , Σ0 ), αroot ∼ Γ(a0 , b0 ) wy | wπ(y) , Σπ(y) ∼ N (wπ(y) , Σπ(y) ) ∀y, αy ∼ Γ(ay , by ) ∀y ∈ T / t|x∼ pi (x) = Multinomial(p1 (x), p2 (x), . [sent-63, score-0.471]

39 , p|T | (x)) ∀(x, t) ∈ D exp(wi x)/Σt ∈T exp(wt x) (1) The parameters of the root node are drawn using user specified parameters w0 , Σ0 , a0 , b0 . [sent-65, score-0.201]

40 Each nonleaf node y ∈ T has its own αy drawn from a Gamma with the shape and inverse-scale parameters / specified by ay and by . [sent-66, score-0.311]

41 Each wy is drawn from the Normal with mean wπ(y) and covariance matrix −1 Σπ(y) = απ(y) I. [sent-67, score-0.548]

42 The class-labels are drawn from a Multinomial whose parameters are a soft-max transformation of the wy s from the leaf nodes. [sent-68, score-0.588]

43 This model leverages the class hierarchy information by encouraging the parameters of closely related nodes (parents, children and siblings) to be more similar to each other than those of distant ones in the hierarchy. [sent-69, score-0.487]

44 parent and children nodes) on a per family basis. [sent-72, score-0.23]

45 For instance it can learn that sibling nodes which are higher in the hierarchy (e. [sent-73, score-0.387]

46 mammals and birds) are generally less similar compared to sibling nodes lower in the hierarchy (e. [sent-75, score-0.439]

47 Although this model is equivalent to the corrMNL proposed in [18], the hierarchical logistic regression formulation is different from corrMNL and has a distinct advantage that the parameters can be decoupled. [sent-78, score-0.329]

48 As we shall see in Section 3, this enables the use of scalable and parallelizable variational inference algorithms. [sent-79, score-0.337]

49 In contrast, in corrMNL the soft-max parameters are modeled as a sum of contributions along the path from a leaf to the root-node. [sent-80, score-0.117]

50 This introduces two layers of dependencies between the parameters in the corrMNL model (inside the normalization constant as well along the path from leaves to root-node) which makes it less amenable to efficient variational inference. [sent-81, score-0.246]

51 Even if one were to develop a variational approach for the corrMNL parameterization, it would be slower and not efficient for parallelization. [sent-82, score-0.168]

52 In our previous example with sub-topics mammals and birds, we may want wmammals , wbirds to be commonly close to their parent in some dimensions (e. [sent-84, score-0.223]

53 For the HC problem, we define the M2 variant of the HBLR approach as: M2 wy | wπ(y) , Σπ(y) ∼ N (wπ(y) , Σπ(y) ) ∀y (i) αy ∼ Γ(a(i) , b(i) ) i = 1. [sent-91, score-0.471]

54 , απ(y) ) π(y) (1) (2) (d) Yet another extension of the M1 model would be to allow each node to have its own covariance matrix for the Gaussian prior over wy , not shared with its siblings. [sent-96, score-0.718]

55 This enables the model to learn how much the individual children nodes differ from the parent node. [sent-97, score-0.315]

56 For example, consider topic mammals and its two sub-topics whales and carnivores; the sub-topic whales is very distinct from a typical mammal and is more of an ‘outlier’ topic. [sent-98, score-0.231]

57 M3 wy | wπ(y) , Σy ∼ N (wπ(y) , Σy ) ∀y αy ∼ Γ(ay , by ) ∀y ∈ T / −1 Note that the only difference between M 3 and M 1 is that M 3 uses Σy = αy I instead of Σπ(y) in the prior for wy . [sent-101, score-1.003]

58 However, using variational methods are themselves computational intractable in high dimensional scenarios due to the requirement of a matrix inversion which is computationally intensive. [sent-108, score-0.196]

59 Therefore, we explore much faster approximation schemes such as partial MAP inference which are highly scalable. [sent-109, score-0.126]

60 Finally, we show the resulting approximate variational inference procedure can be parallelized in a map-reduce framework to tackle large-scale problems that would be impossible to solve on a single processor. [sent-110, score-0.275]

61 1 Variational Inference Starting with a simple factored form for the posterior, we seek such a distribution q which is closest in KL divergence to the true posterior p. [sent-112, score-0.135]

62 We use independent Gaussian q(wy ) and Gamma q(αy ) posterior distributions for wy and αy per node as the factored representation: d q(W, α) = y∈Y \T i=1 y∈Y y∈Y \T (i) (i) Γ(. [sent-113, score-0.715]

63 For every (x, y) we introduce variational parameters βx and ξxy . [sent-116, score-0.214]

64 We now derive an EM algorithm that computes the posterior in the E-step and maximizes the variational parameters in the M-step. [sent-117, score-0.309]

65 Variational E-Step The local variational parameters are fixed, and the posterior for a parameter is computed by matching the log-likelihood of the posterior with the expectation of log-likelihood under the rest of the parameters. [sent-118, score-0.404]

66 5|T | − 1) + λ(ξxy )µy x)/ y∈T λ(ξxy ) y∈T Class-label Prediction After computing the posterior, one way to compute the probability of a target class-label given a test instance is to simply plugin the posterior mean for prediction. [sent-122, score-0.13]

67 We take an alternative route by calculating the joint posterior p(l, W|x) by variational approximations. [sent-128, score-0.291]

68 |˜y ) µ ˜ p q (wy )˜(ly ) ≡ ˜ q q (l, W) = ˜ y∈T y∈T ˜ ˜ The posterior can be calculated as before, by introducing variational parameters ξxy , βx and matching the log likelihoods. [sent-131, score-0.309]

69 In such scenarios, we split the inference into two stages, first calculating the posterior of wy using MAP solution, and second calculating the posterior of αy . [sent-135, score-0.791]

70 In the first stage, we find the MAP estimate map wy and then use laplace approximation to approximate the posterior using a separate Normal distribution for each dimension, thereby leading to a diagonal covariance matrix. [sent-136, score-0.707]

71 Note that due to map the laplace approximation, wy and the posterior mean µy coincide. [sent-137, score-0.63]

72 map µ = wy = arg max W (Ψ(i,i) )−1 = y τπ(y) 1 − (wy − wπ(y) ) diag( )(wy − wπ(y) ) + log p(D|W, α) 2 υπ(y) y∈T (6) x(i) pxy (1 − pxy )x(i) (x,t)∈Dy where pxy is the probability that training instance x is labeled as y. [sent-138, score-0.777]

73 Full MAP inference is also possible by performing an alternating maximization between wy , αy but we do not recommend it as there is no gain in scalability compared to partial MAP Inference and it loses the posterior distribution of αy . [sent-141, score-0.74]

74 The Ψy , τy , νy can be updated in parallel for each node using (3),(4). [sent-146, score-0.186]

75 To make it parallelizable we replace the soft-max function in (1) with multiple binary logistic functions (one for each terminal node), which removes the coupling of parameters inside the log-normalization constant. [sent-148, score-0.212]

76 Secondly, note that the interactions between the wy ’s are only through the parent and child nodes. [sent-150, score-0.607]

77 By fixing the parameters of the parent and children, the parameter wy of a node can be optimized independently of the rest of the hierarchy. [sent-151, score-0.762]

78 One simple way to parallelize is to traverse the hierarchy level by level, optimize the parameters at each level in parallel, and iterate until convergence. [sent-152, score-0.306]

79 4 key-value store for fast retrieve-update of the wy s. [sent-161, score-0.471]

80 The ay , by are variance components such that b(i) y (i) ay (i) represents the expected variance of the wy . [sent-166, score-0.783]

81 Now, the prior on the covariance of wy can be set ˆ ˆ ˆ such that the expected covariance is I −1 . [sent-179, score-0.686]

82 We also tried other popular strategies such as setting improper gamma priors Γ( , ) → 0 widely used in many ARD works (which is equivalent to using type-2 ML for the α’s if one uses variational methods [2]) and Empirical Bayes using a single a and b (as well as other Empirical Bayes variants). [sent-182, score-0.199]

83 First, to evaluate the speed advantage of the variational inference, we compare the full variational {M1,M2,M3}-var and partial MAP {M1,M2,M3-map} inference 5 for the three variants of HBLR to the MCMC sampling based inference of CorrMNL [18]. [sent-186, score-0.564]

84 Re-starts with different initialization values gave the same results for both MCMC and variational methods. [sent-241, score-0.168]

85 With regards to scalability, partial MAP inference is the most scalable method being orders of magnitude faster (750x) than CorrMNL. [sent-246, score-0.171]

86 Full variational inference, although less scalable as it requires O(d3 ) matrix inversions in the feature space, is still orders of magnitude faster (20x) than CorrMNL. [sent-247, score-0.213]

87 In terms of performance, we see that the partial MAP inference for the HBLR has only small loss in performance compared to the full variational inference while having similar training time to the flat approach that does not model the hierarchy ({M1,M2,M3}-flat). [sent-248, score-0.626]

88 Hierarchical Baselines: We selected 3 representative hierarchical methods that have shown to have state-of-the-art performance - Hierarchical SVM [6] (HSVM), a large-margin discriminative method with path-dependent discriminant function. [sent-250, score-0.231]

89 Orthogonal Transfer [23] (OT), a method enforcing orthogonality constraints between the parent node and children and Top-down Classification [14] (TD) Top-down decision making using binary SVMs trained at each node. [sent-251, score-0.376]

90 For the HBLR models, we used partial MAP Inference because full variational is not scalable to high dimensions. [sent-256, score-0.265]

91 Comparing to the other hierarchical baselines, M3 achieves significantly higher performance on all datasets, showing that the Bayesian approach is able to leverage the information provided in the class hierarchy. [sent-267, score-0.282]

92 Surprisingly, the hierarchical baselines (HSVM,TD and OT) experience a very large drop in performance on LSHTC-small when compared to the flat baselines, indicating that the hierarchy information actually mislead these methods rather than helping them. [sent-355, score-0.498]

93 In particular, M3 performs significantly better on the largest datasets, especially in Macro-F1 , showing that even very large class hierarchies can convey very useful information, and highlighting the importance of having a scalable, parallelizable hierarchical classification algorithm. [sent-357, score-0.337]

94 We expect the hierarchy to be most useful in such cases as it enables of sharing of information between class parameters. [sent-359, score-0.299]

95 We repeated the same experiments on the NEWS20 dataset but however did not find an improved performance even with limited training examples suggesting that the hierarchical methods are not able to leverage the hierarchical structure of NEWS20. [sent-364, score-0.508]

96 6 Conclusion In this paper, we presented the HBLR approach to hierarchical classification, focusing on scalable ways to leverage hierarchical dependencies among classes in a joint framework. [sent-365, score-0.584]

97 Using a Gaussian prior with informative mean and covariance matrices, along with fast variational methods, and a practical way to set hyperparameters, HBLR significantly outperformed other popular HC methods on multiple benchmark datasets. [sent-366, score-0.306]

98 We hope this study provides useful insights into how hierarchical relationships can be successfully leveraged in large-scale HC. [sent-367, score-0.201]

99 Improving text classification by shrinkage in a hierarchy of classes. [sent-472, score-0.228]

100 Improving classification when a class hierarchy is available using a hierarchy-based prior. [sent-478, score-0.262]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wy', 0.471), ('corrmnl', 0.288), ('hblr', 0.264), ('hierarchy', 0.228), ('hierarchical', 0.201), ('variational', 0.168), ('clef', 0.168), ('ay', 0.156), ('bsvm', 0.144), ('parent', 0.136), ('blr', 0.12), ('mlr', 0.12), ('xy', 0.117), ('hc', 0.117), ('node', 0.109), ('msvm', 0.106), ('hadoop', 0.096), ('hsvm', 0.096), ('posterior', 0.095), ('children', 0.094), ('ot', 0.093), ('cy', 0.093), ('mammals', 0.087), ('nodes', 0.085), ('logistic', 0.082), ('covariance', 0.077), ('td', 0.075), ('inference', 0.074), ('whales', 0.072), ('leaf', 0.071), ('classi', 0.07), ('nec', 0.07), ('baselines', 0.069), ('map', 0.064), ('ipc', 0.063), ('prior', 0.061), ('pxy', 0.059), ('siblings', 0.059), ('bayesian', 0.057), ('diag', 0.056), ('fisher', 0.054), ('hierarchies', 0.052), ('partial', 0.052), ('ard', 0.05), ('parallelizable', 0.05), ('parallel', 0.049), ('scalability', 0.048), ('claw', 0.048), ('feathers', 0.048), ('leverage', 0.047), ('parallelization', 0.047), ('parameters', 0.046), ('scalable', 0.045), ('xx', 0.043), ('yiming', 0.042), ('comfortably', 0.042), ('odd', 0.041), ('thousands', 0.041), ('bayes', 0.041), ('factored', 0.04), ('sigir', 0.04), ('parents', 0.04), ('levels', 0.04), ('sibling', 0.039), ('taxonomies', 0.039), ('sharing', 0.037), ('orthogonality', 0.037), ('birds', 0.037), ('directory', 0.037), ('instance', 0.035), ('flat', 0.035), ('inside', 0.034), ('class', 0.034), ('eyes', 0.033), ('parallelized', 0.033), ('parallelizing', 0.033), ('secondly', 0.033), ('dependencies', 0.032), ('cation', 0.032), ('placing', 0.032), ('laboratories', 0.032), ('parallelize', 0.032), ('mcmc', 0.032), ('gamma', 0.031), ('discriminative', 0.03), ('training', 0.03), ('losing', 0.03), ('classes', 0.03), ('rstly', 0.029), ('dw', 0.029), ('dataset', 0.029), ('improving', 0.029), ('consistently', 0.029), ('labs', 0.028), ('requirement', 0.028), ('calculating', 0.028), ('variants', 0.028), ('updated', 0.028), ('ways', 0.028), ('gk', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification

Author: Siddharth Gopal, Yiming Yang, Bing Bai, Alexandru Niculescu-mizil

Abstract: A challenging problem in hierarchical classification is to leverage the hierarchical relations among classes for improving classification performance. An even greater challenge is to do so in a manner that is computationally feasible for large scale problems. This paper proposes a set of Bayesian methods to model hierarchical dependencies among class labels using multivariate logistic regression. Specifically, the parent-child relationships are modeled by placing a hierarchical prior over the children nodes centered around the parameters of their parents; thereby encouraging classes nearby in the hierarchy to share similar model parameters. We present variational algorithms for tractable posterior inference in these models, and provide a parallel implementation that can comfortably handle largescale problems with hundreds of thousands of dimensions and tens of thousands of classes. We run a comparative evaluation on multiple large-scale benchmark datasets that highlights the scalability of our approach and shows improved performance over the other state-of-the-art hierarchical methods. 1

2 0.15490973 71 nips-2012-Co-Regularized Hashing for Multimodal Data

Author: Yi Zhen, Dit-Yan Yeung

Abstract: Hashing-based methods provide a very promising approach to large-scale similarity search. To obtain compact hash codes, a recent trend seeks to learn the hash functions from data automatically. In this paper, we study hash function learning in the context of multimodal data. We propose a novel multimodal hash function learning method, called Co-Regularized Hashing (CRH), based on a boosted coregularization framework. The hash functions for each bit of the hash codes are learned by solving DC (difference of convex functions) programs, while the learning for multiple bits proceeds via a boosting procedure so that the bias introduced by the hash functions can be sequentially minimized. We empirically compare CRH with two state-of-the-art multimodal hash function learning methods on two publicly available data sets. 1

3 0.12968615 207 nips-2012-Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification

Author: Wei Bi, James T. Kwok

Abstract: In hierarchical classification, the prediction paths may be required to always end at leaf nodes. This is called mandatory leaf node prediction (MLNP) and is particularly useful when the leaf nodes have much stronger semantic meaning than the internal nodes. However, while there have been a lot of MLNP methods in hierarchical multiclass classification, performing MLNP in hierarchical multilabel classification is much more difficult. In this paper, we propose a novel MLNP algorithm that (i) considers the global hierarchy structure; and (ii) can be used on hierarchies of both trees and DAGs. We show that one can efficiently maximize the joint posterior probability of all the node labels by a simple greedy algorithm. Moreover, this can be further extended to the minimization of the expected symmetric loss. Experiments are performed on a number of real-world data sets with tree- and DAG-structured label hierarchies. The proposed method consistently outperforms other hierarchical and flat multilabel classification methods. 1

4 0.10230425 127 nips-2012-Fast Bayesian Inference for Non-Conjugate Gaussian Process Regression

Author: Emtiyaz Khan, Shakir Mohamed, Kevin P. Murphy

Abstract: We present a new variational inference algorithm for Gaussian process regression with non-conjugate likelihood functions, with application to a wide array of problems including binary and multi-class classification, and ordinal regression. Our method constructs a concave lower bound that is optimized using an efficient fixed-point updating algorithm. We show that the new algorithm has highly competitive computational complexity, matching that of alternative approximate inference methods. We also prove that the use of concave variational bounds provides stable and guaranteed convergence – a property not available to other approaches. We show empirically for both binary and multi-class classification that our new algorithm converges much faster than existing variational methods, and without any degradation in performance. 1

5 0.10127519 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

Author: Chong Wang, David M. Blei

Abstract: We present a truncation-free stochastic variational inference algorithm for Bayesian nonparametric models. While traditional variational inference algorithms require truncations for the model or the variational distribution, our method adapts model complexity on the fly. We studied our method with Dirichlet process mixture models and hierarchical Dirichlet process topic models on two large data sets. Our method performs better than previous stochastic variational inference algorithms. 1

6 0.10065551 306 nips-2012-Semantic Kernel Forests from Multiple Taxonomies

7 0.098027356 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

8 0.093555257 227 nips-2012-Multiclass Learning with Simplex Coding

9 0.087025478 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications

10 0.08636082 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering

11 0.081598327 200 nips-2012-Local Supervised Learning through Space Partitioning

12 0.08139877 298 nips-2012-Scalable Inference of Overlapping Communities

13 0.079150543 197 nips-2012-Learning with Recursive Perceptual Representations

14 0.075869188 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

15 0.075310066 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

16 0.074748553 233 nips-2012-Multiresolution Gaussian Processes

17 0.074192844 37 nips-2012-Affine Independent Variational Inference

18 0.073815972 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking

19 0.068657681 191 nips-2012-Learning the Architecture of Sum-Product Networks Using Clustering on Variables

20 0.067835204 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.21), (1, 0.052), (2, -0.044), (3, -0.022), (4, -0.097), (5, -0.061), (6, -0.001), (7, 0.032), (8, -0.053), (9, -0.043), (10, -0.026), (11, 0.034), (12, 0.06), (13, 0.085), (14, -0.052), (15, 0.008), (16, -0.074), (17, 0.03), (18, -0.021), (19, 0.003), (20, -0.077), (21, 0.111), (22, 0.093), (23, 0.036), (24, -0.001), (25, -0.075), (26, -0.007), (27, -0.058), (28, -0.071), (29, -0.015), (30, -0.068), (31, -0.057), (32, -0.014), (33, -0.088), (34, 0.014), (35, 0.009), (36, 0.045), (37, -0.082), (38, -0.07), (39, 0.029), (40, 0.124), (41, 0.054), (42, 0.082), (43, -0.079), (44, 0.03), (45, -0.066), (46, -0.016), (47, 0.03), (48, 0.005), (49, 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93187952 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification

Author: Siddharth Gopal, Yiming Yang, Bing Bai, Alexandru Niculescu-mizil

Abstract: A challenging problem in hierarchical classification is to leverage the hierarchical relations among classes for improving classification performance. An even greater challenge is to do so in a manner that is computationally feasible for large scale problems. This paper proposes a set of Bayesian methods to model hierarchical dependencies among class labels using multivariate logistic regression. Specifically, the parent-child relationships are modeled by placing a hierarchical prior over the children nodes centered around the parameters of their parents; thereby encouraging classes nearby in the hierarchy to share similar model parameters. We present variational algorithms for tractable posterior inference in these models, and provide a parallel implementation that can comfortably handle largescale problems with hundreds of thousands of dimensions and tens of thousands of classes. We run a comparative evaluation on multiple large-scale benchmark datasets that highlights the scalability of our approach and shows improved performance over the other state-of-the-art hierarchical methods. 1

2 0.78053427 207 nips-2012-Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification

Author: Wei Bi, James T. Kwok

Abstract: In hierarchical classification, the prediction paths may be required to always end at leaf nodes. This is called mandatory leaf node prediction (MLNP) and is particularly useful when the leaf nodes have much stronger semantic meaning than the internal nodes. However, while there have been a lot of MLNP methods in hierarchical multiclass classification, performing MLNP in hierarchical multilabel classification is much more difficult. In this paper, we propose a novel MLNP algorithm that (i) considers the global hierarchy structure; and (ii) can be used on hierarchies of both trees and DAGs. We show that one can efficiently maximize the joint posterior probability of all the node labels by a simple greedy algorithm. Moreover, this can be further extended to the minimization of the expected symmetric loss. Experiments are performed on a number of real-world data sets with tree- and DAG-structured label hierarchies. The proposed method consistently outperforms other hierarchical and flat multilabel classification methods. 1

3 0.68273461 127 nips-2012-Fast Bayesian Inference for Non-Conjugate Gaussian Process Regression

Author: Emtiyaz Khan, Shakir Mohamed, Kevin P. Murphy

Abstract: We present a new variational inference algorithm for Gaussian process regression with non-conjugate likelihood functions, with application to a wide array of problems including binary and multi-class classification, and ordinal regression. Our method constructs a concave lower bound that is optimized using an efficient fixed-point updating algorithm. We show that the new algorithm has highly competitive computational complexity, matching that of alternative approximate inference methods. We also prove that the use of concave variational bounds provides stable and guaranteed convergence – a property not available to other approaches. We show empirically for both binary and multi-class classification that our new algorithm converges much faster than existing variational methods, and without any degradation in performance. 1

4 0.64552569 129 nips-2012-Fast Variational Inference in the Conjugate Exponential Family

Author: James Hensman, Magnus Rattray, Neil D. Lawrence

Abstract: We present a general method for deriving collapsed variational inference algorithms for probabilistic models in the conjugate exponential family. Our method unifies many existing approaches to collapsed variational inference. Our collapsed variational inference leads to a new lower bound on the marginal likelihood. We exploit the information geometry of the bound to derive much faster optimization methods based on conjugate gradients for these models. Our approach is very general and is easily applied to any model where the mean field update equations have been derived. Empirically we show significant speed-ups for probabilistic inference using our bound. 1

5 0.62336409 37 nips-2012-Affine Independent Variational Inference

Author: Edward Challis, David Barber

Abstract: We consider inference in a broad class of non-conjugate probabilistic models based on minimising the Kullback-Leibler divergence between the given target density and an approximating ‘variational’ density. In particular, for generalised linear models we describe approximating densities formed from an affine transformation of independently distributed latent variables, this class including many well known densities as special cases. We show how all relevant quantities can be efficiently computed using the fast Fourier transform. This extends the known class of tractable variational approximations and enables the fitting for example of skew variational densities to the target density. 1

6 0.61271924 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering

7 0.60848302 215 nips-2012-Minimizing Uncertainty in Pipelines

8 0.5913738 298 nips-2012-Scalable Inference of Overlapping Communities

9 0.58230746 232 nips-2012-Multiplicative Forests for Continuous-Time Processes

10 0.55951232 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

11 0.55573446 81 nips-2012-Context-Sensitive Decision Forests for Object Detection

12 0.54655111 306 nips-2012-Semantic Kernel Forests from Multiple Taxonomies

13 0.54071695 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

14 0.52854306 54 nips-2012-Bayesian Probabilistic Co-Subspace Addition

15 0.5125739 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

16 0.50952011 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification

17 0.5087145 182 nips-2012-Learning Networks of Heterogeneous Influence

18 0.50838083 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking

19 0.49014997 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System

20 0.48868379 26 nips-2012-A nonparametric variable clustering model


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.084), (17, 0.018), (21, 0.023), (38, 0.124), (39, 0.02), (42, 0.019), (54, 0.03), (55, 0.018), (59, 0.01), (70, 0.208), (74, 0.054), (76, 0.148), (80, 0.109), (92, 0.041)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.83169061 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds

Author: Teodor M. Moldovan, Pieter Abbeel

Abstract: The expected return is a widely used objective in decision making under uncertainty. Many algorithms, such as value iteration, have been proposed to optimize it. In risk-aware settings, however, the expected return is often not an appropriate objective to optimize. We propose a new optimization objective for risk-aware planning and show that it has desirable theoretical properties. We also draw connections to previously proposed objectives for risk-aware planing: minmax, exponential utility, percentile and mean minus variance. Our method applies to an extended class of Markov decision processes: we allow costs to be stochastic as long as they are bounded. Additionally, we present an efficient algorithm for optimizing the proposed objective. Synthetic and real-world experiments illustrate the effectiveness of our method, at scale. 1

same-paper 2 0.82917732 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification

Author: Siddharth Gopal, Yiming Yang, Bing Bai, Alexandru Niculescu-mizil

Abstract: A challenging problem in hierarchical classification is to leverage the hierarchical relations among classes for improving classification performance. An even greater challenge is to do so in a manner that is computationally feasible for large scale problems. This paper proposes a set of Bayesian methods to model hierarchical dependencies among class labels using multivariate logistic regression. Specifically, the parent-child relationships are modeled by placing a hierarchical prior over the children nodes centered around the parameters of their parents; thereby encouraging classes nearby in the hierarchy to share similar model parameters. We present variational algorithms for tractable posterior inference in these models, and provide a parallel implementation that can comfortably handle largescale problems with hundreds of thousands of dimensions and tens of thousands of classes. We run a comparative evaluation on multiple large-scale benchmark datasets that highlights the scalability of our approach and shows improved performance over the other state-of-the-art hierarchical methods. 1

3 0.75025845 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

Author: Chong Wang, David M. Blei

Abstract: We present a truncation-free stochastic variational inference algorithm for Bayesian nonparametric models. While traditional variational inference algorithms require truncations for the model or the variational distribution, our method adapts model complexity on the fly. We studied our method with Dirichlet process mixture models and hierarchical Dirichlet process topic models on two large data sets. Our method performs better than previous stochastic variational inference algorithms. 1

4 0.74976784 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

Author: Anima Anandkumar, Ragupathyraj Valluvan

Abstract: Graphical model selection refers to the problem of estimating the unknown graph structure given observations at the nodes in the model. We consider a challenging instance of this problem when some of the nodes are latent or hidden. We characterize conditions for tractable graph estimation and develop efficient methods with provable guarantees. We consider the class of Ising models Markov on locally tree-like graphs, which are in the regime of correlation decay. We propose an efficient method for graph estimation, and establish its structural consistency −δη(η+1)−2 when the number of samples n scales as n = Ω(θmin log p), where θmin is the minimum edge potential, δ is the depth (i.e., distance from a hidden node to the nearest observed nodes), and η is a parameter which depends on the minimum and maximum node and edge potentials in the Ising model. The proposed method is practical to implement and provides flexibility to control the number of latent variables and the cycle lengths in the output graph. We also present necessary conditions for graph estimation by any method and show that our method nearly matches the lower bound on sample requirements. Keywords: Graphical model selection, latent variables, quartet methods, locally tree-like graphs. 1

5 0.7493673 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

Author: Michael Bryant, Erik B. Sudderth

Abstract: Variational methods provide a computationally scalable alternative to Monte Carlo methods for large-scale, Bayesian nonparametric learning. In practice, however, conventional batch and online variational methods quickly become trapped in local optima. In this paper, we consider a nonparametric topic model based on the hierarchical Dirichlet process (HDP), and develop a novel online variational inference algorithm based on split-merge topic updates. We derive a simpler and faster variational approximation of the HDP, and show that by intelligently splitting and merging components of the variational posterior, we can achieve substantially better predictions of test data than conventional online and batch variational algorithms. For streaming analysis of large datasets where batch analysis is infeasible, we show that our split-merge updates better capture the nonparametric properties of the underlying model, allowing continual learning of new topics.

6 0.74768019 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

7 0.74734873 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

8 0.74381781 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

9 0.74172193 192 nips-2012-Learning the Dependency Structure of Latent Factors

10 0.74171734 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation

11 0.73969203 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

12 0.7381385 197 nips-2012-Learning with Recursive Perceptual Representations

13 0.73778141 294 nips-2012-Repulsive Mixtures

14 0.73690569 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

15 0.73535603 200 nips-2012-Local Supervised Learning through Space Partitioning

16 0.73517722 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models

17 0.73447257 232 nips-2012-Multiplicative Forests for Continuous-Time Processes

18 0.7344175 47 nips-2012-Augment-and-Conquer Negative Binomial Processes

19 0.73391676 272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms

20 0.73311651 180 nips-2012-Learning Mixtures of Tree Graphical Models