jmlr jmlr2012 jmlr2012-9 knowledge-graph by maker-knowledge-mining

9 jmlr-2012-A Topic Modeling Toolbox Using Belief Propagation

Source: pdf

Author: Jia Zeng

Abstract: Latent Dirichlet allocation (LDA) is an important hierarchical Bayesian model for probabilistic topic modeling, which attracts worldwide interests and touches on many important applications in text mining, computer vision and computational biology. This paper introduces a topic modeling toolbox (TMBP) based on the belief propagation (BP) algorithms. TMBP toolbox is implemented by MEX C++/Matlab/Octave for either Windows 7 or Linux. Compared with existing topic modeling packages, the novelty of this toolbox lies in the BP algorithms for learning LDA-based topic models. The current version includes BP algorithms for latent Dirichlet allocation (LDA), authortopic models (ATM), relational topic models (RTM), and labeled LDA (LaLDA). This toolbox is an ongoing project and more BP-based algorithms for various topic models will be added in the near future. Interested users may also extend BP algorithms for learning more complicated topic models. The source codes are freely available under the GNU General Public Licence, Version 1.0 at https://mloss.org/software/view/399/. Keywords: topic models, belief propagation, variational Bayes, Gibbs sampling

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This paper introduces a topic modeling toolbox (TMBP) based on the belief propagation (BP) algorithms. [sent-4, score-0.82]

2 TMBP toolbox is implemented by MEX C++/Matlab/Octave for either Windows 7 or Linux. [sent-5, score-0.124]

3 Compared with existing topic modeling packages, the novelty of this toolbox lies in the BP algorithms for learning LDA-based topic models. [sent-6, score-1.124]

4 The current version includes BP algorithms for latent Dirichlet allocation (LDA), authortopic models (ATM), relational topic models (RTM), and labeled LDA (LaLDA). [sent-7, score-0.68]

5 This toolbox is an ongoing project and more BP-based algorithms for various topic models will be added in the near future. [sent-8, score-0.606]

6 Interested users may also extend BP algorithms for learning more complicated topic models. [sent-9, score-0.451]

7 The source codes are freely available under the GNU General Public Licence, Version 1. [sent-10, score-0.032]

8 Keywords: topic models, belief propagation, variational Bayes, Gibbs sampling 1. [sent-13, score-0.56]

9 Introduction The past decade has seen rapid development of latent Dirichlet allocation (LDA) (Blei et al. [sent-14, score-0.109]

10 , 2003) for solving topic modeling problems because of its elegant three-layer graphical representation as well as two efﬁcient approximate inference methods such as Variational Bayes (VB) (Blei et al. [sent-15, score-0.566]

11 , 2003) and collapsed Gibbs Sampling (GS) (Grifﬁths and Steyvers, 2004). [sent-16, score-0.087]

12 Both VB and GS have been widely used to learn variants of LDA-based topic models until our recent work (Zeng et al. [sent-17, score-0.462]

13 , 2011) reveals that there is yet another learning algorithm for LDA based on loopy belief propagation (BP). [sent-18, score-0.19]

14 The basic idea of BP is inspired by the collapsed GS algorithm, in which the three-layer LDA can be interpreted as being collapsed into a two-layer factor graph (Kschischang et al. [sent-19, score-0.192]

15 The sum-product BP algorithm operates on the factor graph (Bishop, 2006). [sent-21, score-0.018]

16 Extensive experiments conﬁrm that BP is faster and more accurate than both VB and GS, and thus is a strong candidate for becoming the standard topic modeling algorithm. [sent-22, score-0.529]

17 For example, we show how to learn three typical variants of LDA-based topic models, such as author-topic models (ATM) (Rosen-Zvi et al. [sent-23, score-0.462]

18 , 2004), relational topic models (RTM) (Chang and Blei, 2010), and labeled LDA (LaLDA) (Ramage et al. [sent-24, score-0.559]

19 , 2009) using BP based on the novel factor graph representations (Zeng et al. [sent-25, score-0.018]

20 We have implemented the topic modeling toolbox called TMBP by MEX C++ in the Matlab/Octave interface based on VB, GS and BP algorithms. [sent-27, score-0.635]

21 Compared with other topic modeling c 2012 Jia Zeng. [sent-28, score-0.511]

22 Z ENG packages,1234567 the novelty of this toolbox lies in the BP algorithms for topic modeling. [sent-29, score-0.613]

23 This paper introduces how to use this toolbox for basic topic modeling tasks. [sent-30, score-0.652]

24 Co-occurrence: the different word indices w in the same document d tend to have the same topic label. [sent-33, score-0.672]

25 Smoothness: the same word indices w in the different documents d tend to have the same topic label. [sent-35, score-0.631]

26 Clustering: all word indices w do not tend to be associated with the same topic label. [sent-37, score-0.631]

27 Based on the above rules, recent approximate inference methods compute the marginal distribution of topic label µw,d (k) = p(zk = 1) called message, and estimate parameters using the iterative EM w,d (Bishop, 2006) algorithm according to the maximum-likelihood criterion. [sent-38, score-0.471]

28 The major difference among these inference methods lies in the message update equation. [sent-39, score-0.211]

29 VB updates messages by complicated digamma functions, which cause bias and slow down message updating (Zeng et al. [sent-40, score-0.263]

30 GS updates messages by topic labels randomly sampled from the message in the previous iteration. [sent-42, score-0.603]

31 The sampling process does not keep all uncertainty encoded in the previous message. [sent-43, score-0.023]

32 In contrast, BP directly uses the previous message to update the current message without sampling. [sent-44, score-0.273]

33 Similar ideas have also been proposed within the approximate mean-ﬁeld framework (Asuncion, 2010) as the zero-order approximation of the collapsed VB (CVB0) algorithm (Asuncion et al. [sent-45, score-0.087]

34 While proper settings of hyperparameters can make the topic modeling performance comparable among different inference methods (Asuncion et al. [sent-47, score-0.58]

35 , 2009), we still advocate the BP algorithms because of their ease of use and fast speed. [sent-48, score-0.02]

36 Table 1 compares the message update equations among VB, GS and BP. [sent-49, score-0.152]

37 Compared with BP, VB uses the digamma function Ψ in message update, and GS uses the discrete count of sampled topic labels n−i based on word tokens rather than word index in mesw,d sage update. [sent-50, score-0.972]

38 The Dirichlet hyperparameters α and β can be viewed as the pseudo-messages. [sent-51, score-0.031]

39 The notations −w and −d denote all word indices except w and all document indices except d, and −i denotes all word tokens except the current word token i. [sent-52, score-0.575]

40 Because VB and GS have been widely used for learning different LDA-based topic models, it is easy to develop the corresponding BP algorithms for learning these LDA-based topic models by 1. [sent-55, score-0.895]

41 either removing the digamma function in the VB or without sampling from the posterior probability in the GS algorithm. [sent-91, score-0.115]

42 For example, we show how to develop the corresponding BP algorithms for two typical LDA-based topic models such as ATM and RTM (Zeng et al. [sent-92, score-0.462]

43 An Example of Using TMBP TMBP toolbox contains source codes for learning LDA based on VB, GS, and BP (Zeng et al. [sent-95, score-0.156]

44 , 2011, 2012a,b,c), learning author-topic models (ATM) (Rosen-Zvi et al. [sent-96, score-0.029]

45 , 2004) based on GS and BP, learning relational topic models (RTM) (Chang and Blei, 2010) and labeled LDA (Ramage et al. [sent-97, score-0.559]

46 Here, we present a demo for the synchronous BP algorithm. [sent-101, score-0.045]

47 The results (the training perplexity at every 10 iterations and the top ﬁve words in each of ten topics) are printed on the screen: ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ The sBP A l g o r i t h m ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ I t e r a t i o n 10 o f 5 0 0 : 1041. [sent-104, score-0.026]

48 Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. [sent-180, score-0.483]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('bp', 0.435), ('topic', 0.433), ('zeng', 0.399), ('gs', 0.264), ('vb', 0.244), ('lda', 0.218), ('tmbp', 0.153), ('blei', 0.131), ('toolbox', 0.124), ('atm', 0.123), ('rtm', 0.123), ('message', 0.121), ('word', 0.121), ('xw', 0.105), ('digamma', 0.092), ('dirichlet', 0.091), ('propagation', 0.088), ('collapsed', 0.087), ('belief', 0.08), ('ramage', 0.079), ('modeling', 0.078), ('zk', 0.073), ('ths', 0.071), ('relational', 0.069), ('asuncion', 0.064), ('kschischang', 0.061), ('lalda', 0.061), ('sbp', 0.061), ('allocation', 0.061), ('grif', 0.061), ('indices', 0.052), ('steyvers', 0.052), ('china', 0.048), ('eng', 0.047), ('jia', 0.047), ('mex', 0.044), ('chang', 0.041), ('document', 0.041), ('tokens', 0.041), ('inference', 0.038), ('arxiv', 0.037), ('novelty', 0.035), ('bishop', 0.032), ('codes', 0.032), ('messages', 0.032), ('latent', 0.031), ('update', 0.031), ('hyperparameters', 0.031), ('liu', 0.03), ('grant', 0.03), ('gibbs', 0.029), ('uai', 0.029), ('models', 0.029), ('labeled', 0.028), ('attracts', 0.026), ('attribution', 0.026), ('cheung', 0.026), ('demo', 0.026), ('institutions', 0.026), ('licence', 0.026), ('odeling', 0.026), ('oolbox', 0.026), ('opic', 0.026), ('printed', 0.026), ('sage', 0.026), ('screen', 0.026), ('token', 0.026), ('worldwide', 0.026), ('tend', 0.025), ('variational', 0.024), ('nallapati', 0.024), ('credit', 0.024), ('elapsed', 0.024), ('shanghai', 0.024), ('sing', 0.024), ('topics', 0.024), ('sampling', 0.023), ('frey', 0.022), ('installation', 0.022), ('loopy', 0.022), ('smyth', 0.022), ('speeding', 0.022), ('touches', 0.022), ('lies', 0.021), ('ongoing', 0.02), ('advocate', 0.02), ('allocate', 0.02), ('dw', 0.02), ('welling', 0.02), ('synchronous', 0.019), ('graph', 0.018), ('org', 0.018), ('becoming', 0.018), ('interests', 0.018), ('hierarchical', 0.018), ('complicated', 0.018), ('introduces', 0.017), ('labels', 0.017), ('elegant', 0.017), ('decade', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 9 jmlr-2012-A Topic Modeling Toolbox Using Belief Propagation

Author: Jia Zeng

2 0.24635428 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

Author: Jun Zhu, Amr Ahmed, Eric P. Xing

Abstract: A supervised topic model can use side information such as ratings or labels associated with documents or images to discover more predictive low dimensional topical representations of the data. However, existing supervised topic models predominantly employ likelihood-driven objective functions for learning and inference, leaving the popular and potentially powerful max-margin principle unexploited for seeking predictive representations of data and more discriminative topic bases for the corpus. In this paper, we propose the maximum entropy discrimination latent Dirichlet allocation (MedLDA) model, which integrates the mechanism behind the max-margin prediction models (e.g., SVMs) with the mechanism behind the hierarchical Bayesian topic models (e.g., LDA) under a uniﬁed constrained optimization framework, and yields latent topical representations that are more discriminative and more suitable for prediction tasks such as document classiﬁcation or regression. The principle underlying the MedLDA formalism is quite general and can be applied for jointly max-margin and maximum likelihood learning of directed or undirected topic models when supervising side information is available. Efﬁcient variational methods for posterior inference and parameter estimation are derived and extensive empirical studies on several real data sets are also provided. Our experimental results demonstrate qualitatively and quantitatively that MedLDA could: 1) discover sparse and highly discriminative topical representations; 2) achieve state of the art prediction performance; and 3) be more efﬁcient than existing supervised topic models, especially for classiﬁcation. Keywords: supervised topic models, max-margin learning, maximum entropy discrimination, latent Dirichlet allocation, support vector machines

3 0.062950827 119 jmlr-2012-glm-ie: Generalised Linear Models Inference & Estimation Toolbox

Author: Hannes Nickisch

Abstract: The glm-ie toolbox contains functionality for estimation and inference in generalised linear models over continuous-valued variables. Besides a variety of penalised least squares solvers for estimation, it offers inference based on (convex) variational bounds, on expectation propagation and on factorial mean ﬁeld. Scalable and efﬁcient inference in fully-connected undirected graphical models or Markov random ﬁelds with Gaussian and non-Gaussian potentials is achieved by casting all the computations as matrix vector multiplications. We provide a wide choice of penalty functions for estimation, potential functions for inference and matrix classes with lazy evaluation for convenient modelling. We designed the glm-ie package to be simple, generic and easily expansible. Most of the code is written in Matlab including some MEX ﬁles to be fully compatible to both Matlab 7.x and GNU Octave 3.3.x. Large scale probabilistic classiﬁcation as well as sparse linear modelling can be performed in a common algorithmical framework by the glm-ie toolkit. Keywords: sparse linear models, generalised linear models, Bayesian inference, approximate inference, probabilistic regression and classiﬁcation, penalised least squares estimation, lazy evaluation matrix class

4 0.051349428 41 jmlr-2012-Exploration in Relational Domains for Model-based Reinforcement Learning

Author: Tobias Lang, Marc Toussaint, Kristian Kersting

Abstract: A fundamental problem in reinforcement learning is balancing exploration and exploitation. We address this problem in the context of model-based reinforcement learning in large stochastic relational domains by developing relational extensions of the concepts of the E 3 and R- MAX algorithms. Efﬁcient exploration in exponentially large state spaces needs to exploit the generalization of the learned model: what in a propositional setting would be considered a novel situation and worth exploration may in the relational setting be a well-known context in which exploitation is promising. To address this we introduce relational count functions which generalize the classical notion of state and action visitation counts. We provide guarantees on the exploration efﬁciency of our framework using count functions under the assumption that we had a relational KWIK learner and a near-optimal planner. We propose a concrete exploration algorithm which integrates a practically efﬁcient probabilistic rule learner and a relational planner (for which there are no guarantees, however) and employs the contexts of learned relational rules as features to model the novelty of states and actions. Our results in noisy 3D simulated robot manipulation problems and in domains of the international planning competition demonstrate that our approach is more effective than existing propositional and factored exploration techniques. Keywords: reinforcement learning, statistical relational learning, exploration, relational transition models, robotics

5 0.04712303 106 jmlr-2012-Sign Language Recognition using Sub-Units

Author: Helen Cooper, Eng-Jon Ong, Nicolas Pugeault, Richard Bowden

Abstract: This paper discusses sign language recognition using linguistic sub-units. It presents three types of sub-units for consideration; those learnt from appearance data as well as those inferred from both 2D or 3D tracking data. These sub-units are then combined using a sign level classiﬁer; here, two options are presented. The ﬁrst uses Markov Models to encode the temporal changes between sub-units. The second makes use of Sequential Pattern Boosting to apply discriminative feature selection at the same time as encoding temporal information. This approach is more robust to noise and performs well in signer independent tests, improving results from the 54% achieved by the Markov Chains to 76%. Keywords: sign language recognition, sequential pattern boosting, depth cameras, sub-units, signer independence, data set

6 0.037396919 22 jmlr-2012-Bounding the Probability of Error for High Precision Optical Character Recognition

7 0.037380721 30 jmlr-2012-DARWIN: A Framework for Machine Learning and Computer Vision Research and Development

8 0.029067747 48 jmlr-2012-High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion

9 0.025452554 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization

10 0.023856808 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets

11 0.021862155 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

12 0.021477558 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

13 0.020722648 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods

14 0.02068644 61 jmlr-2012-ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel

15 0.020371012 45 jmlr-2012-Finding Recurrent Patterns from Continuous Sign Language Sentences for Automated Extraction of Signs

16 0.018821653 118 jmlr-2012-Variational Multinomial Logit Gaussian Process

17 0.018113537 31 jmlr-2012-DEAP: Evolutionary Algorithms Made Easy

18 0.017127242 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms

19 0.016352603 90 jmlr-2012-Pattern for Python

20 0.016189944 108 jmlr-2012-Sparse and Unique Nonnegative Matrix Factorization Through Data Preprocessing

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.076), (1, 0.035), (2, 0.173), (3, -0.06), (4, 0.052), (5, 0.02), (6, 0.148), (7, -0.016), (8, -0.167), (9, 0.049), (10, 0.135), (11, 0.125), (12, 0.153), (13, 0.006), (14, 0.285), (15, -0.267), (16, -0.258), (17, -0.348), (18, 0.231), (19, -0.094), (20, 0.061), (21, -0.084), (22, -0.028), (23, -0.042), (24, 0.026), (25, 0.022), (26, 0.033), (27, 0.022), (28, -0.095), (29, 0.038), (30, -0.074), (31, -0.065), (32, 0.053), (33, -0.003), (34, 0.092), (35, 0.005), (36, -0.017), (37, 0.064), (38, -0.037), (39, -0.061), (40, -0.024), (41, 0.023), (42, -0.012), (43, -0.103), (44, -0.014), (45, -0.032), (46, 0.016), (47, 0.033), (48, 0.03), (49, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99299145 9 jmlr-2012-A Topic Modeling Toolbox Using Belief Propagation

Author: Jia Zeng

2 0.84283578 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

Author: Jun Zhu, Amr Ahmed, Eric P. Xing

3 0.26284057 119 jmlr-2012-glm-ie: Generalised Linear Models Inference & Estimation Toolbox

Author: Hannes Nickisch

4 0.20726013 30 jmlr-2012-DARWIN: A Framework for Machine Learning and Computer Vision Research and Development

Author: Stephen Gould

Abstract: We present an open-source platform-independent C++ framework for machine learning and computer vision research. The framework includes a wide range of standard machine learning and graphical models algorithms as well as reference implementations for many machine learning and computer vision applications. The framework contains Matlab wrappers for core components of the library and an experimental graphical user interface for developing and visualizing machine learning data ﬂows. Keywords: machine learning, graphical models, computer vision, open-source software

5 0.1837797 41 jmlr-2012-Exploration in Relational Domains for Model-based Reinforcement Learning

Author: Tobias Lang, Marc Toussaint, Kristian Kersting

6 0.17014715 22 jmlr-2012-Bounding the Probability of Error for High Precision Optical Character Recognition

7 0.10305005 45 jmlr-2012-Finding Recurrent Patterns from Continuous Sign Language Sentences for Automated Extraction of Signs

8 0.09776967 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods

9 0.096874379 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization

10 0.096104085 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets

11 0.091943018 106 jmlr-2012-Sign Language Recognition using Sub-Units

12 0.090387732 118 jmlr-2012-Variational Multinomial Logit Gaussian Process

13 0.086969756 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information

14 0.085855357 48 jmlr-2012-High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion

15 0.07113272 101 jmlr-2012-SVDFeature: A Toolkit for Feature-based Collaborative Filtering

16 0.069502875 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems

17 0.068087026 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data

18 0.06606178 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

19 0.056750655 113 jmlr-2012-The huge Package for High-dimensional Undirected Graph Estimation in R

20 0.056516513 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.53), (7, 0.014), (21, 0.034), (26, 0.034), (27, 0.013), (29, 0.037), (49, 0.02), (56, 0.017), (57, 0.036), (75, 0.016), (77, 0.032), (92, 0.02), (96, 0.06)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81297016 9 jmlr-2012-A Topic Modeling Toolbox Using Belief Propagation

Author: Jia Zeng

2 0.45543063 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions

Author: Yongdai Kim, Sunghoon Kwon, Hosik Choi

Abstract: Asymptotic properties of model selection criteria for high-dimensional regression models are studied where the dimension of covariates is much larger than the sample size. Several sufﬁcient conditions for model selection consistency are provided. Non-Gaussian error distributions are considered and it is shown that the maximal number of covariates for model selection consistency depends on the tail behavior of the error distribution. Also, sufﬁcient conditions for model selection consistency are given when the variance of the noise is neither known nor estimated consistently. Results of simulation studies as well as real data analysis are given to illustrate that ﬁnite sample performances of consistent model selection criteria can be quite different. Keywords: model selection consistency, general information criteria, high dimension, regression

3 0.18968606 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

Author: Jun Zhu, Amr Ahmed, Eric P. Xing

4 0.16090865 54 jmlr-2012-Large-scale Linear Support Vector Regression

Author: Chia-Hua Ho, Chih-Jen Lin

Abstract: Support vector regression (SVR) and support vector classiﬁcation (SVC) are popular learning techniques, but their use with kernels is often time consuming. Recently, linear SVC without kernels has been shown to give competitive accuracy for some applications, but enjoys much faster training/testing. However, few studies have focused on linear SVR. In this paper, we extend state-of-theart training methods for linear SVC to linear SVR. We show that the extension is straightforward for some methods, but is not trivial for some others. Our experiments demonstrate that for some problems, the proposed linear-SVR training methods can very efﬁciently produce models that are as good as kernel SVR. Keywords: support vector regression, Newton methods, coordinate descent methods

5 0.15795289 113 jmlr-2012-The huge Package for High-dimensional Undirected Graph Estimation in R

Author: Tuo Zhao, Han Liu, Kathryn Roeder, John Lafferty, Larry Wasserman

Abstract: We describe an R package named huge which provides easy-to-use functions for estimating high dimensional undirected graphs from data. This package implements recent results in the literature, including Friedman et al. (2007), Liu et al. (2009, 2012) and Liu et al. (2010). Compared with the existing graph estimation package glasso, the huge package provides extra features: (1) instead of using Fortan, it is written in C, which makes the code more portable and easier to modify; (2) besides ﬁtting Gaussian graphical models, it also provides functions for ﬁtting high dimensional semiparametric Gaussian copula models; (3) more functions like data-dependent model selection, data generation and graph visualization; (4) a minor convergence problem of the graphical lasso algorithm is corrected; (5) the package allows the user to apply both lossless and lossy screening rules to scale up large-scale problems, making a tradeoff between computational and statistical efﬁciency. Keywords: high-dimensional undirected graph estimation, glasso, huge, semiparametric graph estimation, data-dependent model selection, lossless screening, lossy screening 1. Overview Undirected graphs is a natural approach to describe the conditional independence among many variables. Each node of the graph represents a single variable and no edge between two variables implies that they are conditional independent given all other variables. In the past decade, significant progress has been made on designing efﬁcient algorithms to learn undirected graphs from high-dimensional observational data sets. Most of these methods are based on either the penalized maximum-likelihood estimation (Friedman et al., 2007) or penalized regression methods (Meinshausen and B¨ hlmann, 2006). Existing packages include glasso, Covpath and CLIME. In particuu ∗. Also in the Department of Biostatistics. †. Also in the Department of Machine Learning. c 2012 Zhao, Liu, Roeder, Lafferty and Wasserman. Z HAO , L IU , ROEDER , L AFFERTY AND WASSERMAN lar, the glasso package has been widely adopted by statisticians and computer scientists due to its friendly user-inference and efﬁciency. In this paper1 we describe a newly developed R package named huge (High-dimensional Undirected Graph Estimation) coded in C. The package includes a wide range of functional modules and addresses some drawbacks of the graphical lasso algorithm. To gain more scalability, the package supports two modes of screening, lossless (Witten et al., 2011) and lossy screening. When using lossy screening, the user can select the desired screening level to scale up for high-dimensional problems, but this introduces some estimation bias. 2. Software Design and Implementation The package huge aims to provide a general framework for high-dimensional undirected graph estimation. The package includes Six functional modules (M1-M6) facilitate a ﬂexible pipeline for analysis (Figure 1). M1. Data Generator: The function huge.generator() can simulate multivariate Gaussian data with different undirected graphs, including hub, cluster, band, scale-free, and Erd¨ s-R´ nyi o e random graphs. The sparsity level of the obtained graph and signal-to-noise ratio can also be set up by users. M2. Semiparametric Transformation: The function huge.npn() implements the nonparanormal method (Liu et al., 2009, 2012) for estimating a semiparametric Gaussian copula model.The nonparanormal family extends the Gaussian distribution by marginally transforming the variables. Computationally, the nonparanormal transformation only requires one pass through the data matrix. M3. Graph Screening: The scr argument in the main function huge() controls the use of largescale correlation screening before graph estimation. The function supports the lossless screening (Witten et al., 2011) and the lossy screening. Such screening procedures can greatly reduce the computational cost and achieve equal or even better estimation by reducing the variance at the expense of increased bias. Figure 1: The graph estimation pipeline. M4. Graph Estimation: Similar to the glasso package, the method argument in the huge() function supports two estimation methods: (i) the neighborhood pursuit algorithm (Meinshausen and B¨ hlmann, 2006) and (ii) the graphical lasso algorithm (Friedman et al., 2007). We apply u the coordinate descent with active set and covariance update, as well as other tricks suggested in Friedman et al. (2010). We modiﬁed the warm start trick to address the potential divergence problem of the graphical lasso algorithm (Mazumder and Hastie, 2011). The code is also memory-optimized using the sparse matrix data structure when estimating and storing full regularization paths for large 1. This paper is only a summary of the package huge. For more details please refer to the online vignette. 1060 H IGH - DIMENSIONAL U NDIRECTED G RAPH E STIMATION data sets. we also provide a complementary graph estimation method based on thresholding the sample correlation matrix, which is computationally efﬁcient and widely applied in biomedical research. M5. Model Selection: The function huge.select() provides two regularization parameter selection methods: the stability approach for regularization selection (StARS) (Liu et al., 2010); and rotation information criterion (RIC). We also provide a likelihood-based extended Bayesian information criterion. M6. Graph Visualization: The plotting functions huge.plot() and plot() provide visualizations of the simulated data sets, estimated graphs and paths. The implementation is based on the igraph package. 3. User Interface by Example We illustrate the user interface by analyzing a stock market data which we contribute to the huge package. We acquired closing prices from all stocks in the S&P; 500 for all the days that the market was open between Jan 1, 2003 and Jan 1, 2008. This gave us 1258 samples for the 452 stocks that remained in the S&P; 500 during the entire time period. > > > > > library(huge) data(stockdata) # Load the data x = log(stockdata$data[2:1258,]/stockdata$data[1:1257,]) # Preprocessing x.npn = huge.npn(x, npn.func=

6 0.15719159 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data

7 0.15520898 41 jmlr-2012-Exploration in Relational Domains for Model-based Reinforcement Learning

8 0.14977789 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression

9 0.14974113 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression

10 0.14962095 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms

11 0.14950427 72 jmlr-2012-Multi-Target Regression with Rule Ensembles

12 0.14892364 22 jmlr-2012-Bounding the Probability of Error for High Precision Optical Character Recognition

13 0.14774503 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

14 0.14714506 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies

15 0.14673889 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets

16 0.14626011 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

17 0.1458002 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks

18 0.14533746 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

19 0.14356786 116 jmlr-2012-Transfer in Reinforcement Learning via Shared Features

20 0.14288045 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting