nips nips2012 nips2012-66 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Zhitang Chen, Kun Zhang, Laiwan Chan
Abstract: In conventional causal discovery, structural equation models (SEM) are directly applied to the observed variables, meaning that the causal effect can be represented as a function of the direct causes themselves. However, in many real world problems, there are significant dependencies in the variances or energies, which indicates that causality may possibly take place at the level of variances or energies. In this paper, we propose a probabilistic causal scale-mixture model with spatiotemporal variance dependencies to represent a specific type of generating mechanism of the observations. In particular, the causal mechanism including contemporaneous and temporal causal relations in variances or energies is represented by a Structural Vector AutoRegressive model (SVAR). We prove the identifiability of this model under the non-Gaussian assumption on the innovation processes. We also propose algorithms to estimate the involved parameters and discover the contemporaneous causal structure. Experiments on synthetic and real world data are conducted to show the applicability of the proposed model and algorithms.
Reference: text
sentIndex sentText sentNum sentScore
1 Causal discovery with scale-mixture model for spatiotemporal variance dependencies Zhitang Chen* , Kun Zhang† , and Laiwan Chan* * Department of Computer Science and Engineering, Chinese University of Hong Kong, Hong Kong {ztchen,lwchan}@cse. [sent-1, score-0.354]
2 de Abstract In conventional causal discovery, structural equation models (SEM) are directly applied to the observed variables, meaning that the causal effect can be represented as a function of the direct causes themselves. [sent-6, score-1.383]
3 However, in many real world problems, there are significant dependencies in the variances or energies, which indicates that causality may possibly take place at the level of variances or energies. [sent-7, score-0.511]
4 In this paper, we propose a probabilistic causal scale-mixture model with spatiotemporal variance dependencies to represent a specific type of generating mechanism of the observations. [sent-8, score-0.991]
5 In particular, the causal mechanism including contemporaneous and temporal causal relations in variances or energies is represented by a Structural Vector AutoRegressive model (SVAR). [sent-9, score-1.992]
6 We also propose algorithms to estimate the involved parameters and discover the contemporaneous causal structure. [sent-11, score-0.975]
7 1 Introduction Causal discovery aims to discover the underlying generating mechanism of the observed data, and consequently, the causal relations allow us to predict the effects of interventions on the system [15, 19]. [sent-13, score-0.866]
8 For example, if we know the causal structure of a stock market, we are able to predict the reactions of other stocks against the sudden collapse of one share price in the market. [sent-14, score-0.785]
9 A traditional way to infer the causal structure is by controlled experiments. [sent-15, score-0.665]
10 Thus, causal discovery from non-experimental data is of great importance and has drawn considerable attention in the past decades [15, 19, 16, 17, 12, 22, 2]. [sent-17, score-0.719]
11 Conventional models such as LiNGAM assume that the causal relations are of a linear form, i. [sent-19, score-0.739]
12 , if the observed variable x is the cause of another observed variable y, we model the causal relation as y = αx + e, where α is a constant coefficient and e is the additive noise independent of x. [sent-21, score-0.732]
13 However, in many types of natural signals or time series such as MEG/EEG data [23] and financial data [20], a common form of nonlinear dependency, as seen from the correlation in variances or energies, is found [5]. [sent-22, score-0.133]
14 This observation indicates that causality may take place at the level of variances or energies instead of the observed variables themselves. [sent-23, score-0.314]
15 Generally speaking, traditional methods cannot detect this type of causal relations. [sent-24, score-0.665]
16 Another restriction of conventional causal models is that these models assume constant variances of the observations; this assumption is unrealistic for those data with strong heteroscedasticity [1]. [sent-25, score-0.798]
17 1 In this paper, we propose a new probabilistic model called Causal Scale-Mixture model with SpatioTemporal Variance Dependencies (CSM-STVD) incorporating the spatial and temporal variance or energy dependencies among the observed data. [sent-26, score-0.292]
18 The main feature of the new model is that we model the spatiotemporal variance dependencies based on the Structural Vector AutoRegressive (SVAR) model, in particular the Non-Gaussian SVAR [11]. [sent-27, score-0.3]
19 First, we provide an alternative way to model the causal relations among the observations, i. [sent-29, score-0.766]
20 In this model, causality takes place at the level of variances or energies, i. [sent-32, score-0.221]
21 , the variance or energy of one observed series at time instant t0 is influenced by the variances or energies of other variables at time instants t ≤ t0 and its past values at time instants t < t0 . [sent-34, score-0.371]
22 Thus, both contemporaneous and temporal causal relations in variances are considered. [sent-35, score-1.231]
23 Furthermore, we propose a method which directly estimates such causal structures without explicitly estimating the variances. [sent-37, score-0.665]
24 2 Related work To model the variance or energy dependencies of the observations, a classic method is to use a scalemixture model [5, 23, 9, 8]. [sent-38, score-0.224]
25 Mathematically, we can represent a signal as si = ui σi , where ui is a signal with zero mean and constant variance, and σi is a positive factor which is independent of ui and modulates the variance or energy of si [5]. [sent-39, score-0.311]
26 On the second stage, the variance dependencies are modeled by applying a linear Non-Gaussian (LiN) SEM to the log-energies of the sources. [sent-52, score-0.168]
27 However, many time series show strong heterosecadasticity and temporal variance dependencies such as financial time series and brain signals. [sent-60, score-0.217]
28 Taking into account of temporal variance dependencies would improve the quality of the estimated underlying structure of the observed data. [sent-61, score-0.263]
29 The first stage also performs linear separation; they proposed a blind source separation algorithm by exploiting the autocorrelations and time-varying variances of the sources. [sent-63, score-0.193]
30 (2) Two different methods are used to model the dependencies among the variances of the innovations. [sent-68, score-0.27]
31 The advantage of this model is that we are able to estimate the temporal causal structure in variances. [sent-71, score-0.714]
32 However, this model provides no information about the 2 contemporaneous causal relations among the sources if there indeed exist such causal relations. [sent-72, score-1.765]
33 The second method to model the variance dependencies is applying a factor model to the log-energies 2 (log σit ) of the sources. [sent-73, score-0.168]
34 The disadvantage of this method is that we cannot model the causal relations among the sources which are more interesting to us. [sent-74, score-0.79]
35 In many real world observations, there are causal influences in variances among the observed variables. [sent-75, score-0.895]
36 For instance, there are significant mutual influences among the volatilities of the observed stock prices. [sent-76, score-0.17]
37 We are more interested in investigating the underlying causal structure among the variances of the observed data. [sent-77, score-0.848]
38 Consequently, in this paper, we consider the situation where the correlation in the variances of the observed data is interesting. [sent-78, score-0.156]
39 , modeling the spatiotemporal variance dependencies and causal mechanism among the observations. [sent-81, score-1.018]
40 In the following sections, we propose our probabilistic model based on SVAR to describe the spatiotemporal variance dependencies among the observations. [sent-82, score-0.327]
41 Our model is, as shown in later sections, closely related to the models introduced in [5, 23], but has significant advantages: (1) both contemporaneous and temporal causal relations can be modeled; (2) this model is fully identifiable under certain assumptions. [sent-83, score-1.098]
42 3 Causal scale-mixture model with spatiotemporal variances dependencies We propose the causal scale-mixture model with spatiotemporal variance dependencies as follows. [sent-84, score-1.34]
43 (3) Here we assume that ui (t) are temporally independent processes, i. [sent-86, score-0.138]
44 , ui (tτ1 ) ⊥ uj (tτ2 ), ∀tτ1 ̸= tτ2 ⊥ but unlike basic scale-mixture model, here ui (t) may be contemporarily dependent, i. [sent-88, score-0.138]
45 (4) Here σit > 0 are related to the variances or energies of the observations zt and are assumed to be spatiotemporally dependent. [sent-93, score-0.33]
46 In this paper, we model the spatiotemporal variance dependencies by a Structural Vector AutoRegressive model (SVAR), i. [sent-95, score-0.3]
47 , yt = A0 yt + L ∑ Bτ yt−τ + ϵt , (5) τ =1 where A0 contains the contemporaneous causal strengths among the variances of the observations, i. [sent-97, score-1.301]
48 , if [A0 ]ij ̸= 0, we say that yit is contemporaneously affected by yjt ; Bτ contains the temporal (time-lag) causal relations, i. [sent-99, score-0.786]
49 Let xt = log |zt | (In this model, we assume that ui (t) do not take value zero) and ηt = log |ut |. [sent-106, score-0.16]
50 Take log of the absolute values of both sides of equation (4), then we have the following model: xt = yt + ηt , yt = A0 yt + L ∑ Bτ yt−τ + ϵt . [sent-107, score-0.34]
51 4 Parameter learning and causal discovery In this section, we propose an effective algorithm to estimate the contemporaneous causal structure matrix A0 and temporal causal structure matrices Bτ , τ = 1, · · · , L (see (6)). [sent-132, score-2.408]
52 Given the log-transformed observations xt = log |zt |, denoted by Rx (k) the autocovariance function of xt at lag k, we have Rx (k) = E(xt xT ). [sent-139, score-0.255]
53 Based on the model assumptions A1 and A2 , we have t+k the following linear equations of the autocovarainces of xt . [sent-140, score-0.131]
54 2 Estimation of A0 The estimations of ABτ (τ = 1, · · · , L) still contain the mixing information of the causal structures A0 and Bτ . [sent-164, score-0.665]
55 In order to further obtain the contemporaneous and temporal causal relations, we need to estimate both A0 and Bτ (τ = 1, · · · , L). [sent-165, score-1.024]
56 Substituting yt = xt − ηt into Equations (6), we have xt − ηt = L ∑ ABτ (xt−τ − ηt−τ ) + Aϵt . [sent-167, score-0.265]
57 The Gaussian assumption of ft can be interpreted hierarchically as the result of central limit theorem because these factors themselves represent the ensemble effects of numerous factors from the whole environment. [sent-175, score-0.165]
58 To tackle the causal discovery problem of LiNGAMGC, it was firstly shown that if x and y are standardized to unit absolute kurtosis then |ρ| < 1 based on the assumption that e1 and e2 are simultaneously super-Gaussian or sub-Gaussian. [sent-182, score-0.757]
59 It was shown that the causal direction can be identified simply by ˜ ˜ ˜ examining the sign of Rxy , i. [sent-186, score-0.665]
60 Once the causal direction has been identified, the estimation of causal strength is straightforward. [sent-189, score-1.33]
61 The work can be extended to multivariate causal network discovery following DirectLiNGAM framework [17]. [sent-190, score-0.719]
62 Here we adopt LiNGAM-GC-UK, the algorithm proposed in [2], to find the contemporaneous casual ˆ ˆ ˆ structure matrix A0 . [sent-191, score-0.31]
63 5 Algorithm 1 Causal discovery with scale-mixture model for spatiotemporal variance dependencies 1: Given the observations zt , compute xt = log |zt |. [sent-194, score-0.549]
64 , xt = xt − xt 3: Choose an appropriate lag L for the SVAR and then estimate ABτ where A = (I − A0 )−1 and τ = 1, · · · , L, using Equations(7). [sent-197, score-0.3]
65 5: Apply LiNGAM-GC algorithms to ξt and obtain the estimation of A0 and Bτ (τ = 1, · · · , L) and the corresponding comtemporaneous and temporal causal orderings. [sent-199, score-0.714]
66 1 Synthetic data We generate the observations according to the following model: zt = r ⊙ ut ⊙ σt , r is a m×1 scale vector of which the elements are randomly selected from interval [1. [sent-202, score-0.142]
67 Denoted by yt = log σt , we model the spatiotemporal variance dependencies of the observations xt by an SVAR(1): yt = A0 yt + B1 yt−1 + ϵt , where A0 is a m × m strictly lower triangular matrix of which the elements are randomly selected from [0. [sent-212, score-0.708]
68 The task of this experiment is to investigate the performance of our algorithms in estimating the coefficient matrix (I − A0 )−1 B1 and also the contemporaneous causal ordering induced by A0 . [sent-225, score-1.064]
69 We use different algorithms: LiNGAM-GC-UK proposed in [2], C-M proposed in [7] and LiNGAM [16] to estimate the contemporaneous causal structure. [sent-228, score-0.975]
70 We calculate the accuracies of LiNGAM-GC-UK, C-M and LiNGAM in finding (1) whole causal ordering (2) exogenous variable (root) of the causal network. [sent-231, score-1.458]
71 We also calculate the sum square error Err of estimated causal strength matrix of different algorithms with respect to the true one. [sent-232, score-0.688]
72 While LiNGAM-GC-UK still successfully finds the exogenous variable (root) or even the whole contemporaneous causal ordering among the variances of the observations if the sample size is sufficiently large enough. [sent-243, score-1.309]
73 0199 the confounding effects of Dft − (I − A)−1 B1 Dft−1 become more problematic such that the performances of C-M and LiNGAM are strongly affected by confounding effect. [sent-324, score-0.147]
74 In order to investigate the robustness of our methods against the Gaussian assumption on the external factors ft , we conduct the following experiment. [sent-327, score-0.143]
75 The experimental setting is the same as that in the above experiment but here the external factors ft are non-Gaussian, and more specifically fit = sign(nit )|nit |p , where nit ∼ N (0, 0. [sent-328, score-0.252]
76 We investigate the performances of LiNGAMGC-UK, LiNGAM and C-M in finding the whole causal ordering in difference scenarios where p = {0. [sent-331, score-0.807]
77 This suggests that although LiNGAM-GC is developed based on the assumption that the latent confounders are Gaussian distributed, it is still robust in the scenarios where the latent confounders are mildly non-Gaussian with mild causal strengths. [sent-340, score-0.871]
78 whole causal ordering 100 accuracy(%) 80 60 40 20 0 0. [sent-341, score-0.752]
79 Note that because of the time difference, we believe that the causal relations among these stock indices are mainly acyclic, as we assumed in this paper. [sent-357, score-0.968]
80 We apply our proposed model with SVAR(1) to model the spatiotemporal variance dependencies of the data. [sent-359, score-0.3]
81 For the contemporaneous causal structure discovery, we use LiNGAM-GC-UK, C-M, LiNGAM2 and Direct-LiNGAM3 to estimate the causal ordering. [sent-360, score-1.64]
82 The discovered causal orderings of different algorithms are shown in Table 2. [sent-361, score-0.692]
83 , DJI and NDX are contemporaneously affected by the indices in Europe. [sent-364, score-0.131]
84 Note that each stock index is given in local time. [sent-365, score-0.12]
85 Because of the time difference between Europe and America and the efficient market hypothesis (the market is quick to absorb new information and adjust stock prices relative to that), the contemporaneous causal relations should be from Europe to America, if they exist. [sent-366, score-1.249]
86 , the stock indices in US is contemporaneously the cause of the indices in Europe, which is difficult to interpret. [sent-371, score-0.285]
87 The contemporaneous causal network of the stock indices are shown in Figure 2. [sent-372, score-1.154]
88 Further interpretation on the discovered causal strengths needs expertise knowledge. [sent-373, score-0.692]
89 6 Conclusion In this paper, we investigate the causal discovery problem where causality takes place at the level of variances or energies instead of the observed variables themselves. [sent-374, score-1.058]
90 We propose a causal scalemixture model with spatiotemporal variance dependencies to describe this type of causal mechanism. [sent-375, score-1.661]
91 In addition, we propose algorithms to estimate the parameters, especially the contemporaneous causal structure of this model. [sent-377, score-0.975]
92 Results using real world data further suggest that our new model can possibly explain the underlying interaction mechanism of major world stock markets. [sent-379, score-0.24]
93 We only show one of the discovered causal ordering here. [sent-382, score-0.756]
94 Causal discovery for linear non-gaussian acyclic models in the presence of latent gaussian confounders. [sent-399, score-0.148]
95 Estimation of causal effects using linear non-gaussian causal models with hidden variables. [sent-423, score-1.354]
96 Pairwise measures of causal direction in linear non-gaussian acyclic models. [sent-427, score-0.734]
97 Blind separation of sources that have spatiotemporal variance dependencies. [sent-440, score-0.244]
98 A linear non-gaussian acyclic model for causal a discovery. [sent-488, score-0.734]
99 Extensions of ica for causality discovery in the hong kong stock market. [sent-527, score-0.336]
100 Source separation and higher-order causal analysis of meg and eeg. [sent-539, score-0.717]
wordName wordTfidf (topN-words)
[('causal', 0.665), ('contemporaneous', 0.31), ('lingam', 0.31), ('rx', 0.141), ('svar', 0.14), ('variances', 0.133), ('spatiotemporal', 0.132), ('hyv', 0.131), ('stock', 0.12), ('dependencies', 0.11), ('ab', 0.107), ('ftse', 0.093), ('nit', 0.093), ('xt', 0.091), ('dft', 0.088), ('causality', 0.088), ('yt', 0.083), ('shimizu', 0.082), ('rinen', 0.08), ('confounders', 0.078), ('rxy', 0.078), ('relations', 0.074), ('energies', 0.07), ('ui', 0.069), ('acyclic', 0.069), ('ordering', 0.064), ('ndx', 0.062), ('ft', 0.06), ('indices', 0.059), ('variance', 0.058), ('zt', 0.058), ('dji', 0.055), ('discovery', 0.054), ('hoyer', 0.054), ('identi', 0.053), ('autoregressive', 0.052), ('temporal', 0.049), ('temporally', 0.048), ('world', 0.047), ('contemporaneously', 0.047), ('cyx', 0.047), ('directlingam', 0.047), ('disturbances', 0.047), ('fchi', 0.047), ('garch', 0.047), ('gdaxi', 0.047), ('observations', 0.046), ('europe', 0.045), ('exogenous', 0.041), ('fit', 0.041), ('equations', 0.04), ('ut', 0.038), ('kurtosis', 0.038), ('cxy', 0.038), ('kong', 0.038), ('hong', 0.036), ('nancial', 0.036), ('ei', 0.035), ('zhang', 0.034), ('confounding', 0.034), ('innovations', 0.031), ('hirayama', 0.031), ('instants', 0.031), ('scalemixture', 0.031), ('sogawa', 0.031), ('separation', 0.03), ('structural', 0.03), ('prices', 0.03), ('performances', 0.03), ('stage', 0.03), ('external', 0.029), ('factors', 0.029), ('xy', 0.028), ('lag', 0.027), ('sem', 0.027), ('washio', 0.027), ('among', 0.027), ('discovered', 0.027), ('mechanism', 0.026), ('asian', 0.025), ('kawahara', 0.025), ('market', 0.025), ('latent', 0.025), ('investigate', 0.025), ('affected', 0.025), ('mutually', 0.025), ('energy', 0.025), ('effects', 0.024), ('sources', 0.024), ('ar', 0.024), ('err', 0.024), ('observed', 0.023), ('spatially', 0.023), ('assumed', 0.023), ('estimated', 0.023), ('whole', 0.023), ('hij', 0.022), ('triangular', 0.022), ('meg', 0.022), ('independent', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 66 nips-2012-Causal discovery with scale-mixture model for spatiotemporal variance dependencies
Author: Zhitang Chen, Kun Zhang, Laiwan Chan
Abstract: In conventional causal discovery, structural equation models (SEM) are directly applied to the observed variables, meaning that the causal effect can be represented as a function of the direct causes themselves. However, in many real world problems, there are significant dependencies in the variances or energies, which indicates that causality may possibly take place at the level of variances or energies. In this paper, we propose a probabilistic causal scale-mixture model with spatiotemporal variance dependencies to represent a specific type of generating mechanism of the observations. In particular, the causal mechanism including contemporaneous and temporal causal relations in variances or energies is represented by a Structural Vector AutoRegressive model (SVAR). We prove the identifiability of this model under the non-Gaussian assumption on the innovation processes. We also propose algorithms to estimate the involved parameters and discover the contemporaneous causal structure. Experiments on synthetic and real world data are conducted to show the applicability of the proposed model and algorithms.
2 0.10984046 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions
Author: Mathieu Sinn, Bei Chen
Abstract: Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. Large-sample properties of Conditional Markov Chains have been first studied in [1]. The paper extends this work in two directions: first, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identifiability and the uniqueness of maximum likelihood estimates are being given. 1
3 0.090508491 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback
Author: Claudio Gentile, Francesco Orabona
Abstract: We present a novel multilabel/ranking algorithm working in partial information settings. The algorithm is based on 2nd-order descent methods, and relies on upper-confidence bounds to trade-off exploration and exploitation. We analyze this algorithm in a partial adversarial setting, where covariates can be adversarial, but multilabel probabilities are ruled by (generalized) linear models. We show O(T 1/2 log T ) regret bounds, which improve in several ways on the existing results. We test the effectiveness of our upper-confidence scheme by contrasting against full-information baselines on real-world multilabel datasets, often obtaining comparable performance. 1
4 0.078251123 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function
Author: Pedro Ortega, Jordi Grau-moya, Tim Genewein, David Balduzzi, Daniel Braun
Abstract: We propose a novel Bayesian approach to solve stochastic optimization problems that involve finding extrema of noisy, nonlinear functions. Previous work has focused on representing possible functions explicitly, which leads to a two-step procedure of first, doing inference over the function space and second, finding the extrema of these functions. Here we skip the representation step and directly model the distribution over extrema. To this end, we devise a non-parametric conjugate prior based on a kernel regressor. The resulting posterior distribution directly captures the uncertainty over the maximum of the unknown function. Given t observations of the function, the posterior can be evaluated efficiently in time O(t2 ) up to a multiplicative constant. Finally, we show how to apply our model to optimize a noisy, non-convex, high-dimensional objective function.
5 0.076409876 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking
Author: James Scott, Jonathan W. Pillow
Abstract: Characterizing the information carried by neural populations in the brain requires accurate statistical models of neural spike responses. The negative-binomial distribution provides a convenient model for over-dispersed spike counts, that is, responses with greater-than-Poisson variability. Here we describe a powerful data-augmentation framework for fully Bayesian inference in neural models with negative-binomial spiking. Our approach relies on a recently described latentvariable representation of the negative-binomial distribution, which equates it to a Polya-gamma mixture of normals. This framework provides a tractable, conditionally Gaussian representation of the posterior that can be used to design efficient EM and Gibbs sampling based algorithms for inference in regression and dynamic factor models. We apply the model to neural data from primate retina and show that it substantially outperforms Poisson regression on held-out data, and reveals latent structure underlying spike count correlations in simultaneously recorded spike trains. 1
6 0.070822738 324 nips-2012-Stochastic Gradient Descent with Only One Projection
7 0.068860337 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems
8 0.068206504 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks
9 0.063595086 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
10 0.063034773 293 nips-2012-Relax and Randomize : From Value to Algorithms
11 0.059803635 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes
12 0.054692809 195 nips-2012-Learning visual motion in recurrent neural networks
13 0.050750166 292 nips-2012-Regularized Off-Policy TD-Learning
14 0.04734569 277 nips-2012-Probabilistic Low-Rank Subspace Clustering
15 0.0428541 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
16 0.042735443 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders
17 0.042705193 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme
18 0.042381864 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data
19 0.040704351 22 nips-2012-A latent factor model for highly multi-relational data
20 0.040282756 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data
topicId topicWeight
[(0, 0.117), (1, 0.011), (2, 0.054), (3, 0.121), (4, 0.012), (5, -0.037), (6, -0.001), (7, -0.037), (8, -0.012), (9, 0.002), (10, -0.01), (11, -0.029), (12, -0.017), (13, 0.005), (14, 0.019), (15, 0.025), (16, 0.06), (17, 0.029), (18, 0.018), (19, 0.002), (20, 0.004), (21, -0.005), (22, -0.026), (23, -0.042), (24, 0.01), (25, -0.023), (26, -0.008), (27, -0.038), (28, -0.015), (29, 0.005), (30, -0.002), (31, 0.062), (32, 0.004), (33, 0.046), (34, -0.025), (35, -0.021), (36, -0.015), (37, -0.034), (38, 0.027), (39, 0.041), (40, -0.013), (41, 0.019), (42, -0.016), (43, -0.045), (44, 0.047), (45, 0.06), (46, -0.0), (47, -0.021), (48, -0.033), (49, -0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.919236 66 nips-2012-Causal discovery with scale-mixture model for spatiotemporal variance dependencies
Author: Zhitang Chen, Kun Zhang, Laiwan Chan
Abstract: In conventional causal discovery, structural equation models (SEM) are directly applied to the observed variables, meaning that the causal effect can be represented as a function of the direct causes themselves. However, in many real world problems, there are significant dependencies in the variances or energies, which indicates that causality may possibly take place at the level of variances or energies. In this paper, we propose a probabilistic causal scale-mixture model with spatiotemporal variance dependencies to represent a specific type of generating mechanism of the observations. In particular, the causal mechanism including contemporaneous and temporal causal relations in variances or energies is represented by a Structural Vector AutoRegressive model (SVAR). We prove the identifiability of this model under the non-Gaussian assumption on the innovation processes. We also propose algorithms to estimate the involved parameters and discover the contemporaneous causal structure. Experiments on synthetic and real world data are conducted to show the applicability of the proposed model and algorithms.
Author: Lars Buesing, Maneesh Sahani, Jakob H. Macke
Abstract: Latent linear dynamical systems with generalised-linear observation models arise in a variety of applications, for instance when modelling the spiking activity of populations of neurons. Here, we show how spectral learning methods (usually called subspace identification in this context) for linear systems with linear-Gaussian observations can be extended to estimate the parameters of a generalised-linear dynamical system model despite a non-linear and non-Gaussian observation process. We use this approach to obtain estimates of parameters for a dynamical model of neural population data, where the observed spike-counts are Poisson-distributed with log-rates determined by the latent dynamical process, possibly driven by external inputs. We show that the extended subspace identification algorithm is consistent and accurately recovers the correct parameters on large simulated data sets with a single calculation, avoiding the costly iterative computation of approximate expectation-maximisation (EM). Even on smaller data sets, it provides an effective initialisation for EM, avoiding local optima and speeding convergence. These benefits are shown to extend to real neural data.
3 0.69072253 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions
Author: Mathieu Sinn, Bei Chen
Abstract: Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. Large-sample properties of Conditional Markov Chains have been first studied in [1]. The paper extends this work in two directions: first, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identifiability and the uniqueness of maximum likelihood estimates are being given. 1
4 0.66599238 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking
Author: James Scott, Jonathan W. Pillow
Abstract: Characterizing the information carried by neural populations in the brain requires accurate statistical models of neural spike responses. The negative-binomial distribution provides a convenient model for over-dispersed spike counts, that is, responses with greater-than-Poisson variability. Here we describe a powerful data-augmentation framework for fully Bayesian inference in neural models with negative-binomial spiking. Our approach relies on a recently described latentvariable representation of the negative-binomial distribution, which equates it to a Polya-gamma mixture of normals. This framework provides a tractable, conditionally Gaussian representation of the posterior that can be used to design efficient EM and Gibbs sampling based algorithms for inference in regression and dynamic factor models. We apply the model to neural data from primate retina and show that it substantially outperforms Poisson regression on held-out data, and reveals latent structure underlying spike count correlations in simultaneously recorded spike trains. 1
5 0.6328488 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks
Author: Tivadar Papai, Henry Kautz, Daniel Stefankovic
Abstract: Markov logic is a widely used tool in statistical relational learning, which uses a weighted first-order logic knowledge base to specify a Markov random field (MRF) or a conditional random field (CRF). In many applications, a Markov logic network (MLN) is trained in one domain, but used in a different one. This paper focuses on dynamic Markov logic networks, where the size of the discretized time-domain typically varies between training and testing. It has been previously pointed out that the marginal probabilities of truth assignments to ground atoms can change if one extends or reduces the domains of predicates in an MLN. We show that in addition to this problem, the standard way of unrolling a Markov logic theory into a MRF may result in time-inhomogeneity of the underlying Markov chain. Furthermore, even if these representational problems are not significant for a given domain, we show that the more practical problem of generating samples in a sequential conditional random field for the next slice relying on the samples from the previous slice has high computational cost in the general case, due to the need to estimate a normalization factor for each sample. We propose a new discriminative model, slice normalized dynamic Markov logic networks (SN-DMLN), that suffers from none of these issues. It supports efficient online inference, and can directly model influences between variables within a time slice that do not have a causal direction, in contrast with fully directed models (e.g., DBNs). Experimental results show an improvement in accuracy over previous approaches to online inference in dynamic Markov logic networks. 1
6 0.60963851 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback
7 0.60735846 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function
8 0.59544182 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme
9 0.59201711 293 nips-2012-Relax and Randomize : From Value to Algorithms
10 0.58873445 11 nips-2012-A Marginalized Particle Gaussian Process Regression
11 0.5763135 292 nips-2012-Regularized Off-Policy TD-Learning
12 0.54548228 72 nips-2012-Cocktail Party Processing via Structured Prediction
13 0.53629577 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems
14 0.49221537 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
15 0.48976472 309 nips-2012-Semi-supervised Eigenvectors for Locally-biased Learning
16 0.4876405 23 nips-2012-A lattice filter model of the visual pathway
17 0.48443115 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes
18 0.47907066 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions
19 0.47766867 52 nips-2012-Bayesian Nonparametric Modeling of Suicide Attempts
20 0.4752689 195 nips-2012-Learning visual motion in recurrent neural networks
topicId topicWeight
[(0, 0.067), (21, 0.021), (38, 0.082), (39, 0.021), (42, 0.019), (44, 0.016), (54, 0.025), (55, 0.028), (74, 0.027), (76, 0.127), (80, 0.092), (92, 0.053), (96, 0.327)]
simIndex simValue paperId paperTitle
same-paper 1 0.74585903 66 nips-2012-Causal discovery with scale-mixture model for spatiotemporal variance dependencies
Author: Zhitang Chen, Kun Zhang, Laiwan Chan
Abstract: In conventional causal discovery, structural equation models (SEM) are directly applied to the observed variables, meaning that the causal effect can be represented as a function of the direct causes themselves. However, in many real world problems, there are significant dependencies in the variances or energies, which indicates that causality may possibly take place at the level of variances or energies. In this paper, we propose a probabilistic causal scale-mixture model with spatiotemporal variance dependencies to represent a specific type of generating mechanism of the observations. In particular, the causal mechanism including contemporaneous and temporal causal relations in variances or energies is represented by a Structural Vector AutoRegressive model (SVAR). We prove the identifiability of this model under the non-Gaussian assumption on the innovation processes. We also propose algorithms to estimate the involved parameters and discover the contemporaneous causal structure. Experiments on synthetic and real world data are conducted to show the applicability of the proposed model and algorithms.
2 0.70176125 157 nips-2012-Identification of Recurrent Patterns in the Activation of Brain Networks
Author: Firdaus Janoos, Weichang Li, Niranjan Subrahmanya, Istvan Morocz, William Wells
Abstract: Identifying patterns from the neuroimaging recordings of brain activity related to the unobservable psychological or mental state of an individual can be treated as a unsupervised pattern recognition problem. The main challenges, however, for such an analysis of fMRI data are: a) deďŹ ning a physiologically meaningful feature-space for representing the spatial patterns across time; b) dealing with the high-dimensionality of the data; and c) robustness to the various artifacts and confounds in the fMRI time-series. In this paper, we present a network-aware feature-space to represent the states of a general network, that enables comparing and clustering such states in a manner that is a) meaningful in terms of the network connectivity structure; b)computationally efďŹ cient; c) low-dimensional; and d) relatively robust to structured and random noise artifacts. This feature-space is obtained from a spherical relaxation of the transportation distance metric which measures the cost of transporting “massâ€? over the network to transform one function into another. Through theoretical and empirical assessments, we demonstrate the accuracy and efďŹ ciency of the approximation, especially for large problems. 1
3 0.58653224 206 nips-2012-Majorization for CRFs and Latent Likelihoods
Author: Tony Jebara, Anna Choromanska
Abstract: The partition function plays a key role in probabilistic modeling including conditional random fields, graphical models, and maximum likelihood estimation. To optimize partition functions, this article introduces a quadratic variational upper bound. This inequality facilitates majorization methods: optimization of complicated functions through the iterative solution of simpler sub-problems. Such bounds remain efficient to compute even when the partition function involves a graphical model (with small tree-width) or in latent likelihood settings. For large-scale problems, low-rank versions of the bound are provided and outperform LBFGS as well as first-order methods. Several learning applications are shown and reduce to fast and convergent update rules. Experimental results show advantages over state-of-the-art optimization methods. This supplement presents additional details in support of the full article. These include the application of the majorization method to maximum entropy problems. It also contains proofs of the various theorems, in particular, a guarantee that the bound majorizes the partition function. In addition, a proof is provided guaranteeing convergence on (non-latent) maximum conditional likelihood problems. The supplement also contains supporting lemmas that show the bound remains applicable in constrained optimization problems. The supplement then proves the soundness of the junction tree implementation of the bound for graphical models with large n. It also proves the soundness of the low-rank implementation of the bound for problems with large d. Finally, the supplement contains additional experiments and figures to provide further empirical support for the majorization methodology. Supplement for Section 2 Proof of Theorem 1 Rewrite the partition function as a sum over the integer index j = 1, . . . , n under the random ordering π : Ω → {1, . . . , n}. This defines j∑ π(y) and associates h and f with = n hj = h(π −1 (j)) and fj = f (π −1 (j)). Next, write Z(θ) = j=1 αj exp(λ⊤ fj ) by introducing ˜ ˜ λ = θ − θ and αj = hj exp(θ ⊤ fj ). Define the partition function over only the first i components ∑i as Zi (θ) = j=1 αj exp(λ⊤ fj ). When i = 0, a trivial quadratic upper bound holds ( ) Z0 (θ) ≤ z0 exp 1 λ⊤ Σ0 λ + λ⊤ µ0 2 with the parameters z0 → 0+ , µ0 = 0, and Σ0 = z0 I. Next, add one term to the current partition function Z1 (θ) = Z0 (θ) + α1 exp(λ⊤ f1 ). Apply the current bound Z0 (θ) to obtain 1 Z1 (θ) ≤ z0 exp( 2 λ⊤ Σ0 λ + λ⊤ µ0 ) + α1 exp(λ⊤ f1 ). Consider the following change of variables u γ 1/2 −1/2 = Σ0 λ − Σ0 = α1 z0 exp( 1 (f1 2 (f1 − µ0 )) − µ0 )⊤ Σ−1 (f1 − µ0 )) 0 and rewrite the logarithm of the bound as log Z1 (θ) ( ) 1 ≤ log z0 − 1 (f1 − µ0 )⊤ Σ−1 (f1 − µ0 ) + λ⊤ f1 + log exp( 2 ∥u∥2 ) + γ . 0 2 Apply Lemma 1 (cf. [32] p. 100) to the last term to get ( (1 ) ) log Z1 (θ) ≤ log z0 − 1 (f1 − µ0 )⊤ Σ−1 (f1 − µ0 ) + λ⊤ f1 + log exp 2 ∥v∥2 +γ 0 2 ( ) v⊤ (u − v) 1 + + (u − v)⊤ I + Γvv⊤ (u − v) 1+γ exp(− 1 ∥v∥2 ) 2 2 10 where Γ = 1 1 tanh( 2 log(γ exp(− 2 ∥v∥2 ))) . 1 2 log(γ exp(− 2 ∥v∥2 )) The bound in [32] is tight when u = v. To achieve tightness −1/2 ˜ when θ = θ or, equivalently, λ = 0, we choose v = Σ0 (µ0 − f1 ) yielding ( ) Z1 (θ) ≤ z1 exp 1 λ⊤ Σ1 λ + λ⊤ µ1 2 where we have z1 = z0 + α1 z0 α1 = µ0 + f1 z0 + α1 z0 + α1 1 tanh( 2 log(α1 /z0 )) = Σ0 + (µ0 − f1 )(µ0 − f1 )⊤ . 2 log(α1 /z0 ) µ1 Σ1 This rule updates the bound parameters z0 , µ0 , Σ0 to incorporate an extra term in the sum over i in Z(θ). The process is iterated n times (replacing 1 with i and 0 with i − 1) to produce an overall bound on all terms. Lemma 1 (See [32] p. 100) ( ( ) ) For all u ∈ Rd , any v ∈ Rd and any γ ≥ 0, the bound log exp 1 ∥u∥2 + γ ≤ 2 ( ( ) ) log exp 1 ∥v∥2 + γ + 2 ( ) v⊤ (u − v) 1 + (u − v)⊤ I + Γvv⊤ (u − v) 1 1 + γ exp(− 2 ∥v∥2 ) 2 holds when the scalar term Γ = 1 tanh( 2 log(γ exp(−∥v∥2 /2))) . 2 log(γ exp(−∥v∥2 /2)) Equality is achieved when u = v. Proof of Lemma 1 The proof is provided in [32]. Supplement for Section 3 Maximum entropy problem We show here that partition functions arise naturally in maximum ∑ p(y) entropy estimation or minimum relative entropy RE(p∥h) = y p(y) log h(y) estimation. Consider the following problem: ∑ ∑ min RE(p∥h) s.t. p(y)f (y) = 0, p(y)g(y) ≥ 0. p y y d′ Here, assume that f : Ω → R and g : Ω → R are arbitrary (non-constant) vector-valued functions ( ) over the sample space. The solution distribution p(y) = h(y) exp θ ⊤ f (y) + ϑ⊤ g(y) /Z(θ, ϑ) is recovered by the dual optimization ∑ ( ) arg θ, ϑ = max − log h(y) exp θ ⊤ f (y) + ϑ⊤ g(y) d ϑ≥0,θ y ′ where θ ∈ Rd and ϑ ∈ Rd . These are obtained by minimizing Z(θ, ϑ) or equivalently by maximizing its negative logarithm. Algorithm 1 permits variational maximization of the dual via the quadratic program min 1 (β ϑ≥0,θ 2 ˜ ˜ − β)⊤ Σ(β − β) + β ⊤ µ ′ where β ⊤ = [θ ⊤ ϑ⊤ ]. Note that any general convex hull of constraints β ∈ Λ ⊆ Rd+d could be imposed without loss of generality. Proof of Theorem 2 We begin by proving a lemma that will be useful later. Lemma 2 If κΨ ≽ Φ ≻ 0 for Φ, Ψ ∈ Rd×d , then 1 ˜ ˜ ˜ L(θ) = − 2 (θ − θ)⊤ Φ(θ − θ) − (θ − θ)⊤ µ ˜ ˜ ˜ U (θ) = − 1 (θ − θ)⊤ Ψ(θ − θ) − (θ − θ)⊤ µ 2 satisfy supθ∈Λ L(θ) ≥ 1 κ ˜ supθ∈Λ U (θ) for any convex Λ ⊆ Rd , θ ∈ Λ, µ ∈ Rd and κ ∈ R+ . 11 Proof of Lemma 2 Define the primal problems of interest as PL = supθ∈Λ L(θ) and PU = supθ∈Λ U (θ). The constraints θ ∈ Λ can be summarized by a set of linear inequalities Aθ ≤ b where A ∈ Rk×d and b ∈ Rk for some (possibly infinite) k ∈ Z. Apply the change of variables ˜ ˜ ˜ ˜ ˜ ˜ z = θ− θ. The constraint A(z+ θ) ≤ b simplifies into Az ≤ b where b = b−Aθ. Since θ ∈ Λ, it 1 ⊤ ˜ ≥ 0. We obtain the equivalent primal problems PL = sup is easy to show that b ˜ − z Φz − Az≤b z⊤ µ and PU = supAz≤b − 1 z⊤ Ψz − z⊤ µ. The corresponding dual problems are ˜ 2 2 ⊤ −1 y⊤AΦ−1A⊤y ˜ µ Φ µ +y⊤AΦ−1µ+y⊤b+ y≥0 2 2 y⊤AΨ−1 A⊤y µ⊤Ψ−1µ ˜ DU = inf +y⊤AΨ−1µ+y⊤b+ . y≥0 2 2 DL = inf ˜ Due to strong duality, PL = DL and PU = DU . Apply the inequalities Φ ≼ κΨ and y⊤ b > 0 as ⊤ −1 y⊤AΨ−1 A⊤y y⊤AΨ−1 µ κ ˜ µΨ µ + + y⊤b + PL ≥ sup − z⊤ Ψz − z⊤ µ = inf y≥0 2 2κ κ 2κ ˜ Az≤b 1 1 ≥ DU = PU . κ κ 1 This proves that PL ≥ κ PU . We will use the above to prove Theorem 2. First, we will upper-bound (in the Loewner ordering sense) the matrices Σj in Algorithm 2. Since ∥fxj (y)∥2 ≤ r for all y ∈ Ωj and since µj in Algorithm 1 is a convex combination of fxj (y), the outer-product terms in the update for Σj satisfy (fxj (y) − µ)(fxj (y) − µ)⊤ ≼ 4r2 I. Thus, Σj ≼ F(α1 , . . . , αn )4r2 I holds where α 1 F(α1 , . . . , αn ) = i n ∑ tanh( 2 log( ∑i−1 αk )) k=1 αi 2 log( ∑i−1 α ) i=2 k k=1 using the definition of α1 , . . . , αn in the proof of Theorem 1. The formula for F starts at i = 2 since z0 → 0+ . Assume permutation π is sampled uniformly at random. The expected value of F is then α π(i) 1 n 1 ∑ ∑ tanh( 2 log( ∑i−1 απ(k) )) k=1 . Eπ [F(α1 , . . . , αn )] = απ(i) n! π i=2 ) 2 log( ∑i−1 k=1 απ(k) We claim that the expectation is maximized when all αi = 1 or any positive constant. Also, F is invariant under uniform scaling of its arguments. Write the expected value of F as E for short. ∂E ∂E ∂E Next, consider ∂αl at the setting αi = 1, ∀i. Due to the expectation over π, we have ∂αl = ∂αo for any l, o. Therefore, the gradient vector is constant when all αi = 1. Since F(α1 , . . . , αn ) is invariant to scaling, the gradient vector must therefore be the all zeros vector. Thus, the point ∂ ∂E when all αi = 1 is an extremum or a saddle. Next, consider ∂αo ∂αl for any l, o. At the setting 2 ∂ ∂E αi = 1, ∂ E = −c(n) and, ∂αo ∂αl = c(n)/(n − 1) for some non-negative constant function ∂α2 l c(n). Thus, the αi = 1 extremum is locally concave and is a maximum. This establishes that Eπ [F(α1 , . . . , αn )] ≤ Eπ [F(1, . . . , 1)] and yields the Loewner bound ) ( n−1 ∑ tanh(log(i)/2) 2 I = ωI. Σj ≼ 2r log(i) i=1 Apply this bound to each Σj in the lower bound on J(θ) and also note a corresponding upper bound ∑ ˜ ˜ ˜ J(θ) ≥ J(θ)−tω+tλ ∥θ − θ∥2− (θ − θ)⊤(µj −fxj (yj )) 2 j ∑ ˜ ˜ ˜ J(θ) ≤ J(θ)−tλ ∥θ − θ∥2− (θ − θ)⊤(µj −fxj (yj )) 2 j 12 ˜ which follows from Jensen’s inequality. Define the current θ at time τ as θτ and denote by Lτ (θ) the above lower bound and by Uτ (θ) the above upper bound at time τ . Clearly, Lτ (θ) ≤ J(θ) ≤ Uτ (θ) with equality when θ = θτ . Algorithm 2 maximizes J(θ) after initializing at θ0 and performing an update by maximizing a lower bound based on Σj . Since Lτ (θ) replaces the definition of Σj with ωI ≽ Σj , Lτ (θ) is a looser bound than the one used by Algorithm 2. Thus, performing θτ +1 = arg maxθ∈Λ Lτ (θ) makes less progress than a step of Algorithm 1. Consider computing the slower update at each iteration τ and returning θτ +1 = arg maxθ∈Λ Lτ (θ). Setting Φ = (tω +tλ)I, Ψ = tλI and κ = ω+λ allows us to apply Lemma 2 as follows λ sup Lτ (θ) − Lτ (θτ ) = θ∈Λ 1 sup Uτ (θ) − Uτ (θτ ). κ θ∈Λ Since Lτ (θτ ) = J(θτ ) = Uτ (θτ ), J(θτ +1 ) ≥ supθ∈Λ Lτ (θ) and supθ∈Λ Uτ (θ) ≥ J(θ ∗ ), we obtain ( ) 1 J(θτ +1 ) − J(θ ∗ ) ≥ 1− (J(θτ ) − J(θ ∗ )) . κ Iterate the above inequality starting at t = 0 to obtain ( )τ 1 ∗ J(θτ ) − J(θ ) ≥ 1− (J(θ0 ) − J(θ ∗ )) . κ ( ) 1 τ κ A solution within a multiplicative factor of ϵ implies that ϵ = 1 − κ or log(1/ϵ) = τ log κ−1 . ⌉ ⌈ Inserting the definition for κ shows that the number of iterations τ is at most log(1/ϵ) or κ log κ−1 ⌉ ⌈ log(1/ϵ) log(1+λ/ω) . Inserting the definition for ω gives the bound. Y12,0 Y11,1 Y21,1 Y31,1 ··· 1,1 Ym1,1 Figure 3: Junction tree of depth 2. Algorithm 5 SmallJunctionTree ˜ Input Parameters θ and h(u), f (u) ∀u ∈ Y12,0 and zi , Σi , µi ∀i = 1, . . . , m1,1 + Initialize z → 0 , µ = 0, Σ = zI For each configuration u ∈ Y12,0 { ∏m1,1 ∑m1,1 ∏m1,1 ˜ ˜ ˜ α = h(u)( ∑ zi exp(−θ ⊤ µi )) exp(θ ⊤ (f (u) + i=1 µi )) = h(u) exp(θ ⊤ f (u)) i=1 zi i=1 m1,1 l = f (u) + i=1 µi − µ 1 ∑m1,1 tanh( 2 log(α/z)) ⊤ Σ + = i=1 Σi + ll 2 log(α/z) α µ + = z+α l z += α } Output z, µ, Σ Supplement for Section 5 Proof of correctness for Algorithm 3 Consider a simple junction tree of depth 2 shown on Figure 3. The notation Yca,b refers to the cth tree node located at tree level a (first level is considered as the one with∑ leaves) whose parent is the bth from the higher tree level (the root has no parent so b = 0). tree ∑ Let Y a1 ,b1 refer to the sum over all configurations of variables in Yca1 ,b1 and Y a1 ,b1 \Y a2 ,b2 1 c1 c1 c2 refers to the sum over all configurations of variables that are in Yca1 ,b1 but not in Yca2 ,b2 . Let ma,b 1 2 denote the number of children of the bth node located at tree level a + 1. For short-hand, use ψ(Y ) = h(Y ) exp(θ ⊤ f (Y )). The partition function can be expressed as: 13 Y13,0 Y12,1 ··· Y11,1 Y21,1 ··· Y22,1 1,1 Ym1,1 ··· Y11,2 Y21,2 1,2 Ym1,2 2,1 Ym2,1 1,m2,1 Y1 ··· 1,m2,1 Y2 1,m 2,1 Ym1,m2,1 Figure 4: Junction tree of depth 3. Z(θ) = ∑ 2,0 u∈Y1 ≤ ∑ 2,0 u∈Y1 = ∑ ψ(u) ∏ m1,1 i=1 [ ∏ m1,1 ψ(u) i=1 [ ∑ ψ(v) 2,0 v∈Yi1,1 \Y1 ) 1( ˜ ˜ ˜ zi exp( θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi 2 ∏ ( m1,1 ⊤ h(u) exp(θ f (u)) 2,0 u∈Y1 zi exp i=1 ] 1 ˜ ˜ ˜ (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi 2 )] ∑ where the upper-bound is obtained by applying Theorem 1 to each of the terms v∈Y 1,1 \Y 2,0 ψ(v). 1 i By simply rearranging terms we get: ) ( ( [ (m1,1 )) m1,1 ∏ ∑ ∑ ˜ zi exp(−θ ⊤ µi ) exp θ ⊤ f (u) + µi h(u) Z(θ) ≤ 2,0 u∈Y1 i=1 ( exp 1 ˜ (θ − θ)⊤ 2 (m1,1 ∑ ) Σi i=1 ˜ (θ − θ) )] . i=1 One ( can prove that this ) expression can be upper-bounded by 1 ˆ ⊤ Σ(θ − θ) + (θ − θ)⊤ µ where z, Σ and µ can be computed using Algoˆ ˆ z exp 2 (θ − θ) rithm 5 (a simplification of Algorithm 3). We will call this result Lemma A. The proof is similar to the proof of Theorem 1 so is not repeated here. Consider enlarging the tree to a depth 3 as shown on Figure 4. The partition function is now m2,1 m1,i ∑ ∏ ∑ ∏ ∑ Z(θ) = ψ(w) . ψ(u) ψ(v) 3,0 u∈Y1 i=1 3,0 v∈Yi2,1 \Y1 j=1 w∈Yj1,i \Yi2,1 ( )) ∏m1,i (∑ ∑ term By Lemma A we can upper bound each v∈Y 2,1 \Y 3,0 ψ(v) j=1 w∈Yj1,i \Yi2,1 ψ(w) 1 i ( ) ˆ ˆ ˆ by the expression zi exp 1 (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi . This yields 2 [ )] ( m2,1 ∑ ∏ 1 ˜ ˜ ˜ Z(θ) ≤ ψ(u) (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi . zi exp 2 3,0 i=1 u∈Y1 2,1 2,1 2,1 This process can be viewed as collapsing the sub-trees S1 , S2 , . . ., Sm2,1 to super-nodes that are represented by bound parameters, zi , Σi and µi , i = {1, 2, · · · , m2,1 }, where the sub-trees are 14 defined as: 2,1 S1 = 1,1 {Y12,1 , Y11,1 , Y21,1 , Y31,1 , . . . , Ym1,1 } 2,1 S2 = 1,2 {Y22,1 , Y11,2 , Y21,2 , Y31,2 , . . . , Ym1,2 } = 2,1 {Ym2,1 , Y1 . . . 2,1 Sm2,1 1,m2,1 1,m2,1 , Y2 1,m2,1 , Y3 1,m2,1 , . . . , Ym1,m2,1 }. Notice that the obtained expression can be further upper bounded using again Lemma A (induction) ( ) ˆ ˆ ˆ yielding a bound of the form: z exp 1 (θ − θ)⊤ Σ(θ − θ) + (θ − θ)⊤ µ . 2 Finally, for a general tree, follow the same steps described above, starting from leaves and collapsing nodes to super-nodes, each represented by bound parameters. This procedure effectively yields Algorithm 3 for the junction tree under consideration. Supplement for Section 6 Proof of correctness for Algorithm 4 We begin by proving a lemma that will be useful later. Lemma 3 For all x ∈ Rd and for all l ∈ Rd , 2 d d 2 ∑ ∑ l(i) . x(i)2 l(i)2 ≥ x(i) √∑ d l(j)2 i=1 i=1 j=1 Proof of Lemma 3 By Jensen’s inequality, 2 ( d )2 d d ∑ x(i)l(i)2 ∑ ∑ x(i)l(i)2 . √∑ x(i)2 ∑d ⇐⇒ x(i)2 l(i)2 ≥ ≥ ∑d d l(j)2 l(j)2 l(j)2 j=1 j=1 i=1 i=1 i=1 i=1 d ∑ l(i)2 j=1 Now we prove the correctness of Algorithm 4. At the ith iteration, the algorithm stores Σi using ⊤ a low-rank representation Vi Si Vi + Di where Vi ∈ Rk×d is orthonormal, Si ∈ Rk×k positive d×d semi-definite and Di ∈ R is non-negative diagonal. The diagonal terms D are initialized to tλI where λ is the regularization term. To mimic Algorithm 1 we must increment the Σ matrix by a rank one update of the form Σi = Σi−1 + ri r⊤ . By projecting ri onto each eigenvector in V, we i ∑k ⊤ can decompose it as ri = j=1 Vi−1 (j, ·)ri Vi−1 (j, ·)⊤ + g = Vi−1 Vi−1 ri + g where g is the remaining residue. Thus the update rule can be rewritten as: Σi ⊤ ⊤ ⊤ = Σi−1 + ri r⊤ = Vi−1 Si−1 Vi−1 + Di−1 + (Vi−1 Vi−1 ri + g)(Vi−1 Vi−1 ri + g)⊤ i ′ ′ ′ ⊤ ⊤ ⊤ = Vi−1 (Si−1 + Vi−1 ri r⊤ Vi−1 )Vi−1 + Di−1 + gg⊤ = Vi−1 Si−1 Vi−1 + gg⊤ + Di−1 i ′ where we define Vi−1 = Qi−1 Vi−1 and defined Qi−1 in terms of the singular value decomposition, ′ ′ ⊤ Q⊤ Si−1 Qi−1 = svd(Si−1 + Vi−1 ri r⊤ Vi−1 ). Note that Si−1 is diagonal and nonnegative by i−1 i construction. The current formula for Σi shows that we have a rank (k + 1) system (plus diagonal term) which needs to be converted back to a rank k system (plus diagonal term) which we denote by ′ Σi . We have two options as follows. Case 1) Remove g from Σi to obtain ′ ′ ′ ′ ⊤ Σi = Vi−1 Si−1 Vi−1 + Di−1 = Σi − gg⊤ = Σi − cvv⊤ 1 where c = ∥g∥2 and v = ∥g∥ g. th Case 2) Remove the m (smallest) eigenvalue in S′ and its corresponding eigenvector: i−1 ′ Σi ′ ′ ′ ′ ′ ′ ⊤ = Vi−1 Si−1 Vi−1 + Di−1 + gg⊤ − S (m, m)V (m, ·)⊤ V (m, ·) = Σi − cvv⊤ ′ ′ where c = S (m, m) and v = V(m, ·) . 15 ′ Clearly, both cases can be written as an update of the form Σi = Σi + cvv⊤ where c ≥ 0 and v⊤ v = 1. We choose the case with smaller c value to minimize the change as we drop from a system of order (k + 1) to order k. Discarding the smallest singular value and its corresponding eigenvector would violate the bound. Instead, consider absorbing this term into the diagonal component to ′′ ′ preserve the bound. Formally, we look for a diagonal matrix F such that Σi = Σi + F which also ′′ maintains x⊤ Σi x ≥ x⊤ Σi x for all x ∈ Rd . Thus, we want to satisfy: ( d )2 d ∑ ∑ ⊤ ′′ ⊤ ⊤ ⊤ ⊤ x Σi x ≥ x Σi x ⇐⇒ x cvv x ≤ x Fx ⇐⇒ c x(i)v(i) ≤ x(i)2 F(i) i=1 i=1 where, for ease of notation, we take F(i) = F(i, i). ′ where w = v⊤ 1. Consider the case where v ≥ 0 though we will soon get rid of (∑ )2 ∑d d this assumption. We need an F such that i=1 x(i)2 F(i) ≥ c i=1 x(i)v(i) . Equivalently, we (∑ ) ∑d ′ 2 ′ d need i=1 x(i)2 F(i) ≥ . Define F(i) = F(i) for all i = 1, . . . , d. So, we need 2 i=1 x(i)v(i) cw cw2 (∑ ) ∑d ′ ′ ′ 2 d an F such that i=1 x(i)2 F(i) ≥ . Using Lemma 3 it is easy to show that we i=1 x(i)v(i) Define v = 1 wv ′ ′ ′ may choose F (i) = v(i) . Thus, we obtain F(i) = cw2 F(i) = cwv(i). Therefore, for all x ∈ Rd , ∑d all v ≥ 0, and for F(i) = cv(i) j=1 v(j) we have d ∑ ( x(i) F(i) ≥ c 2 i=1 d ∑ )2 x(i)v(i) . (3) i=1 To generalize the inequality to hold for all vectors v ∈ Rd with potentially negative entries, it is ∑d sufficient to set F(i) = c|v(i)| j=1 |v(j)|. To verify this, consider flipping the sign of any v(i). The left side of the Inequality 3 does not change. For the right side of this inequality, flipping the sign of v(i) is equivalent to flipping the sign of x(i) and not changing the sign of v(i). However, in this case the inequality holds as shown before (it holds for any x ∈ Rd ). Thus for all x, v ∈ Rd and ∑d for F(i) = c|v(i)| j=1 |v(j)|, Inequality 3 holds. Supplement for Section 7 Small scale experiments In additional small-scale experiments, we compared Algorithm 2 with steepest descent (SD), conjugate gradient (CG), BFGS and Newton-Raphson. Small-scale problems may be interesting in real-time learning settings, for example, when a website has to learn from a user’s uploaded labeled data in a split second to perform real-time retrieval. We considered logistic regression on five UCI data sets where missing values were handled via mean-imputation. A range of regularization settings λ ∈ {100 , 102 , 104 } was explored and all algorithms were initialized from the same ten random start-points. Table 3 shows the average number of seconds each algorithm needed to achieve the same solution that BFGS converged to (all algorithms achieve the same solution due to concavity). The bound is the fastest algorithm as indicated in bold. data|λ BFGS SD CG Newton Bound a|100 1.90 1.74 0.78 0.31 0.01 a|102 0.89 0.92 0.83 0.25 0.01 a|104 2.45 1.60 0.85 0.22 0.01 b|100 3.14 2.18 0.70 0.43 0.07 b|102 2.00 6.17 0.67 0.37 0.04 b|104 1.60 5.83 0.83 0.35 0.04 c|100 4.09 1.92 0.65 0.39 0.07 c|102 1.03 0.64 0.64 0.34 0.02 c|104 1.90 0.56 0.72 0.32 0.02 d|100 5.62 12.04 1.36 0.92 0.16 d|102 2.88 1.27 1.21 0.63 0.09 d|104 3.28 1.94 1.23 0.60 0.07 e|100 2.63 2.68 0.48 0.35 0.03 e|102 2.01 2.49 0.55 0.26 0.03 e|104 1.49 1.54 0.43 0.20 0.03 Table 3: Convergence time in seconds under various regularization levels for a) Bupa (t = 345, dim = 7), b) Wine (t = 178, dim = 14), c) Heart (t = 187, dim = 23), d) Ion (t = 351, dim = 34), and e) Hepatitis (t = 155, dim = 20) data sets. Influence of rank k on bound performance in large scale experiments We also examined the influence of k on bound performance and compared it with LBFGS, SD and CG. Several choices 16 of k were explored. Table 4 shows results for the SRBCT data-set. In general, the bound performs best but slows down for superfluously large values of k. Steepest descent and conjugate gradient are slow yet obviously do not vary with k. Note that each iteration takes less time with smaller k for the bound. However, we are reporting overall runtime which is also a function of the number of iterations. Therefore, total runtime (a function of both) may not always decrease/increase with k. k LBFGS SD CG Bound 1 1.37 8.80 4.39 0.56 2 1.32 8.80 4.39 0.56 4 1.39 8.80 4.39 0.67 8 1.35 8.80 4.39 0.96 16 1.46 8.80 4.39 1.34 32 1.40 8.80 4.39 2.11 64 1.54 8.80 4.39 4.57 Table 4: Convergence time in seconds as a function of k. Additional latent-likelihood results For completeness, Figure 5 depicts two additional data-sets to complement Figure 2. Similarly, Table 5 shows all experimental settings explored in order to provide the summary Table 2 in the main article. bupa wine −19 0 −5 −log(J(θ)) −log(J(θ)) −20 −21 −22 Bound Newton BFGS Conjugate gradient Steepest descent −15 −23 −24 −5 −10 0 5 log(Time) [sec] 10 −20 −4 −2 0 2 4 log(Time) [sec] 6 8 Figure 5: Convergence of test latent log-likelihood for bupa and wine data-sets. Data-set Algorithm BFGS SD CG Newton Bound ion m=1 m=2m=3m=4 -4.96 -5.55 -5.88 -5.79 -11.80 -9.92 -5.56 -8.59 -5.47 -5.81 -5.57 -5.22 -5.95 -5.95 -5.95 -5.95 -6.08 -4.84 -4.18 -5.17 Data-set Algorithm BFGS SD CG Newton Bound bupa m=1 m=2 m=3 m=4 -22.07 -21.78 -21.92 -21.87 -21.76 -21.74 -21.73 -21.83 -21.81 -21.81 -21.81 -21.81 -21.85 -21.85 -21.85 -21.85 -21.85 -19.95 -20.01 -19.97 wine m=1m=2m=3m=4 -0.90 -0.91 -1.79 -1.35 -1.61 -1.60 -1.37 -1.63 -0.51 -0.78 -0.95 -0.51 -0.71 -0.71 -0.71 -0.71 -0.51 -0.51 -0.48 -0.51 hepatitis m=1m=2m=3m=4 -4.42 -5.28 -4.95 -4.93 -4.93 -5.14 -5.01 -5.20 -4.84 -4.84 -4.84 -4.84 -5.50 -5.50 -5.50 -4.50 -5.47 -4.40 -4.75 -4.92 SRBCT m=1m=2m=3m=4 -5.99 -6.17 -6.09 -6.06 -5.61 -5.62 -5.62 -5.61 -5.62 -5.49 -5.36 -5.76 -5.54 -5.54 -5.54 -5.54 -5.31 -5.31 -4.90 -0.11 Table 5: Test latent log-likelihood at convergence for different values of m ∈ {1, 2, 3, 4} on ion, bupa, hepatitis, wine and SRBCT data-sets. 17
4 0.57226044 197 nips-2012-Learning with Recursive Perceptual Representations
Author: Oriol Vinyals, Yangqing Jia, Li Deng, Trevor Darrell
Abstract: Linear Support Vector Machines (SVMs) have become very popular in vision as part of state-of-the-art object recognition and other classification tasks but require high dimensional feature spaces for good performance. Deep learning methods can find more compact representations but current methods employ multilayer perceptrons that require solving a difficult, non-convex optimization problem. We propose a deep non-linear classifier whose layers are SVMs and which incorporates random projection as its core stacking element. Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous –often more complicated– methods on several vision and speech benchmarks. 1
5 0.50918096 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model
Author: Yi Wu, David P. Wipf
Abstract: Sparse linear (or generalized linear) models combine a standard likelihood function with a sparse prior on the unknown coefficients. These priors can conveniently be expressed as a maximization over zero-mean Gaussians with different variance hyperparameters. Standard MAP estimation (Type I) involves maximizing over both the hyperparameters and coefficients, while an empirical Bayesian alternative (Type II) first marginalizes the coefficients and then maximizes over the hyperparameters, leading to a tractable posterior approximation. The underlying cost functions can be related via a dual-space framework from [22], which allows both the Type I or Type II objectives to be expressed in either coefficient or hyperparmeter space. This perspective is useful because some analyses or extensions are more conducive to development in one space or the other. Herein we consider the estimation of a trade-off parameter balancing sparsity and data fit. As this parameter is effectively a variance, natural estimators exist by assessing the problem in hyperparameter (variance) space, transitioning natural ideas from Type II to solve what is much less intuitive for Type I. In contrast, for analyses of update rules and sparsity properties of local and global solutions, as well as extensions to more general likelihood models, we can leverage coefficient-space techniques developed for Type I and apply them to Type II. For example, this allows us to prove that Type II-inspired techniques can be successful recovering sparse coefficients when unfavorable restricted isometry properties (RIP) lead to failure of popular ℓ1 reconstructions. It also facilitates the analysis of Type II when non-Gaussian likelihood models lead to intractable integrations. 1
6 0.50833678 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes
7 0.50512731 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models
9 0.50359243 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs
10 0.5033645 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines
11 0.50281614 192 nips-2012-Learning the Dependency Structure of Latent Factors
12 0.50277817 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models
13 0.50236315 47 nips-2012-Augment-and-Conquer Negative Binomial Processes
14 0.50231999 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
15 0.50191855 200 nips-2012-Local Supervised Learning through Space Partitioning
16 0.50079042 188 nips-2012-Learning from Distributions via Support Measure Machines
17 0.50034952 65 nips-2012-Cardinality Restricted Boltzmann Machines
18 0.50008601 279 nips-2012-Projection Retrieval for Classification
19 0.49960569 193 nips-2012-Learning to Align from Scratch
20 0.4995223 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models