nips nips2011 nips2011-306 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nan Ding, Yuan Qi, S.v.n. Vishwanathan
Abstract: Approximate inference is an important technique for dealing with large, intractable graphical models based on the exponential family of distributions. We extend the idea of approximate inference to the t-exponential family by defining a new t-divergence. This divergence measure is obtained via convex duality between the log-partition function of the t-exponential family and a new t-entropy. We illustrate our approach on the Bayes Point Machine with a Student’s t-prior. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Approximate inference is an important technique for dealing with large, intractable graphical models based on the exponential family of distributions. [sent-9, score-0.371]
2 We extend the idea of approximate inference to the t-exponential family by defining a new t-divergence. [sent-10, score-0.333]
3 This divergence measure is obtained via convex duality between the log-partition function of the t-exponential family and a new t-entropy. [sent-11, score-0.438]
4 1 Introduction The exponential family of distributions is ubiquitous in statistical machine learning. [sent-13, score-0.305]
5 Two prominent approximate inference techniques include the Monte Carlo Markov Chain (MCMC) method [1], and the deterministic method [2, 3]. [sent-17, score-0.182]
6 Essentially, these methods are premised on the search for a proxy in an analytically solvable distribution family that approximates the true underlying distribution. [sent-19, score-0.231]
7 To measure the closeness between the true and the approximate distributions, the relative entropy between these two distributions is used. [sent-20, score-0.411]
8 When working with the exponential family, one uses the Shannon-Boltzmann-Gibbs (SBG) entropy in which case the relative entropy is the well known Kullback-Leibler (KL) divergence [2]. [sent-21, score-0.681]
9 Numerous well-known algorithms in exponential family, such as the mean field method [2, 4] and the expectation propagation [3, 5], are based on this criterion. [sent-22, score-0.141]
10 The thin-tailed nature of the exponential family makes it unsuitable for designing algorithms which are potentially robust against certain kinds of noisy data. [sent-23, score-0.286]
11 Notable work including [6, 7] utilizes mixture/split exponential family based approximate model to improve the robustness. [sent-24, score-0.329]
12 Meanwhile, effort has also been devoted to develop alternate, generalized distribution families in statistics [e. [sent-25, score-0.134]
13 It is a special case of the more general φ-exponential family of Naudts [11, 15–17]. [sent-33, score-0.2]
14 Related work in [18] has applied the t-exponential family to generalize logistic regression and obtain an algorithm that is robust against certain types of label noise. [sent-34, score-0.2]
15 In this paper, we attempt to generalize deterministic approximate inference by using the texponential family. [sent-35, score-0.157]
16 In other words, the approximate distribution used is from the t-exponential family. [sent-36, score-0.093]
17 To obtain the corresponding divergence measure as in the exponential family, we exploit the 1 Sometimes, also called the q-exponential family or the Tsallis distribution. [sent-37, score-0.406]
18 1 convex duality between the log-partition function of the t-exponential family and a new t-entropy2 to define the t-divergence. [sent-38, score-0.299]
19 To illustrate the usage of the above procedure, we use it for approximate inference in the Bayes Point Machine (BPM) [3] but with a Student’s t-prior. [sent-39, score-0.133]
20 In Section 4, the t-divergence is derived and is used for approximate inference in Section 5. [sent-43, score-0.133]
21 2 The t-exponential Family and Related Entropies The t-exponential family was first proposed by Tsallis and co-workers [10, 13, 14]. [sent-45, score-0.2]
22 It is defined as p(x; θ) := expt (�Φ(x), θ� − gt (θ)) , where � exp(x) if t = 1 1 expt (x) := 1−t otherwise. [sent-46, score-0.38]
23 (3) (2) The inverse of the expt function is called logt . [sent-48, score-0.602]
24 Note that the log-partition function, gt (θ), in (1) preserves convexity and satisfies Here q(x) is called the escort distribution of p(x), and is defined as q(x) := � p(x)t . [sent-49, score-0.539]
25 p(x)t dx (4) See the supplementary material for a proof of convexity of gt (θ) based on material from [17], and a detailed review of the t-exponential family of distributions. [sent-50, score-0.574]
26 There are various generalizations of the Shannon-Boltzmann-Gibbs (SBG) entropy which are proposed in statistical physics, and paired with the t-exponential family of distributions. [sent-51, score-0.421]
27 Perhaps the most well-known among them is the Tsallis entropy [10]: � (5) Htsallis (p) := − p(x)t logt p(x)dx. [sent-52, score-0.723]
28 Naudts [11, 15, 16, 17] proposed a more general framework, wherein the familiar exp and log functions are generalized to expφ and logφ functions which are defined via a function φ. [sent-53, score-0.072]
29 These generalized functions are used to define a family of distributions, and corresponding to this family an entropy like measure called the information content Iφ (p) as well as its divergence measure are defined. [sent-54, score-0.894]
30 The information content is the dual of a function F (θ), where ∇θ F (θ) = Ep [Φ(x)] . [sent-55, score-0.077]
31 (6) Setting φ(p) = p in the Naudts framework recovers the t-exponential family defined in (1). [sent-56, score-0.2]
32 Interestingly when φ(p) = 1 p2−t , the information content Iφ is exactly the Tsallis entropy (5). [sent-57, score-0.264]
33 t t e One another well-known non-SBG entropy is the R´ nyi entropy [19]. [sent-58, score-0.511]
34 The R´ nyi α-entropy (when e α �= 1) of the probability distribution p(x) is defined as: �� � 1 α p(x) dx . [sent-59, score-0.242]
35 log (7) Hα (p) = 1−α Besides these entropies proposed in statistical physics, it is also worth noting efforts that work with generalized linear models or utilize different divergence measures, such as [5, 8, 20, 21]. [sent-60, score-0.275]
36 It is well known that the negative SBG entropy is the Fenchel dual of the log-partition function of an exponential family distribution. [sent-61, score-0.522]
37 This fact is crucially used in variational inference [2]. [sent-62, score-0.114]
38 Although all 2 Although closely related, our t-entropy definition is different from either the Tsallis entropy [10] or the information content in [17]. [sent-63, score-0.264]
39 Nevertheless, it can be regarded as an example of the generalized framework of the entropy proposed in [8]. [sent-64, score-0.293]
40 2 of the above generalized entropies are useful in their own way, none of them satisfy this important property for the t-exponential family. [sent-65, score-0.155]
41 In the following sections we attempt to find an entropy which satisfies this property, and outline the principles of approximate inference using the t-exponential family. [sent-66, score-0.354]
42 Note that although our main focus is the t-exponential family, we believe that our results can also be extended to the more general φ-exponential family of Naudts [15, 17]. [sent-67, score-0.2]
43 3 Convex Duality and the t-Entropy Definition 1 (Inspired by Wainwright and Jordan [2]) The t-entropy of a distribution p(x; θ) is defined as � Ht (p(x; θ)) : = − q(x; θ) logt p(x; θ) dx = − Eq [logt p(x; θ)] . [sent-68, score-0.675]
44 (8) where q(x; θ) is the escort distribution of p(x; θ). [sent-69, score-0.307]
45 Furthermore, the following theorem establishes the duality between −Ht and gt . [sent-71, score-0.278]
46 Theorem 2 For any µ, define θ(µ) (if exists) to be the parameter of the t-exponential family s. [sent-75, score-0.2]
47 ∗ gt (µ) Then = � −Ht (p(x; θ(µ))) if θ(µ) exists +∞ otherwise . [sent-78, score-0.18]
48 (10) ∗ where gt (µ) denotes the Fenchel dual of gt (θ). [sent-79, score-0.394]
49 By duality it also follows that ∗ gt (θ) = sup {�µ, θ� − gt (µ)} . [sent-80, score-0.438]
50 1 2 t−1 (13) Relation with the Tsallis Entropy Using (4), (5), and (8), the relation between the t-entropy and Tsallis entropy is obvious. [sent-87, score-0.241]
51 Basically, the t-entropy is a normalized version of the Tsallis entropy, � 1 1 p(x)t logt p(x)dx = � Htsallis (p). [sent-88, score-0.502]
52 One can recover the SBG entropy by setting t = 1. [sent-108, score-0.221]
53 2 Relation with the R´ nyi Entropy e We can equivalently rewrite the R´ nyi Entropy as: e �� � �� �−1/(1−α) 1 p(x)α dx = − log log p(x)α dx . [sent-111, score-0.422]
54 Hα (p) = 1−α The t-entropy of p(x) (when t �= 1) is equal to � �−1/(1−t) �� p(x)t logt p(x)dx � . [sent-112, score-0.502]
55 Ht (p) = − = − logt p(x)t dx p(x)t dx (15) (16) Therefore, when α = t, Ht (p) = − logt (exp(−Hα (p))) (17) When t and α → 1, both entropies go to the SBG entropy. [sent-113, score-1.371]
56 4 The t-divergence Recall that the Bregman divergence defined by a convex function −H between p and p is [22]: ˜ � dH(˜) p (˜(x) − p(x))dx. [sent-114, score-0.141]
57 p (18) D(p� p) = −H(p) + H(˜) + ˜ p dp ˜ For the SBG entropy, it is easy to verify that the Bregman divergence leads to the relative SBGentropy (also widely known as the Kullback-Leibler (KL) divergence). [sent-115, score-0.172]
58 Analogously, one can define the t-divergence3 as the Bregman divergence or relative entropy based on the t-entropy. [sent-116, score-0.393]
59 Definition 3 The t-divergence, which is the relative t-entropy between two distribution p(x) and p(x), is defined as, ˜ � ˜ ˜ (19) Dt (p� p) = q(x) logt p(x) − q(x) logt p(x)dx. [sent-117, score-1.087]
60 Theorem 4 The t-divergence is the Bregman divergence defined on the negative t-entropy −Ht (p). [sent-120, score-0.12]
61 3 Note that the t-divergence is not a special case of the divergence measure of Naudts [17] because the entropies are defined differently the derivations are fairly similar in spirit. [sent-121, score-0.222]
62 4 The t-divergence plays a central role in the variational inference that will be derived shortly. [sent-122, score-0.114]
63 5 Approximate Inference in the t-Exponential Family In essence, the deterministic approximate inference finds an approximate distribution from an analytically tractable distribution family which minimizes the relative entropy (e. [sent-149, score-0.754]
64 Since the relative entropy is not symmetric, the results of minimizing D(p� p) and D(˜ �p) are different. [sent-152, score-0.273]
65 p Given an arbitrary probability distribution p(x), in order to obtain a good approximation p(x; θ) in ˜ the t-exponential family, we minimize the relative t-relative entropy (19) � p = argmin Dt (p� p) = q(x) logt p(x) − q(x) logt p(x; θ)dx. [sent-155, score-1.308]
66 ˜ ˜ ˜ (24) p ˜ Here q(x) = 1 t Z p(x) denotes the escort of the original distribution p(x). [sent-156, score-0.307]
67 Since p(x; θ) = expt (�Φ(x), θ� − gt (θ)), ˜ 5 (25) using the fact that ∇θ gt (θ) = Eq [Φ(x)], one can take the derivative of (24) with respect to θ: ˜ Eq [Φ(x)] = Eq [Φ(x)]. [sent-157, score-0.46]
68 ˜ (26) In other words, the approximate distribution can be obtained by matching the escort expectation of Φ(x) between the two distributions. [sent-158, score-0.438]
69 The escort expectation matching in (26) is reminiscent of the moment matching in the Power-EP [5] or the Fractional BP [23] algorithm, where the approximate distribution is obtained by Ep [Φ(x)] = Epα p1−α /Z [Φ(x)]. [sent-159, score-0.493]
70 In contrast, we use the generalized exponential family (t-exponential family) to build our approximate models. [sent-161, score-0.401]
71 In this context, the t-divergence plays the same role as KL divergence in the exponential family. [sent-162, score-0.187]
72 To illustrate our ideas on a non-trivial problem, we apply escort expectation matching to the Bayes Point Machine (BPM) [3] with a Student’s t-distribution prior. [sent-163, score-0.345]
73 For each training data point (xi , yi ), the conditional distribution of the label yi given xi and w is modeled as [3]: ti (w) = p(yi | xi , w) = � + (1 − 2�)Θ(yi �w, xi �), (28) where Θ(z) is the step function: Θ(z) = 1 if z > 0 and = 0 otherwise. [sent-169, score-0.376]
74 assumption about the data, the posterior distribution can be written as � ti (w), (29) p(w | D) ∝ p0 (w) i where p0 (w) denotes a prior distribution. [sent-173, score-0.164]
75 ˜ (31) p0 (w) = p0 (w) ˜ pi (w) ∝ pi−1 (w)ti (w) ˜ (32) (33) pi (w) = St(w; µ(i) , Σ(i) , v) = argmin Dt (pi (w)�St(w; µ, Σ, v)). [sent-178, score-0.234]
76 The extension to the expectation propagation is similar to [3] and omitted due to space limitation. [sent-180, score-0.074]
77 The main idea is to process data points one by one and update the posterior by using escort moment matching. [sent-181, score-0.337]
78 Assume the approximate distribution after processing (x1 , y1 ), . [sent-182, score-0.093]
79 6 ˜ Here qi (w) ∝ pi (w)t , qi (w) ∝ pi−1 (w)t ti (w) and ˜ ˜ ˜ � � ˜ ti (w) = ti (w)t = �t + (1 − �)t − �t Θ(yi �w, xi �). [sent-186, score-0.66]
80 Z2 x� Σ(i−1) xi x� Σ(i−1) xi i i Equations (39) and (41) are analogous to Eq. [sent-188, score-0.094]
81 Combining with (38) and (40), we obtain the escort expectations of pi (w) from Z1 and Z2 (similar to Eq. [sent-192, score-0.393]
82 13) in [3]), � 1 ˜ qi−1 (w)ti (w) w d w = µ(i−1) + Σ(i−1) g ˜ Eq [w] = (44) Z2 � 1 ˜ Eq [w w� ] − Eq [w] Eq [w]� = qi−1 (w)ti (w) w w� d w − Eq [w] Eq [w]� ˜ Z2 � � = r Σ(i−1) − Σ(i−1) g g� −2 G Σ(i−1) (45) where r = Z1 /Z2 and Eq [·] means the expectation with respect to qi (w). [sent-195, score-0.142]
83 Since the mean and variance of the escort of pi (w) is µ(i) and Σ(i) (again see example 5), after ˜ combining with (42) and (43), µ(i) = Eq [w] = µ(i−1) + αyi Σ(i−1) xi Σ(i) = Eq [w w� ] − Eq [w] Eq [w]� = r Σ(i−1) −(Σ(i−1) xi ) 6. [sent-196, score-0.487]
84 1 � � αyi xi , µ �� (i) x� Σ(i−1) xi i (46) (Σ(i−1) xi )� . [sent-197, score-0.141]
85 (47) Results In the above Bayesian online learning algorithm, everytime a new data xn coming in, p(θ | x1 , . [sent-198, score-0.085]
86 In our experiment, we build a synthetic online dataset which mimics the above senario, that is the underlying classification hyperplane is changed during a certain time interval. [sent-222, score-0.076]
87 During each sub-sequence s, there is a base weight vector wb ∈ {−1, +1}100 . [sent-225, score-0.165]
88 Each point x(i) of the (s) � � � subsequence is labeled as y(i) = sign w(i) x(i) where w(i) = wb +n and n is a random noise (s) from [−0. [sent-226, score-0.088]
89 The base weight vector wb can be (I) totally randomly generated, or (II) (s) generated based on the base weight vector wb (s−1) in the following way: b w(s)j = � Rand{−1, +1} j ∈ [400s − 399, 400s] b otherwise. [sent-229, score-0.33]
90 w(s−1)j (48) Namely, only 10% of the base weight vector is changed based upon the previous base weight vector. [sent-230, score-0.178]
91 We report (1) for each point the number of different signs between the base weight vector and the mean of the posterior (2) the error rate of all the points. [sent-234, score-0.169]
92 7 Discussion In this paper, we investigated the convex duality of the log-partition function of the t-exponential family, and defined a new t-entropy. [sent-239, score-0.099]
93 By using the t-divergence as a divergence measure, we proposed approximate inference on the t-exponential family by matching the expectation of the escort distributions. [sent-240, score-0.798]
94 The results in this paper can be extended to the more generalized φ-exponential family by Naudts [15]. [sent-241, score-0.272]
95 The t-divergence based approximate inference is only applied in a toy example. [sent-242, score-0.133]
96 Especially, it is important to investigate a new family of graphical models based on heavy-tailed distributions for applications involving noisy data. [sent-244, score-0.271]
97 Comparing the mean field method and belief propagation for approximate inference in MRFs. [sent-266, score-0.191]
98 Generalized thermostatistics based on deformed exponential and logarithmic functions. [sent-333, score-0.161]
99 Estimators, escort proabilities, and φ-exponential families in statistical physics. [sent-341, score-0.307]
100 Relative loss bounds for on-line density estimation with the exponential family of distributions. [sent-372, score-0.267]
wordName wordTfidf (topN-words)
[('logt', 0.502), ('escort', 0.276), ('tsallis', 0.276), ('student', 0.243), ('eq', 0.235), ('entropy', 0.221), ('family', 0.2), ('gt', 0.18), ('naudts', 0.15), ('sbg', 0.15), ('dx', 0.142), ('ht', 0.141), ('divergence', 0.12), ('pi', 0.117), ('st', 0.112), ('dt', 0.108), ('qi', 0.104), ('bern', 0.103), ('expt', 0.1), ('ti', 0.096), ('physica', 0.095), ('wb', 0.088), ('entropies', 0.083), ('pt', 0.079), ('duality', 0.078), ('bpm', 0.075), ('generalized', 0.072), ('inference', 0.071), ('nyi', 0.069), ('exponential', 0.067), ('approximate', 0.062), ('bregman', 0.059), ('signs', 0.055), ('yi', 0.054), ('bayes', 0.053), ('relative', 0.052), ('gauss', 0.052), ('htsallis', 0.05), ('senario', 0.05), ('thermostatistics', 0.05), ('base', 0.048), ('bernoulli', 0.047), ('xi', 0.047), ('deformed', 0.044), ('variational', 0.043), ('content', 0.043), ('physics', 0.043), ('fenchel', 0.041), ('distributions', 0.038), ('expectation', 0.038), ('posterior', 0.037), ('propagation', 0.036), ('coming', 0.035), ('dual', 0.034), ('ep', 0.034), ('graphical', 0.033), ('distribution', 0.031), ('matching', 0.031), ('fractional', 0.031), ('families', 0.031), ('url', 0.03), ('weight', 0.029), ('kl', 0.028), ('convexity', 0.028), ('hyperplane', 0.027), ('prominent', 0.025), ('online', 0.025), ('xn', 0.025), ('deterministic', 0.024), ('moment', 0.024), ('preserves', 0.024), ('supplementary', 0.024), ('changed', 0.024), ('belief', 0.022), ('game', 0.022), ('grunwald', 0.022), ('culota', 0.022), ('nan', 0.022), ('approximative', 0.022), ('azoury', 0.022), ('mendes', 0.022), ('rosenthal', 0.022), ('convex', 0.021), ('wainwright', 0.021), ('measures', 0.02), ('sousa', 0.02), ('hungarica', 0.02), ('boyen', 0.02), ('studia', 0.02), ('bayesian', 0.02), ('relation', 0.02), ('theorem', 0.02), ('measure', 0.019), ('editors', 0.019), ('bouchard', 0.019), ('departments', 0.019), ('unsuitable', 0.019), ('obermayer', 0.019), ('purdue', 0.019), ('closeness', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 306 nips-2011-t-divergence Based Approximate Inference
Author: Nan Ding, Yuan Qi, S.v.n. Vishwanathan
Abstract: Approximate inference is an important technique for dealing with large, intractable graphical models based on the exponential family of distributions. We extend the idea of approximate inference to the t-exponential family by defining a new t-divergence. This divergence measure is obtained via convex duality between the log-partition function of the t-exponential family and a new t-entropy. We illustrate our approach on the Bayes Point Machine with a Student’s t-prior. 1
2 0.14239566 123 nips-2011-How biased are maximum entropy models?
Author: Jakob H. Macke, Iain Murray, Peter E. Latham
Abstract: Maximum entropy models have become popular statistical models in neuroscience and other areas in biology, and can be useful tools for obtaining estimates of mutual information in biological systems. However, maximum entropy models fit to small data sets can be subject to sampling bias; i.e. the true entropy of the data can be severely underestimated. Here we study the sampling properties of estimates of the entropy obtained from maximum entropy models. We show that if the data is generated by a distribution that lies in the model class, the bias is equal to the number of parameters divided by twice the number of observations. However, in practice, the true distribution is usually outside the model class, and we show here that this misspecification can lead to much larger bias. We provide a perturbative approximation of the maximally expected bias when the true model is out of model class, and we illustrate our results using numerical simulations of an Ising model; i.e. the second-order maximum entropy distribution on binary data. 1
3 0.09236642 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations
Author: Shie Mannor, Ohad Shamir
Abstract: We consider an adversarial online learning setting where a decision maker can choose an action in every stage of the game. In addition to observing the reward of the chosen action, the decision maker gets side observations on the reward he would have obtained had he chosen some of the other actions. The observation structure is encoded as a graph, where node i is linked to node j if sampling i provides information on the reward of j. This setting naturally interpolates between the well-known “experts” setting, where the decision maker can view all rewards, and the multi-armed bandits setting, where the decision maker can only view the reward of the chosen action. We develop practical algorithms with provable regret guarantees, which depend on non-trivial graph-theoretic properties of the information feedback structure. We also provide partially-matching lower bounds. 1
4 0.081065714 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression
Author: David A. Knowles, Tom Minka
Abstract: Variational Message Passing (VMP) is an algorithmic implementation of the Variational Bayes (VB) method which applies only in the special case of conjugate exponential family models. We propose an extension to VMP, which we refer to as Non-conjugate Variational Message Passing (NCVMP) which aims to alleviate this restriction while maintaining modularity, allowing choice in how expectations are calculated, and integrating into an existing message-passing framework: Infer.NET. We demonstrate NCVMP on logistic binary and multinomial regression. In the multinomial case we introduce a novel variational bound for the softmax factor which is tighter than other commonly used bounds whilst maintaining computational tractability. 1
5 0.073355377 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
Author: Alex K. Susemihl, Ron Meir, Manfred Opper
Abstract: Bayesian filtering of stochastic stimuli has received a great deal of attention recently. It has been applied to describe the way in which biological systems dynamically represent and make decisions about the environment. There have been no exact results for the error in the biologically plausible setting of inference on point process, however. We present an exact analysis of the evolution of the meansquared error in a state estimation task using Gaussian-tuned point processes as sensors. This allows us to study the dynamics of the error of an optimal Bayesian decoder, providing insights into the limits obtainable in this task. This is done for Markovian and a class of non-Markovian Gaussian processes. We find that there is an optimal tuning width for which the error is minimized. This leads to a characterization of the optimal encoding for the setting as a function of the statistics of the stimulus, providing a mathematically sound primer for an ecological theory of sensory processing. 1
6 0.072034709 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison
7 0.067912281 170 nips-2011-Message-Passing for Approximate MAP Inference with Latent Variables
8 0.067189284 295 nips-2011-Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction
9 0.065904245 258 nips-2011-Sparse Bayesian Multi-Task Learning
10 0.061631288 158 nips-2011-Learning unbelievable probabilities
11 0.05968421 288 nips-2011-Thinning Measurement Models and Questionnaire Design
12 0.059236765 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization
13 0.05917779 145 nips-2011-Learning Eigenvectors for Free
14 0.057837542 59 nips-2011-Composite Multiclass Losses
15 0.05762003 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines
16 0.054693002 249 nips-2011-Sequence learning with hidden units in spiking neural networks
17 0.054104142 217 nips-2011-Practical Variational Inference for Neural Networks
18 0.052515805 104 nips-2011-Generalized Beta Mixtures of Gaussians
19 0.051924575 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning
20 0.051159363 60 nips-2011-Confidence Sets for Network Structure
topicId topicWeight
[(0, 0.154), (1, -0.033), (2, 0.031), (3, -0.044), (4, -0.019), (5, -0.053), (6, -0.02), (7, -0.086), (8, -0.009), (9, -0.011), (10, -0.038), (11, -0.113), (12, 0.056), (13, -0.025), (14, -0.03), (15, 0.019), (16, 0.041), (17, -0.013), (18, 0.009), (19, 0.062), (20, -0.026), (21, 0.049), (22, 0.114), (23, -0.023), (24, 0.172), (25, -0.002), (26, 0.03), (27, 0.0), (28, 0.054), (29, 0.073), (30, 0.012), (31, 0.101), (32, -0.078), (33, 0.005), (34, -0.067), (35, -0.051), (36, 0.066), (37, 0.007), (38, 0.01), (39, 0.059), (40, -0.085), (41, 0.022), (42, 0.167), (43, -0.101), (44, 0.061), (45, 0.093), (46, 0.085), (47, -0.056), (48, -0.079), (49, 0.044)]
simIndex simValue paperId paperTitle
same-paper 1 0.94483733 306 nips-2011-t-divergence Based Approximate Inference
Author: Nan Ding, Yuan Qi, S.v.n. Vishwanathan
Abstract: Approximate inference is an important technique for dealing with large, intractable graphical models based on the exponential family of distributions. We extend the idea of approximate inference to the t-exponential family by defining a new t-divergence. This divergence measure is obtained via convex duality between the log-partition function of the t-exponential family and a new t-entropy. We illustrate our approach on the Bayes Point Machine with a Student’s t-prior. 1
2 0.76305008 123 nips-2011-How biased are maximum entropy models?
Author: Jakob H. Macke, Iain Murray, Peter E. Latham
Abstract: Maximum entropy models have become popular statistical models in neuroscience and other areas in biology, and can be useful tools for obtaining estimates of mutual information in biological systems. However, maximum entropy models fit to small data sets can be subject to sampling bias; i.e. the true entropy of the data can be severely underestimated. Here we study the sampling properties of estimates of the entropy obtained from maximum entropy models. We show that if the data is generated by a distribution that lies in the model class, the bias is equal to the number of parameters divided by twice the number of observations. However, in practice, the true distribution is usually outside the model class, and we show here that this misspecification can lead to much larger bias. We provide a perturbative approximation of the maximally expected bias when the true model is out of model class, and we illustrate our results using numerical simulations of an Ising model; i.e. the second-order maximum entropy distribution on binary data. 1
3 0.72764421 295 nips-2011-Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction
Author: Siwei Lyu
Abstract: When used to learn high dimensional parametric probabilistic models, the classical maximum likelihood (ML) learning often suffers from computational intractability, which motivates the active developments of non-ML learning methods. Yet, because of their divergent motivations and forms, the objective functions of many non-ML learning methods are seemingly unrelated, and there lacks a unified framework to understand them. In this work, based on an information geometric view of parametric learning, we introduce a general non-ML learning principle termed as minimum KL contraction, where we seek optimal parameters that minimizes the contraction of the KL divergence between the two distributions after they are transformed with a KL contraction operator. We then show that the objective functions of several important or recently developed non-ML learning methods, including contrastive divergence [12], noise-contrastive estimation [11], partial likelihood [7], non-local contrastive objectives [31], score matching [14], pseudo-likelihood [3], maximum conditional likelihood [17], maximum mutual information [2], maximum marginal likelihood [9], and conditional and marginal composite likelihood [24], can be unified under the minimum KL contraction framework with different choices of the KL contraction operators. 1
4 0.70198321 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison
Author: Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, Masashi Sugiyama
Abstract: Divergence estimators based on direct approximation of density-ratios without going through separate approximation of numerator and denominator densities have been successfully applied to machine learning tasks that involve distribution comparison such as outlier detection, transfer learning, and two-sample homogeneity test. However, since density-ratio functions often possess high fluctuation, divergence estimation is still a challenging task in practice. In this paper, we propose to use relative divergences for distribution comparison, which involves approximation of relative density-ratios. Since relative density-ratios are always smoother than corresponding ordinary density-ratios, our proposed method is favorable in terms of the non-parametric convergence speed. Furthermore, we show that the proposed divergence estimator has asymptotic variance independent of the model complexity under a parametric setup, implying that the proposed estimator hardly overfits even with complex models. Through experiments, we demonstrate the usefulness of the proposed approach. 1
5 0.57065439 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression
Author: David A. Knowles, Tom Minka
Abstract: Variational Message Passing (VMP) is an algorithmic implementation of the Variational Bayes (VB) method which applies only in the special case of conjugate exponential family models. We propose an extension to VMP, which we refer to as Non-conjugate Variational Message Passing (NCVMP) which aims to alleviate this restriction while maintaining modularity, allowing choice in how expectations are calculated, and integrating into an existing message-passing framework: Infer.NET. We demonstrate NCVMP on logistic binary and multinomial regression. In the multinomial case we introduce a novel variational bound for the softmax factor which is tighter than other commonly used bounds whilst maintaining computational tractability. 1
6 0.53744191 77 nips-2011-Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression
7 0.53312051 288 nips-2011-Thinning Measurement Models and Questionnaire Design
8 0.5328759 253 nips-2011-Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment
9 0.53076488 69 nips-2011-Differentially Private M-Estimators
10 0.53043449 240 nips-2011-Robust Multi-Class Gaussian Process Classification
11 0.50348359 225 nips-2011-Probabilistic amplitude and frequency demodulation
12 0.49225107 81 nips-2011-Efficient anomaly detection using bipartite k-NN graphs
13 0.47565085 158 nips-2011-Learning unbelievable probabilities
14 0.47404188 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines
15 0.40309888 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints
16 0.38918927 3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems
17 0.38826543 191 nips-2011-Nonnegative dictionary learning in the exponential noise model for adaptive music signal representation
18 0.38746193 42 nips-2011-Bayesian Bias Mitigation for Crowdsourcing
19 0.38589755 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
20 0.38298655 33 nips-2011-An Exact Algorithm for F-Measure Maximization
topicId topicWeight
[(0, 0.026), (4, 0.037), (20, 0.02), (26, 0.016), (31, 0.115), (33, 0.027), (43, 0.127), (45, 0.09), (57, 0.02), (74, 0.05), (77, 0.286), (83, 0.051), (99, 0.048)]
simIndex simValue paperId paperTitle
1 0.78536218 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators
Author: Dominique C. Perrault-joncas, Marina Meila
Abstract: This paper considers the problem of embedding directed graphs in Euclidean space while retaining directional information. We model the observed graph as a sample from a manifold endowed with a vector field, and we design an algorithm that separates and recovers the features of this process: the geometry of the manifold, the data density and the vector field. The algorithm is motivated by our analysis of Laplacian-type operators and their continuous limit as generators of diffusions on a manifold. We illustrate the recovery algorithm on both artificially constructed and real data. 1 Motivation Recent advances in graph embedding and visualization have focused on undirected graphs, for which the graph Laplacian properties make the analysis particularly elegant [1, 2]. However, there is an important number of graph data, such as social networks, alignment scores between biological sequences, and citation data, which are naturally asymmetric. A commonly used approach for this type of data is to disregard the asymmetry by studying the spectral properties of W +W T or W T W , where W is the affinity matrix of the graph. Some approaches have been offered to preserve the asymmetry information contained in data: [3], [4], [5] or to define directed Laplacian operators [6]. Although quite successful, these works adopt a purely graph-theoretical point of view. Thus, they are not concerned with the generative process that produces the graph, nor with the interpretability and statistical properties of their algorithms. In contrast, we view the nodes of a directed graph as a finite sample from a manifold in Euclidean space, and the edges as macroscopic observations of a diffusion kernel between neighboring points on the manifold. We explore how this diffusion kernel determines the overall connectivity and asymmetry of the resulting graph and demonstrate how Laplacian-type operators of this graph can offer insights into the underlying generative process. Based on the analysis of the Laplacian-type operators, we derive an algorithm that, in the limit of infinite sample and vanishing bandwidth, recovers the key features of the sampling process: manifold geometry, sampling distribution, and local directionality, up to their intrinsic indeterminacies. 2 Model The first premise here is that we observe a directed graph G, with n nodes, having weights W = [Wij ] for the edge from node i to node j. In following with common Laplacian-based embedding approaches, we assume that G is a geometric random graph constructed from n points sampled according to distribution p = e−U on an unobserved compact smooth manifold M ⊆ Rl of known intrinsic dimension d ≤ l. The edge weight Wij is then determined by a directed similarity kernel k (xi , xj ) with bandwidth . The directional component of k (xi , xj ) will be taken to be derived 1 from a vector field r on M, which assigns a preferred direction between weights Wij and Wji . The choice of a vector field r to characterize the directional component of G might seem restrictive at first. In the asymptotic limit of → 0 and n → ∞ however, kernels are characterized by their diffusion, drift, and source components [7]. As such, r is sufficient to characterize any directionality associated with a drift component and as it turns out, the component of r normal M in Rl can also be use to characterize any source component. As for the diffusion component, it is not possible to uniquely identify it from G alone [8]. Some absolute knownledge of M is needed to say anything about it. Hence, without loss of generality, we will construct k (xi , xj ) so that the diffusion component ends being isotropic and constant, i.e. equal to Laplace-Beltrami operator ∆ on M. The schematic of this generative process is shown in the top left of Figure 1 below. From left to right: the graph generative process mapping the sample on M to geometric random graph G via the kernel k (x, y), then the subsequent embedding (α) Ψn of G by operators Haa,n , (α) Hss,n (defined in section 3.1). As these operators converge to (α) their respective limits, Haa and (α) Hss , so will Ψn → Ψ, pn → p, and rn → r. We design an algorithm that, given G, produces the top right embedding (Ψn , pn , and rn ). Figure 1: Schematic of our framework. The question is then as follows: can the generative process’ geometry M, distribution p = e−U , and directionality r, be recovered from G? In other words, is there an embedding of G in Rm , m ≥ d that approximates all three components of the process and that is also consistent as sample size increases and the bandwidth vanishes? In the case of undirected graphs, the theory of Laplacian eigenmaps [1] and Diffusion maps [9] answers this question in the affirmative, in that the geometry of M and p = e−U can be inferred using spectral graph theory. The aim here is to build on the undirected problem and recover all three components of the generative process from a directed graph G. The spectral approach to undirected graph embedding relies on the fact that eigenfunctions of the Laplace-Beltrami operator are known to preserve the local geometry of M [1]. With a consistent empirical Laplace-Beltrami operator based on G, its eigenvectors also recover the geometry of M and converge to the corresponding eigenfunctions on M. For a directed graph G, an additional operator is needed to recover the local directional component r, but the principle remains the same. (α) The schematic for this is shown in Figure 1 where two operators - Hss,n , introduced in [9] for (α) undirected embeddings, and Haa,n , a new operator defined in section 3.1 - are used to obtain the (α) (α) (α) embedding Ψn , distribution pn , and vector field rn . As Haa,n and Hss,n converge to Haa and (α) Hss , Ψn , pn , and rn also converge to Ψ, p, and r, where Ψ is the local geometry preserving the embedding of M into Rm . (α) The algorithm we propose in Section 4 will calculate the matrices corresponding to H·,n from the graph G, and with their eigenvectors, will find estimates for the node coordinates Ψ, the directional component r, and the sampling distribution p. In the next section we briefly describe the mathematical models of the diffusion processes that our model relies on. 2 2.1 Problem Setting The similarity kernel k (x, y) can be used to define transport operators on M. The natural transport operator is defined by normalizing k (x, y) as T [f ](x) = M k (x, y) f (y)p(y)dy , where p (x) = p (x) k (x, y)p(y)dy . (1) M T [f ](x) represents the diffusion of a distribution f (y) by the transition density k (x, y)p(y)/ k (x, y )p(y )dy . The eigenfunctions of this infinitesimal operator are the continuous limit of the eigenvectors of the transition probability matrix P = D−1 W given by normalizing the affinity matrix W of G by D = diag(W 1) [10]. Meanwhile, the infinitesimal transition ∂f (T − I)f = lim (2) →0 ∂t defines the backward equation for this diffusion process over M based on kernel k . Obtaining the explicit expression for transport operators like (2) is then the main technical challenge. 2.2 Choice of Kernel In order for T [f ] to have the correct asymptotic form, some hypotheses about the similarity kernel k (x, y) are required. The hypotheses are best presented by considering the decomposition of k (x, y) into symmetric h (x, y) = h (y, x) and anti-symmetric a (x, y) = −a (y, x) components: k (x, y) = h (x, y) + a (x, y) . (3) The symmetric component h (x, y) is assumed to satisfy the following properties: 1. h (||y − 2 x||2 ) = h(||y−x|| / ) , and 2. h ≥ 0 and h is exponentially decreasing as ||y − x|| → ∞. This form d/2 of symmetric kernel was used in [9] to analyze the diffusion map. For the asymmetric part of the similarity kernel, we assume the form a (x, y) = r(x, y) h(||y − x||2 / ) · (y − x) , d/2 2 (4) with r(x, y) = r(y, x) so that a (x, y) = −a (y, x). Here r(x, y) is a smooth vector field on the manifold that gives an orientation to the asymmetry of the kernel k (x, y). It is worth noting that the dependence of r(x, y) on both x and y implies that r : M × M → Rl with Rl the ambient space of M; however in the asymptotic limit, the dependence in y is only important “locally” (x = y), and as such it is appropriate to think of r(x, x) being a vector field on M. As a side note, it is worth pointing out that even though the form of a (x, y) might seem restrictive at first, it is sufficiently rich to describe any vector field . This can be seen by taking r(x, y) = (w(x) + w(y))/2 so that at x = y the resulting vector field is given by r(x, x) = w(x) for an arbitrary vector field w(x). 3 Continuous Limit of Laplacian Type Operators We are now ready to state the main asymptotic result. Proposition 3.1 Let M be a compact, closed, smooth manifold of dimension d and k (x, y) an asymmetric similarity kernel satisfying the conditions of section 2.2, then for any function f ∈ C 2 (M), the integral operator based on k has the asymptotic expansion k (x, y)f (y)dy = m0 f (x) + g(f (x), x) + o( ) , (5) M where g(f (x), x) = and m0 = Rd m2 (ω(x)f (x) + ∆f (x) + r · 2 h(||u||2 )du, m2 = Rd u2 h(||u||2 )du. i 3 f (x) + f (x) · r + c(x)f (x)) (6) The proof can be found in [8] along with the definition of ω(x) and c(x) in (6). For now, it suffices to say that ω(x) corresponds to an interaction between the symmetric kernel h and the curvature of M and was first derived in [9]. Meanwhile, c(x) is a new term that originates from the interaction between h and the component of r that is normal to M in the ambient space Rl . Proposition 3.1 foreshadows a general fact about spectral embedding algorithms: in most cases, Laplacian operators confound the effects of spatial proximity, sampling density and directional flow due to the presence of the various terms above. 3.1 Anisotropic Limit Operators Proposition 3.1 above can be used to derive the limits of a variety of Laplacian type operators associated with spectral embedding algorithms like [5, 6, 3]. Although we will focus primarily on a few operators that give the most insight into the generative process and enable us to recover the model defined in Figure 1, we first present four distinct families of operators for completeness. These operator families are inspired by the anisotropic family of operators that [9] introduced for undirected graphs, which make use of anisotropic kernels of the form: k (α) (x, y) = k (x, y) pα (x)pα (y) , (7) with α ∈ [0, 1] where α = 0 is the isotropic limit. To normalize the anisotropic kernels, we need (α) (α) (α) as p (x) = M k (x, y)p(y)dy. From (7), four to redefine the outdegrees distribution of k families of diffusion processes of the form ft = H (α) [f ](x) can be derived depending on which kernel is normalized and which outdegree distribution is used for the normalization. Specifically, (α) (α) we define transport operators by normalizing the asymmetric k or symmetric h kernels with the 1 asymmetric p or symmetric q = M h (x, y)p(y)dy outdegree distribution . To keep track of all options, we introduce the following notation: the operators will be indexed by the type of kernel and outdegree distribution they correspond to (symmetric or asymmetric), with the first index identifying the kernel and the second index identifying the outdegree distribution. For example, the family of anisotropic limit operators introduced by [9] is defined by normalizing the symmetric kernel by (α) the symmetric outdegree distribution, hence they will be denoted as Hss , with the superscript corresponding to the anisotropic power α. Proposition 3.2 With the above notation, (α) Haa [f ] = ∆f − 2 (1 − α) U · f + r· f (α) Has [f ] (α) Hsa [f ] (α) Hss [f ] = ∆f − 2 (1 − α) U · = ∆f − 2 (1 − α) U · = ∆f − 2(1 − α) U · (8) f − cf + (α − 1)(r · U )f − ( · r + (α − 1)r · f + (c + f. U )f · r)f + r · f (9) (10) (11) The proof of this proposition, which can be found in [8], follows from repeated application of Proposition 3.1 to p(y) or q(y) and then to k α (x, y) or hα (x, y), as well as the fact that p1 = α 1 p−α [1 − α (ω + ∆p p + 2r · p p +2 · r + c)] + o( ). (α) Thus, if we use the asymmetric k and p , we get Haa , defined by the advected diffusion equa(α) tion (8). In general, Haa is not hermitian, so it commonly has complex eigenvectors. This makes (1) embedding directed graphs with this operator problematic. Nevertheless, Haa will play an important role in extracting the directionality of the sampling process. If we use the symmetric kernel h but the asymmetric outdegree distribution p , we get the family (α) of operators Hsa , of which the WCut of [3] is a special case (α = 0). If we reverse the above, i.e. (α) (α) (α) use k and q , we obtain Has . This turns out to be merely a combination of Haa and Hsa . 1 The reader may notice that there are in fact eight possible combinations of kernel and degree distribution, since the anisotripic kernel (7) could also be defined using a symmetric or asymmetric outdegree distribution. However, there are only four distinct asymptotic results and they are all covered by using one kernel (symmetric or asymmetric) and one degree distribution (symmetric or asymmetric) throughout. 4 Algorithm 1 Directed Embedding Input: Affinity matrix Wi,j and embedding dimension m, (m ≥ d) 1. S ← (W + W T )/2 (Steps 1–6 estimate the coordinates as in [11]) n 2. qi ← j=1 Si,j , Q = diag(q) 3. V ← Q−1 SQ−1 (1) n 4. qi ← j=1 Vi,j , Q(1) = diag(q (1) ) (1) −1 5. Hss,n ← Q(1) V 6. Compute the Ψ the n × (m + 1) matrix with orthonormal columns containing the m + 1 largest (1) right eigenvector (by eigenvalue) of Hss,n as well as the Λ the (m + 1) × (m + 1) diagonal matrix of eigenvalues. Eigenvectors 2 to m + 1 from Ψ are the m coordinates of the embedding. (1) 7. Compute π the left eigenvector of Hss,n with eigenvalue 1. (Steps 7–8 estimate the density) n 8. π ← π/ i=1 πi is the density distribution over the embedding. n 9. pi ← j=1 Wi,j , P = diag(p) (Steps 9–13 estimate the vector field r) 10. T ← P −1 W P −1 (1) n 11. pi ← j=1 Ti,j , P (1) = diag(p(1) ) (1) −1 12. Haa,n ← P (1) T (1) (1) 13. R ← (Haa,n − Hss,n )Ψ/2. Columns 2 to m + 1 of R are the vector field components in the direction of the corresponding coordinates of the embedding. (α) Finally, if we only consider the symmetric kernel h and degree distribution q , we recover Hss , the anisotropic kernels of [9] for symmetric graphs. This operator for α = 1 is shown to separate the manifold from the probability distribution [11] and will be used as part of our recovery algorithm. Isolating the Vector Field r 4 Our aim is to esimate the manifold M, the density distribution p = e−U , and the vector field r. The (1) first two components of the data can be recovered from Hss as shown in [11] and summarized in Algorithm 1. At this juncture, one feature of generative process is missing: the vector field r. The natural approach (α) (α) for recovering r is to isolate the linear operator r · from Haa by substracting Hss : (α) (α) Haa − Hss = r · . (12) The advantage of recovering r in operator form as in (12) is that r · is coordinate free. In other words, as long as the chosen embedding of M is diffeomorphic to M2 , (12) can be used to express the component of r that lies in the tangent space T M, which we denote by r|| . Specifically, let Ψ be a diffeomorphic embedding of M ; the component of r along coordinate ψk is then given by r · ψk = rk , and so, in general, r|| = r · Ψ. (13) The subtle point that only r|| is recovered from (13) follows from the fact that the operator r · only defined along M and hence any directional derivative is necessarily along T M. is Equation (13) and the previous observations are the basis for Algorithm 1, which recovers the three important features of the generative process for an asymmetric graph with affinity matrix W . A similar approach can be employed to recover c + · r, or simply · r if r has no component perpendicular to the tangent space T M (meaning that c ≡ 0). Recovering c + · r is achieved by taking advantage of the fact that (1) (1) (Hsa − Hss ) = (c + 2 · r) , (14) (1) A diffeomorphic embedding is guaranteed by using the eigendecomposition of Hss . 5 (1) (1) which is a diagonal operator. Taking into account that for finite n (Hsa,n − Hss,n ) is not perfectly (1) (1) diagonal, using ψn ≡ 1n (vector of ones), i.e. (Hsa,n − Hss,n )[1n ] = (cn + · rn ), has been found (1) (1) empirically to be more stable than simply extracting the diagonal of (Hsa,n − Hss,n ). 5 Experiments Artificial Data For illustrative purposes, we begin by applying our method to an artificial example. We use the planet Earth as a manifold with a topographic density distribution, where sampling probability is proportional to elevation. We also consider two vector fields: the first is parallel to the line of constant latitude and purely tangential to the sphere, while the second is parallel to the line of constant longitude with a component of the vector field perpendicular to the manifold. The true model with constant latitude vector field is shown in Figure 2, along with the estimated density and vector field projected on the true manifold (sphere). Model Recovered Latitudinal (a) Longitudinal (b) Figure 2: (a): Sphere with latitudinal vector field, i.e East-West asymmetry, with Wew > Wwe if node w lies to the West of node e. The graph nodes are sampled non-uniformly, with the topographic map of the world as sampling density. We sample n = 5000 nodes, and observe only the resulting W matrix, but not the node locations. From W , our algorithm estimates the sample locations (geometry), the vector field (black arrows) generating the observed asymmetries, and the sampling distribution at each data point (colormap). (b) Vector fields on a spherical region (blue), and their estimates (red): latitudinal vector field tangent to the manifold (left) and longitudinal vector field with component perpendicular to manifold tangent plane (right). Both the estimated density and vector field agree with the true model, demonstrating that for artificial data, the recovery algorithm 1 performs quite well. We note that the estimated density does not recover all the details of the original density, even for large sample size (here n = 5000 with = 0.07). Meanwhile, the estimated vector field performs quite well even when the sampling is reduced to n = 500 with = 0.1. This can be seen in Figure 2, b, where the true and estimated vector fields are superimposed. Figure 2 also demonstrates how r · only recovers the tangential component of r. The estimated geometry is not shown on any of these figures, since the success of the diffusion map in recovering the geometry for such a simple manifold is already well established [2, 9]. Real DataThe National Longitudinal Survey of Youth (NLSY) 1979 Cohort is a representative sample of young men and women in the United States who were followed from 1979 to 2000 [12, 13]. The aim here is to use this survey to obtain a representation of the job market as a diffusion process over a manifold. The data set consists of a sample of 7,816 individual career sequences of length 64, listing the jobs a particular individual held every quarter between the ages of 20 and 36. Each token in the sequence identifies a job. Each job corresponds to an industry × occupation pair. There are 25 unique industry and 20 unique occupation indices. Out of the 500 possible pairings, approximately 450 occur in the data, with only 213 occurring with sufficient frequency to be included here. Thus, our graph G has 213 nodes - the jobs - and our observations consist of 7,816 walks between the graph nodes. We convert these walks to a directed graph with affinity matrix W . Specifically, Wij represents the number of times a transition from job i to job j was observed (Note that this matrix is asymmetric, 6 i.e Wij = Wji ). Normalizing each row i of W by its outdegree di gives P = diag(di )−1 W , the non-parametric maximum likelihood estimator for the Markov chain over G for the progression (0) of career sequences. This Markov chain has as limit operator Haa , as the granularity of the job market increases along with the number of observations. Thus, in trying to recover the geometry, distribution and vector field, we are actually interested in estimating the full advective effect of the (0) diffusion process generated by Haa ; that is, we want to estimate r · − 2 U · where we can use (0) (1) −2 U · = Hss − Hss to complement Algorithm 1. 0.25 1600 0.05 0.1 1400 0.1 3 0.7 1800 0.15 ! 0.8 0.15 0.2 0.9 0.2 2000 0.25 !30.05 0.6 0.5 0 0 0.4 1200 −0.05 −0.05 −0.1 0.3 −0.1 1000 0.2 −0.15 −0.15 800 −0.2 −0.25 0.1 −0.2 −0.25 −0.1 −0.05 0 !2 0.05 0.1 0.15 0.2 (a) −0.1 −0.05 0 !2 0.05 0.1 0.15 0.2 (b) Figure 3: Embedding the job market along with field r − 2 U over the first two non-constant eigenvectors. The color map corresponds to the mean monthly wage in dollars (a) and to the female proportion (b) for each job. We obtain an embedding of the job market that describes the relative position of jobs, their distribution, and the natural time progression from each job. Of these, the relative position and natural time progression are the most interesting. Together, they summarize the job market dynamics by describing which jobs are naturally “close” as well as where they can lead in the future. From a public policy perspective, this can potentially improve focus on certain jobs for helping individuals attain better upward mobility. The job market was found to be a high dimensional manifold. We present only the first two dimen(0) sions, that is, the second and third eigenvectors of Hss , since the first eigenvector is uninformative (constant) by construction. The eigenvectors showed correlation with important demographic data, such as wages and gender. Figure 3 displays this two-dimensional sub-embedding along with the directional information r − 2 U for each dimension. The plot shows very little net progression toward regions of increasing mean salary3 . This is somewhat surprising, but it is easy to overstate this observation: diffusion alone would be enough to move the individuals towards higher salary. What Figure 3 (a) suggests is that there appear to be no “external forces” advecting individuals towards higher salary. Nevertheless, there appear to be other external forces at play in the job market: Figure 3 (b), which is analogous to Figure 3 (a), but with gender replacing the salary color scheme, suggests that these forces push individuals towards greater gender differentiation. This is especially true amongst male-dominated jobs which appear to be advected toward the left edge of the embedding. Hence, this simple analysis of the job market can be seen as an indication that males and females tend to move away from each other over time, while neither seems to have a monopoly on high- or low- paying jobs. 6 Discussion This paper makes three contributions: (1) it introduces a manifold-based generative model for directed graphs with weighted edges, (2) it obtains asymptotic results for operators constructed from the directed graphs, and (3) these asymptotic results lead to a natural algorithm for estimating the model. 3 It is worth noting that in the NLSY data set, high paying jobs are teacher, nurse and mechanic. This is due to the fact that the career paths observed stop at at age 36, which is relatively early in an individual’s career. 7 Generative Models that assume that data are sampled from a manifold are standard for undirected graphs, but to our knowledge, none have yet been proposed for directed graphs. When W is symmetric, it is natural to assume that it depends on the points’ proximity. For asymmetric affinities W , one must include an additional component to explain the asymmetry. In the asymptotic limit, this is tantamount to defining a vector field on the manifold. Algorithm We have used from [9] the idea of defining anisotropic kernels (indexed by α) in order to separate the density p and the manifold geometry M. Also, we adopted their general assumptions about the symmetric part of the kernel. As a consequence, the recovery algorithm for p and M is identical to theirs. However, insofar as the asymmetric part of the kernel is concerned, everything, starting from the definition and the introduction of the vector field r as a way to model the asymmetry, through the derivation of the asymptotic expression for the symmetric plus asymmetric kernel, is new. We go significantly beyond the elegant idea of [9] regarding the use of anisotropic kernels by analyzing the four distinct renormalizations possible for a given α, each of them combining different aspects of M, p and r. Only the successful (and novel) combination of two different anisotropic operators is able to recover the directional flow r. Algorithm 1 is natural, but we do not claim it is the only possible one in the context of our model. (α) For instance, we can also use Hsa to recover the operator · r (which empirically seems to have worse numerical properties than r · ). In the National Longitudinal Survery of Youth study, we were interested in the whole advective term, so we estimated it from a different combination of operators. Depending on the specific question, other features of the model could be obtained Limit Results Proposition 3.1 is a general result on the asymptotics of asymmetric kernels. Recovering the manifold and r is just one, albeit the most useful, of the many ways of exploiting these (0) results. For instance, Hsa is the limit operator of the operators used in [3] and [5]. The limit analysis could be extended to other digraph embedding algorithms such as [4, 6]. How general is our model? Any kernel can be decomposed into a symmetric and an asymmetric part, as we have done. The assumptions on the symmetric part h are standard. The paper of [7] goes one step further from these assumptions; we will discuss it in relationship with our work shortly. The more interesting question is how limiting are our assumptions regarding the choice of kernel, especially the asymmetric part, which we parameterized as a (x, y) = r/2 · (y − x)h (x, y) in (4). In the asymptotic limit, this choice turns out to be fully general, at least up to the identifiable aspects of the model. For a more detailed discussion of this issue, see [8]. In [7], Ting, Huang and Jordan presented asymptotic results for a general family of kernels that includes asymmetric and random kernels. Our k can be expressed in the notation of [7] by taking wx (y) ← 1+r(x, y)·(y−x), rx (y) ← 1, K0 ← h, h ← . Their assumptions are more general than the assumptions we make here, yet our model is general up to what can be identified from G alone. The distinction arises because [7] focuses on the graph construction methods from an observed sample of M, while we focus on explaining an observed directed graph G through a manifold generative process. Moreover, while the [7] results can be used to analyze data from directed graphs, they differ from our Proposition 3.1. Specifically, with respect to the limit in Theorem 3 from [7], we obtain the additional source terms f (x) · r and c(x)f (x) that follow from not enforcing (α) (α) conservation of mass while defining operators Hsa and Has . We applied our theory of directed graph embedding to the analysis of the career sequences in Section 5, but asymmetric affinity data abound in other social contexts, and in the physical and life sciences. Indeed, any “similarity” score that is obtained from a likelihood of the form Wvu =likelihood(u|v) is generally asymmetric. Hence our methods can be applied to study not only social networks, but also patterns of human movement, road traffic, and trade relations, as well as alignment scores in molecular biology. Finally, the physical interpretation of our model also makes it naturally applicable to physical models of flows. Acknowledgments This research was partially supported by NSW awards IIS-0313339 and IIS-0535100. 8 References [1] Belkin and Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15:1373–1396, 2002. [2] Nadler, Lafon, and Coifman. Diffusion maps, spectral clustering and eigenfunctions of fokker-planck operators. In Neural Information Processing Systems Conference, 2006. [3] Meila and Pentney. Clustering by weighted cuts in directed graphs. In SIAM Data Mining Conference, 2007. [4] Zhou, Huang, and Scholkopf. Learning from labeled and unlabeled data on a directed graph. In International Conference on Machine Learning, pages 1041–1048, 2005. [5] Zhou, Schlkopf, and Hofmann. Semi-supervised learning on directed graphs. In Advances in Neural Information Processing Systems, volume 17, pages 1633–1640, 2005. [6] Fan R. K. Chung. The diameter and laplacian eigenvalues of directed graphs. Electr. J. Comb., 13, 2006. [7] Ting, Huang, and Jordan. An analysis of the convergence of graph Laplacians. In International Conference on Machine Learning, 2010. [8] Dominique Perrault-Joncas and Marina Meil˘ . Directed graph embedding: an algorithm based on contina uous limits of laplacian-type operators. Technical Report TR 587, University of Washington - Department of Statistics, November 2011. [9] Coifman and Lafon. Diffusion maps. Applied and Computational Harmonic Analysis, 21:6–30, 2006. [10] Mikhail Belkin and Partha Niyogi. Convergence of laplacian eigenmaps. preprint, short version NIPS 2008, 2008. [11] Coifman, Lafon, Lee, Maggioni, Warner, and Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. In Proceedings of the National Academy of Sciences, pages 7426–7431, 2005. [12] United States Department of Labor. National longitudinal survey of youth 1979 cohort. http://www.bls.gov/nls/, retrived October 2011. [13] Marc A. Scott. Affinity models for career sequences. Journal of the Royal Statistical Society: Series C (Applied Statistics), 60(3):417–436, 2011. 9
same-paper 2 0.77353388 306 nips-2011-t-divergence Based Approximate Inference
Author: Nan Ding, Yuan Qi, S.v.n. Vishwanathan
Abstract: Approximate inference is an important technique for dealing with large, intractable graphical models based on the exponential family of distributions. We extend the idea of approximate inference to the t-exponential family by defining a new t-divergence. This divergence measure is obtained via convex duality between the log-partition function of the t-exponential family and a new t-entropy. We illustrate our approach on the Bayes Point Machine with a Student’s t-prior. 1
3 0.71676052 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning
Author: Xinggang Wang, Xiang Bai, Xingwei Yang, Wenyu Liu, Longin J. Latecki
Abstract: We propose a novel inference framework for finding maximal cliques in a weighted graph that satisfy hard constraints. The constraints specify the graph nodes that must belong to the solution as well as mutual exclusions of graph nodes, i.e., sets of nodes that cannot belong to the same solution. The proposed inference is based on a novel particle filter algorithm with state permeations. We apply the inference framework to a challenging problem of learning part-based, deformable object models. Two core problems in the learning framework, matching of image patches and finding salient parts, are formulated as two instances of the problem of finding maximal cliques with hard constraints. Our learning framework yields discriminative part based object models that achieve very good detection rate, and outperform other methods on object classes with large deformation. 1
4 0.66271436 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods
Author: Ali Jalali, Christopher C. Johnson, Pradeep K. Ravikumar
Abstract: In this paper, we address the problem of learning the structure of a pairwise graphical model from samples in a high-dimensional setting. Our first main result studies the sparsistency, or consistency in sparsity pattern recovery, properties of a forward-backward greedy algorithm as applied to general statistical models. As a special case, we then apply this algorithm to learn the structure of a discrete graphical model via neighborhood estimation. As a corollary of our general result, we derive sufficient conditions on the number of samples n, the maximum nodedegree d and the problem size p, as well as other conditions on the model parameters, so that the algorithm recovers all the edges with high probability. Our result guarantees graph selection for samples scaling as n = Ω(d2 log(p)), in contrast to existing convex-optimization based algorithms that require a sample complexity of Ω(d3 log(p)). Further, the greedy algorithm only requires a restricted strong convexity condition which is typically milder than irrepresentability assumptions. We corroborate these results using numerical simulations at the end.
5 0.58536375 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis
Author: Jun-ichiro Hirayama, Aapo Hyvärinen
Abstract: Components estimated by independent component analysis and related methods are typically not independent in real data. A very common form of nonlinear dependency between the components is correlations in their variances or energies. Here, we propose a principled probabilistic model to model the energycorrelations between the latent variables. Our two-stage model includes a linear mixing of latent signals into the observed ones like in ICA. The main new feature is a model of the energy-correlations based on the structural equation model (SEM), in particular, a Linear Non-Gaussian SEM. The SEM is closely related to divisive normalization which effectively reduces energy correlation. Our new twostage model enables estimation of both the linear mixing and the interactions related to energy-correlations, without resorting to approximations of the likelihood function or other non-principled approaches. We demonstrate the applicability of our method with synthetic dataset, natural images and brain signals. 1
6 0.58481264 258 nips-2011-Sparse Bayesian Multi-Task Learning
7 0.58153772 301 nips-2011-Variational Gaussian Process Dynamical Systems
8 0.58111733 86 nips-2011-Empirical models of spiking in neural populations
9 0.58029389 183 nips-2011-Neural Reconstruction with Approximate Message Passing (NeuRAMP)
10 0.57669514 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines
11 0.5764637 24 nips-2011-Active learning of neural response functions with Gaussian processes
12 0.57561243 281 nips-2011-The Doubly Correlated Nonparametric Topic Model
13 0.57439673 75 nips-2011-Dynamical segmentation of single trials from population neural data
14 0.57275784 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
15 0.57222074 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations
16 0.56970519 15 nips-2011-A rational model of causal inference with continuous causes
17 0.56919092 267 nips-2011-Spectral Methods for Learning Multivariate Latent Tree Structure
18 0.56707418 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models
19 0.56706798 83 nips-2011-Efficient inference in matrix-variate Gaussian models with \iid observation noise
20 0.56692773 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation