nips nips2007 nips2007-66 knowledge-graph by maker-knowledge-mining

66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions


Source: pdf

Author: Tony Jebara, Yingbo Song, Kapil Thadani

Abstract: A method is proposed for semiparametric estimation where parametric and nonparametric criteria are exploited in density estimation and unsupervised learning. This is accomplished by making sampling assumptions on a dataset that smoothly interpolate between the extreme of independently distributed (or id) sample data (as in nonparametric kernel density estimators) to the extreme of independent identically distributed (or iid) sample data. This article makes independent similarly distributed (or isd) sampling assumptions and interpolates between these two using a scalar parameter. The parameter controls a Bhattacharyya affinity penalty between pairs of distributions on samples. Surprisingly, the isd method maintains certain consistency and unimodality properties akin to maximum likelihood estimation. The proposed isd scheme is an alternative for handling nonstationarity in data without making drastic hidden variable assumptions which often make estimation difficult and laden with local optima. Experiments in density estimation on a variety of datasets confirm the value of isd over iid estimation, id estimation and mixture modeling.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract A method is proposed for semiparametric estimation where parametric and nonparametric criteria are exploited in density estimation and unsupervised learning. [sent-3, score-0.305]

2 This is accomplished by making sampling assumptions on a dataset that smoothly interpolate between the extreme of independently distributed (or id) sample data (as in nonparametric kernel density estimators) to the extreme of independent identically distributed (or iid) sample data. [sent-4, score-0.231]

3 Surprisingly, the isd method maintains certain consistency and unimodality properties akin to maximum likelihood estimation. [sent-7, score-0.818]

4 The proposed isd scheme is an alternative for handling nonstationarity in data without making drastic hidden variable assumptions which often make estimation difficult and laden with local optima. [sent-8, score-0.814]

5 Experiments in density estimation on a variety of datasets confirm the value of isd over iid estimation, id estimation and mixture modeling. [sent-9, score-1.356]

6 Most approaches can be split into two categories: parametric methods where the functional form of the distribution is known a priori (often from the exponential family (Collins et al. [sent-11, score-0.196]

7 A popular non-parametric approach is kernel density estimation or the Parzen windows method (Silverman, 1986). [sent-14, score-0.15]

8 Alternatively, one may seed non-parametric distributions with parametric assumptions (Hjort & Glad, 1995) or augment parametric models with nonparametric factors (Naito, 2004). [sent-20, score-0.262]

9 This article instead proposes a continuous interpolation between iid parametric density estimation and id kernel density estimation. [sent-21, score-0.775]

10 In isd, a scalar parameter λ trades off parametric and non-parametric properties to produce an overall better density estimate. [sent-23, score-0.172]

11 The method avoids sampling or approximate inference computations and only recycles well known parametric update rules for estimation. [sent-24, score-0.15]

12 Section 2 shows how id and iid sampling setups can be smoothly interpolated using a novel isd posterior which maintains log-concavity for many popular models. [sent-27, score-1.303]

13 Section 3 gives analytic formulae for the exponential family case as well as slight modifications to familiar maximum likelihood updates for recovering parameters under isd assumptions. [sent-28, score-0.867]

14 Some consistency properties of the isd posterior are provided. [sent-29, score-0.791]

15 Section 5 provides experiments comparing isd with id and iid as well as mixture modeling. [sent-31, score-1.198]

16 2 A Continuum between id and iid Assume we are given a dataset of N − 1 inputs x1 , . [sent-33, score-0.454]

17 Given a new query input xN also in the same sample space, density estimation aims at recovering a density function p(x1 , . [sent-37, score-0.215]

18 A common subsequent assumption is that the data points are id or independently sampled which leads to the following simplification: N pid (X ) = pn (xn ). [sent-51, score-0.604]

19 n=1 The joint likelihood factorizes into a product of independent singleton marginals pn (xn ) each of which can be different. [sent-52, score-0.409]

20 n=1 which is the popular iid sampling situation. [sent-54, score-0.245]

21 The id setup gives rise to what is commonly referred to as kernel density or Parzen estimation. [sent-56, score-0.374]

22 Meanwhile, the iid setup gives rise to traditional iid parametric maximum likelihood (ML) or maximum a posteriori (MAP) estimation. [sent-57, score-0.555]

23 The iid assumption may be too aggressive for many real world problems. [sent-59, score-0.208]

24 Similarly, the id setup may be too flexible and might over-fit when the marginal pn (x) is myopically recovered from a single xn . [sent-61, score-0.696]

25 The MAP id parametric setting involves maximizing the following posterior (likelihood times a prior) over the models: N pid (X , Θ) = p(xn |θn )p(θn ). [sent-67, score-0.45]

26 To obtain the iid setup, we can maximize pid subject to constraints that force all marginals to be equal, in other words θm = θn for all m, n ∈ {1, . [sent-71, score-0.342]

27 Instead of applying N (N − 1)/2 hard pairwise constraints in an iid setup, consider imposing penalty functions across pairs of marginals. [sent-75, score-0.241]

28 These penalty functions reduce the posterior score when marginals disagree and encourage some stickiness between models (Teh et al. [sent-76, score-0.18]

29 We measure the level of agreement between two marginals pm (x) and pn (x) using the following Bhattacharyya affinity metric (Bhattacharyya, 1943) between two distributions: B(pm , pn ) = B(p(x|θm ), p(x|θn )) = pβ (x|θm )pβ (x|θn )dx. [sent-78, score-0.948]

30 This is a symmetric non-negative quantity in both distributions pm and pn . [sent-79, score-0.641]

31 The natural choice for the setting of β is 1/2 and in this case, it is easy to verify the affinity is maximal and equals one if and only if pm (x) = pn (x). [sent-80, score-0.612]

32 A large family of alternative information divergences exist to relate pairs of distributions (Topsoe, 1999) and are discussed in the Appendix. [sent-81, score-0.139]

33 In addition, it leads to straightforward variants of the estimation algorithms as in the id and iid situations for many choices of parametric densities. [sent-83, score-0.58]

34 Clearly, if λ → 0, then the affinity is always unity and the marginals are completely unconstrained as in the id setup. [sent-89, score-0.293]

35 We will refer to the isd posterior as Equation 1 and when p(θn ) is set to uniform, we will call it the isd likelihood. [sent-92, score-1.464]

36 One can also view the additional term in isd as id estimation with a modified prior p(Θ) as follows: ˜ p(Θ) ˜ ∝ B λ/N (p(x|θm ), p(x|θn )). [sent-93, score-0.995]

37 p(θn ) n m=n This prior is a Markov random field tying all parameters in a pairwise manner in addition to the standard singleton potentials in the id scenario. [sent-94, score-0.317]

38 However, this perspective is less appealing since it disguises the fact that the samples are not quite id or iid. [sent-95, score-0.246]

39 One of the appealing properties of iid and id maximum likelihood estimation is its unimodality for log-concave distributions. [sent-96, score-0.544]

40 The isd posterior also benefits from a unique optimum and log-concavity. [sent-97, score-0.758]

41 This set of distributions includes the Gaussian distribution (with fixed variance) and many exponential family distributions such as the Poisson, multinomial and exponential distribution. [sent-99, score-0.205]

42 We next show that the isd posterior score for log-concave distributions is log-concave in Θ. [sent-100, score-0.814]

43 This produces a unique estimate for the parameters as was the case for id and iid setups. [sent-101, score-0.488]

44 Theorem 1 The isd posterior is log-concave for jointly log-concave density distributions and for log-concave prior distributions. [sent-102, score-0.892]

45 Proof 1 The isd log-posterior is the sum of the id log-likelihoods, the singleton log-priors and pairwise log-Bhattacharyya affinities: log pλ (X , Θ) = const + log p(xn |θn ) + n log p(θn ) + n λ N log B(pm , pn ). [sent-103, score-1.533]

46 n m=n The id log-likelihood is the sum of the log-probabilities of distributions that are log-concave in the parameters and is therefore concave. [sent-104, score-0.292]

47 Finally, since the isd log-posterior is the sum of concave terms and concave log-Bhattacharyya affinities, it must be concave. [sent-111, score-0.742]

48 Furthermore, the isd setup will produce convenient update rules that build upon iid estimation algorithms. [sent-113, score-1.04]

49 There are additional properties of isd which are detailed in the following sections. [sent-114, score-0.706]

50 3 Exponential Family Distributions and β = 1/2 We first specialize the above derivations to the case where the singleton marginals obey the exponential family form as follows: p(x|θn ) = T exp H(x) + θn T (x) − A(θn ) . [sent-116, score-0.198]

51 Therefore, exponential family distributions are always log-concave in the parameters θn . [sent-121, score-0.143]

52 For the exponential family, the Bhattacharyya affinity is computable in closed form as follows: B(pm , pn ) = exp (A(θm /2 + θn /2) − A(θm )/2 − A(θn )/2) . [sent-122, score-0.339]

53 Assuming uniform priors on the exponential family parameters, it is now straightforward to write an iterative algorithm to maximize the isd posterior. [sent-123, score-0.848]

54 , θN that maximize the isd posterior or log pλ (X , Θ) using a simple greedy method. [sent-127, score-0.82]

55 It suffices to consider only terms in log pλ (X , Θ) that are variable with θn : ˜ log pλ (X , θn , Θ/n ) = T const + θn T (xn ) − N + λ(N − 1) 2λ A(θn ) + N N ˜ A(θm /2 + θn /2). [sent-133, score-0.15]

56 m=n If the exponential family is jointly log-concave in parameters and data (as is the case for Gaussians), this term is log-concave in θn . [sent-134, score-0.147]

57 Since A(θ) is a convex function, we can compute a linear variational lower bound on each A(θm /2 + θn /2) term for the current setting of θn : N + λ(N − 1) T ˜ log pλ (X , θn , Θ/n ) ≥ const + θn T (xn ) − A(θn ) N T λ ˜ ˜ ˜ ˜ ˜ + 2A(θm /2 + θn /2) + A′ (θm /2 + θn /2) (θn − θn ). [sent-141, score-0.139]

58 Since we have a variational lower bound, each iterative update of θn monotonically increases the isd posterior. [sent-145, score-0.796]

59 We can also work with a robust (yet not log-concave) version of the isd score which has the form:   λ log pλ (X , Θ) = const + ˆ log p(xn |θn ) + log p(θn ) + log  B(pm , pn ) . [sent-146, score-1.26]

60 N n n n m=n and leads to the general update rule (where α = 0 reproduces isd and larger α increases robustness):   ˜ ˜ N λ (N − 1)B α (p(x|θm ), p(x|θn )) ′ ˜ ˜ T (xn ) + A′ (θn ) = A (θm /2 + θn /2) . [sent-147, score-0.752]

61 ˜ ˜ N + λ(N − 1) N B α (p(x|θl ), p(x|θn )) l=n m=n We next examine marginal consistency, another important property of the isd posterior. [sent-148, score-0.732]

62 It is possible to show that the isd posterior is marginally consistent at least in the Gaussian mean case (one element of the exponential family). [sent-152, score-0.824]

63 In other words, marginalizing over an observation and its associated marginal’s parameter (which can be taken to be xN and θN without loss of generality) still produces a similar isd posterior on the remaining observations X/N and parameters Θ/N . [sent-153, score-0.792]

64 Theorem 2 The isd posterior with β = 1/2 is marginally consistent for Gaussian distributions. [sent-156, score-0.774]

65 p(xi |θi ) i=1 n=1 m=n+1 Therefore, we get the same isd score that we would have obtained had we started with only (N − 1) data points. [sent-158, score-0.733]

66 The isd estimator thus has useful properties and still agrees with id when λ = 0 and iid when λ = ∞. [sent-160, score-1.185]

67 Next, the estimator is generalized to handle distributions beyond the exponential family where latent variables are implicated (as is the case for mixtures of Gaussians, hidden Markov models, latent graphical models and so on). [sent-161, score-0.246]

68 4 Hidden Variable Models and β = 1 One important limitation of most divergences between distributions is that they become awkward when dealing with hidden variables or mixture models. [sent-162, score-0.158]

69 Thus, evaluating the isd posterior is straightforward for such models when β = 1. [sent-171, score-0.78]

70 We next provide a variational method that makes it possible to maximize a lower bound on the isd posterior in these cases. [sent-172, score-0.809]

71 This gives isd the following auxiliary function Q(θn |Θ) = h 2λ ˜ p(h|xn , θn ) log p(xn , h|θn ) + log p(θn ) + N ˜ p(x|θm ) m=n ˜ p(h|x, θn ) log p(x, h|θn )dx. [sent-182, score-0.856]

72 h This is a variational lower bound which can be iteratively maximized instead of the original isd ˜ posterior. [sent-183, score-0.739]

73 , S virtual samples ˜ ˜ ˜ xm,s obtained from the m’th model p(x|θm ) for each of the other N − 1 models, Q(θn |Θ) = h 2λ ˜ p(h|xn , θn ) log p(xn , h|θn ) + log p(θn ) + SN ˜ p(h|xm,s , θn ) log p(xm,s , h|θn ). [sent-188, score-0.132]

74 5 Experiments A preliminary way to evaluate the usefulness of the isd framework is to explore density estimation over real-world datasets under varying λ. [sent-192, score-0.821]

75 If we set λ large, we have the standard iid setup and only fit a single parametric model to the dataset. [sent-193, score-0.328]

76 In between, an iterative algorithm is available to maximize the isd posterior to ∗ ∗ obtain potentially superior models θ1 , . [sent-195, score-0.825]

77 Figure 1 shows the isd estimator with Gaussian models on a ring-shaped 2D dataset. [sent-199, score-0.753]

78 To evaluate performance on real data, we aggregate the isd learned models into a single density estimate as is done with Parzen estimators and compute the iid likelihood of held out test 2 1. [sent-201, score-1.027]

79 5 2 1 2 Figure 1: Estimation with isd for Gaussian models (mean and covariance) on synthetic data. [sent-257, score-0.728]

80 69e2 Table 1: Gaussian test log-likelihoods using id, iid, EM, ∞ GMM and isd estimation. [sent-312, score-0.706]

81 On 6 standard datasets, we show the average test log-likelihood of Gaussian estimation while varying the settings of λ compared to a single iid Gaussian, an id Parzen RBF estimator and a mixture of 2 to 5 Gaussians using EM. [sent-316, score-0.56]

82 Cross-validation was used to choose the σ, λ or EM local minimum (from ten initializations), for the id, isd and EM algorithms respectively. [sent-318, score-0.706]

83 The test log-likelihoods show that isd outperformed iid, id and EM estimation and was comparable to infinite Gaussian mixture (iid−∞) models (Rasmussen, 1999) (which is a far more computationally demanding method). [sent-320, score-1.055]

84 Table 2 shows that the isd estimator for certain values of λ produced higher test log-likelihoods than id and iid. [sent-326, score-0.977]

85 6 Discussion This article has provided an isd scheme to smoothly interpolate between id and iid assumptions in density estimation. [sent-327, score-1.324]

86 The method maintains simple update rules for recovering parameters for exponential families as well as mixture models. [sent-329, score-0.23]

87 In addition, the isd posterior maintains useful log-concavity and marginal consistency properties. [sent-330, score-0.849]

88 Experiments show its advantages in real-world datasets where id or iid assumptions may be too extreme. [sent-331, score-0.474]

89 We are also considering computing the isd posλ = 0 λ = 1 λ = 2 λ = 3 λ = 4 λ = 5 λ = 10 λ = 20 λ = 30 λ = ∞ -5. [sent-333, score-0.706]

90 5721 Table 2: HMM test log-likelihoods using id, iid and isd estimation. [sent-343, score-0.914]

91 7 Appendix: Alternative Information Divergences There is a large family of information divergences (Topsoe, 1999) between pairs of distributions (Renyi measure, variational distance, χ2 divergence, etc. [sent-345, score-0.172]

92 ) that can be used to pull models pm and pn towards each other. [sent-346, score-0.655]

93 An alternative is the Kullback-Leibler divergence D(pm pn ) = pm (x)(log pm (x)−log pn (x))dx and its symmetrized variant D(pm pn )/2 + D(pn pm )/2. [sent-348, score-1.915]

94 Consider a variational distribution q that lies between the input pm and pn . [sent-350, score-0.645]

95 The log Bhattacharyya affinity with β = 1/2 can be written as follows: pm (x)pn (x) dx ≥ −D(q pm )/2 − D(q pn )/2. [sent-351, score-1.003]

96 q(x) Thus, B(pm , pn ) ≥ exp(−D(q pm )/2 − D(q pn )/2). [sent-352, score-0.901]

97 The choice of q that maximizes the lower 1 bound on the Bhattacharyya is q(x) = Z pm (x)pn (x). [sent-353, score-0.323]

98 Here, Z = B(pm , pn ) normalizes q(x) and is therefore equal to the Bhattacharyya affinity. [sent-354, score-0.289]

99 Thus we have the following property: −2 log B(pm , pn ) = min D(q pm ) + D(q pn ). [sent-355, score-0.945]

100 q Simple manipulations then show 2JS(pm , pn ) ≤ min(D(pm pn ), D(pn pm )). [sent-357, score-0.901]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('isd', 0.706), ('pm', 0.323), ('pn', 0.289), ('id', 0.246), ('bhattacharyya', 0.209), ('iid', 0.208), ('af', 0.141), ('nity', 0.102), ('xn', 0.098), ('parametric', 0.083), ('density', 0.072), ('pid', 0.069), ('const', 0.062), ('parzen', 0.058), ('dxn', 0.055), ('singleton', 0.054), ('posterior', 0.052), ('jebara', 0.051), ('exponential', 0.05), ('nities', 0.048), ('symmetrized', 0.048), ('family', 0.047), ('marginals', 0.047), ('divergences', 0.046), ('hidden', 0.045), ('log', 0.044), ('estimation', 0.043), ('prekopa', 0.042), ('topsoe', 0.042), ('semiparametric', 0.039), ('mixture', 0.038), ('setup', 0.037), ('jointly', 0.033), ('variational', 0.033), ('consistency', 0.033), ('em', 0.032), ('maintains', 0.032), ('article', 0.032), ('divergence', 0.031), ('gyor', 0.031), ('update', 0.03), ('rasmussen', 0.029), ('distributions', 0.029), ('recovering', 0.028), ('mixtures', 0.028), ('glad', 0.028), ('hjort', 0.028), ('naito', 0.028), ('olking', 0.028), ('piid', 0.028), ('spiegelman', 0.028), ('unimodality', 0.028), ('wand', 0.028), ('score', 0.027), ('iterative', 0.027), ('devroye', 0.026), ('permits', 0.026), ('marginal', 0.026), ('estimator', 0.025), ('nonparametric', 0.025), ('silverman', 0.024), ('dx', 0.024), ('gaussian', 0.023), ('smoothly', 0.022), ('teh', 0.022), ('pos', 0.022), ('integral', 0.022), ('models', 0.022), ('sampling', 0.021), ('meanwhile', 0.021), ('pull', 0.021), ('assumptions', 0.02), ('formula', 0.02), ('gaussians', 0.02), ('markov', 0.02), ('efron', 0.019), ('collins', 0.019), ('families', 0.019), ('likelihood', 0.019), ('kernel', 0.019), ('interpolate', 0.018), ('maximize', 0.018), ('auxiliary', 0.018), ('hmms', 0.018), ('ml', 0.018), ('concave', 0.018), ('parameters', 0.017), ('distributed', 0.017), ('jones', 0.017), ('scalar', 0.017), ('pairs', 0.017), ('produces', 0.017), ('arguments', 0.017), ('rule', 0.016), ('award', 0.016), ('marginally', 0.016), ('et', 0.016), ('popular', 0.016), ('penalty', 0.016), ('rules', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions

Author: Tony Jebara, Yingbo Song, Kapil Thadani

Abstract: A method is proposed for semiparametric estimation where parametric and nonparametric criteria are exploited in density estimation and unsupervised learning. This is accomplished by making sampling assumptions on a dataset that smoothly interpolate between the extreme of independently distributed (or id) sample data (as in nonparametric kernel density estimators) to the extreme of independent identically distributed (or iid) sample data. This article makes independent similarly distributed (or isd) sampling assumptions and interpolates between these two using a scalar parameter. The parameter controls a Bhattacharyya affinity penalty between pairs of distributions on samples. Surprisingly, the isd method maintains certain consistency and unimodality properties akin to maximum likelihood estimation. The proposed isd scheme is an alternative for handling nonstationarity in data without making drastic hidden variable assumptions which often make estimation difficult and laden with local optima. Experiments in density estimation on a variety of datasets confirm the value of isd over iid estimation, id estimation and mixture modeling.

2 0.17143399 181 nips-2007-Sparse Overcomplete Latent Variable Decomposition of Counts Data

Author: Madhusudana Shashanka, Bhiksha Raj, Paris Smaragdis

Abstract: An important problem in many fields is the analysis of counts data to extract meaningful latent components. Methods like Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA) have been proposed for this purpose. However, they are limited in the number of components they can extract and lack an explicit provision to control the “expressiveness” of the extracted components. In this paper, we present a learning formulation to address these limitations by employing the notion of sparsity. We start with the PLSA framework and use an entropic prior in a maximum a posteriori formulation to enforce sparsity. We show that this allows the extraction of overcomplete sets of latent components which better characterize the data. We present experimental evidence of the utility of such representations.

3 0.16628195 22 nips-2007-Agreement-Based Learning

Author: Percy Liang, Dan Klein, Michael I. Jordan

Abstract: The learning of probabilistic models with many hidden variables and nondecomposable dependencies is an important and challenging problem. In contrast to traditional approaches based on approximate inference in a single intractable model, our approach is to train a set of tractable submodels by encouraging them to agree on the hidden variables. This allows us to capture non-decomposable aspects of the data while still maintaining tractability. We propose an objective function for our approach, derive EM-style algorithms for parameter estimation, and demonstrate their effectiveness on three challenging real-world learning tasks. 1

4 0.078229889 161 nips-2007-Random Projections for Manifold Learning

Author: Chinmay Hegde, Michael Wakin, Richard Baraniuk

Abstract: We propose a novel method for linear dimensionality reduction of manifold modeled data. First, we show that with a small number M of random projections of sample points in RN belonging to an unknown K-dimensional Euclidean manifold, the intrinsic dimension (ID) of the sample set can be estimated to high accuracy. Second, we rigorously prove that using only this set of random projections, we can estimate the structure of the underlying manifold. In both cases, the number of random projections required is linear in K and logarithmic in N , meaning that K < M ≪ N . To handle practical situations, we develop a greedy algorithm to estimate the smallest size of the projection space required to perform manifold learning. Our method is particularly relevant in distributed sensing systems and leads to significant potential savings in data acquisition, storage and transmission costs.

5 0.070780307 44 nips-2007-Catching Up Faster in Bayesian Model Selection and Model Averaging

Author: Tim V. Erven, Steven D. Rooij, Peter Grünwald

Abstract: Bayesian model averaging, model selection and their approximations such as BIC are generally statistically consistent, but sometimes achieve slower rates of convergence than other methods such as AIC and leave-one-out cross-validation. On the other hand, these other methods can be inconsistent. We identify the catch-up phenomenon as a novel explanation for the slow convergence of Bayesian methods. Based on this analysis we define the switch-distribution, a modification of the Bayesian model averaging distribution. We prove that in many situations model selection and prediction based on the switch-distribution is both consistent and achieves optimal convergence rates, thereby resolving the AIC-BIC dilemma. The method is practical; we give an efficient algorithm. 1

6 0.065622352 47 nips-2007-Collapsed Variational Inference for HDP

7 0.058337048 87 nips-2007-Fast Variational Inference for Large-scale Internet Diagnosis

8 0.050702434 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization

9 0.049126092 43 nips-2007-Catching Change-points with Lasso

10 0.045686826 185 nips-2007-Stable Dual Dynamic Programming

11 0.045515023 135 nips-2007-Multi-task Gaussian Process Prediction

12 0.044340581 160 nips-2007-Random Features for Large-Scale Kernel Machines

13 0.043450437 63 nips-2007-Convex Relaxations of Latent Variable Training

14 0.041165061 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model

15 0.039986271 61 nips-2007-Convex Clustering with Exemplar-Based Models

16 0.037341796 105 nips-2007-Infinite State Bayes-Nets for Structured Domains

17 0.036420241 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

18 0.035611685 8 nips-2007-A New View of Automatic Relevance Determination

19 0.034996014 84 nips-2007-Expectation Maximization and Posterior Constraints

20 0.034626059 213 nips-2007-Variational Inference for Diffusion Processes


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.122), (1, 0.034), (2, -0.036), (3, -0.039), (4, -0.012), (5, -0.033), (6, -0.074), (7, -0.03), (8, -0.067), (9, 0.046), (10, 0.051), (11, 0.042), (12, 0.008), (13, -0.055), (14, 0.04), (15, -0.069), (16, -0.011), (17, 0.152), (18, -0.089), (19, -0.048), (20, 0.002), (21, -0.078), (22, -0.016), (23, 0.038), (24, 0.063), (25, 0.055), (26, -0.054), (27, -0.127), (28, 0.108), (29, 0.06), (30, -0.058), (31, 0.114), (32, 0.14), (33, -0.104), (34, -0.139), (35, 0.218), (36, -0.019), (37, -0.101), (38, -0.059), (39, -0.039), (40, 0.073), (41, -0.113), (42, 0.039), (43, 0.015), (44, 0.147), (45, -0.082), (46, 0.04), (47, -0.048), (48, -0.09), (49, -0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9472065 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions

Author: Tony Jebara, Yingbo Song, Kapil Thadani

Abstract: A method is proposed for semiparametric estimation where parametric and nonparametric criteria are exploited in density estimation and unsupervised learning. This is accomplished by making sampling assumptions on a dataset that smoothly interpolate between the extreme of independently distributed (or id) sample data (as in nonparametric kernel density estimators) to the extreme of independent identically distributed (or iid) sample data. This article makes independent similarly distributed (or isd) sampling assumptions and interpolates between these two using a scalar parameter. The parameter controls a Bhattacharyya affinity penalty between pairs of distributions on samples. Surprisingly, the isd method maintains certain consistency and unimodality properties akin to maximum likelihood estimation. The proposed isd scheme is an alternative for handling nonstationarity in data without making drastic hidden variable assumptions which often make estimation difficult and laden with local optima. Experiments in density estimation on a variety of datasets confirm the value of isd over iid estimation, id estimation and mixture modeling.

2 0.60205853 22 nips-2007-Agreement-Based Learning

Author: Percy Liang, Dan Klein, Michael I. Jordan

Abstract: The learning of probabilistic models with many hidden variables and nondecomposable dependencies is an important and challenging problem. In contrast to traditional approaches based on approximate inference in a single intractable model, our approach is to train a set of tractable submodels by encouraging them to agree on the hidden variables. This allows us to capture non-decomposable aspects of the data while still maintaining tractability. We propose an objective function for our approach, derive EM-style algorithms for parameter estimation, and demonstrate their effectiveness on three challenging real-world learning tasks. 1

3 0.59431762 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization

Author: Xuanlong Nguyen, Martin J. Wainwright, Michael I. Jordan

Abstract: We develop and analyze an algorithm for nonparametric estimation of divergence functionals and the density ratio of two probability distributions. Our method is based on a variational characterization of f -divergences, which turns the estimation into a penalized convex risk minimization problem. We present a derivation of our kernel-based estimation algorithm and an analysis of convergence rates for the estimator. Our simulation results demonstrate the convergence behavior of the method, which compares favorably with existing methods in the literature. 1

4 0.57951343 181 nips-2007-Sparse Overcomplete Latent Variable Decomposition of Counts Data

Author: Madhusudana Shashanka, Bhiksha Raj, Paris Smaragdis

Abstract: An important problem in many fields is the analysis of counts data to extract meaningful latent components. Methods like Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA) have been proposed for this purpose. However, they are limited in the number of components they can extract and lack an explicit provision to control the “expressiveness” of the extracted components. In this paper, we present a learning formulation to address these limitations by employing the notion of sparsity. We start with the PLSA framework and use an entropic prior in a maximum a posteriori formulation to enforce sparsity. We show that this allows the extraction of overcomplete sets of latent components which better characterize the data. We present experimental evidence of the utility of such representations.

5 0.47441342 44 nips-2007-Catching Up Faster in Bayesian Model Selection and Model Averaging

Author: Tim V. Erven, Steven D. Rooij, Peter Grünwald

Abstract: Bayesian model averaging, model selection and their approximations such as BIC are generally statistically consistent, but sometimes achieve slower rates of convergence than other methods such as AIC and leave-one-out cross-validation. On the other hand, these other methods can be inconsistent. We identify the catch-up phenomenon as a novel explanation for the slow convergence of Bayesian methods. Based on this analysis we define the switch-distribution, a modification of the Bayesian model averaging distribution. We prove that in many situations model selection and prediction based on the switch-distribution is both consistent and achieves optimal convergence rates, thereby resolving the AIC-BIC dilemma. The method is practical; we give an efficient algorithm. 1

6 0.38375485 87 nips-2007-Fast Variational Inference for Large-scale Internet Diagnosis

7 0.37675002 84 nips-2007-Expectation Maximization and Posterior Constraints

8 0.36628211 47 nips-2007-Collapsed Variational Inference for HDP

9 0.35897538 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model

10 0.35288358 159 nips-2007-Progressive mixture rules are deviation suboptimal

11 0.32988629 63 nips-2007-Convex Relaxations of Latent Variable Training

12 0.32939842 119 nips-2007-Learning with Tree-Averaged Densities and Distributions

13 0.31681263 185 nips-2007-Stable Dual Dynamic Programming

14 0.27829251 61 nips-2007-Convex Clustering with Exemplar-Based Models

15 0.26488212 101 nips-2007-How SVMs can estimate quantiles and the median

16 0.25824589 145 nips-2007-On Sparsity and Overcompleteness in Image Models

17 0.25424033 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images

18 0.2518827 67 nips-2007-Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation

19 0.2499778 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis

20 0.23911339 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.064), (13, 0.026), (16, 0.039), (18, 0.016), (21, 0.057), (34, 0.03), (35, 0.026), (47, 0.055), (49, 0.022), (76, 0.271), (83, 0.14), (85, 0.04), (87, 0.016), (90, 0.091)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.79583645 23 nips-2007-An Analysis of Convex Relaxations for MAP Estimation

Author: Pawan Mudigonda, Vladimir Kolmogorov, Philip Torr

Abstract: The problem of obtaining the maximum a posteriori estimate of a general discrete random field (i.e. a random field defined using a finite and discrete set of labels) is known to be NP-hard. However, due to its central importance in many applications, several approximate algorithms have been proposed in the literature. In this paper, we present an analysis of three such algorithms based on convex relaxations: (i) LP - S: the linear programming (LP) relaxation proposed by Schlesinger [20] for a special case and independently in [4, 12, 23] for the general case; (ii) QP - RL: the quadratic programming (QP) relaxation by Ravikumar and Lafferty [18]; and (iii) SOCP - MS: the second order cone programming (SOCP) relaxation first proposed by Muramatsu and Suzuki [16] for two label problems and later extended in [14] for a general label set. We show that the SOCP - MS and the QP - RL relaxations are equivalent. Furthermore, we prove that despite the flexibility in the form of the constraints/objective function offered by QP and SOCP, the LP - S relaxation strictly dominates (i.e. provides a better approximation than) QP - RL and SOCP - MS. We generalize these results by defining a large class of SOCP (and equivalent QP) relaxations which is dominated by the LP - S relaxation. Based on these results we propose some novel SOCP relaxations which strictly dominate the previous approaches.

2 0.75851536 137 nips-2007-Multiple-Instance Pruning For Learning Efficient Cascade Detectors

Author: Cha Zhang, Paul A. Viola

Abstract: Cascade detectors have been shown to operate extremely rapidly, with high accuracy, and have important applications such as face detection. Driven by this success, cascade learning has been an area of active research in recent years. Nevertheless, there are still challenging technical problems during the training process of cascade detectors. In particular, determining the optimal target detection rate for each stage of the cascade remains an unsolved issue. In this paper, we propose the multiple instance pruning (MIP) algorithm for soft cascades. This algorithm computes a set of thresholds which aggressively terminate computation with no reduction in detection rate or increase in false positive rate on the training dataset. The algorithm is based on two key insights: i) examples that are destined to be rejected by the complete classifier can be safely pruned early; ii) face detection is a multiple instance learning problem. The MIP process is fully automatic and requires no assumptions of probability distributions, statistical independence, or ad hoc intermediate rejection targets. Experimental results on the MIT+CMU dataset demonstrate significant performance advantages. 1

same-paper 3 0.75411594 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions

Author: Tony Jebara, Yingbo Song, Kapil Thadani

Abstract: A method is proposed for semiparametric estimation where parametric and nonparametric criteria are exploited in density estimation and unsupervised learning. This is accomplished by making sampling assumptions on a dataset that smoothly interpolate between the extreme of independently distributed (or id) sample data (as in nonparametric kernel density estimators) to the extreme of independent identically distributed (or iid) sample data. This article makes independent similarly distributed (or isd) sampling assumptions and interpolates between these two using a scalar parameter. The parameter controls a Bhattacharyya affinity penalty between pairs of distributions on samples. Surprisingly, the isd method maintains certain consistency and unimodality properties akin to maximum likelihood estimation. The proposed isd scheme is an alternative for handling nonstationarity in data without making drastic hidden variable assumptions which often make estimation difficult and laden with local optima. Experiments in density estimation on a variety of datasets confirm the value of isd over iid estimation, id estimation and mixture modeling.

4 0.5721634 156 nips-2007-Predictive Matrix-Variate t Models

Author: Shenghuo Zhu, Kai Yu, Yihong Gong

Abstract: It is becoming increasingly important to learn from a partially-observed random matrix and predict its missing elements. We assume that the entire matrix is a single sample drawn from a matrix-variate t distribution and suggest a matrixvariate t model (MVTM) to predict those missing elements. We show that MVTM generalizes a range of known probabilistic models, and automatically performs model selection to encourage sparse predictive models. Due to the non-conjugacy of its prior, it is difficult to make predictions by computing the mode or mean of the posterior distribution. We suggest an optimization method that sequentially minimizes a convex upper-bound of the log-likelihood, which is very efficient and scalable. The experiments on a toy data and EachMovie dataset show a good predictive accuracy of the model. 1

5 0.57156342 63 nips-2007-Convex Relaxations of Latent Variable Training

Author: Yuhong Guo, Dale Schuurmans

Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semidefinite relaxation that yields global training by eliminating local minima. 1

6 0.57024342 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

7 0.56963938 187 nips-2007-Structured Learning with Approximate Inference

8 0.56701618 182 nips-2007-Sparse deep belief net model for visual area V2

9 0.56682682 49 nips-2007-Colored Maximum Variance Unfolding

10 0.56486624 7 nips-2007-A Kernel Statistical Test of Independence

11 0.56414825 128 nips-2007-Message Passing for Max-weight Independent Set

12 0.5635097 180 nips-2007-Sparse Feature Learning for Deep Belief Networks

13 0.56339109 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

14 0.56303537 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

15 0.56070071 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)

16 0.55893385 185 nips-2007-Stable Dual Dynamic Programming

17 0.55815971 141 nips-2007-New Outer Bounds on the Marginal Polytope

18 0.55744535 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization

19 0.55633432 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

20 0.5561378 175 nips-2007-Semi-Supervised Multitask Learning