nips nips2006 nips2006-121 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Martin Raphan, Eero P. Simoncelli
Abstract: unkown-abstract
Reference: text
sentIndex sentText sentNum sentScore
1 edu Bayesian estimators are defined in terms of the posterior distribution. [sent-8, score-0.225]
2 Typically, this is written as the product of the likelihood function and a prior probability density, both of which are assumed to be known. [sent-9, score-0.089]
3 But in many situations, the prior density is not known, and is difficult to learn from data since one does not have access to uncorrupted samples of the variable being estimated. [sent-10, score-0.229]
4 We show that for a wide variety of observation models, the Bayes least squares (BLS) estimator may be formulated without explicit reference to the prior. [sent-11, score-0.425]
5 Specifically, we derive a direct expression for the estimator, and a related expression for the mean squared estimation error, both in terms of the density of the observed measurements. [sent-12, score-0.274]
6 Each of these prior-free formulations allows us to approximate the estimator given a sufficient amount of observed data. [sent-13, score-0.315]
7 We use the first form to develop practical nonparametric approximations of BLS estimators for several different observation processes, and the second form to develop a parametric family of estimators for use in the additive Gaussian noise case. [sent-14, score-0.805]
8 We examine the empirical performance of these estimators as a function of the amount of observed data. [sent-15, score-0.338]
9 Those that minimize the mean squared error (known as Bayes least squares, or BLS) are particularly widespread. [sent-17, score-0.079]
10 These estimators are usually derived assuming explicit knowledge of the observation process (expressed as the conditional density of the observation given the quantity to be estimated), and the prior density over that quantity. [sent-18, score-0.65]
11 Despite its appeal, this approach is often criticized for the reliance on knowledge of the prior distribution, since the true prior is usually not known, and in many cases one does not have data drawn from this distribution with which to approximate it. [sent-19, score-0.165]
12 In general, learning the prior distribution from the observed data presents a difficult, if not impossible task, even when the observation process is known. [sent-21, score-0.144]
13 In the commonly used ”empirical Bayesian” approach [1], one assumes a parametric family of densities, whose parameters are obtained by fitting the data. [sent-22, score-0.112]
14 This prior is then used to derive an estimator that may be applied to the data. [sent-23, score-0.329]
15 If the true prior is not a member of the assumed parametric family, however, such estimators can perform quite poorly. [sent-24, score-0.351]
16 An estimator may also be obtained in a supervised setting, in which one is provided with many pairs containing a corrupted observation along with the true value of the quantity to be estimated. [sent-25, score-0.421]
17 In this case, selecting an estimator is a classic regression problem: find a function that best maps the observations to the correct values, in a least squares sense. [sent-26, score-0.339]
18 Given a large enough number of training samples, this function will approach the BLS estimate, and should perform well on new samples drawn from the same distribution as the training samples. [sent-27, score-0.074]
19 In this paper, we examine the BLS estimation problem in a setting that lies between the two cases described above. [sent-29, score-0.11]
20 Specifically, we assume the observation process (but not the prior) is known, and we assume unsupervised training data, consisting only of corrupted observations (without the correct values). [sent-30, score-0.156]
21 We show that for many observation processes, the BLS estimator may be written directly in terms of the observation density. [sent-31, score-0.442]
22 We also show a dual formulation, in which the BLS estimator may be obtained by minimizing an expression for the mean squared error that is written only in terms of the observation density. [sent-32, score-0.492]
23 A few special cases of the first formulation appear in the empirical Bayes literature [2], and of the second formulation in another branch of the statistical literature concerned with improvement of estimators [3, 4, 5]. [sent-33, score-0.449]
24 We develop practical nonparametric approximations of estimators for several different observation processes, demonstrating empirically that they converge to the BLS estimator as the amount of observed data increases. [sent-35, score-0.656]
25 We also develop a parametric family of estimators for use in the additive Gaussian case, and examine their empirical convergence properties. [sent-36, score-0.512]
26 We expect such BLS estimators, constructed from corrupted observations without explicit knowledge of, assumptions about, or samples from the prior, to prove useful in a variety real-world estimation problems faced by machine or biological systems that must learn from examples. [sent-37, score-0.214]
27 2 Bayes least squares estimation Suppose we make an observation, Y , that depends on a hidden variable X, where X and Y may be scalars or vectors. [sent-38, score-0.083]
28 If the prior distribution on X is PX , and the likelihood function is PY |X then this can be written using Bayes’ rule as = x PX|Y (x|y) dx = E{X|Y = y} x PY |X (y|x) PX (x) dx PY (y) , (1) where the denominator is the distribution of the observed data: PY (y) = PY |X (y|x) PX (x) dx . [sent-40, score-0.182]
29 ) samples (Xn , Yn ) drawn from the joint distribution of (X, Y ), then we can solve for the estimator f (y) = E{X|Y = y} nonparametrically, or we could choose a parametric family of estimators {fθ }, and choose θ to minimize the empirical squared error: 1 ˆ θ = arg min θ N N |fθ (Yn ) − Xn |2 . [sent-45, score-0.806]
30 n=1 However, in many situations, one does not have access to PX , or to samples drawn from PX . [sent-46, score-0.115]
31 1 Prior-free reformulation of the BLS estimator In many cases, the BLS estimate may be written without explicit reference to the prior distribution. [sent-48, score-0.442]
32 (1), the prior appears only in the numerator N (y) = PY |X (y|x) x PX (x) dx. [sent-50, score-0.088]
33 This equation may be viewed as a composition of linear transformations of the function PX (x) N (y) = (A ◦ X){PX }(y), where X{f }(x) = xf (x), and the operator A computes an inner product with the likelihood function A{f }(y) = PY |X (y|x) f (x) dx. [sent-51, score-0.092]
34 If the linear transformation A is 1-1, and we restrict PY to lie in the range of A, then we can then write the numerator as a linear transformation on PY alone, without explicit reference to PX : N (y) = = (A ◦ X ◦ A−1 ){PY }(y) L{PY }(y). [sent-54, score-0.116]
35 This allows us to write the BLS estimator as L{PY }(y) . [sent-56, score-0.275]
36 PY (y) In the definition of the operator L, A−1 effectively inverts the observation process, recovering PX from PY . [sent-60, score-0.142]
37 For example, in the case of additive Gaussian noise, A−1 is a deconvolution operation which is inherently unstable for high frequencies. [sent-62, score-0.126]
38 (4) comes from the fact that in many cases, the composite operation L may be written explicitly, even when the inverse operation is poorly defined or unstable. [sent-64, score-0.089]
39 In section 3, we develop examples of operators L for a variety of observation processes. [sent-65, score-0.125]
40 2 Prior-free reformulation of the mean squared error In some cases, developing a stable nonparametric approximation of the ratio in Eq. [sent-67, score-0.114]
41 However, the linear operator formulation of the BLS estimator also leads to a dual expression for the mean squared error that does not depend explicitly on the prior, and this may be used to select an optimal estimator from a parametric family of estimators. [sent-69, score-0.903]
42 Specifically, for any estimator fθ (Y ) parameterized by θ, the mean squared error may be decomposed into two orthogonal terms: 2 2 = E |fθ (Y ) − E(X|Y )| E |fθ (Y ) − X| 2 + E |E(X|Y ) − X| . [sent-70, score-0.337]
43 θ (5) where the expectation on the right is over the observation variable, Y . [sent-77, score-0.087]
44 Again this does not require any knowledge of, or samples drawn from, the prior PX . [sent-80, score-0.128]
45 3 Example estimators In general, it can be difficult to obtain the operator L directly from the definition in Eq. [sent-81, score-0.301]
46 (3), because inversion of the operator A could be unstable or undefined. [sent-82, score-0.101]
47 This is an eigenfunction equation: for each value of x, the conditional density PY |X (y|x) must be an eigenfunction (eigenvector, for discrete variables) of eoperator L, with associated eigenvalue x. [sent-84, score-0.192]
48 Consider a standard example, in which the variable of interest is corrupted by independent additive noise: Y = X + W . [sent-85, score-0.136]
49 The conditional density is PY |X (y|x) = PW (y − x). [sent-86, score-0.108]
50 We wish to find an operator which when applied to this conditional density (viewed as a function of y) will give L{PW (y − x)} = x PW (y − x) (7) for all x. [sent-87, score-0.184]
51 (8) where M {f } (y) = L{f }(y) − y f (y) is a linear shift-invariant operator (acting in y). [sent-89, score-0.076]
52 (10) This gives us the linear operator L{f }(y) = y f (y) − F −1 i∇ω ln PW (ω) f (ω) (y), (11) where F −1 denotes the inverse Fourier transform. [sent-91, score-0.105]
53 Thus, in cases with known prior density and unknown additive noise density, one can formulate the estimator entirely in terms of the prior. [sent-93, score-0.545]
54 Our prior-free estimator methodology is quite general, and can often be applied to more complicated observation processes. [sent-94, score-0.341]
55 scale mixture g(y) d PY (y) T (y) dy { g(y) } >0 −EY {Y ; Y < y} PY {Y > y} Table 1: Prior-free estimation formulas. [sent-124, score-0.106]
56 Bracketed numbers are references for operators L, with * denoting references for the parametric (dual) operator, L∗ . [sent-127, score-0.103]
57 1 Simulations Non-parametric examples Since each of the prior-free estimators discussed above relies on approximating values from the observed data, the behavior of such estimators should approach the BLS estimator as the number of data samples grows. [sent-129, score-0.835]
58 1, we examine the behavior of three non-parametric prior-free estimators based on Eq. [sent-131, score-0.293]
59 The estimator 4 does not know the binary distribution of the source (which was a “fair coin” for our simulation), but does know the bit-switching probability. [sent-134, score-0.313]
60 For this estimator we approximate PY using a simple histogram, and then use the matrix version of the linear operator in Eq. [sent-135, score-0.351]
61 Signal density is a generalized Gaussian with exponent 0. [sent-139, score-0.119]
62 3 0 10 1 10 2 10 # samples 3 10 10 2 1 0 −1 −2 0 2 10 4 3 3 4 10 10 5 10 −3 2 10 3 10 # samples (b) (a) 10 4 # samples 5 10 6 10 (c) Fig. [sent-150, score-0.147]
63 1: Empirical convergence of prior-free estimator to optimal BLS solution, as a function number of observed samples of Y . [sent-151, score-0.368]
64 For each number of observations, each estimator is simulated many times. [sent-152, score-0.275]
65 White line shows the mean improvement using the conventional BLS solution, E{X|Y = y}, assuming the prior density is known. [sent-154, score-0.239]
66 (a) Binary noise (10,000 simulations for each number of observations); (b) additive Gaussian noise (1,000 simulations); (c) Poisson noise (1,000 simulations). [sent-156, score-0.213]
67 We fit a local exponential model similar to that used in [9] to the data in bins, with binwidth adaptively selected so that the product of the number of points in the bin and the squared binwidth is constant. [sent-158, score-0.154]
68 This binwidth selection procedure, analogous to adaptive binning procedures developed in the density estimation literature [10], provides a reasonable tradeoff between bias and variance, and converges to the correct answer for any well-behaved density [8]. [sent-159, score-0.307]
69 Note that in this case, convergence is substantially slower than for the binary case, as might be expected given that we are dealing with a continuous density rather than a single scalar probability. [sent-160, score-0.085]
70 It should be noted that improved performance for this estimator is expected if we were to use a more sophisticated approximation of the ratio of densities. [sent-166, score-0.275]
71 2 Parametric examples In this section we discuss the empirical behavior of the parametric approach applied to the additive Gaussian case. [sent-168, score-0.225]
72 dy In this particular case,, it is easier to represent the estimator as f (y) = y + g(y). [sent-170, score-0.334]
73 Therefore, if we have a parametric family {gθ } of such g, and are given data {Yn } we can try and minimize 1 N N {|gθ (Yn )|2 + σ 2 gθ (Yn )}. [sent-173, score-0.129]
74 (12) n=1 This expression, known as Stein’s unbiased risk estimator (SURE) [4], favors estimators gθ that have small amplitude, and highly negative derivatives at the data values. [sent-174, score-0.5]
75 This is intuitively sensible: the resulting estimators will “shrink” the data toward regions of high probability. [sent-175, score-0.225]
76 (12) was used as a criterion for density estimation in cases where the normalizing constant, or partition function, is difficult to obtain [11]. [sent-177, score-0.164]
77 2: Example bump functions, used for linear parameterization of estimators in Figs. [sent-179, score-0.266]
78 5 0 2 10 SNR improvement 3 SNR improvement 3 2. [sent-185, score-0.112]
79 5 0 2 10 3 4 10 10 5 10 0 2 10 3 4 10 10 # samples # samples (a) (b) 5 10 # samples (c) Fig. [sent-190, score-0.147]
80 3: Empirical convergence of parametric prior-free method to optimal BLS solution, as a function number of data observations, for three different parameterized estimators. [sent-191, score-0.092]
81 All cases use a generalized Gaussian prior (exponent 0. [sent-193, score-0.086]
82 approach we are discussing provides an interpretation for this procedure: the optimal density is the one which, when converted into an estimator using the formula in Table 1 for the additive Gaussian case, gives the best MSE. [sent-195, score-0.454]
83 As an example, we parametrize g as a linear combination of nonlinear “bump” functions gθ (y) = θk gk (y) (13) k where the functions gk are of the form 1 kπ sgn(y) log2 (|y|/σ + 1) − , α 2 as illustrated in Fig. [sent-197, score-0.078]
84 Figure 3 shows the empirical behavior of these “SURE-bump” estimators when using three bumps ( Fig. [sent-206, score-0.373]
85 Three bumps behaves fairly well, though the asymptotic behavior for large amounts of data is biased and thus falls short of ideal. [sent-209, score-0.106]
86 Fifteen bumps asymptotes correctly but has very large variance for small amounts of data (overfitting). [sent-210, score-0.085]
87 5 Discussion We have reformulated the Bayes least squares estimation problem for a setting in which one knows the observation process, and has access to many observations. [sent-215, score-0.19]
88 We do not assume the prior density is known, nor do we assume access to samples from the prior. [sent-216, score-0.229]
89 In many cases, the prior-free estimator can be written explicitly, and we have shown a number of examples to illustrate the diversity of estimators that can arise under different observation processes. [sent-218, score-0.623]
90 For three simple cases, we developed implementations and demonstrated that these converge to optimal BLS estimators as the amount of data grows. [sent-219, score-0.282]
91 We also have derived a prior-free formulation of the MSE, which allows selection of an estimator from a parametric family. [sent-220, score-0.376]
92 We have shown simulations for a linear family of estimators in the additive Gaussian case. [sent-221, score-0.403]
93 These simulations serve to demonstrate the potential of this approach, which holds particular appeal for real-world systems (machine or biological) that must learn the priors from environmental observations. [sent-222, score-0.087]
94 It is of particular interest to develop incremental implementations, which would update the estimator based on incoming observations. [sent-224, score-0.303]
95 This would further enhance the applicability of this approach for systems that must learn to do optimal estimation from corrupted observations. [sent-225, score-0.129]
96 Berger, “Improving on inadmissible estimators in continuous exponential families with applications to simultaneous estimation of gamma scale parameters,” The Annals of Staistics, vol. [sent-240, score-0.307]
97 Hwang, “Improving upon standard estimators in discrete exponential families with applications to poisson and negative binomial cases,” The Annals of Staistics, vol. [sent-250, score-0.324]
98 Miyasawa, “An empirical bayes estimator of the mean of a normal population,” Bull. [sent-254, score-0.404]
99 Simoncelli, “Empirical Bayes least squares estimation without an explicit prior. [sent-269, score-0.111]
100 Loader, “Local likelihood density estimation,” Annals of Statistics, vol. [sent-275, score-0.085]
wordName wordTfidf (topN-words)
[('py', 0.703), ('bls', 0.324), ('estimator', 0.275), ('pw', 0.259), ('estimators', 0.225), ('px', 0.164), ('ypy', 0.112), ('density', 0.085), ('operator', 0.076), ('additive', 0.074), ('parametric', 0.072), ('bumps', 0.069), ('bayes', 0.067), ('observation', 0.066), ('simulations', 0.064), ('corrupted', 0.062), ('snr', 0.06), ('dy', 0.059), ('poisson', 0.056), ('binwidth', 0.056), ('raphan', 0.056), ('improvement', 0.056), ('yn', 0.056), ('prior', 0.054), ('samples', 0.049), ('sgn', 0.047), ('mse', 0.047), ('estimation', 0.047), ('empirical', 0.042), ('squared', 0.042), ('courant', 0.041), ('bump', 0.041), ('access', 0.041), ('family', 0.04), ('gk', 0.039), ('staistics', 0.037), ('behavior', 0.037), ('squares', 0.036), ('written', 0.035), ('numerator', 0.034), ('exponent', 0.034), ('wavelet', 0.033), ('stein', 0.032), ('cases', 0.032), ('sure', 0.032), ('operators', 0.031), ('examine', 0.031), ('fourier', 0.03), ('thresholding', 0.03), ('reformulation', 0.03), ('eigenfunction', 0.03), ('ln', 0.029), ('formulation', 0.029), ('explicit', 0.028), ('observations', 0.028), ('proc', 0.028), ('develop', 0.028), ('expression', 0.028), ('operation', 0.027), ('annals', 0.027), ('dual', 0.026), ('drawn', 0.025), ('noise', 0.025), ('unstable', 0.025), ('conventional', 0.024), ('observed', 0.024), ('discrete', 0.024), ('simoncelli', 0.024), ('situations', 0.024), ('xn', 0.023), ('dx', 0.023), ('conditional', 0.023), ('parameterizations', 0.023), ('appeal', 0.023), ('wk', 0.022), ('diversity', 0.022), ('nonparametric', 0.022), ('implementations', 0.021), ('expectation', 0.021), ('reference', 0.02), ('gaussian', 0.02), ('mean', 0.02), ('optimal', 0.02), ('arg', 0.019), ('families', 0.019), ('ck', 0.019), ('know', 0.019), ('quantity', 0.018), ('literature', 0.018), ('minimize', 0.017), ('transformation', 0.017), ('bayesian', 0.017), ('amount', 0.016), ('ml', 0.016), ('tradeoff', 0.016), ('asymptotes', 0.016), ('xf', 0.016), ('inadmissible', 0.016), ('cauchy', 0.016), ('atlanta', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 121 nips-2006-Learning to be Bayesian without Supervision
Author: Martin Raphan, Eero P. Simoncelli
Abstract: unkown-abstract
2 0.1899088 200 nips-2006-Unsupervised Regression with Applications to Nonlinear System Identification
Author: Ali Rahimi, Ben Recht
Abstract: We derive a cost functional for estimating the relationship between highdimensional observations and the low-dimensional process that generated them with no input-output examples. Limiting our search to invertible observation functions confers numerous benefits, including a compact representation and no suboptimal local minima. Our approximation algorithms for optimizing this cost functional are fast and give diagnostic bounds on the quality of their solution. Our method can be viewed as a manifold learning algorithm that utilizes a prior on the low-dimensional manifold coordinates. The benefits of taking advantage of such priors in manifold learning and searching for the inverse observation functions in system identification are demonstrated empirically by learning to track moving targets from raw measurements in a sensor network setting and in an RFID tracking experiment. 1
3 0.12498555 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
Author: Samuel S. Gross, Olga Russakovsky, Chuong B. Do, Serafim Batzoglou
Abstract: We consider the problem of training a conditional random field (CRF) to maximize per-label predictive accuracy on a training set, an approach motivated by the principle of empirical risk minimization. We give a gradient-based procedure for minimizing an arbitrarily accurate approximation of the empirical risk under a Hamming loss function. In experiments with both simulated and real data, our optimization procedure gives significantly better testing performance than several current approaches for CRF training, especially in situations of high label noise. 1
4 0.092033297 9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments
Author: Daniel J. Navarro, Thomas L. Griffiths
Abstract: The additive clustering model is widely used to infer the features of a set of stimuli from their similarities, on the assumption that similarity is a weighted linear function of common features. This paper develops a fully Bayesian formulation of the additive clustering model, using methods from nonparametric Bayesian statistics to allow the number of features to vary. We use this to explore several approaches to parameter estimation, showing that the nonparametric Bayesian approach provides a straightforward way to obtain estimates of both the number of features used in producing similarity judgments and their importance. 1
5 0.076471746 20 nips-2006-Active learning for misspecified generalized linear models
Author: Francis R. Bach
Abstract: Active learning refers to algorithmic frameworks aimed at selecting training data points in order to reduce the number of required training data points and/or improve the generalization performance of a learning method. In this paper, we present an asymptotic analysis of active learning for generalized linear models. Our analysis holds under the common practical situation of model misspecification, and is based on realistic assumptions regarding the nature of the sampling distributions, which are usually neither independent nor identical. We derive unbiased estimators of generalization performance, as well as estimators of expected reduction in generalization error after adding a new training data point, that allow us to optimize its sampling distribution through a convex optimization problem. Our analysis naturally leads to an algorithm for sequential active learning which is applicable for all tasks supported by generalized linear models (e.g., binary classification, multi-class classification, regression) and can be applied in non-linear settings through the use of Mercer kernels. 1
6 0.072107166 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
7 0.066070899 175 nips-2006-Simplifying Mixture Models through Function Approximation
8 0.063426197 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors
9 0.055678867 128 nips-2006-Manifold Denoising
10 0.053068958 17 nips-2006-A recipe for optimizing a time-histogram
11 0.04601486 57 nips-2006-Conditional mean field
12 0.044705454 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
13 0.043488644 65 nips-2006-Denoising and Dimension Reduction in Feature Space
14 0.040099006 68 nips-2006-Dirichlet-Enhanced Spam Filtering based on Biased Samples
15 0.039889865 116 nips-2006-Learning from Multiple Sources
16 0.037575204 114 nips-2006-Learning Time-Intensity Profiles of Human Activity using Non-Parametric Bayesian Models
17 0.036872644 60 nips-2006-Convergence of Laplacian Eigenmaps
18 0.035488192 56 nips-2006-Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data
19 0.034132503 131 nips-2006-Mixture Regression for Covariate Shift
20 0.033818867 182 nips-2006-Statistical Modeling of Images with Fields of Gaussian Scale Mixtures
topicId topicWeight
[(0, -0.132), (1, 0.029), (2, 0.008), (3, 0.008), (4, -0.001), (5, 0.035), (6, 0.089), (7, -0.032), (8, -0.022), (9, 0.069), (10, 0.002), (11, 0.001), (12, 0.078), (13, 0.082), (14, -0.022), (15, 0.022), (16, 0.06), (17, -0.145), (18, -0.05), (19, 0.001), (20, 0.045), (21, -0.069), (22, 0.038), (23, -0.184), (24, -0.11), (25, 0.026), (26, 0.05), (27, -0.013), (28, -0.104), (29, 0.131), (30, -0.017), (31, 0.013), (32, -0.037), (33, -0.017), (34, -0.104), (35, -0.144), (36, -0.025), (37, 0.034), (38, -0.013), (39, -0.006), (40, 0.133), (41, -0.04), (42, 0.039), (43, 0.035), (44, 0.118), (45, 0.134), (46, -0.018), (47, 0.079), (48, -0.082), (49, -0.084)]
simIndex simValue paperId paperTitle
same-paper 1 0.94263417 121 nips-2006-Learning to be Bayesian without Supervision
Author: Martin Raphan, Eero P. Simoncelli
Abstract: unkown-abstract
2 0.5955559 200 nips-2006-Unsupervised Regression with Applications to Nonlinear System Identification
Author: Ali Rahimi, Ben Recht
Abstract: We derive a cost functional for estimating the relationship between highdimensional observations and the low-dimensional process that generated them with no input-output examples. Limiting our search to invertible observation functions confers numerous benefits, including a compact representation and no suboptimal local minima. Our approximation algorithms for optimizing this cost functional are fast and give diagnostic bounds on the quality of their solution. Our method can be viewed as a manifold learning algorithm that utilizes a prior on the low-dimensional manifold coordinates. The benefits of taking advantage of such priors in manifold learning and searching for the inverse observation functions in system identification are demonstrated empirically by learning to track moving targets from raw measurements in a sensor network setting and in an RFID tracking experiment. 1
3 0.40619349 128 nips-2006-Manifold Denoising
Author: Matthias Hein, Markus Maier
Abstract: We consider the problem of denoising a noisily sampled submanifold M in Rd , where the submanifold M is a priori unknown and we are only given a noisy point sample. The presented denoising algorithm is based on a graph-based diffusion process of the point sample. We analyze this diffusion process using recent results about the convergence of graph Laplacians. In the experiments we show that our method is capable of dealing with non-trivial high-dimensional noise. Moreover using the denoising algorithm as pre-processing method we can improve the results of a semi-supervised learning algorithm. 1
4 0.39059529 182 nips-2006-Statistical Modeling of Images with Fields of Gaussian Scale Mixtures
Author: Siwei Lyu, Eero P. Simoncelli
Abstract: The local statistical properties of photographic images, when represented in a multi-scale basis, have been described using Gaussian scale mixtures (GSMs). Here, we use this local description to construct a global field of Gaussian scale mixtures (FoGSM). Specifically, we model subbands of wavelet coefficients as a product of an exponentiated homogeneous Gaussian Markov random field (hGMRF) and a second independent hGMRF. We show that parameter estimation for FoGSM is feasible, and that samples drawn from an estimated FoGSM model have marginal and joint statistics similar to wavelet coefficients of photographic images. We develop an algorithm for image denoising based on the FoGSM model, and demonstrate substantial improvements over current state-ofthe-art denoising method based on the local GSM model. Many successful methods in image processing and computer vision rely on statistical models for images, and it is thus of continuing interest to develop improved models, both in terms of their ability to precisely capture image structures, and in terms of their tractability when used in applications. Constructing such a model is difficult, primarily because of the intrinsic high dimensionality of the space of images. Two simplifying assumptions are usually made to reduce model complexity. The first is Markovianity: the density of a pixel conditioned on a small neighborhood, is assumed to be independent from the rest of the image. The second assumption is homogeneity: the local density is assumed to be independent of its absolute position within the image. The set of models satisfying both of these assumptions constitute the class of homogeneous Markov random fields (hMRFs). Over the past two decades, studies of photographic images represented with multi-scale multiorientation image decompositions (loosely referred to as “wavelets”) have revealed striking nonGaussian regularities and inter and intra-subband dependencies. For instance, wavelet coefficients generally have highly kurtotic marginal distributions [1, 2], and their amplitudes exhibit strong correlations with the amplitudes of nearby coefficients [3, 4]. One model that can capture the nonGaussian marginal behaviors is a product of non-Gaussian scalar variables [5]. A number of authors have developed non-Gaussian MRF models based on this sort of local description [6, 7, 8], among which the recently developed fields of experts model [7] has demonstrated impressive performance in denoising (albeit at an extremely high computational cost in learning model parameters). An alternative model that can capture non-Gaussian local structure is a scale mixture model [9, 10, 11]. An important special case is Gaussian scale mixtures (GSM), which consists of a Gaussian random vector whose amplitude is modulated by a hidden scaling variable. The GSM model provides a particularly good description of local image statistics, and the Gaussian substructure of the model leads to efficient algorithms for parameter estimation and inference. Local GSM-based methods represent the current state-of-the-art in image denoising [12]. The power of GSM models should be substantially improved when extended to describe more than a small neighborhood of wavelet coefficients. To this end, several authors have embedded local Gaussian mixtures into tree-structured MRF models [e.g., 13, 14]. In order to maintain tractability, these models are arranged such that coefficients are grouped in non-overlapping clusters, allowing a graphical probability model with no loops. Despite their global consistency, the artificially imposed cluster boundaries lead to substantial artifacts in applications such as denoising. In this paper, we use a local GSM as a basis for a globally consistent and spatially homogeneous field of Gaussian scale mixtures (FoGSM). Specifically, the FoGSM is formulated as the product of two mutually independent MRFs: a positive multiplier field obtained by exponentiating a homogeneous Gaussian MRF (hGMRF), and a second hGMRF. We develop a parameter estimation procedure, and show that the model is able to capture important statistical regularities in the marginal and joint wavelet statistics of a photographic image. We apply the FoGSM to image denoising, demonstrating substantial improvement over the previous state-of-the-art results obtained with a local GSM model. 1 Gaussian scale mixtures A GSM random vector x is formed as the product of a zero-mean Gaussian random vector u and an d √ d independent random variable z, as x = zu, where = denotes equality in distribution. The density of x is determined by the covariance of the Gaussian vector, Σ, and the density of the multiplier, p z (z), through the integral Nx (0, zΣ)pz (z)dz ∝ p(x) = z z xT Σ−1 x 1 pz (z)dz. exp − √ 2z z|Σ| (1) A key property of GSMs is that when z determines the scale of the conditional variance of x given z, which is a Gaussian variable with zero mean and covariance zΣ. In addition, the normalized variable √ x z is a zero mean Gaussian with covariance matrix Σ. The GSM model has been used to describe the marginal and joint densities of local clusters of wavelet coefficients, both within and across subbands [9], where the embedded Gaussian structure affords simple and efficient computation. This local GSM model has been be used for denoising, by independently estimating each coefficient conditioned on its surrounding cluster [12]. This method achieves state-of-the-art performances, despite the fact that treating overlapping clusters as independent does not give rise to a globally consistent statistical model that satisfies all the local constraints. 2 Fields of Gaussian scale mixtures In this section, we develop fields of Gaussian scale mixtures (FoGSM) as a framework for modeling wavelet coefficients of photographic images. Analogous to the local GSM model, we use a latent multiplier field to modulate a homogeneous Gaussian MRF (hGMRF). Formally, we define a FoGSM x as the product of two mutually independent MRFs, √ d x = u ⊗ z, (2) where u is a zero-mean hGMRF, and z is a field of positive multipliers that control the local coefficient variances. The operator ⊗ denotes element-wise multiplication, and the square root operation is applied to each component. Note that x has a one-dimensional GSM marginal distributions, while its components have dependencies captured by the MRF structures of u and z. Analogous to the local GSM, when conditioned on z, x is an inhomogeneous GMRF p(x|z) ∝ √ 1 |Qu | exp − xT D z zi 2 i −1 Qu D √ z −1 x = 1 |Qu | exp − (x zi 2 i √ T z) Qu (x √ z) , (3) where Qu is the inverse covariance matrix of u (also known as the precision matrix), and D(·) denotes the operator that form a diagonal matrix from an input vector. Note also that the element√ wise division of the two fields, x z, yields a hGMRF with precision matrix Q u . To complete the FoGSM model, we need to specify the structure of the multiplier field z. For tractability, we use another hGMRF as a substrate, and map it into positive values by exponentiation, x u log z Fig. 1. Decomposition of a subband from image “boat” (left) into the normalized subband u (middle) and the multiplier field z (right, in the logarithm domain). Each image is rescaled individually to fill the full range of grayscale intensities. as was done in [10]. To be more specific, we model log(z) as a hGMRF with mean µ and precision matrix Qz , where the log operator is applied element-wise, from which the density of z follows as: pz (z) ∝ |Qz | 1 exp − (log z − µ)T Qz (log z − µ) . zi 2 i (4) This is a natural extension of the univariate lognormal prior used previously for the scalar multiplier in the local GSM model [12]. The restriction to hGMRFs greatly simplifies computation with FoGSM. Particularly, we take advantage of the fact that a 2D hGMRF with circular boundary handling has a sparse block-circulant precision matrix with a generating kernel θ specifying its nonzero elements. A block-circulant matrix is diagonalized by the Fourier transform, and its multiplication with a vector corresponds to convolution with the kernel θ. The diagonalizability with a fixed and efficiently computed transform makes the parameter estimation, sampling, and inference with a hGMRF substantially more tractable than with a general MRF. Readers are referred to [15] for a detailed description of hGMRFs. Parameter estimation: The estimation of the latent multiplier field z and the model parameters (µ, Qz , Qu ) may be achieved by maximizing log p(x, z; Q u, Qz , µ) with an iterative coordinate-ascent method, which is guaranteed to converge. Specifically, based on the statistical dependency structures in the FoGSM model, the following three steps are repeated until convergence: (i) z(t+1) = argmaxz log p(x|z; Q(t) ) + log p(z; Q(t) , µ(t) ) u z (ii) Q(t+1) = argmaxQu log p(x|z(t+1); Qu ) u (iii) (Q(t+1) , µ(t+1) ) = argmaxQz ,µ log p(z(t+1) ; Qz , µ) z (5) According to the FoGSM model structure, steps (ii) and (iii) correspond to maximum likelihood √ z(t+1) and log z(t+1) , respectively. Because of this, estimates of the parameters of hGMRFs, x both steps may be efficiently implemented by exploiting the diagonalization of the precision matrices with 2D Fourier transforms [15]. Step (i) in (5) may be implemented with conjugate gradient ascent [16]. To simplify description and computation, we introduce a new variable for the element-wise inverse square root of the multiplier: √ s=1 z. The likelihood in (3) is then changed to: p(x|s) ∝ i 1 si exp − (x ⊗ s)T Qu (x ⊗ s) = 2 i 1 si exp − sT D(x)Qu D(x)s . 2 (6) The joint density of s is obtained from (4), using the relations between densities of transformed variables, as 1 1 exp − (2 log s + µ)T Qz (2 log s + µ) . (7) p(s) ∝ si 2 i ˆ Combining . (6) and (7), step (i) in (5) is equivalent to computing s = argmaxs log p(x|s; Qu ) + log p(s; Qz , µ), which is further simplified into: argmin s 1 T 1 s D (x) Qu D (x) s + (2 log s + µ)T Qz (2 log s + µ) . 2 2 (8) boat house peppers x x x x log p(x) Barbara Fig. 2. Empirical marginal log distributions of coefficients from a multi-scale decomposition of photographic images (blue dot-dashed line), synthesized FoGSM samples from the same subband (red solid line), and a Gaussian with the same standard deviation (red dashed line). ˆ ˆ and the optimal z is then recovered as z = 1 (ˆ ⊗ s). We the optimize (8) with conjugate gradient s ˆ ascent [16]. Specifically, the negative gradient of the objective function in (8) with respect to s is − ∂ log p(x|s)p(s) ∂s = D (x) Qu D (x) s + 2 D(s)−1 Qz (2 log s + µ) = x ⊗ (θu (x ⊗ s)) + 2(θz (2 log s + µ)) s, and the multiplication of any vector h with the Hessian matrix can be computed as: ∂2 log p(x|s)p(s) h = x ⊗ (θu (x ⊗ h)) + 4 (θz (h s)) s − 2 θz (log s + µ) ⊗ h (s ⊗ s). ∂s2 Both operations can be expressed entirely in terms of element-wise operations ( and ⊗) and 2D convolutions ( ) with the generating kernels of the two precision matrices θ u and θz , which allows for efficient implementation. 3 Modeling photographic images We have applied the FoGSM model to subbands of a multi-scale image representation known as a steerable pyramid [17]. This decomposition is a tight frame, constructed from oriented multiscale derivative operators, and is overcomplete by a factor of 4K/3, where K is the number of orientation bands. Note that the marginal and joint statistics we describe are not specific to this decomposition, and are similar for other multi-scale oriented representations. We fit a FoGSM model to each subband of a decomposed photographic image, using the algorithms described in the previous section. For precision matrices Q u and Qz , we assumed a 5 × 5 Markov neighborhood (corresponding to a 5 × 5 convolution kernel), which was loosely chosen to optimize the tradeoff between accuracy and overfitting. Figure 1 shows the result of fitting a FoGSM model to an example subband from the “boat” image (left panel). The subband is decomposed into the product of the u field (middle panel) and the z field (right panel, in the logarithm domain), along with model parameters Q u , µ and Qz (not shown). Visually, the changing spatial variances are represented in the estimated log z field, and the estimated u is much more homogeneous than the original subband and has a marginal distribution close to Gaussian.1 However, the log z field still has a non-Gaussian marginal distribution and is spatially inhomogeneous, suggesting limitations of FoGSM for modeling photographic image wavelet coefficients (see Discussion). The statistical dependencies captured by the FoGSM model can be further revealed by examining marginal and joint statistics of samples synthesized with the estimated model parameters. A sample √ from FoGSM can be formed by multiplying samples of u and z. The former is obtained by sampling from hGMRF u, and the latter is obtained from the element-wise exponentiation followed by a element-wise square root operation of a sample of hGMRF log z. This procedure is again efficient for FoGSM due to the use of hGMRFs as building blocks [15]. Marginal distributions: We start by comparing the marginal distributions of the samples and the original subband. Figure 2 shows empirical histograms in the log domain of a particular subband 1 This ”Gaussianizing” behavior was first noted in photographic images by Ruderman [18], who observed that image derivative measurements that were normalized by a local estimate of their standard deviation had approximately Gaussian marginal distributions. close ∆ = 1 near ∆ = 4 far ∆ = 32 orientation scale real sim real sim Fig. 3. Examples of empirically observed distributions of wavelet coefficient pairs, compared with distributions from synthesized samples with the FoGSM model. See text for details. from four different photographic images (blue dot-dashed line), and those of the synthesized samples of FoGSM models learned from each corresponding subband (red solid line). For comparison, a Gaussian with the same standard deviation as the image subband is also displayed (red dashed line). Note that the synthesized samples have conspicuous non-Gaussian characteristics similar to the real subbands, exemplified by the high peak and heavy tails in the marginal distributions. On the other hand, they are typically less kurtotic than the real subbands. We believe this arises from the imprecise Gaussian approximation of log z (see Discussion). Joint distributions: In addition to one-dimensional marginal statistics, the FoGSM model is capable of capturing the joint behavior of wavelet coefficients. As described in [4, 9], wavelet coefficients of photographic images present non-Gaussian dependencies. Shown in the first and the third rows in Fig. 3 are empirical joint and conditional histograms for one subband of the “boat” image, for five pairs of coefficients, corresponding to basis functions with spatial separations of ∆ = {1, 4, 32} samples, two orthogonal orientations and two adjacent scales. Contour lines in the joint histogram are drawn at equal intervals of log probability. Intensities in the conditional histograms correspond to probability, except that each column is independently rescaled to fill the full range of intensity. For a pair of adjacent coefficients, we observe an elliptical joint distribution and a “bow-tie” shaped conditional distribution. The latter is indicative of strong non-Gaussian dependencies. For coefficients that are distant, the dependency becomes weaker and the corresponding joint and conditional histograms become more separable, as would be expected for two independent random variables. Random samples drawn from a FoGSM model, with parameters fitted to the corresponding subband, have statistical characteristics consistent with the general description of wavelet coefficients of photographic images. Shown in the second and the fourth rows of Fig. 3 are the joint and conditional histograms of synthesized samples from the FoGSM model estimated from the same subband as in the first and the third rows. Note that the joint and conditional histograms of the synthesized samples have similar transition of spatial dependencies as the separation increases (column 1,2 and 3), suggesting that the FoGSM accounts well for pairwise joint dependencies of coefficients over a full range of spatial separations. On the other hand, the dependencies between subbands of different orientations and scales are not properly modeled by FoGSM (column 4 and 5). This is especially true for subbands at different scales, which exhibit strong dependencies. The current FoGSM model original image noisy image (σ = 50) (PSNR = 14.15dB) GSM-BLS (PSNR = 26.34dB) FoGSM (PSNR = 27.01dB) Fig. 4. Denoising results using local GSM [12] and FoGSM. Performances are evaluated in peaksignal-to-noise-ratio (PSNR), 20 log10 (255/σe ), where σe is the standard deviation of the error. does not exhibit those dependencies as only spatial neighbors are used to make use the 2D hGMRFs (see Discussion). 4 Application to image denoising Let y = x+w be a wavelet subband of an image that has been corrupted with white Gaussian noise of known variance. In an overcomplete wavelet domain such as steerable pyramid, the white Gaussian noise is transformed into correlated Gaussian noise w ∼ N w (0, Σw ), whose covariance Σ w can be derived from the basis functions of the pyramid transform. With FoGSM as prior over x, commonly used denoising methods involve expensive high-dimensional integration: for instance, maximum ˆ a posterior estimate, x MAP = argmaxx log p(x|y), requires a high-dimensional integral over z, and ˆ the Bayesian least square estimation, x BLS = E(x|y) requires a double high-dimensional integral over x and z. Although it is possible to optimize with these criteria using Monte-Carlo Markov sampling or other approximations, we instead develop a more efficient deterministic algorithm that takes advantage of the hGMRF structure in the FoGSM model. Specifically, we compute (ˆ , z, Qu , Qz , µ) = argmaxx,z,Qu ,Qz ,µ log p(x, z|y; Q u, Qz , µ) x ˆ ˆ ˆ ˆ (9) ˆ and take x as the optimal denoised subband. Note that the model parameters are learned within the inference process rather than in a separate parameter learning step. This strategy, known as partial optimal solution [19], greatly reduces the computational complexity. We optimize (9) with coordinate ascent, iterating between maximizing each of (x, z, Q u , Qz , µ) while fixing the others. With fixed estimates of (z, Q u , Qz , µ), the optimization of x is argmaxx log p(x, z|y; Q u, Qz , µ) = argmaxx log p(x|z, y; Q u, Qz , µ) + log p(z|y; Q u, Qz , µ) , which reduces to argmax x log p(x|z, y; Q u), with the second term independent of x and can be dropped from optimization. Given the Gaussian structure of x given z, this step is then equivalent to a Wiener filter (linear in y). Fixing (x, Q u , Qz , µ), the optimization of z is argmaxz log p(x, z|y; Q u, Qz , µ)= argmaxz log p(y|x, z; Q u)+log p(x, z; Q u, Qz , µ)−log p(y; Qu , Qz , µ) , which is further reduced to argmax z log p(x, z; Qu , Qz , µ). Here, the first term was dropped since y is independent of z when conditioned on x. The last term was also dropped since it is also independent of z. Therefore, optimizing z given (x, Q u , Qz , µ) is equivalent to the first step of the algorithm in section 2, which can be implemented with efficient gradient descent. Finally, given (x, z), the FoGSM √ z(t+1) and log z(t+1) , similar to the model parameters (Q u , Qz , µ) are estimated from hGMRFs x second and third step in the algorithm of section 2. However, to reduce the overall computation time, instead of a complete maximum likelihood estimation, these parameters are estimated with a maximum pseudo-likelihood procedure [20], which finds the parameters maximizing the product of all conditional distributions (which are 1D Gaussians in the GMRF case), followed by a projection to the subspace of FoGSM parameters that results in positive definite precision matrices. We tested this denoising method on a standard set of test images [12]. The noise corrupted images were first decomposed these into a steerable pyramid with multiple levels (5 levels for a 512 × 512 image and 4 levels for a 256 × 256 image ) and 8 orientations. We assumed a FoGSM model for each subband, with a 5 × 5 neighborhood for field u and a 1 × 1 neighborhood for field log z. These sizes were chosen to provide a reasonable combination of performance and computational efficiency. We then estimate the optimal x with the algorithm described previously, with the initial values of x and z computed from subband denoised with the local GSM model [12]. Shown in Fig. 4 is an example of denoising the “boat” image corrupted with simulated additive white Gaussian noise of strength σ = 50, corresponding to a peak-signal-to-noise-ratio (PSNR), of 14.15 dB. We compare this with the local GSM method in [12], which, assuming a local GSM model for the neighborhood consisting of 3 × 3 spatial neighbors plus parent in the next coarsest scale, computes a Bayes least squares estimate of each coefficient conditioned on its surrounding neighborhood. The FoGSM denoising achieves substantial improvement (+0.68 in PSNR) and is seen to exhibit better contrast and continuation of oriented features (see Fig. 4). On the other hand, FoGSM introduces some noticeable artifacts in low contrast areas, which is caused by numerical instability at locations with small z. We find that the improvement in terms of PSNR is consistent across photographic images and noise levels, as reported in Table 1. But even with a restricted neighborhood for the multiplier field, this PSNR improvement does come at a substantial computational cost. As a rough indication, running on a PowerPC G5 workstation with 2.3 Ghz processor and 16 Gb RAM memory, using unoptimized MATLAB (version R14) code, denoising a 512 × 512 image takes on average 4.5 hours (results averaging over 5 images), and denoising a 256×256 image takes on average 2.4 hours (result averaging over 2 images), to a convergence precision producing the reported results. Our preliminary investigation indicates that the slow running time is mainly due to the nature of coordinate ascent and the landscape of (9), which requires many iterations to converge. 5 Discussion We have introduced fields of Gaussian scale mixtures as a flexible and efficient tool for modeling the statistics of wavelet coefficients of photographic images. We developed a feasible (although admittedly computationally costly) parameter estimation method, and showed that samples synthesized from the fitted FoGSM model are able to capture structures in the marginal and joint wavelet statistics of photographic images. Preliminary results of applying FoGSM to image denoising indicate substantial improvements over the state-of-the-art methods based on the local GSM model. Although FoGSM has a structure that is similar to the local scale mixture model [9, 10], there is a fundamental difference between them. In FoGSM, hGMRF structures are enforced in u and log z, while the local scale mixture models impose minimal statistical structure on these variables. Because of this, our model easily extends to images of arbitrary size, while the local scale mixture models are essentially confined to describing small image patches (the curse of dimensionality, and the increase in computational cost prevent one from scaling the patch size up). On the other hand, the close relation to Gaussian MRF makes the analysis and computation of FoGSM significantly easier than other non-Gaussian MRF based image models [6, 7, 5]. We envision, and are currently working on, a number of model improvements. First, the model should benefit from the introduction of more general Markov neighborhoods, including wavelet coefficients from subbands at other scales and orientations [4, 12], since the current model is clearly not accounting for these dependencies (see Fig. 3). Secondly, the log transformation used to derive the multiplier field from a hGMRF is somewhat ad hoc, and we believe that substitution of another nonlinear transformation (e.g., a power law [14]) might lead to a more accurate description of the image statistics. Thirdly, the current denoising method estimates model parameter during the process of denoising, which produces image adaptive model parameters. We are exploring the possibility of using a set of generic model parameter learned a priori on a large set of photographic images, so that a generic statistical model for all photographic images based on FoGSM can be built. Finally, there exist residual inhomogeneous structures in the log z field (see Fig. 1) that can likely be captured by explicitly incorporating local orientation [21] or phase into the model. Finding tractable models and algorithms for handling such circular variables is challenging, but we believe their inclusion will result in substantial improvements in modeling and in denoising performance. σ/PSNR 10/28.13 25/20.17 50/14.15 100/8.13 σ/PSNR 10/28.13 25/20.17 50/14.15 100/8.13 Barbara 35.01 (34.01) 30.10 (29.07) 26.40 (25.45) 23.01 (22.61) Flintstones 32.47 (31.78) 28.29 (27.48) 24.82 (24.02) 21.24 (20.49) barco 35.05 (34.42) 30.44 (29.73) 27.36 (26.63) 24.44 (23.84) house 35.63 (35.27) 31.64 (31.32) 28.51 (28.23) 25.33 (25.31) boat 34.12 (33.58) 30.03 (29.34) 27.01 (26.35) 24.20 (23.79) Lena 35.94 (35.60) 32.11 (31.70) 29.12 (28.62) 26.12 (25.77) fingerprint 33.28 (32.45) 28.45 (27.44) 25.11 (24.13) 21.78 (21.21) peppers 34.38 (33.73) 29.78 (29.18) 26.43 (25.93) 23.17 (22.80) Table 1. Denoising results with FoGSM on different images and different noise levels. Shown in the table are PSNRs (20 log10 (255/σe ), where σe is the standard deviation of the error) of the denoised images, and in the parenthesis are the PSNRs of the same images denoised with a local GSM model [12]. References [1] P. J. Burt. Fast filter transforms for image processing. Comp. Graph. Image Proc., 16:20–51, 1981. [2] D. J. Field. Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am., 4(12):2379–2394, 1987. [3] B. Wegmann and C. Zetzsche. Statistical dependencies between orientation filter outputs used in human vision based image code. In Proc. Visual Comm. and Image Proc., volume 1360, pages 909–922, 1990. [4] R. W. Buccigrossi and E. P. Simoncelli. Image compression via joint statistical characterization in the wavelet domain. IEEE Trans. on Image Proc., 8(12):1688–1701, 1999. [5] Y. W. Teh, M. Welling, S. Osindero, and G. E. Hinton. Energy-based models for sparse overcomplete representations. J. of Machine Learning Res., 4:1235–1260, 2003. [6] S. C. Zhu, Y. Wu, and D. Mumford. Filters, random fields and maximum entropy (FRAME): Towards a unified theory for texture modeling. Int’l. J. Comp. Vis., 27(2):107–126, 1998. [7] S. Roth and M. J. Black. Fields of experts: a framework for learning image priors. In IEEE Conf. on Comp. Vis. and Pat. Rec., volume 2, pages 860–867, 2005. [8] P. Gehler and M. Welling. Products of ”edge-perts”. In Adv. in Neural Info. Proc. Systems (NIPS*05). MIT Press, 2006. [9] M. J. Wainwright and E. P. Simoncelli. Scale mixtures of Gaussians and the statistics of natural images. In Adv. Neural Info. Proc. Sys. (NIPS*99), volume 12, pages 855–861, May 2000. [10] Y. Karklin and M. S. Lewicki. A hierarchical Bayesian model for learning non-linear statistical regularities in non-stationary natural signals. Neural Computation, 17(2):397–423, 2005. [11] A. Hyv¨ rinen, P. O. Hoyer, and M. Inki. Topographic ICA as a model of natural image statistics. In the a First IEEE Int’l. Workshop on Bio. Motivated Comp. Vis., London, UK, 2000. [12] J. Portilla, V. Strela, M. J. Wainwright, and E. P. Simoncelli. Image denoising using scale mixtures of Gaussians in the wavelet domain. IEEE Trans. on Image Proc., 12(11):1338–1351, 2003. [13] J. Romberg, H. Choi, and R. G. Baraniuk. Bayesian tree-structured image modeling using wavelet domain hidden Markov models. IEEE Trans. on Image Proc., 10(7):303–347, 2001. [14] M. J. Wainwright, E. P. Simoncelli, and A. S. Willsky. Random cascades on wavelet trees and their use in modeling and analyzing natural imagery. Appl. and Comp. Harm. Ana., 11(1):89–123, 2001. [15] H. Rue and L. Held. Gaussian Markov Random Fields: Theory And Applications. Monographs on Statistics and Applied Probability. Chapman and Hall/CRC, 2005. [16] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes. Cambridge, 2nd edition, 2002. [17] E. P. Simoncelli and W. T. Freeman. The steerable pyramid: A flexible architecture for multi-scale derivative computation. In IEEE Int’l. Conf. on Image Proc., volume 3, pages 444–447, 1995. [18] D. Ruderman. The statistics of natural images. Network : Comp. in Neural Sys., 5:598–605, 1994. [19] M. Figueiredo and J. Leit¨ o. Unsupervised image restoration and edge location using compound Gaussa Markov random fields and MDL principle. IEEE Trans. on Image Proc., 6(8):1089–1122, 1997. [20] J. Besag. On the statistical analysis of dirty pictures. J. of the Royal Stat. Soc., Series B, 48:259–302, 1986. [21] D. K. Hammond and E. P. Simoncelli. Image denoising with an orientation-adaptive Gaussian scale mixture model. In Proc. 13th IEEE Int’l. Conf. on Image Proc., pages 1433–1436, October 2006.
5 0.38168666 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
Author: Samuel S. Gross, Olga Russakovsky, Chuong B. Do, Serafim Batzoglou
Abstract: We consider the problem of training a conditional random field (CRF) to maximize per-label predictive accuracy on a training set, an approach motivated by the principle of empirical risk minimization. We give a gradient-based procedure for minimizing an arbitrarily accurate approximation of the empirical risk under a Hamming loss function. In experiments with both simulated and real data, our optimization procedure gives significantly better testing performance than several current approaches for CRF training, especially in situations of high label noise. 1
6 0.3556318 9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments
7 0.35042492 20 nips-2006-Active learning for misspecified generalized linear models
8 0.34955287 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
9 0.34016535 1 nips-2006-A Bayesian Approach to Diffusion Models of Decision-Making and Response Time
10 0.33649987 131 nips-2006-Mixture Regression for Covariate Shift
11 0.33559674 120 nips-2006-Learning to Traverse Image Manifolds
12 0.32218155 147 nips-2006-Non-rigid point set registration: Coherent Point Drift
13 0.31995791 17 nips-2006-A recipe for optimizing a time-histogram
14 0.29729331 56 nips-2006-Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data
15 0.2948187 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization
16 0.28017047 98 nips-2006-Inferring Network Structure from Co-Occurrences
17 0.27632317 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
18 0.27321512 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression
19 0.27196592 68 nips-2006-Dirichlet-Enhanced Spam Filtering based on Biased Samples
20 0.27010894 162 nips-2006-Predicting spike times from subthreshold dynamics of a neuron
topicId topicWeight
[(1, 0.088), (3, 0.073), (7, 0.112), (9, 0.025), (20, 0.061), (22, 0.059), (44, 0.092), (57, 0.045), (64, 0.01), (65, 0.038), (69, 0.028), (71, 0.015), (97, 0.24)]
simIndex simValue paperId paperTitle
1 0.80450439 136 nips-2006-Multi-Instance Multi-Label Learning with Application to Scene Classification
Author: Zhi-hua Zhou, Min-ling Zhang
Abstract: In this paper, we formalize multi-instance multi-label learning, where each training example is associated with not only multiple instances but also multiple class labels. Such a problem can occur in many real-world tasks, e.g. an image usually contains multiple patches each of which can be described by a feature vector, and the image can belong to multiple categories since its semantics can be recognized in different ways. We analyze the relationship between multi-instance multi-label learning and the learning frameworks of traditional supervised learning, multiinstance learning and multi-label learning. Then, we propose the M IML B OOST and M IML S VM algorithms which achieve good performance in an application to scene classification. 1
same-paper 2 0.79440403 121 nips-2006-Learning to be Bayesian without Supervision
Author: Martin Raphan, Eero P. Simoncelli
Abstract: unkown-abstract
3 0.61470753 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
Author: Kilian Q. Weinberger, Fei Sha, Qihui Zhu, Lawrence K. Saul
Abstract: In many areas of science and engineering, the problem arises how to discover low dimensional representations of high dimensional data. Recently, a number of researchers have converged on common solutions to this problem using methods from convex optimization. In particular, many results have been obtained by constructing semidefinite programs (SDPs) with low rank solutions. While the rank of matrix variables in SDPs cannot be directly constrained, it has been observed that low rank solutions emerge naturally by computing high variance or maximal trace solutions that respect local distance constraints. In this paper, we show how to solve very large problems of this type by a matrix factorization that leads to much smaller SDPs than those previously studied. The matrix factorization is derived by expanding the solution of the original problem in terms of the bottom eigenvectors of a graph Laplacian. The smaller SDPs obtained from this matrix factorization yield very good approximations to solutions of the original problem. Moreover, these approximations can be further refined by conjugate gradient descent. We illustrate the approach on localization in large scale sensor networks, where optimizations involving tens of thousands of nodes can be solved in just a few minutes. 1
4 0.61349326 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
Author: Gloria Haro, Gregory Randall, Guillermo Sapiro
Abstract: The study of point cloud data sampled from a stratification, a collection of manifolds with possible different dimensions, is pursued in this paper. We present a technique for simultaneously soft clustering and estimating the mixed dimensionality and density of such structures. The framework is based on a maximum likelihood estimation of a Poisson mixture model. The presentation of the approach is completed with artificial and real examples demonstrating the importance of extending manifold learning to stratification learning. 1
5 0.61075145 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency
Author: Wolf Kienzle, Felix A. Wichmann, Matthias O. Franz, Bernhard Schölkopf
Abstract: This paper addresses the bottom-up influence of local image information on human eye movements. Most existing computational models use a set of biologically plausible linear filters, e.g., Gabor or Difference-of-Gaussians filters as a front-end, the outputs of which are nonlinearly combined into a real number that indicates visual saliency. Unfortunately, this requires many design parameters such as the number, type, and size of the front-end filters, as well as the choice of nonlinearities, weighting and normalization schemes etc., for which biological plausibility cannot always be justified. As a result, these parameters have to be chosen in a more or less ad hoc way. Here, we propose to learn a visual saliency model directly from human eye movement data. The model is rather simplistic and essentially parameter-free, and therefore contrasts recent developments in the field that usually aim at higher prediction rates at the cost of additional parameters and increasing model complexity. Experimental results show that—despite the lack of any biological prior knowledge—our model performs comparably to existing approaches, and in fact learns image features that resemble findings from several previous studies. In particular, its maximally excitatory stimuli have center-surround structure, similar to receptive fields in the early human visual system. 1
6 0.60276723 128 nips-2006-Manifold Denoising
7 0.59689236 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures
8 0.59579885 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
9 0.59569675 55 nips-2006-Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees
10 0.5923034 48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines
11 0.59124309 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization
12 0.58918542 43 nips-2006-Bayesian Model Scoring in Markov Random Fields
13 0.58901167 117 nips-2006-Learning on Graph with Laplacian Regularization
14 0.58846176 33 nips-2006-Analysis of Representations for Domain Adaptation
15 0.58843505 193 nips-2006-Tighter PAC-Bayes Bounds
16 0.58817303 159 nips-2006-Parameter Expanded Variational Bayesian Methods
17 0.58801949 20 nips-2006-Active learning for misspecified generalized linear models
18 0.58770323 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
19 0.58751941 80 nips-2006-Fundamental Limitations of Spectral Clustering
20 0.58711523 169 nips-2006-Relational Learning with Gaussian Processes