We’re usually taught that it’s important to test model assumptions. For example, if your inference assumes the probability distribution for the data is a normal/gaussian distribution, then you should look at a histogram of the data to see if it actually looks like a gaussian. If it doesn’t, then you have a large risk of getting wrong or misleading answers. Or so we’re told.

Consider the following simple example. There is a parameter which we want to learn from 1000 data values , whose sampling distribution is independently for each data point. As I have argued elsewhere, the sampling distribution is really a prior for the data given the parameters, and has the same logical status as any other prior distribution. This prior doesn’t imply that a histogram of the -values *has to* look like a gaussian, but this hypothesis (call it “gaussianity”) does have quite a high prior probability.

With an improper uniform prior for , the posterior for given the data is also a gaussian with mean (i.e. the arithmetic mean of the data), and standard deviation . What we usually want to see is that the posterior density is high at the true value of . Let’s say that if the true value of is within 1.96 standard deviations of the peak of the posterior, then our inference has been “successful”. The prior probability of “success”, according to the joint prior (uniform for times gaussian for the data given ) is 95 %.

An important question is whether there is any relationship between “gaussianity” of the data, and whether the data would yield a “successful” posterior distribution. According to folklore, there should be a strong relationship: if the data are highly non-gaussian, then our inferences should be suspect, right? Well we can test that. We know the prior probability of success is 95%, but what’s the prior probability of success *given non-gaussianity*? To compute this I simulated fake parameter values and datasets from the joint prior, and calculated whether the data set was gaussian (I used a classical test to do this, and called the data non-gaussian if the p-value was below 0.0001.), and whether the inference was successful. Counting the fraction of inferences that were successful when the data were deemed non-gaussian gives us the result,

wait for it…

basically 95%! That’s right, the proposition that “the data look gaussian” has *nothing whatsoever* to do with whether the inference is successful or not! Therefore doing such a ‘check’ is basically superstition. To test this further I reduced the p-value threshold to (I had to use Nested Sampling to compute the results), and the results were unchanged. So what *does* produce a successful inference? In this example it’s easy to say. Since the posterior only depends on the arithmetic mean of the data, any data set whose mean is close to the true value of will result in success – no matter how the data points are actually “distributed”.

Real inference problems are more complicated than this example. Of course it is possible to get misleading results when there are problems with your priors. The way to improve this situation is to put more thought into your priors, not to apply superstitions.

The R code I used to calculate P(success | nongaussian)

reps = 10000000

# Data size

N = 1000

# Count number of nongaussian datasets seen

nongaussian = 0

# Count number of successful inferences when the dataset was nongaussian

success = 0

for(i in 1:reps)

{

# Wide-ish flat prior

mu = -10 + 20*runif(1)

# Generate data

x = mu + rnorm(N)

# A classical test to see if the data is nongaussian

pval = (shapiro.test(x))$p.value

if(pval < 0.0001)

{

nongaussian = nongaussian + 1

# Check for success

if(abs(mean(x) – mu) < 1.96/sqrt(N))

success = success + 1

hist(x, breaks=100)

# Estimate P(success | nongaussian)

a = success + 1

b = nongaussian – success + 1

m = a/(a + b)

s = sqrt(a*b)/(a + b)/sqrt(a + b + 1)

print(c(nongaussian, m))

}

}

Ahhh … confused. It seems like your inference never actually assumes the probability distribution for the data is a normal/gaussian distribution. It just assumes that it is drawn from a distribution with mean \mu.

In which case, which inferences do really assumes the probability distribution for the data is a normal/gaussian distribution?

Hi Luke,

The inferences were based on the uniform prior + gaussian likelihood. That’s where the gaussian posterior centered around the mean of the data comes from.

Thanks for a nice post. My experience is consistent with yours.

I think that if someone “hangs around” with “Bayesian stochastic engineering” for a while, the cartoon version of Bayesian inference drawn by many frequentist tests drops away and crumbles. That “cartoon version” says that choice of priors makes Bayesian inference subjective, and, so, results will differ depending upon the choice of priors, and, because there’s no systematic way to choose priors, the student takes their chances doing Bayesian inference. Bunk, at least for most problems. Bunk.

Unless a prior explicitly rules out a portion of the posterior density, in practice, what I find is that priors help speed convergence. No one wants to wait around ten years waiting for the Gelman-Rubin PSRF to get small. That is often helped by stating a model in a different way to encourage mixing, but good priors help, too. It is also important to try to use, as Gelman and as Best, Lund, …, Spiegelhalter advise wide “uninformative priors” as much as possible. These do not, as some frequentists seem to believe, necessarily result in the same answer as the m.l.e. version of the problem, if only for the reason that the posterior isn’t necessarily unimodal or doesn’t necessarily have a sharp peak anywhere, just wide plateaus.

There’s a view of practical Bayesian inference which sees it as stochastic search and optimization, and I think that can be helpful.

The advice about Gaussianity of distributions CAN affect certain frequentist techniques strongly. Kruschke’s comparison between the Bayesian way of doing the equivalent of t-tests has many cases where, if the incoming data skew away from the Gaussian, the t-test misleads, but the Gibbs sampler does just fine.

There are the efforts in “objective Bayes” work for systematically and non-parametrically choosing priors and I can’t comment on those. I have found the “stick breaking prior” which comes up in that work can be useful in cases where categories are supposed to remain distinct. I’ve used these, for example, in my recent post on Bayesian inversion of commingled solid waste tonnage to obtain estimates of components, at http://hypergeometric.wordpress.com/2014/09/08/blind-bayesian-recovery-of-components-of-residential-solid-waste-tonnage-from-totals-data/