“When you have eliminated the impossible, all that remains, no matter how improbable, must be the truth.”
– Sherlock Holmes (Arthur Conan Doyle)

For a long time Bayesian inference was something I understood without really understanding it. I only really got it it after reading Chapter 2 of John K. Kruschke’s textbook Doing Bayesian Data Analysis, where he describes Bayesian Inference as Reallocation of Credibility Across Possibilities

I now understand Bayesian Inference to be essentially a mathematical generalization of Sherlock Holmes’ pithy statement about eliminating the impossible. This article is my own attempt to elucidate this idea. If this essay doesn’t do the trick, you might try Bayesian Reasoning for Intelligent People by Simon DeDeo or Kruschke’s Bayesian data analysis for newcomers.

This essay has two parts. Part 1 is a purely visual explanation of the concept of Bayesian inference, without any math, where Part 2 translates these ideas into mathematical formulas, including the famous Bayes theorem.

Part 1: The Concept

Prior Beliefs

A Bayesian reasoner starts out with a set of beliefs, which are modeled as a probability distribution. You can think of the probability distribution as simply a list of possibilities, each of which has a probability. These possibilities are mutually exclusive and their probabilities sum to 100%.

For example, suppose you pick a card at random from 52-card deck. There are 52 possibilities, each with a probability of 1/52, and the total adds up to 1, or 100%

Example Prior Beliefs: Bayesian Gambler

Revising Beliefs

The Bayesian reasoner’s beliefs before he acquires new information, are called prior beliefs.

Bayesian reasoners then revise their beliefs when they acquire new information/evidence according to very simple rules:

reject any possibilities that are incompatible with the evidence
reallocate probability to the remaining possibilities so that they sum to 100%

A Bayesian reasoner’s beliefs after acquiring new evidence are called the posterior beliefs.

For example, if you find out that the card is a heart, you must eliminate all non-hearts, and reallocate the remaining probability to the remaining 13 cards. Each card now has a probability of 1/13, and the total still adds up to 1.

Example Posterior Beliefs: Bayesian Gambler

That’s All It Is

As simple as this seems, that’s all Bayesian inference is. This doesn’t seem all that powerful or impressive, but that’s just because this example probability distribution is so simple. It lacks interesting correlations. A card’s suit for example is not correlated with its rank: the prior probability of a queen was 1/13, and the posterior is also 1/13. Learning that the card was a heart doesn’t tell us much except that the card is a heart.

Bayesian inference becomes powerful when the prior beliefs contain interesting correlations, so that learning about one thing tells you about something else, sometimes in counter-intuitive ways.

The Case of the Disappearing Duchess

Let’s look at another example. Suppose that Sherlock Holmes is investigating the Case of the Disappearing Duchess, and believes that there is:

a 50% chance that the Duke has kidnapped the Duchess and is holding her alive in captivity
a 25% that chance that the Duke has murdered her
and a 25% chance that she has been murdered by the Count

The total of the probabilities is 100%. Holmes’ prior beliefs can be summarized in the table below.

Holmes’ Prior Beliefs

Culprit	Status	Probability
Duke	Alive	50%
Duke	Dead	25%
Count	Dead	25%
	TOTAL	100%

If the Duchess’s body is subsequently found buried under the Atrium, Holmes must eliminate the possibility that she is being held alive by the Duke. He must then reallocate probability among the remaining two scenarios in which the Duchess is dead.

Holmes’ Posterior Beliefs

Culprit	Status	Probability
~~Duke~~	~~Alive~~	0% (eliminated)
Duke	Dead	50%
Count	Dead	50%
	TOTAL	100%

Reallocation of Probability Mass

Another way of looking at this is that the 50% probability mass previously allocated to the eliminated possibility is reallocated to the remaining possibilities. You can visualize this if we plot the prior and posterior probabilities as bar charts. The total volume of the bars in each of the two charts below is the same: 100%. The 50% probability mass that was previously allocated to the first possibility in Holmes’ prior beliefs is reallocated to the remaining two possibilities in his posterior beliefs.

Reallocation of Probabilities Example

Illustration of the concept of reallocation of probability mass. After the "Alive+Duke" scenario is eliminated, probability mass is reallocated to the remaining 2 scenarios.

Sequential Updating

Suppose Holmes subsequently finds evidence that exonerates the Count. To update Holmes’ beliefs again, we repeat the process.

First, the posterior after the first piece of evidence (the fact that the countess was found dead) becomes the new prior, as shown in the left-hand chart below.

Next, we eliminate possibility #3 (she was murdered by the Count). This time, since only one possibility remains, all probability mass is reallocated to this possibility, as shown in the right-hand chart below.

Reallocation of Probabilities Example

Illustration of sequential updating. The posterior after the first piece of evidence becomes the prior for the next piece of evidence. After the "Dead+Count" possibility is eliminated, probability mass is reallocated to the remaining possibility.

All That Remains

So what happens now that we’ve reached a point where there are no more possibilities to eliminate? At this point, no more inferences can be made. There is nothing more to learn – at least with respect to the Case of the Disappearing Duchess. Holmes has eliminated the impossible and the remaining possibility must be the truth.

It’s not always possible to eliminate all uncertainty such that only one possibility remains. But Bayesian inference can be thought of as the process of reducing uncertainty as much as possible by eliminating the impossible, and then reallocating the probability mass to the possibilities that remain.

When you have eliminated the impossible, the probability of all that remains, no matter how improbable, must sum to 100%
– Sherlock Thomas Bayes Holmes

Updating Beliefs based on Evidence

What makes Bayesian inference so powerful is that learning about one thing can shift beliefs in other things in counter-intuitive ways.

For example, learning that the Duchess is dead decreased the probability that the Duke did it (from 75% to 50%), and increased the probability that the Count did it (from 25% to 50%). You should be able to convince yourself that this is the case using common sense logic. But the rules of Bayesian inference give you a systematic way to come to the conclusion.

This is demonstrated visually in the four charts below. The first row of charts we have already seen: they show Holmes’ priors on the left, and his posteriors after learning that the Duchess is dead on the right.

The second row of charts show the same probabilities, but this time the charts show the marginal probabilities (e.g. the totals) for each possible culprit. The Duke is the culprit in two different scenarios in the priors, so the total prior probability for the Duke is 50% + 25% = 75%. The total prior probability for the Count is 25%.

After eliminating the Alive+Duke scenario, the total probability is 50% for both the Duke and the Count, as shown in the bottom-right chart. And so the probability decreased for the Duke and increased for the Count.

Reallocation of Probabilities Example

Illustration of how Holmes belief in the probability of guilt of the two suspects changes after learning that the Duchess is dead. Probability mass is reallocated proportionally to the remaining two possibilities, as illustrated in the top two charts. But although this results in an increase in the total probability that the Count did it, it results in an decrease in the total probability that the Duke did it, as illustrated in the bottom two charts.

Beliefs as Joint Probability Distributions

The key to the power of Bayesian inference is that, once we know a rational agent’s prior beliefs, we know exactly how they should update their belief in one thing (the culprit), after learning another thing (the Countess is dead).

Inferring one thing from another thing is only possible here because these things are correlated. And these correlations only exist because the prior probability distribution covers beliefs about combinations of propositions, not just individual propositions.

Holmes’ prior beliefs are not simply that there is a 50% chance that the Duchess is dead or there is a 75% chance that the Duke did it. If his beliefs were so simple, learning that the Duchess was dead would not tell Holmes anything about the culprit.

Rather he assigns different probabilities to different combinations such a the Countess is Dead and the Duke did it. His beliefs form a joint probability distribution that encodes the knowledge about the correlations between propositions that enables Holmes to make inferences.

Part 2: The Math

Conditional Probability

Before discovering the Duchess’s body, we can calculate what Holmes’ beliefs would be if he learned that the Duchess was definitely alive or dead. The probability that the Duke/Count is the culprit given the countess is Alive/Dead is called a conditional probability.

Conditional probabilities are written in the form $P(H \vert E)$. $Evidence$ is whatever new information might be learned (e.g. the Duchess is Dead), and $Hypothesis$ is any other proposition of interest (e.g. the Duke Count did it).

The conditional probability of some Hypothesis $H$ given some piece of Evidence $E$ can be calculated using the following formula:

$$ \begin{aligned} P(H \vert E) &= \frac{P(H, E)}{P(E)} \end{aligned} $$

Where $P(H, E)$ is the total prior probability of all possibilities where both the evidence and hypothesis is true, and $P(E)$ is the total probability of all possibilities where the evidence is true.

For example, referring back to Holmes’ prior probability table, you can see that

$$ \begin{aligned} P(Dead+Duke) &= 25\% \cr P(Dead) &= P(Dead+Duke) + P(Dead+Count) \cr &= 25\% + 25\% \cr &= 50\% \end{aligned} $$

So:

$$ \begin{aligned} P(Duke|Dead) &= \frac{P(Dead+Duke)}{P(Dead)}\cr &= \frac{25\%}{50\%} \cr &= 50\% \end{aligned} $$

Posterior Belief Formula

It is a common convention to represent prior beliefs (before learning some piece of new information) as $P$, and posterior beliefs (after learning new information) as $P’$. $P’$ represents a whole new probability distribution, generated from $P$ by eliminating all possibilities incompatible with the evidence and scaling the remaining probabilities so they sum to 1.

For example, let’s say Holmes’ beliefs after finding the Duchess’s body is $P’$.

Now we don’t actually have to calculate all of $P’$ if all we want to know is $P’(Duke)$. Instead, we can use the conditional probability formula above to calculate $P(Duke \vert Dead)$. That is, the posterior belief, $P’(Duke)$, is equal to the prior conditional belief given the Duchess is dead, $P(Duke \vert Dead)$.

$$ P’(Duke) = P(Duke|Dead) $$

Or more generally

$$ \begin{aligned} P’(H)\cr &= P(H|E)\cr\cr &= \frac{P(H, E)}{P(E)}\cr \end{aligned} $$

This three-part formula is useful one to memorize. Note that the left-hand side is a posterior probability. The middle formula is the notation for the conditional probability of the hypothesis given the evidence. This is defined in terms of the prior probability distribution. And the right-hand side is the formula for calculating the conditional probability.

Bayes Theorem

So far, we have engaged in Bayesian inference without using the famous Bayes’ Theorem. Bayes’ theorem is not actually necessary for Bayesian inference, and the use of Bayes’ theorem should not be conflated with Bayesian inference.

However, now that we’ve got this far, the derivation of Bayes’ theorem is simple. We just need to observe that the formula for conditional probability can be applied “in reverse” to define the probability of the evidence given the hypothesis.

$$ \begin{aligned} P(H|E) &= \frac{P(H, E)}{P(E)}\cr P(E|H) &= \frac{P(E, H)}{P(H)}\cr \end{aligned} $$

Rearranging these formulas, we have:

$$ \begin{aligned} P(H|E)P(E) &= P(H, E)\cr P(E|H)P(H) &= P(E, H)\cr \end{aligned} $$

$ P(H, E) = P(E, H) $. And so:

$$ P(H|E)P(E) = P(E|H)P(H) $$

Which we can arrange to get Bayes theorem:

$$ P(H|E) = \frac{P(E|H)P(H)}{P(E)} $$

So Bayes’ Theorem is just an alternative formula for calculating conditional probability.

$$ \begin{aligned} P(H|E)\cr &= \frac{P(H, E)}{P(E)}\cr &= \frac{P(E|H)P(H)}{P(E)} \end{aligned} $$

Using either of these formuls is just a shortcut. Although theoretically Bayesian inference involves updating our entire posterior probability distribution to $P’$ after learning some evidence, usually we are interested in a single hypothesis and just want to know what $P’(H)$ is. We can calculate this without calculating the entire posterior by calculating $P(H \vert E)$ using one of the formulas above.

Summary

So here’s a summary of the principle of Bayesian inference:

Start with prior beliefs as a joint probability distribution
Eliminate possibilities inconsistent with new evidence
Reallocate probability to remaining possibilities such that they sum to 100%
Update beliefs sequentially by eliminating possibilities as new evidence is learned
Make inferences by simply calculating the total posterior probability of the hypothesis given the evidence using the conditional probability formula or Bayes theorem