“When you have eliminated the impossible, all that remains, no matter how improbable, must be the truth.”

– Sherlock Holmes (Arthur Conan Doyle)

For a long time Bayesian inference was something I understood without really understanding it. I only really *got it* it after reading Chapter 2 of John K. Kruschke’s textbook *Doing Bayesian Data Analysis*, where he describes Bayesian Inference as *Reallocation of Credibility Across Possibilities*

I now understand Bayesian Inference to be essentially Sherlock Holmes’ pithy statement about eliminating the impossible quoted above, taken to its mathematical conclusion. This article is my own attempt to elucidate this idea. If this essay doesn’t do the trick, you might try Bayesian Reasoning for Intelligent People by Simon DeDeo or Kruschke’s Bayesian data analysis for newcomers.

## Prior Beliefs

A Bayesian reasoner starts out with a set of beliefs, which are modeled as a probability distribution. You can think of the probability distribution as simply a list of **possibilities**, each of which has a **probability**. These possibilities are mutually exclusive and their probabilities sum to 100%.

For example, suppose you pick a card at random from 52-card deck. There are 52 possibilities, each with a probability of 1/52, and the total adds up to 1, or 100%

## Revising Beliefs

The Bayesian reasoner’s beliefs before he acquires new information, are called **prior** beliefs.

Bayesian reasoners then revise their beliefs when they acquire new information/evidence according to very simple rules:

- reject any possibilities that are incompatible with the evidence
- reallocate probability to the remaining possibilities so that they sum to 100%

A Bayesian reasoner’s beliefs **after** acquiring new evidence are called the **posterior** beliefs.

For example, if you find out that the card is a heart, you must eliminate all non-hearts, and reallocate the remaining probability to the remaining 13 cards. These each now have a probability of 1/13, and the total still adds up to 1.

## That’s All It Is

As simple as this seems, that’s all Bayesian inference is. This doesn’t seem all that powerful or impressive, but that’s just because this example probability distribution is so simple. It lacks interesting correlations. A card’s suit for example is not correlated with its rank: the prior probability of a queen was 1/13, and the posterior is also 1/13. Learning that the card was a heart doesn’t tell us much except that the card is a heart.

Bayesian inference becomes powerful when the prior beliefs contain interesting correlations, so that learning about one thing tells you about something else, sometimes in counter-intuitive ways.

## The Case of the Disappearing Duchess

Let’s look at another example. Suppose that Sherlock Holmes is investigating the Case of the Disappearing Duchess, and believes that there is:

- a 50% chance that the Duke has kidnapped the Duchess and is holding her alive in captivity
- a 25% that chance that the Duke has murdered her
- and a 25% chance that she has been murdered by the Count

The total of the probabilities is 100%. Holmes’ prior beliefs can be summarized in the table below.

**Holmes’ Prior Beliefs**

Culprit | Status | Probability |
---|---|---|

Duke | Alive | 50% |

Duke | Dead | 25% |

Count | Dead | 25% |

TOTAL | 100% |

If the Duchess’s body is subsequently found buried under the Atrium, Holmes must eliminate the possibility that she is being held alive by the Duke. He must then reallocate probability among the remaining two scenarios in which the Duchess is dead.

**Holmes’ Posterior Beliefs**

Culprit | Status | Probability |
---|---|---|

0% (eliminated) | ||

Duke | Dead | 50% |

Count | Dead | 50% |

TOTAL | 100% |

## Reallocation of Probability Mass

Another way of looking at this is that the 50% **probability mass** previously allocated to the eliminated possibility is **reallocated** to the remaining possibilities. You can visualize this if we plot the prior and posterior probabilities as bar charts. The total volume of the bars in each of the two charts below is the same: 100%. The 50% probability mass that was previously allocated to the first possibility in Holmes’ prior beliefs is reallocated to the remaining two possibilities in his posterior beliefs.

## Sequential Updating

Suppose Holmes subsequently finds evidence that exonerates the Count. To update Holmes’ beliefs again, we repeat the process.

First, the posterior after the first piece of evidence (the fact that the countess was found dead) becomes the new prior, as shown in the left-hand chart below.

Next, we eliminate possibility #3 (she was murdered by the Count). This time, since only one possibility remains, all probability mass is reallocated to this possibility, as shown in the right-hand chart below.

## All That Remains

So what happens now that we’ve reached a point where there are no more possibilities to eliminate? At this point, no more inferences can be made. There is nothing more to learn – at least with respect to the Case of the Disappearing Duchess. Holmes has eliminated the impossible and the remaining possibility *must* be the truth.

It’s not always possible to eliminate all uncertainty such that only one possibility remains. But Bayesian inference can be thought of as the process of reducing uncertainty by eliminating the impossible, and then reallocating probability mass.

When you have eliminated the impossible, the probability of all that remains, no matter how improbable, must sum to 100%

– Sherlock Thomas Bayes Holmes (Jonathan Warden)

## Updating Beliefs based on Evidence

What makes Bayesian inference so powerful is that learning about one thing can shift beliefs in other things in counter-intuitive ways.

For example, learning that the Duchess is dead *decreased* the probability that the Duke did it (from 75% to 50%), and *increased* the probability that the Count did it (from 25% to 50%). You should be able to convince yourself that this is the case using common sense logic. But the rules of Bayesian inference give you a systematic way to come to the conclusion.

This is demonstrated visually in the four charts below. The first row of charts we have already seen: they show Holmes’ priors on the left, and his posteriors after learning that the Duchess is dead on the right.

The second row of charts show the same probabilities, but this time the charts show the *marginal probabilities* (e.g. the totals) for each possible culprit. The Duke is the culprit in two different scenarios in the priors, so the total prior probability for the Duke is 50% + 25% = 75%. The total prior probability for the Count is 25%.

After eliminating the Alive+Duke scenario, the remaining probability mass for the Duke and the Count are both 25% – but these are then scaled up so to 50% each so their total sums to 100%, as shown in the bottom-right chart. The net result is a decreased total probability for the Duke and increased total probability for the Count.

## Beliefs as Joint Probability Distributions

The key to the power of Bayesian inference is that, once we know a rational agent’s prior beliefs, we know exactly how they should update their belief in one thing (the culprit), after learning another thing (the Countess is dead).

Inferring one thing from another thing is only possible here because these things are correlated. And these correlations only exist because the prior probability distribution covers beliefs about **combinations** of propositions, not just individual propositions.

Holmes’ prior beliefs are not simply that *there is a 50% chance that the Duchess is dead* or *there is a 75% chance that the Duke did it*. If his beliefs were so simple, learning that the Duchess was dead would not tell Holmes anything about the culprit.

Rather he assigns different probabilities to different **combinations** such a *the Countess is Dead and the Duke did it*. His beliefs form a **joint probability distribution** that encodes the knowledge about the correlations between propositions that enables Holmes to make inferences.

## Conditional Probability

Before discovering the Duchess’s body, we can calculate what Holmes’ beliefs **would** be if he learned that the Duchess was definitely alive or dead. The probability that the Duke/Count is the culprit **given** the countess is Alive/Dead is called a **conditional** probability.

Conditional probabilities are written in the form $P(Hypothesis \vert Evidence)$. $Evidence$ is whatever new information might be learned (e.g. the *Duchess is Dead*), and $Hypothesis$ is any other proposition of interest (e.g. the *Duke Count did it*).

The conditional probability of some Hypothesis given some piece of Evidence can be calculated using the following formula:

$$ \begin{aligned} P(Hypothesis \vert Evidence) &= \frac{P(Hypothesis, Evidence)}{P(Evidence)} \end{aligned} $$

Where $P(Hypothesis, Evidence)$ is the **total prior probability** of all possibilities where both the evidence and hypothesis is true, and $P(Evidence)$ is the total probability of all possibilities where the evidence is true.

For example, referring back to Holmes’ prior probability table, you can see that

$$ \begin{aligned} P(Dead+Duke) &= 25\% \cr P(Dead) &= P(Dead+Duke) + P(Dead+Count) \cr &= 25\% + 25\% \cr &= 50\% \end{aligned} $$

So:

$$ \begin{aligned} P(Duke|Dead) &= \frac{P(Dead+Duke)}{P(Dead)}\cr &= \frac{25\%}{50\%} \cr &= 50\% \end{aligned} $$

## Posterior Belief Formula

It is a common convention to represent **prior** beliefs (before learning some piece of new information) as $P$, and **posterior** beliefs (after learning new information) as $P’$. $P’$ represents a whole new probability distribution, generated from $P$ by eliminating all possibilities incompatible with the evidence and scaling the remaining probabilities so they sum to 1.

For example, let’s say Holmes’ beliefs after finding the Duchess’s body is $P’$.

Now we don’t actually have to calculate all of $P’$ if all we want to know is $P’(Duke)$. Instead, we can use the conditional probability formula above to calculate $P(Duke \vert Dead)$. That is, the posterior belief, $P’(Duke)$, is equal to the prior *conditional* belief *given* the Duchess is dead, $P(Duke \vert Dead)$.

$$ P’(Duke) = P(Duke|Dead) $$

Or more generally

$$ \begin{aligned} P’(Hypothesis)\cr &= P(Hypothesis|Evidence)\cr\cr &= \frac{P(Hypothesis, Evidence)}{P(Evidence)}\cr \end{aligned} $$

This three-part formula is useful one to memorize. Note that the left-hand side is a *posterior* probability. The middle formula is the notation for the *conditional* probability of the hypothesis given the evidence. And the right-hand side lets us calculate the posterior probability of any hypothesis given any piece of evidence in terms of the prior probability distribution.

## Summary

So far, we have engaged in Bayesian inference without using the famous Bayes’ Theorem. Bayes’ theorem is not actually necessary for Bayesian inference, and conflating the use of Bayes’ theorem with Bayesian inference can interfere with an understanding of the more fundamental principle of Bayesian inference as reallocation of probabilities.

So here’s a summary of the principle of Bayesian inference:

- Start with prior beliefs as a joint probability distribution
- Eliminate possibilities inconsistent with new evidence
- Reallocate probability to remaining possibilities such that they sum to 100%
- Update beliefs sequentially by eliminating possibilities as new evidence is learned
- Make inferences by simply calculating the total posterior probability of the hypothesis given the evidence using the conditional probability formula