“When you have eliminated the impossible, all that remains, no matter how improbable, must be the truth.”
– Sherlock Holmes (Arthur Conan Doyle)
For a long time Bayesian inference was something I understood without really understanding it. I only really got it it after reading Chapter 2 of John K. Kruschke’s textbook Doing Bayesian Data Analysis, where he describes Bayesian Inference as Reallocation of Credibility Across Possibilities
I now understand Bayesian Inference to be essentially a mathematical generalization of Sherlock Holmes’ pithy statement about eliminating the impossible. This article is my own attempt to elucidate this idea. If this essay doesn’t do the trick, you might try Bayesian Reasoning for Intelligent People by Simon DeDeo or Kruschke’s Bayesian data analysis for newcomers.
This essay has two parts. Part 1 is a purely visual explanation of the concept of Bayesian inference, without any math, where Part 2 translates these ideas into mathematical formulas, including the famous Bayes theorem.
Part 1: The Concept
Prior Beliefs
A Bayesian reasoner starts out with a set of beliefs, which are modeled as a probability distribution. You can think of the probability distribution as simply a list of possibilities, each of which has a probability. These possibilities are mutually exclusive and their probabilities sum to 100%.
For example, suppose you pick a card at random from 52-card deck. There are 52 possibilities, each with a probability of 1/52, and the total adds up to 1, or 100%
Revising Beliefs
The Bayesian reasoner’s beliefs before he acquires new information, are called prior beliefs.
Bayesian reasoners then revise their beliefs when they acquire new information/evidence according to very simple rules:
- reject any possibilities that are incompatible with the evidence
- reallocate probability to the remaining possibilities so that they sum to 100%
A Bayesian reasoner’s beliefs after acquiring new evidence are called the posterior beliefs.
For example, if you find out that the card is a heart, you must eliminate all non-hearts, and reallocate the remaining probability to the remaining 13 cards. Each card now has a probability of 1/13, and the total still adds up to 1.
That’s All It Is
As simple as this seems, that’s all Bayesian inference is. This doesn’t seem all that powerful or impressive, but that’s just because this example probability distribution is so simple. It lacks interesting correlations. A card’s suit for example is not correlated with its rank: the prior probability of a queen was 1/13, and the posterior is also 1/13. Learning that the card was a heart doesn’t tell us much except that the card is a heart.
Bayesian inference becomes powerful when the prior beliefs contain interesting correlations, so that learning about one thing tells you about something else, sometimes in counter-intuitive ways.
The Case of the Disappearing Duchess
Let’s look at another example. Suppose that Sherlock Holmes is investigating the Case of the Disappearing Duchess, and believes that there is:
- a 50% chance that the Duke has kidnapped the Duchess and is holding her alive in captivity
- a 25% that chance that the Duke has murdered her
- and a 25% chance that she has been murdered by the Count
The total of the probabilities is 100%. Holmes’ prior beliefs can be summarized in the table below.
Holmes’ Prior Beliefs
Culprit | Status | Probability |
---|---|---|
Duke | Alive | 50% |
Duke | Dead | 25% |
Count | Dead | 25% |
TOTAL | 100% |
If the Duchess’s body is subsequently found buried under the Atrium, Holmes must eliminate the possibility that she is being held alive by the Duke. He must then reallocate probability among the remaining two scenarios in which the Duchess is dead.
Holmes’ Posterior Beliefs
Culprit | Status | Probability |
---|---|---|
0% (eliminated) | ||
Duke | Dead | 50% |
Count | Dead | 50% |
TOTAL | 100% |
Reallocation of Probability Mass
Another way of looking at this is that the 50% probability mass previously allocated to the eliminated possibility is reallocated to the remaining possibilities. You can visualize this if we plot the prior and posterior probabilities as bar charts. The total volume of the bars in each of the two charts below is the same: 100%. The 50% probability mass that was previously allocated to the first possibility in Holmes’ prior beliefs is reallocated to the remaining two possibilities in his posterior beliefs.
Sequential Updating
Suppose Holmes subsequently finds evidence that exonerates the Count. To update Holmes’ beliefs again, we repeat the process.
First, the posterior after the first piece of evidence (the fact that the countess was found dead) becomes the new prior, as shown in the left-hand chart below.
Next, we eliminate possibility #3 (she was murdered by the Count). This time, since only one possibility remains, all probability mass is reallocated to this possibility, as shown in the right-hand chart below.
All That Remains
So what happens now that we’ve reached a point where there are no more possibilities to eliminate? At this point, no more inferences can be made. There is nothing more to learn – at least with respect to the Case of the Disappearing Duchess. Holmes has eliminated the impossible and the remaining possibility must be the truth.
It’s not always possible to eliminate all uncertainty such that only one possibility remains. But Bayesian inference can be thought of as the process of reducing uncertainty as much as possible by eliminating the impossible, and then reallocating the probability mass to the possibilities that remain.
When you have eliminated the impossible, the probability of all that remains, no matter how improbable, must sum to 100%
– Sherlock Thomas Bayes Holmes
Updating Beliefs based on Evidence
What makes Bayesian inference so powerful is that learning about one thing can shift beliefs in other things in counter-intuitive ways.
For example, learning that the Duchess is dead decreased the probability that the Duke did it (from 75% to 50%), and increased the probability that the Count did it (from 25% to 50%). You should be able to convince yourself that this is the case using common sense logic. But the rules of Bayesian inference give you a systematic way to come to the conclusion.
This is demonstrated visually in the four charts below. The first row of charts we have already seen: they show Holmes’ priors on the left, and his posteriors after learning that the Duchess is dead on the right.
The second row of charts show the same probabilities, but this time the charts show the marginal probabilities (e.g. the totals) for each possible culprit. The Duke is the culprit in two different scenarios in the priors, so the total prior probability for the Duke is 50% + 25% = 75%. The total prior probability for the Count is 25%.
After eliminating the Alive+Duke scenario, the total probability is 50% for both the Duke and the Count, as shown in the bottom-right chart. And so the probability decreased for the Duke and increased for the Count.
Beliefs as Joint Probability Distributions
The key to the power of Bayesian inference is that, once we know a rational agent’s prior beliefs, we know exactly how they should update their belief in one thing (the culprit), after learning another thing (the Countess is dead).
Inferring one thing from another thing is only possible here because these things are correlated. And these correlations only exist because the prior probability distribution covers beliefs about combinations of propositions, not just individual propositions.
Holmes’ prior beliefs are not simply that there is a 50% chance that the Duchess is dead or there is a 75% chance that the Duke did it. If his beliefs were so simple, learning that the Duchess was dead would not tell Holmes anything about the culprit.
Rather he assigns different probabilities to different combinations such a the Countess is Dead and the Duke did it. His beliefs form a joint probability distribution that encodes the knowledge about the correlations between propositions that enables Holmes to make inferences.
Part 2: The Math
Conditional Probability
Before discovering the Duchess’s body, we can calculate what Holmes’ beliefs would be if he learned that the Duchess was definitely alive or dead. The probability that the Duke/Count is the culprit given the countess is Alive/Dead is called a conditional probability.
Conditional probabilities are written in the form $P(H \vert E)$. $Evidence$ is whatever new information might be learned (e.g. the Duchess is Dead), and $Hypothesis$ is any other proposition of interest (e.g. the Duke Count did it).
The conditional probability of some Hypothesis $H$ given some piece of Evidence $E$ can be calculated using the following formula:
$$ \begin{aligned} P(H \vert E) &= \frac{P(H, E)}{P(E)} \end{aligned} $$
Where $P(H, E)$ is the total prior probability of all possibilities where both the evidence and hypothesis is true, and $P(E)$ is the total probability of all possibilities where the evidence is true.
For example, referring back to Holmes’ prior probability table, you can see that
$$ \begin{aligned} P(Dead+Duke) &= 25\% \cr P(Dead) &= P(Dead+Duke) + P(Dead+Count) \cr &= 25\% + 25\% \cr &= 50\% \end{aligned} $$
So:
$$ \begin{aligned} P(Duke|Dead) &= \frac{P(Dead+Duke)}{P(Dead)}\cr &= \frac{25\%}{50\%} \cr &= 50\% \end{aligned} $$
Posterior Belief Formula
It is a common convention to represent prior beliefs (before learning some piece of new information) as $P$, and posterior beliefs (after learning new information) as $P’$. $P’$ represents a whole new probability distribution, generated from $P$ by eliminating all possibilities incompatible with the evidence and scaling the remaining probabilities so they sum to 1.
For example, let’s say Holmes’ beliefs after finding the Duchess’s body is $P’$.
Now we don’t actually have to calculate all of $P’$ if all we want to know is $P’(Duke)$. Instead, we can use the conditional probability formula above to calculate $P(Duke \vert Dead)$. That is, the posterior belief, $P’(Duke)$, is equal to the prior conditional belief given the Duchess is dead, $P(Duke \vert Dead)$.
$$ P’(Duke) = P(Duke|Dead) $$
Or more generally
$$ \begin{aligned} P’(H)\cr &= P(H|E)\cr\cr &= \frac{P(H, E)}{P(E)}\cr \end{aligned} $$
This three-part formula is useful one to memorize. Note that the left-hand side is a posterior probability. The middle formula is the notation for the conditional probability of the hypothesis given the evidence. This is defined in terms of the prior probability distribution. And the right-hand side is the formula for calculating the conditional probability.
Bayes Theorem
So far, we have engaged in Bayesian inference without using the famous Bayes’ Theorem. Bayes’ theorem is not actually necessary for Bayesian inference, and the use of Bayes’ theorem should not be conflated with Bayesian inference.
However, now that we’ve got this far, the derivation of Bayes’ theorem is simple. We just need to observe that the formula for conditional probability can be applied “in reverse” to define the probability of the evidence given the hypothesis.
$$ \begin{aligned} P(H|E) &= \frac{P(H, E)}{P(E)}\cr P(E|H) &= \frac{P(E, H)}{P(H)}\cr \end{aligned} $$
Rearranging these formulas, we have:
$$ \begin{aligned} P(H|E)P(E) &= P(H, E)\cr P(E|H)P(H) &= P(E, H)\cr \end{aligned} $$
$ P(H, E) = P(E, H) $. And so:
$$ P(H|E)P(E) = P(E|H)P(H) $$
Which we can arrange to get Bayes theorem:
$$ P(H|E) = \frac{P(E|H)P(H)}{P(E)} $$
So Bayes’ Theorem is just an alternative formula for calculating conditional probability.
$$ \begin{aligned} P(H|E)\cr &= \frac{P(H, E)}{P(E)}\cr &= \frac{P(E|H)P(H)}{P(E)} \end{aligned} $$
Using either of these formuls is just a shortcut. Although theoretically Bayesian inference involves updating our entire posterior probability distribution to $P’$ after learning some evidence, usually we are interested in a single hypothesis and just want to know what $P’(H)$ is. We can calculate this without calculating the entire posterior by calculating $P(H \vert E)$ using one of the formulas above.
Summary
So here’s a summary of the principle of Bayesian inference:
- Start with prior beliefs as a joint probability distribution
- Eliminate possibilities inconsistent with new evidence
- Reallocate probability to remaining possibilities such that they sum to 100%
- Update beliefs sequentially by eliminating possibilities as new evidence is learned
- Make inferences by simply calculating the total posterior probability of the hypothesis given the evidence using the conditional probability formula or Bayes theorem