Entropy as a Measure of Uncertainty

Measuring Uncertainty

How do you measure “uncertainty”?

That may seem like an odd question. But let’s just dive right into it, because starting down this path of inquiry will lead us step by step to the definition of the fascinating concept of Shannon entropy.

“I’m 99% Certain”

We can start with one common way people express certainty. You might say “I’m 99% certain it will rain today”. This, of course, implies that you’re 1% uncertain. So one obvious definition of uncertainty is the inverse of certainty, or $1 - p$, where $p$ is certainty expressed as a percentage.

But if you are 99% certain that it will rain, then you are 1% certain that it won’t rain! You are certain about one thing, and uncertain about it’s opposite. So are you certain, or uncertain?

A better definition of uncertainty would make sense no matter how you frame it. We can get this by always taking the most probable outcome to represent certainty. So the belief that there’s a 1% chance of rain also represents 99% certainty.

Although $1 - p$ is a sensible definition of uncertainty, here’s another definition that also decreases as certainty increases:

Definition 1

$$ \text{uncertainty} = \frac{1}{p} $$

Where $p$ is the probability of the most probable outcome. So if you think there’s a 99% chance that it (will/won’t) rain, uncertainty would be $\frac{1}{.99} ≈ 1.01$.

Let’s start with this definition and see where it takes us.

The Number of Possibilities

Now what if there are more than two possibilitiies? Imagine a murder has been committed. We don’t know who did it, so there’s uncertainty. If there are only two people who could have done it (say, Colonel Mustard and Professor Plum) then it seems there’s little uncertainty. With ten possible suspects uncertainty increases. On the other hand if there’s only one person who could have done it then there’s no uncertainty at all.

So another straightforward measure of uncertainty might be the number of possibilities. Or to use the conventional terminology of probability theory, the number of possible outcomes.

Definition 2

$$ \text{uncertainty} = n $$

Where $n$ is the number of possible outcomes.

Probability vs. Possible Outcomes

Okay we have two possible definitions of uncertainty so far. Definition 1 is based on the probability when there are two possible outcomes, and Definition 2 is based on the number of possible outcomes.

There is a relationship between probability and the number of possible outcomes. If there are $n$ equally-probable outcomes, then the probability of each outcome is $p = \frac{1}{n}$. Conversely, $n = \frac{1}{p}$.

Substituting this into Definition 2, we get:

$$ \text{uncertainty} = \frac{1}{p} $$

Which is the same as Definition 1!

So we now have a definition of uncertainty that makes sense both when:

there are multiple equally-probable outcomes
there are only two (not necessarily equally-probable) possible outcomes

Examples:

2 equally-probable outcomes: Uncertainty is $\frac{1}{(1/2)} = 2$.

The uncertainty of $n = 2$ equally-probable outcomes is the same as the uncertainty of a 50% chance of rain.

99% chance of rain: uncertainty $\frac{1}{0.99} \approx 1.01$.

The uncertainty of a 99% chance of rain is equal to the uncertainty of only $n \approx 1.01$ possible outcomes.

75% chance of rain: uncertainty $\frac{1}{0.75} \approx 1.33$.

The uncertainty of a 75% chance of rain is equal to the uncertainty of $n \approx 1.33$ possible outcomes.

1000 equally-probable outcomes: uncertainty is $\frac{1}{(1/1000)} = 1000$.

Almost There

Now we are actually very close to the actual definition of Shannon entropy. We just have two more steps:

convert it to a log scale.
generalize for cases where there are more than two possible outcomes that are not equally probable.

Surprisal: Log-Scale Uncertainty

If we take the log of our metric, we get something known in information theory as surprisal.

Definition 3: Surprisal

$$ \text{surprisal} = \log\left(\frac{1}{p}\right) = -\log(p) $$

We can use any base, but as a software engineer I prefer base 2.

Advantages of Logs

There are a couple of benefits to measuring uncertainty on a log scale.

First, because when there is only one possible outcome, it seems intuitive that uncertainty should be zero. And sure enough, $log(1) = 0$!

Second, working with logs makes uncertainty additive. For example, going back to our murder mystery, suppose there are two equally-probable murder suspects. Uncertainty, measured as surprisal, is $log(2) = 1$. And suppose there are four possible murder weapons. That means the uncertainty about the murder weapon is $log(4) = 2$.

To get the total number of possibilities, we need to count the number of culprit-weapon combinations (Professor Plum with the lead pipe, etc). This means we need to multiply:

$$ n = 2 \times 4 = 8 $$

So total surprisal is $log(8) = 3$.

But we could have gotten that by adding the individual uncertainties:

$$ log(2) + log(4) = 1 + 2 = 3 = log(8) $$

Adding uncertainties can be a bit more convenient than multiplying.

A log scale is especially useful when the number of possible outcomes is very large. Suppose there are 1,000,000 suspects. Multiplying this by the number of possible outcomes for weapons, motives, times of death, etc., can yield trillions of combinations.

If there are a trillion possible outcomes, instead of saying there’s “1,000,000,000,000 possible-outcomes-worth” of uncertainty, we say uncertainty is $log(10^{12}) \approx 39.86$.

Uncertainty for Unequal Probabilities

Okay, our final step is to deal with situations where there are multiple possible outcomes, but they are not all equally probable.

We can’t just use surprisal ($\log\left(\frac{1}{p(x)}\right)$), because there are multiple values for $p(x)$.

Using the surprisal of the most probable outcome doesn’t quite work either. Say the most probable outcome is 90%. Surprisal would be ($\log\left(\frac{1}{.9}\right)$), no matter how many other possible outcomes there are. But uncertainty should increase with the number of possible outcomes.

So what if we used a weighted average? We could weigh the surprisal of each possible outcome by its probability. So our uncertainty measure would be determined mostly be the surprisal of the most probable outcome – which is good – but the surprisal of the remaining outcomes would still contribute.

Entropy as Weighted Average Surprisal

This gives us the following measure of uncertainty, which – tada! – is exactly the definition of Shannon entropy.

Definition 4: Shannon Entropy

$$ \begin{aligned} H(X) &= \sum_{x} p(x) \cdot \text{surprisal}(x) \cr &= \sum_{x} p(x) \cdot \log(\frac{1}{p(x)}) \end{aligned} $$

The formula for Shannon entropy can also be understood as the formula for the expected value of surprisal.

$$ \begin{aligned} H(X) &= \sum_{x} p(x) \cdot \text{surprisal}(x) \cr &= \sum_{x} p(x) \cdot \log(\frac{1}{p(x)}) \cr &= 𝐸 \log(\frac{1}{p(X)}) \end{aligned} $$

Properties of Shannon Entropy

This definition of uncertainty has some desirable properties that we’ve previously discussed.

Here’s a chart showing Shannon entropy in the case of 2 possible outcomes. It shows entropy as a function of the probability of one of the outcomes.

1: It approaches zero as the probability of the most probable outcome approaches 1

For example, in the case of a 99% chance of rain, Shannon entropy is:

$$ \begin{aligned} &= .99 \cdot \log(\frac{1}{.99}) &+ .01 \cdot \log(\frac{1}{.01}) &= 0.081 \end{aligned} $$

Which is pretty close to zero because there is not a lot of uncertainty.

What’s more, it’s easy to see that in the case of 1% chance of rain, entropy is exactly the same!

2: it is maximized when each outcome is equally probable

In the case of 2 possible outcomes, it is maximized when the probability of each outcome is 50%.

In the case of more than 2 outcomes, Shannon entropy is also maximized when they are all equally probable.

In the case of multiple possible outcomes, it has another desirable property:

3: it increases as the number of possible outcomes increases

If all $n$ outcomes are equally probable, then $p(x) = \frac{1}{n} = \log(\frac{1}{p(x)})$ for all $x$. In this case Shannon entropy is just equal to the surprisal of each possibility ($\log(\frac{1}{p(x)})$):

$$ \begin{aligned} H(X) &= \sum_{x} p(x) \cdot \log(\frac{1}{p(x)}) \cr &= \sum_{x} \frac{1}{n} \cdot \log(n) \cr &= n \cdot \left( \frac{1}{n} \cdot \log(n) \right) \cr &= \log(n) \cr &= \log(\frac{1}{p(x)}) \end{aligned} $$

Which increases with the number of possibilities.

Information as the Resolution of Uncertainty

“Information is the resolution of uncertainty.”
– Claude Shannon

So now we have a nice way to actually quantify uncertainty. But Shannon entropy is also a measure of information. What is the relationship between uncertainty and information?

Suppose I know who the murderer is. But you don’t – for you there are still two possibilities. How many bits of information do I need to provide to you to tell you who did it? Just one. I might send you a “1” for Professor Plum and “0” for Colonel Mustard for example. I need to give you 1 bit of information to resolve your 1 bit of uncertainty about the murderer.

How many bits do I need to tell you who the murder weapon is? We said above there are 4 possibilities, and a 2-bit number can encode four possibilities. So I need to provide 2 bits of information to resolve your 2 bits of uncertainty about the weapon.

So “uncertainty” and “information” are two sides of the same coin. Every time you receive one bit of information, you can look at it as resolving one bit of uncertainty. For example, suppose I am sending you a byte of information, one bit at a time. Initially there are $2^8 = 256$ possible values for that byte. So your uncertainty is $log(256) = 8$ bits. When you find out the value of the first bit, you have cut the number of possible outcomes in half to $2^7 = 128$, which means uncertainty is now $log(128) = 7$ bits. Each bit of information reduces uncertainty by 1 bit.

Efficient Encoding

Now let’s suppose the 256 possible values are not equally-probable: 99% of the time, the value was zero. The remaining values are all equally probable. I could devise an encoding schema that, on average, took much less than 8 bits. For example, if the value was zero, I could send a zero, and if not, I could send you a 1 followed by the value. Sometimes it would take 1 bit to communicate the value, sometimes it would take 9, but the average would be closer to 1.

In his seminal 1948 paper, “A Mathematical Theory of Communication,” Claude Shannon probed that, no matter what the probabilities are, it is possible to devise an encoding scheme such that, on average, the number of bit required to communicate a value is equal to the entropy of the associated probability distribution.

Conclusion: Entropy as Uncertainty

So entropy can be understood as a measure of uncertainty. We can calculate it given the distribution of probabilities of a set of possible outcomes. It is measured in bits. Information is also measured in bits. And the number of bits of information required to resolve all uncertainty is equal to the number of bits of entropy.