Entropy Made Intuitive

Entropy

More uncertainty about a trial $\Rightarrow$ higher information gained

\[H(X) = \color{blue}{-}\sum_i \color{red}{p_i} \cdot \color{blue}{\log_2 p_i} = \sum_i \color{red}{p_i} \cdot \color{blue}{\log_2 \frac{1}{p_i}}\]

$H(X)$ is the entropy of random variable $X$: the expected amount of information in $X$’s possible outcomes. Mirroring the expected value definition, $\color{red}{p_i}$ is the probability of each outcome and $\color{blue}{\log_2 \frac{1}{p_i}}$ is then the amount of information for each outcome.

This all makes sense, but the part that confused me is how does $\log_2 \frac{1}{p_i}$ measure the amount of information? Why not some other measure?

We have two criteria for our mysterious information measure $I(x)$, with random variables $x1, x2$:

More uncertainty, more information: $p(x1) > p(x2) \Rightarrow I(x1) < I(x2)$
If $x1, x2$ are independent: $I(x1,x2) = I(x1) + I(x2)$

For (1), we could use an inverse proportion such as $I(x) = \frac{1}{p(x)}$. This cannot be added correctly for independent events, however. But if we use $I(x) = \log\frac{1}{p(x)} = -\log(p(x))$, then it all checks out and (1) is still satisfied thanks to a log property!

So we can use this $\log_b \frac{1}{p_i}$ for any base $b$ and both criteria will be satisfied. $\log_2 \frac{1}{p_i}$ is used to represent the “number of bits” of information.

def entropy(P):
    total = 0 
    for prob in P:
        total += prob*log(prob)
    return -total 

Cross Entropy

Cross entropy between two probability distributions $q$ and $p$ is the expected number of information bits to identify an event when using estimate distribution $q$ instead of real distribution $p$. It is usuaully used when we are estimating a probability distribution, such as classification problems with a softmax output layer.

\[H_Q(P) = \sum_i \color{red}{p_i} \cdot \color{blue}{\log_2 \frac{1}{q_i}}\]

Just as before, $\color{blue}{\log_2 \frac{1}{q_i}}$ represents the amount of information and $\color{red}{p_i}$ is the probability of outcome $i$.

It’s clear the probability should be $p_i$ because that’s the true probability distribution. And since $q$ is an estimate of $p$, the best we can do is if our estimate is perfect and equals the entropy of $P$:

\[H_Q(P)=H_P(P)=H(P)\]

What’s the worst we can do with our estimate $q$? In binary classification, if we are 100% wrong on the labels then $H_Q(P) \rightarrow \infty$. To see this, notice the term approaches $\infty$ for the true class where $p_i=1$ and our estimate $q_i \rightarrow 0$, but terms for the wrong classes equal 0 where $p_i=0$ and our estimate $q_i \rightarrow 1$.

So the range of $H_Q(P)$ is $\big[H(P), \infty \big)$. This makes it effective as a loss function for classification: we want to minimize the loss between the our model’s class distribution $Q$ (given by softmax at last layer) and the true class distribution $P$, where the minimum loss is $H(P)$.

def cross_entropy(P, Q):
    total = 0 
    for pi, qi in zip(P,Q):
        total += pi*log(qi)
    return -total 

KL Divergence

Kullback-Leibler (KL) divergence is a measure between probability distributions. Note: it is not a distance metric because it’s asymmetric, i.e., $D(P\vert\vert Q) \neq D(P\vert\vert Q)$.

With true probability distribution $P$, model A distribution $Q_A$ and model B distrubution $Q_B$, the KL divergence $D_{KL}(P\vert\vert Q)$ is the relative entropy of $P\ \text{w.r.t. } Q$. Or, it’s the relative entropy of the true distribution relative to an estimate distribution. It’s the expected additional information to encode $P$ when we optimized the encoding for $Q$.

But how do we measure this extra information (in bits)? Reminder that cross entropy $H_Q(P)$ measures the additional bits to recover $P$ from $Q$, plus the base entropy $H(P)$. Then we just want those addtional bits! The KL divergence is just:

\[D_{KL}(P\vert\vert Q) = H_Q(P) - H(P)\]

or $D_{KL}(P\vert\vert Q)$ is just the cross entropy minus the entropy. It has a minimum at 0 when $P = Q$ because there is no extra uncertainty. By the way, it’s asymmetric because cross entropy is too.

In ML models, minimizing KL divergence is equivalent to minimizing the cross entropy (given $P$ is stationary). This is because the entropy of the true distribution $H(P)$ does not change. However, in a field like multi-agent reinforcement learning we encounter nonstationary distributions frequently. For example, two agents $A_1, A_2$ each predicting the next step which depends on the (still training) agents’ policies. The “true” distribution of states $S$ changes over time because the agents are affecting the environment differently as they learn.

def kl(P,Q):
    return cross_entropy(P, Q) - entropy(P)

Joint Entropy

The joint entropy $H(X,Y)$ describes the average expected information across multiple variables, or distributions. Simply: it is the entropy of the joint distribution $P(X,Y)$.

Remember, the joint distribution is given by the probability of each event occuring together within each distribution:

\[P(X,Y) = \{ p(x,y) | X=x, Y=y \}\]

This would be a simple multiplication if the distributions were independent. Then the joint entropy is:

\[H(X,Y) = \sum_{x,y} p(x,y) \log_2 \left( \frac{1}{p(x,y)} \right)\]

Conditional Entropy

If we already know the outcome of a random variable $Y$, we don’t need to send as much information. It is like we selected that element from distribution $Y$ in our joint $P(X,Y)$, and then renormalize all the probabilities (to get the conditional probability distribution).

The conditional entropy describes the average information content of a distribution $X$ when we already know $Y$:

\[H(X|Y) = \sum_{x,y} p(x,y) \log_2 \left( \frac{1}{p(x|y)} \right)\]

Mutual Information

The joint entropy and conditional entropy both describe the distributions and something about their overlap, or differences. $H(X|Y)$ is describing the information that is in $X$ but not in $Y$.

The mutual information describes exactly this overlap between distributions:

\[I(X,Y) = H(X) + H(Y) - H(X,Y)\]

(Notice the similarity to probability union: $P(A \cup B) = P(A) + P(B) - P(A \cap B)$)

The information that is not shared between is called the variation of information:

\[VI(X,Y) = H(X,Y) - I(X,Y) = H(X) + H(Y) - 2 H(X,Y)\]