Information Theory Crash Course

Let’s start with a game.

A coin is hidden among four opaque cups. At each round, we ask yes/no questions to uncover the cup containing the coin. Multiple rounds are played and we aim to minimize the number of questions asked.

We first assume the coin has 11 in 44 chances to be under each cup. The best strategy is then a binary search asking 22 questions per round.

Suppose we notice the coin has in fact 11 in 22 chances to be in the first cup, 11 in 44 to be in the last one and 11 in 88 to be in the others. Using the questionnaire “Is the coin in the first cup? Is the coin in the last cup? Is the coin in this remaining cup?” we will ask on average only 12+14×2+(18+18)×3=1.75\frac{1}{2}+\frac{1}{4}\times2+\left(\frac{1}{8}+\frac{1}{8}\right)\times3=1.75 questions per round.

Imagine now playing the same game using only two cups with a 70%70\% chance for the coin to be in the first cup and 30%30\% chance to be in the other. Here, we must ask one question per round.

But let’s change the rules of the game allowing questions relative to two simultaneous rounds. There is a 49%49\% chance for the two coins to be in their first cups, 9%9\% chance for them to be in their last cups and 21%21\% for the two other cases. The best strategy will ask 12(0.49+0.21×2+(0.21+0.09)×3)=0.905\frac{1}{2}(0.49+0.21\times2+(0.21+0.09)\times3)=0.905 questions per coin.

Entropy and KL divergence

Given nn cups and their probabilities to contain the coin p1,,pnp_{1},\dots,p_{n}, as we increase the number of simultaneous rounds, the number of questions per coin of the best strategy will converge to,
i=1npilog2(pi)-\sum_{i=1}^{n}p_{i}\log_{2}\left(p_{i}\right)

where log2\log_{2} is the binary logarithm.

This is the entropy of the distribution pp and is denoted H(p)H(p). If a single-round optimal strategy exists, we can view it as a decision tree: log2(pi)-\log_{2}\left(p_{i}\right) is the length of the branch we will follow with probability pip_{i} when the coin is found in the ithi_{^{th}} cup.

Now if we assume the wrong distribution qq, the number of questions we ask on average is:
i=1npilog2(qi)-\sum_{i=1}^{n}p_{i}\log_{2}\left(q_{i}\right)

This is the cross-entropy H(p,q)H(p,q). As expected:
H(p,q)H(p)H(p,q) \geq H(p)

The difference is called Kullback-Leibler divergence and we note:
DKL(pq)=H(p,q)H(p)D_{KL}(p\Vert q)=H(p,q)-H(p)

Conditional entropy

Let XX and YY be two discrete random variables, using conditional probabilities we get,
P(X,Y)=P(X)P(YX)P(X,Y)=P(X)P(Y|X)

hence,
log2P(X,Y)=log2P(X)+log2P(YX)\log_{2}P(X,Y)=\log_{2}P(X)+\log_{2}P(Y|X)

and the expected value operator being linear,
E[log2P(X,Y)]=E[log2P(X)]+E[log2P(YX)]\mathbb{E}\left[\log_{2}P(X,Y)\right]=\mathbb{E}\left[\log_{2}P(X)\right]+\mathbb{E}\left[\log_{2}P(Y|X)\right]

we will note:
H(XY)=H(X)+H(πX(Y))H(XY)=H(X)+H(\pi_X(Y))

πX(Y)\pi_X(Y) can be interpreted as a random variable obtained from YY once XX has been realized.

Similarly, given X1,,XnX_{1},\dots,X_{n} discrete random variables:
H(X1Xn)=i=1nH(πXj<i(Xi))H(X_{1}\dots X_{n})=\sum_{i=1}^{n}H\left(\pi_{X_{j\lt i}}\left(X_{i}\right)\right)

Note that H(X)+H(Y)H(XY)=DKL(P(X,Y)P(X)P(Y))0H(X)+H(Y)-H(XY)=D_{KL}(P(X,Y)\Vert P(X)\otimes P(Y))\geq 0 where \otimes denotes the tensor product.

This is 00 if and only if XX and YY are independent, hence:
H(XY)H(XπX(Y))H(X)+H(πX(Y))=H(XY)H(XY)\geq H(X\pi_X(Y)) \leq H(X) + H(\pi_X(Y))=H(XY)

The above formula is then an equality proving that XX and πX(Y)\pi_X(Y) are independent.

Source coding theorem (lower bound)

For a discrete random variable XX, a binary encoding bb of XX is an injective function from the domain of XX to the set of binary strings. Noting BX=b(X)B_X=b(X) and BiB_{i} the binary variable given by the ithi_{^{th}} bit when the length BXi\left\Vert B_X\right\Vert \geqslant i. For a suchll binary encoding we have:
E[BX]H(X)\mathbb{E}[\left\Vert B_X\right\Vert ]\geqslant H(X)

Indeed, since bb is injective we get:
H(X)=H(BX)=i=1maxbP(BX=i)H(πBX=i(BX))=i=1maxbP(BX=i)j=1iH(πBX=i(πBk<j(Bj)))i=1maxbP(BX=i)×iE[BX] \begin{align*} H(X) & =H(B_X)\\ & =\sum_{i=1}^{\max\left\Vert b\right\Vert }P(\left\Vert B_X\right\Vert =i)H(\pi_{\left\Vert B_X\right\Vert =i}(B_X))\\ & =\sum_{i=1}^{\max\left\Vert b\right\Vert }P(\left\Vert B_X\right\Vert =i)\sum_{j=1}^{i}H\left(\pi_{\left\Vert B_X\right\Vert =i}\left(\pi_{B_{k\lt j}}\left(B_{j}\right)\right)\right)\\ & \leq\sum_{i=1}^{\max\left\Vert b\right\Vert }P(\left\Vert B_X\right\Vert =i)\times i\\ & \leq\mathbb{E}[\left\Vert B_X\right\Vert ] \end{align*}

Using the fact that for any BB binary random variable H(B)1H(B)\leq 1.

Source coding theorem (upper bound)

Given ϵ>0\epsilon>0 and n>1n>1, noting XnX^n a sequence of nn random variables from a same distribution, there exists a binary encoding BXnB_{X^n} such that:
nH(X)E[BXn]n(H(X)+ϵ)nH(X)\leq\mathbb{E}\left[\left\Vert B_{X^n}\right\Vert \right]\leq n\left(H(X)+\epsilon\right)

From the conditional entropy formula we have H(Xn)=nH(X)H(X^{n})=nH(X), so the left inequality is given by our previous result.

For the right hand side, we will first show that given some naturals l1lnl_{1}\leq\dots\leq l_{n} where 2li1\sum2^{-l_{i}}\leq1 we can build binary strings of lengths l1,,lnl_{1},\dots,l_{n} such that no string is the prefix of another.

Indeed, we notice that 12li1-\sum2^{-l_i} is a multiple of 2ln2^{-l_n}, such that adding 2ln2lnli2^{l_{n}}-\sum2^{l_{n}-l_{i}} times lnl_{n} to our set of lengths we get 2li=1\sum 2^{-l_i} = 1. Then, if we prune an infinite binary tree at these lengths, the set of paths from the root to the leafs described using left/right instructions will form a valid prefix-distinct set of binary strings.

Now if we pick the lengths as log2P(Xn=(x1,,xn))\left\lceil -\log_{2}P(X^{n}=(x_{1},\dots,x_{n}))\right\rceil then:
E[BXn]=P(Xn=(x1,,xn))log2P(Xn=(x1,,xn))P(Xn=(x1,,xn))(1log2P(Xn=(x1,,xn)))n(H(X)+1n) \begin{align*} \mathbb{E}\left[\left\Vert B_{X^n}\right\Vert \right] & =\sum P(X^{n}=(x_{1},\dots,x_{n}))\left\lceil -\log_{2}P(X^{n}=(x_{1},\dots,x_{n}))\right\rceil \\ & \leq\sum P(X^{n}=(x_{1},\dots,x_{n}))(1-\log_{2}P(X^{n}=(x_{1},\dots,x_{n})))\\ & \leq n\left(H(X)+\frac{1}{n}\right) \end{align*}

And we obtain the right inequality for n>1/ϵn>1/\epsilon.

Asymptotic codelength property

Reusing previous notations, we have:
0E[BXn]nH(X)10\leq\mathbb{E}\left[\left\Vert B_{X^n}\right\Vert \right]-nH(X)\leq 1

Let’s denote this quantity θ\theta and ϵ=θ/n\epsilon = \theta/n. Given mm a multiple of nn and encoding XmX^m as concatenations of XnX^n encodings, Chebyshev’s inequality gives,
P(BXmmH(X)ϵm>mδσn))<δ P(|\left\Vert B_{X^m}\right\Vert - mH(X)- \epsilon m|>\sqrt{\frac{m}{\delta}}\sigma_n))<\delta

for any δ>0\delta > 0, with σn2\sigma_n^2 the variance of BXn\left\Vert B_{X^n}\right\Vert.

Then using the triangle inequality we get,
P(BXmmH(X)>ϵm+mδσn))<δ P(|\left\Vert B_{X^m}\right\Vert - mH(X)|> \epsilon m+\sqrt{\frac{m}{\delta}}\sigma_n))<\delta

and finally, with Landau’s notation:
P(BXmmH(X)>ϵm+o(m))<δ P(|\left\Vert B_{X^m}\right\Vert - mH(X)|>\epsilon m+o(m))<\delta

Mutual information and channel capacity

Given two discrete random variables XX and YY we note,
I(X,Y)=H(X)+H(Y)H(XY)I(X,Y)=H(X)+H(Y)-H(XY)

the mutual information between XX and YY.

A communication channel is modeled by a source of input messages XX transmitted and received as output messages YY on the other end of the channel. The conditional probability P(YX)P(Y|X) defines the reliability of the channel.

Given a fixed conditional distribution P(YX)P(Y|X), we can try to modify the distribution of the input messages P(X)P(X) to maximize the mutual information I(X,Y)I(X,Y). This maximal mutual information is called the capacity of the channel.

Channel coding theorem

Given a joint probability distribution with mutual information CC and Xn,YnX^n,Y^n sequences of nn random variables sampled from it, there exist two sequences of functions (en)(e_n) and (dn)(d_n) such that,
limnP(en(Xn)=dn(Yn))=1 \lim_{n\to\infty}P(e_n(X^n)=d_n(Y^n))=1

and, using Landau’s notation:
H(en(Xn))=nC+o(n)=H(dn(Yn))H(e_n(X^n))=nC+o(n)=H(d_n(Y^n))

Conversely, if
H(en(Xn))=nC+Ω(n)=H(dn(Yn)) H(e_n(X^n))=nC+|\Omega(n)|=H(d_n(Y^n))

and the above limit exists, then it cannot be 11.

Bit flip channel (capacity)

Before proving the theorem in a more general setting and to make things concrete, let’s look at an example: the bit flip channel. Inputs and outputs of this channel take binary values and the conditional probabilities are,

where 0p<120\leq p\lt \frac{1}{2} and the two missing conditional probabilities can be deduced.

We can bound the mutual information:
I(X,Y)=H(YX)+H(Y)=plog2p+(1p)log2(1p)+H(Y)plog2p+(1p)log2(1p)+1 \begin{align*} I(X,Y) & =-H(Y|X)+H(Y)\\ & =p\log_{2}p+(1-p)\log_{2}(1-p)+H(Y)\\ & \leq p\log_{2}p+(1-p)\log_{2}(1-p)+1 \end{align*}

And the bound is reached when P(Y=0)=1/2P(Y=0)=1/2 that is when P(X=0)=1/2P(X=0)=1/2.

From now on, we will note h(p)=plog2p(1p)log2(1p)h(p)=-p\log_{2}p-(1-p)\log_{2}(1-p).

Since we now have the capacity of the bit flip channel, we can try to prove the first half of the channel coding theorem in this specific case.

Bit flip channel (channel coding theorem)

The idea of the proof consists in taking a random set of binary strings of given length nn and to show that if this set contains less than 2n(Cϵ)2^{n(C-\epsilon)} strings, for any ϵ>0\epsilon>0, we will be almost certain to recognize the strings after a transmission as nn goes to infinity.

Let’s assume η>0\eta>0 is our targeted probability of error. If we transmit a random string of length nn chosen among a set of size 2nR2^{nR}, the number of bit flips FF will follow a binomial distribution with parameters nn and pp.

With Chebyshev’s inequality we get:
P(Fnpkp(1p)n))1k2 P\left(|F-np|\geq k\sqrt{p(1-p)n})\right)\leq\frac{1}{k^{2}}

Let’s pick kk such that 1/k2<η/21/k^{2}<\eta/2.

Now assuming Fnp<kp(1p)n|F-np|\lt k\sqrt{p(1-p)n}, we must find a condition on RR such that the probability of having an other transmitted string in a radius of np+kp(1p)nnp+k\sqrt{p(1-p)n} flips is lower than η/2\eta/2.

Since the set of strings is chosen uniformly at random, the number of flips FF' from the original transmitted string to any other transmitted string will follow a binomial distribution of parameters nn and 1/21/2. So we need:
perror=1(1P(F<np+kp(1p)n))2nR1η2 p_{error}=1-\left(1-P\left(F'\lt np+k\sqrt{p(1-p)n}\right)\right)^{2^{nR}-1}\leq\frac{\eta}{2}

Using the fact that for n>1n>1, 1(1x)n1-(1-x)^{n} is below its tangent nxnx at 00:
perror(2nR1)×P(F<np+kp(1p)n)(2nR1)×i=0np+o(n)(ni)2n2n(R1)i=0np+o(n)(ni)2n(R1)(np+o(n))(nnp+o(n)) \begin{align*} p_{error} & \leq(2^{nR}-1)\times P\left(F'\lt np+k\sqrt{p(1-p)n}\right)\\ & \leq(2^{nR}-1)\times\sum_{i=0}^{np+o(n)}\binom{n}{i}2^{-n}\\ & \leq2^{n(R-1)}\sum_{i=0}^{np+o(n)}\binom{n}{i}\\ & \leq2^{n(R-1)}(np+o(n))\binom{n}{np+o(n)} \end{align*}

Now, as a small interlude, let’s prove for knk\leq n, (nk)2nh(kn)\binom{n}{k}\leq2^{nh\left(\frac{k}{n}\right)}:
(kn+1kn)n=1(nk)(kn)k(1kn)nk1(nk)2nh(kn) \begin{align*} \left(\frac{k}{n}+1-\frac{k}{n}\right)^{n} & =1\\ \binom{n}{k}\left(\frac{k}{n}\right)^{k}\left(1-\frac{k}{n}\right)^{n-k} & \leq1\\ \binom{n}{k} & \leq2^{nh\left(\frac{k}{n}\right)} \end{align*}

For our main inequality, we get,
perror2n(R1)(np+o(n))2nh(p+o(1)) p_{error} \leq2^{n(R-1)}(np+o(n))2^{nh(p+o(1))}

and if R<1h(p)=CR<1-h(p)=C, the perrorp_{error} goes to 00 as desired.

Proof of the channel coding theorem (first half)

Given n>0n\gt 0, we have,

H(XnYn)=H(πYn(Xn))+H(Yn)=H(πYn(Xn))+H(πXn(Yn))+H(ππXn(Yn)(Yn)) \begin{align*}H(X^{n}Y^{n}) & =H(\pi_{Y^n}(X^n))+H(Y^n)\\ &= H(\pi_{Y^n}(X^n))+H(\pi_{X^n}(Y^n))+H(\pi_{\pi_{X^n}(Y^n)}(Y^n)) \end{align*}

as proven earlier, πYn(Xn)\pi_{Y^n}(X^n) is independent of YnY^n, such that:

H(XnYn)=H(πYn(Xn))+H(πXn(Yn))+H(ππXn(Yn)πYn(Xn)(YnπYn(Xn)))=H(πYn(Xn))+H(πXn(Yn))+H(ππXn(Yn)πYn(Xn)(XnYn)) \begin{align*}H(X^{n}Y^{n}) & =H(\pi_{Y^n}(X^n))+H(\pi_{X^n}(Y^n))+H(\pi_{\pi_{X^n}(Y^n)\pi_{Y^n}(X^n)}(Y^n\pi_{Y^n}(X^n)))\\ &= H(\pi_{Y^n}(X^n))+H(\pi_{X^n}(Y^n))+H(\pi_{\pi_{X^n}(Y^n)\pi_{Y^n}(X^n)}(X^nY^n)) \end{align*}

Noting MXnYn=ππXn(Yn)πYn(Xn)(XnYn)M_{X^nY^n}=\pi_{\pi_{X^{n}}(Y^{n})\pi_{Y^{n}}(X^{n})}\left(X^{n}Y^{n}\right), we notice MXnYnM_{X^nY^n} and πXn(Yn)\pi_{X^{n}}(Y^{n}) represent YnY^n since adjoining πYn(Xn)\pi_{Y^{n}}(X^{n}) recovers XnYnX^nY^n.

Similarly MXnYnM_{X^nY^n} and πYn(Xn)\pi_{Y^{n}}(X^{n}) represent XnX^n, so there exist two encodings:

And since,
H(MXnYn)=H(XnYn)H(πYn(Xn))H(πXn(Yn))=H(Xn)H(πXn(Yn))+H(XnYn)H(XnYn)=H(Xn)+H(Yn)H(XnYn)=nC \begin{align*} H(M_{X^{n}Y^{n}}) & =H(X^{n}Y^{n})-H(\pi_{Y^{n}}(X^{n}))-H(\pi_{X^{n}}(Y^{n}))\\ &= H(X^n)-H(\pi_{X^{n}}(Y^{n})) +H(X^{n}Y^{n}) - H(X^{n}Y^{n}) \\ &=H(X^n) + H(Y^n) - H(X^nY^n)\\ & =nC \end{align*}

using the asymptotic codelength property and given ϵ>0\epsilon > 0:
limnP(BMXnYn<n(Cϵ))=0 \lim_{n\to\infty}P\left(\left\Vert B_{M_{X^{n}Y^{n}}}\right\Vert \lt n(C-\epsilon)\right)=0

Hence, by encoding YnY^n and considering the n(Cϵ)n(C−\epsilon) first bits we are almost certain to recover a prefix of BXnB_{X^n} concluding the reasoning.

Proof of the channel coding theorem (converse)

By contradiction, suppose there exists fnf_n such that,
I(fn(Yn),Xn)nC=Ω(n)I(f_n(Y^n),X^n)-nC=|\Omega(n)|

since πXn(Yn)\pi_{X^n}(Y^n) is independent of XnX^n, we would have,
I(ππXn(Yn)(fn(Yn)),Xn)nC=Ω(n)I(\pi_{\pi_{X^n}(Y^n)}(f_n(Y^n)),X^n)-nC=|\Omega(n)|

but then:
H(Yn)H(ππXn(Yn)(fn(Yn))πXn(Yn))H(Yn)H(ππXn(Yn)(fn(Yn)))+H(πXn(Yn))nCH(ππXn(Yn)(fn(Yn)))\begin{align*}H(Y^{n}) & \geq H(\pi_{\pi_{X^{n}}(Y^{n})}(f_{n}(Y^{n}))\pi_{X^{n}}(Y^{n}))\\H(Y^{n}) & \geq H(\pi_{\pi_{X^{n}}(Y^{n})}(f_{n}(Y^{n})))+H(\pi_{X^{n}}(Y^{n}))\\nC & \geq H(\pi_{\pi_{X^{n}}(Y^{n})}(f_{n}(Y^{n})))\end{align*}

Contradiction.