Information Theory: A Constructive Crash Course

This text aims at intuitively explaining and proving the main definitions and theorems of Shannon's information theory with limited prerequisites; this includes: cross/conditional entropy, KL-divergence, mutual information, asymptotic codelength property (variant of AEP), source and channel coding theorems. It contains a new constructive proof of the coding theorem for discrete memoryless channel and an other non constructive proof of achievability of Shannon's capacity for the bit-flip channel.

The target audience is "undergraduates/good high school students" but anyone is welcome here.

Find the coin

As a playful formalist, I would like to introduce the subject with a small game. ;)

A coin is hidden among four opaque cups. At each round, we ask yes/no questions to uncover the cup concealing the coin. Multiple rounds are played and we aim to minimize the number of questions asked.

We first assume the coin has 11 in 44 chances to be under each cup such that the best strategy is a binary search asking 22 questions per round.

But after playing a few rounds, we notice the coin has in fact 11 in 22 chance to be in the first cup, 11 in 44 to be in the last one and 11 in 88 to be in the others. If we update our questionnaire asking: "Is the coin in the first cup? Is the coin in the last cup? Is the coin in this remaining cup?” we will ask on average only 12+14×2+(18+18)×3=1.75\frac{1}{2}+\frac{1}{4}\times2+\left(\frac{1}{8}+\frac{1}{8}\right)\times3=1.75 questions per round.

Now imagine playing the same game using only two cups with a 70%70\% chance for the coin to be in the first cup and 30%30\% chance to be in the other. Here, we must ask one question per round.

But let's change the rules of the game allowing questions relative to two simultaneous rounds. There is a 49%49\% chance for the two coins to be in their first cups, 9%9\% chance for them to be in their last cups and 21%21\% for the two other cases. The best strategy will ask 12(0.49+0.21×2+(0.21+0.09)×3)=0.905\frac{1}{2}(0.49+0.21\times2+(0.21+0.09)\times3)=0.905 questions per coin.

Entropies and KL divergence

Given nn cups and their probabilities to contain the coin p1,,pnp_{1},\dots,p_{n}, as we increase the number of simultaneous rounds, the number of questions per coin of the best strategy will converge to,

i=1npilog2(pi)-\sum_{i=1}^{n}p_{i}\log_{2}\left(p_{i}\right)

where log2\log_{2} is the binary logarithm.

This is the entropy of the distribution pp and is denoted H(p)H(p). If a single-round optimal strategy exists, we can view it as a decision tree: log2(pi)-\log_{2}\left(p_{i}\right) is the length of the branch we will follow with probability pip_{i} when the coin is found in the ithi_{^{th}} cup.

Now if we assume the wrong distribution qq, the number of questions we ask on average is:

i=1npilog2(qi)-\sum_{i=1}^{n}p_{i}\log_{2}\left(q_{i}\right)

This is the cross-entropy H(p,q)H(p,q) and it is not symmetric. As expected:

H(p,q)H(p)H(p,q) \geq H(p)

The difference is called Kullback-Leibler divergence, we note:

DKL(pq)=H(p,q)H(p)D_{KL}(p\Vert q)=H(p,q)-H(p)

In the rest of the text we will drop the 22 from log2\log_{2} for simplicity.

Conditional entropies

Let XX and YY be two discrete random variables, we will write the conditional probabilities as,

P(X,Y=x,y)=P(X=x)P(Y=yX=x)P(X,Y = x,y) = P(X = x)P(Y = y|X = x)

hence,

logP(X,Y=x,y)=logP(X=x)+logP(Y=yX=x)\log P(X,Y = x,y) = \log P(X = x)+\log P(Y = y|X = x)

and the expected value operator being linear,

Ex,y[logP(X,Y=x,y)]=Ex,y[logP(X=x)]+Ex,y[logP(Y=yX=x)]\mathbb{E}_{x,y}\left[\log P(X,Y = x,y)\right]=\mathbb{E}_{x,y}\left[\log P(X = x)\right]+\mathbb{E}_{x,y}\left[\log P(Y = y|X = x)\right]

we will note,

H(XY)=H(X)+H(YX)H(XY)=H(X)+H(Y|X)

and:

Ey[logP(Y=yX=x)]=H(YX=x)\mathbb{E}_{y}\left[\log P(Y = y|X = x)\right] = H(Y|X = x)

With a similar reasonning, given X1,,XnX_{1},\dots,X_{n} discrete random variables:

H(X1Xn)=i=1nH(XiXj<i))H(X_{1}\dots X_{n})=\sum_{i=1}^{n}H\left(X_{i}|X_{j\lt i})\right)

Mutual information

Given two discrete random variables XX and YY we have, H(X)+H(Y)H(XY)=DKL(P(X,Y)P(X)P(Y))0H(X)+H(Y)-H(XY)=D_{KL}(P(X,Y)\Vert P(X)\otimes P(Y))\geq 0 where \otimes denotes the outer product.

This quantity is called the mutual information I(X,Y)I(X, Y).

It can be equivalently expressed as:

I(X,Y)=H(XY)H(XY)H(YX)=H(X)H(XY)=H(Y)H(YX) \begin{align*} I(X, Y) & =H(XY) - H(X|Y) - H(Y|X)\\ & = H(X) - H(X|Y)\\ & = H(Y) - H(Y|X) \end{align*}

This is left as an exercise but rest assured from that point on we will prove everything.

Source coding theorem (lower bound)

For a discrete random variable XX, a binary encoding bb of XX is an injection from the domain of XX to the set of binary strings. Noting BX=b(X)B_X=b(X) and BXiB_{X}^{i} the binary random variable given by the ithi_{^{th}} bit when the length BXi\overline{B_X} \geqslant i. For a such binary encoding we have:

E[BX]H(X)\mathbb{E}[\overline{B_X}]\geqslant H(X)

Indeed, since bb is injective we get:

H(X)=H(BX)=i=1maxbP(BX=i)H(BXBX=i)=i=1maxbP(BX=i)j=1iH(BXjBX=iBXk<j)i=1maxbP(BX=i)×iE[BX] \begin{align*} H(X) & =H(B_X)\\ & =\sum_{i=1}^{\max\overline{b}}P(\overline{B_X}=i)H(B_X|\overline{B_X}=i)\\ & =\sum_{i=1}^{\max\overline{b}}P(\overline{B_X}=i)\sum_{j=1}^{i}H\left(B_X^{j}|\overline{B_X}=i \land B_{X}^{k\lt j}\right)\\ & \leq\sum_{i=1}^{\max\overline{b}}P(\overline{B_X}=i)\times i\\ & \leq\mathbb{E}[\overline{B_X}] \end{align*}

Using the fact that for any BB binary random variable H(B)1H(B)\leq 1.

Source coding theorem (upper bound)

Given ϵ>0\epsilon>0 and n>1n>1, noting XnX^n a sequence of nn random variables from a same distribution, there exists a binary encoding BXnB_{X^n} such that:

nH(X)E[BXn]n(H(X)+ϵ)nH(X)\leq\mathbb{E}\left[\overline{B_{X^n}}\right]\leq n\left(H(X)+\epsilon\right)

From the conditional entropy formula we have H(Xn)=nH(X)H(X^{n})=nH(X), so the left inequality is given by our previous result.

For the right hand side, we first notice that given some naturals l1lnl_{1}\leq\dots\leq l_{n} where 2li1\sum2^{-l_{i}}\leq1 we can build binary strings of lengths l1,,lnl_{1},\dots,l_{n} such that no string is the prefix of another.

Indeed, 12li1-\sum2^{-l_i} is a multiple of 2ln2^{-l_n}, such that adding 2ln2lnli2^{l_{n}}-\sum2^{l_{n}-l_{i}} times lnl_{n} to our set of lengths we get 2li=1\sum 2^{-l_i} = 1. Then, if we prune an infinite binary tree at these lengths, the set of paths from the root to the leafs described using left/right instructions will form a valid prefix-distinct set of binary strings.

Now if we pick the lengths as log2P(Xn=(x1,,xn))\left\lceil -\log_{2}P(X^{n}=(x_{1},\dots,x_{n}))\right\rceil then:

E[BXn]=P(Xn=(x1,,xn))log2P(Xn=(x1,,xn))P(Xn=(x1,,xn))(1log2P(Xn=(x1,,xn)))n(H(X)+1n) \begin{align*} \mathbb{E}\left[\overline{B_{X^n}}\right] & =\sum P(X^{n}=(x_{1},\dots,x_{n}))\left\lceil -\log_{2}P(X^{n}=(x_{1},\dots,x_{n}))\right\rceil \\ & \leq\sum P(X^{n}=(x_{1},\dots,x_{n}))(1-\log_{2}P(X^{n}=(x_{1},\dots,x_{n})))\\ & \leq n\left(H(X)+\frac{1}{n}\right) \end{align*}

And we obtain the right inequality for n>1/ϵn>1/\epsilon.

Asymptotic Codelength Property (ACP)

Given n>1n > 1, some sequence of nn discrete random variables XnX^n and some logical statement S(Xn)S({X^n}), we will say that S(Xn)S(X^n) holds almost surely asymptotically (a.s.a.) if we have:

limnP(S(Xn))=1\lim_{n\to\infty}P(S(X^n))=1

Let m>1m > 1, using previous notation, we can encode Xnm=(X1n,,Xmn)X^{nm} =(X^n_{1},\dots,X^n_{m}) as BXnm=BX1nBXmnB_{X^{nm}}=B_{X^n_1}\dots B_{X^n_m} since the encoding BXnB_{X^n} is prefix-distinct.

With the weak law of large numbers we get a.s.a.,

1mBXnmE[BXn]1\left|\frac{1}{m}\overline{B_{X^{nm}}} - \mathbb{E}\left[\overline{B_{X^n}}\right]\right|\leq 1

using the last inequality of the previous section and E[BXn]nH(X)\mathbb{E}[\overline{B_{X^{n}}}]\geqslant nH(X) a.s.a.,

n(H(X)1n)1mBXnmn(H(X)+2n)n\left(H(X)-\frac{1}{n}\right) \leq\frac{1}{m}\overline{B_{X^{nm}}} \leq n\left(H(X)+\frac{2}{n}\right)

and with n=mn=m and k=n2k=n^2, using Landau's notation we get a.s.a.:

BXk[kH(X)±o(k)]\overline{B_{X^{k}}} \in \left[kH(X)\pm o(k)\right]

Given a sequence of kk discrete random variables YkY^k sampled from a conditional distribution P(Y=yX=x)P(Y=y|X=x) we can similarly consider of the conditional encoding BYkXk=xkB_{Y^k|X^k=x^k}.

With the weak law of large number we have a.s.a. H(YkXk=xk)[kH(YX)±o(k)]H(Y^k|X^k=x^k)\in [kH(Y|X)\pm o(k)]. Pairing this result with the above construction we get a.s.a. BYkXk=xk[kH(YX)±o(k)]\overline{B_{Y^k|X^k=x^k}} \in \left[kH(Y|X)\pm o(k)\right].

Prefix tree rebalancing

BYkXk=xkB_{Y^{k}|X^{k}=x^k} is a prefix distinct encoding as concatenation of prefix distinct encodings.

Given ϵ>0\epsilon > 0, we can then perform a rebalacing of the prefix distinct tree defining BYkXk=xkB_{Y^{k}|X^{k}=x^k} as follow:

  1. if BYkXk=xk(yn)<k(H(YX)ϵ)\overline{B_{Y^{k}|X^{k}=x^k}(y^n)}<k(H(Y|X) - \epsilon) pad it with ones to length k(H(YX)ϵ)k(H(Y|X) - \epsilon)
  2. use the new internal nodes with a single child to graft branches pruned from length k(H(YX)ϵ)k(H(Y|X) - \epsilon)

We repeat these two operations till the level k(H(YX)ϵ)k(H(Y|X) - \epsilon) of the binary tree is "full" or there is no branch left to prune.

We can bound the increase in expected length from the padding step by P(BYkXk=xk(yn)<k(H(YX)ϵ))H(XY)k=o(k)P(\overline{B_{Y^{k}|X^{k}=x^k}(y^n)}<k(H(Y|X) - \epsilon))H(X|Y)k=|o(k)|, and the grafting step followed by the second padding step only decrease the expected length (inductively for following iterations).

For large enough kk, the rebalancing procedure terminates with a full k(H(YX)ϵ)k(H(Y|X) - \epsilon) level since a.s.a. H(BYkXk=xk)=H(YkXk=xk)[kH(YX)±o(k)]H(B_{Y^{k}|X^{k}=x^k}) = H(Y^k|X^k=x^k) \in [kH(Y|X) \pm o(k)]. Let's pick such kk.

Since this is valid for any ϵ>0\epsilon > 0 we can pick a suitable o(k)o(k) such that ϵ=o(k)/k\epsilon = |o(k)|/k.

Unification of prefix encodings using branch permutations

After the rebalancing we can view the prefix from BYkXk=xkB_{Y^{k}|X^{k}=x^k} of length kH(YX)o(k)kH(Y|X) - |o(k)| as defining a family of surjections indexed by xkx^k from the domain of YkY^k to binary strings of said length.

We can then consider the right inverses fxkf_{x^k} (note we don't need the axiom of choice since all sets are finite) such that prefixkH(YX)o(k)(BYkXk=xk)fxk=BYkXk\text{prefix}_{kH(Y|X) - o(k)}(B_{Y^{k}|X^{k}=x^k})\circ f_{x^k}=B_{Y^k|X^k}.

Note that BYkXkB_{Y^k|X^k} is a function of yky^k that does not depend upon the realisation xkx^k of XkX^k. Using H(BYkXk=xk)[kH(YX)±o(k)]H(B_{Y^k|X^k=x^k})\in [kH(Y|X)\pm o(k)] we have H(BYkXk)=kH(YX)o(k)H(B_{Y^k|X^k}) = kH(Y|X)-|o(k)|.

Similarly, the suffix BYkXk=xkB'_{Y^k|X^k=x^k} has an entropy of o(k)|o(k)|. We are now ready to dive into the channel coding theorem.

Channel capacity

A communication channel is modeled by a source of input messages XX transmitted and received as output messages YY on the other end of the channel. The conditional distribution P(YX)P(Y|X) defines the reliability of the channel.

Given a fixed conditional distribution P(YX)P(Y|X), we can try to modify the distribution of the input messages P(X)P(X) to maximize the mutual information I(X,Y)I(X,Y). This maximal mutual information is called the capacity of the channel.

Channel coding theorem

Given nn pairs of discrete random variables sampled from a joint probability distribution, we note XnX^n and YnY^n the deinterlaced sequences.

Let (en)(e_n) and (dn)(d_n) be sequences of functions such that H(en(Xn)dn(Yn))=o(n)H(e_n(X^n)|d_n(Y^n))=|o(n)| and H(dn(Yn)en(Xn))=o(n)H(d_n(Y^n)|e_n(X^n))=|o(n)|. If we note,

he,d=supn1nH(en(Xn))=supn1nH(dn(Yn))\begin{align*} h_{e,d} & =\sup_{n}\frac{1}{n}H(e_{n}(X^{n}))\\ & =\sup_{n}\frac{1}{n}H(d_{n}(Y^{n})) \end{align*}

then:

supe,dhe,d=I(X,Y)\sup_{e,d}h_{e,d}=I(X,Y)

Channel coding proof (achievability)

For the rest of the proof we will note C=I(X,Y)C=I(X,Y) and n>1n>1.

We first remark we can encode (Xn,Yn)(X^n, Y^n) in two ways:

with H(BX)=o(n)H(B'_{X})=|o(n)| and H(BY)=o(n)H(B'_{Y})=|o(n)| as concatenations of various suffixes.

Droping BXnYnB_{X^n|Y^n} and BYnXnB_{Y^n|X^n} we notice,

H(BYnBYnXnBYBXnBXnYnBX)H(BYnBYnXnBYBXnBXnYn)H(BX)0H(BYnBYnXnBYBXnBXnYn)o(n)\begin{align*} H(B_{Y^n|B_{Y^n|X^n}}B'_{Y}|B_{X^n|B_{X^n|Y^n}}B'_{X})&\geq H(B_{Y^n|B_{Y^n|X^n}}B'_{Y}|B_{X^n|B_{X^n|Y^n}})-H(B'_{X})\\ 0&\geq H(B_{Y^n|B_{Y^n|X^n}}B'_{Y}|B_{X^n|B_{X^n|Y^n}}) - |o(n)| \end{align*}

that is H(BYnBYnXnBYBXnBXnYn)=o(n)H(B_{Y^n|B_{Y^n|X^n}}B'_{Y}|B_{X^n|B_{X^n|Y^n}})=|o(n)| and droping BYB'_{Y} we have,

H(BYnBYnXnBXnBXnYn)=o(n) H(B_{Y^n|B_{Y^n|X^n}}|B_{X^n|B_{X^n|Y^n}})=|o(n)|

and similarly:

H(BXnBXnYnBYnBYnXn)=o(n) H(B_{X^n|B_{X^n|Y^n}}|B_{Y^n|B_{Y^n|X^n}})=|o(n)|

Furthermore we have H(BYnBYnXn)=nCo(n)H(B_{Y^n|B_{Y^n|X^n}}) = nC - |o(n)| and H(BXnBXnYn)=nCo(n)H(B_{X^n|B_{X^n|Y^n}}) = nC - |o(n)|.

Channel coding proof (optimality)

By contradiction, using Landau's notation, let's assume there is a dnd_n such that,

I(dn(Yn),Xn)nC=Ω(n)I(d_n(Y^n), X^n)-nC=|\Omega(n)|

Since H(XnYn)=H(Xn)+H(BYnXn)+o(n)H(X^nY^n)=H(X^n)+H(B_{Y^n|X^n})+o(n), we would have,

I(Bdn(Yn)BYnXn,Xn)nC=Ω(n)I(B_{d_n(Y^n)|B_{Y^n|X^n}}, X^n)-nC=|\Omega(n)|

but since:

H(Yn)H(Bdn(Yn)BYnXnBYnXn)H(Yn)H(Bdn(Yn)BYnXn)+H(BYnXn)nC+o(n)H(Bdn(Yn)BYnXn) \begin{align*}H(Y^{n}) & \geq H(B_{d_n(Y^n)|B_{Y^n|X^n}}B_{Y^n|X^n})\\H(Y^{n}) & \geq H(B_{d_n(Y^n)|B_{Y^n|X^n}})+H(B_{Y^n|X^n})\\nC+o(n)& \geq H(B_{d_n(Y^n)|B_{Y^n|X^n}})\end{align*}

Contradiction.

Bonus 1: capacity of the bit flip channel

Inputs and outputs of this channel take binary values and the conditional probabilities are,

where 0p<120\leq p\lt \frac{1}{2} and the two missing conditional probabilities can be deduced.

We can bound the mutual information:

I(X,Y)=H(YX)+H(Y)=plog2p+(1p)log2(1p)+H(Y)plog2p+(1p)log2(1p)+1 \begin{align*} I(X,Y) & =-H(Y|X)+H(Y)\\ & =p\log_{2}p+(1-p)\log_{2}(1-p)+H(Y)\\ & \leq p\log_{2}p+(1-p)\log_{2}(1-p)+1 \end{align*}

And the bound is reached when P(Y=0)=1/2P(Y=0)=1/2 that is when P(X=0)=1/2P(X=0)=1/2.

From now on, we will note h(p)=plog2p(1p)log2(1p)h(p)=-p\log_{2}p-(1-p)\log_{2}(1-p).

Bonus 2: proof of achivability for the bit flip channel

The idea of the proof consists in taking a random set of binary strings of given length nn and to show that if this set contains less than 2n(Cϵ)2^{n(C-\epsilon)} strings, for any ϵ>0\epsilon>0, we will be almost certain to recognize the strings after a transmission as nn goes to infinity.

Let's assume η>0\eta>0 is our targeted probability of error. If we transmit a random string of length nn chosen among a set of size 2nR2^{nR}, the number of bit flips FF will follow a binomial distribution with parameters nn and pp.

With Chebyshev's inequality we get:

P(Fnpkp(1p)n))1k2 P\left(|F-np|\geq k\sqrt{p(1-p)n})\right)\leq\frac{1}{k^{2}}

Let's pick kk such that 1/k2<η/21/k^{2}<\eta/2.

Now assuming Fnp<kp(1p)n|F-np|\lt k\sqrt{p(1-p)n}, we must find a condition on RR such that the probability of having an other transmitted string in a radius of np+kp(1p)nnp+k\sqrt{p(1-p)n} flips is lower than η/2\eta/2.

Since the set of strings is chosen uniformly at random, the number of flips FF' from the original transmitted string to any other transmitted string will follow a binomial distribution of parameters nn and 1/21/2. So we need:

perror=1(1P(F<np+kp(1p)n))2nR1η2 p_{error}=1-\left(1-P\left(F'\lt np+k\sqrt{p(1-p)n}\right)\right)^{2^{nR}-1}\leq\frac{\eta}{2}

Using the fact that for n>1n>1, 1(1x)n1-(1-x)^{n} is below its tangent nxnx at 00:

perror(2nR1)×P(F<np+kp(1p)n)(2nR1)×i=0np+o(n)(ni)2n2n(R1)i=0np+o(n)(ni)2n(R1)(np+o(n))(nnp+o(n)) \begin{align*} p_{error} & \leq(2^{nR}-1)\times P\left(F'\lt np+k\sqrt{p(1-p)n}\right)\\ & \leq(2^{nR}-1)\times\sum_{i=0}^{np+o(n)}\binom{n}{i}2^{-n}\\ & \leq2^{n(R-1)}\sum_{i=0}^{np+o(n)}\binom{n}{i}\\ & \leq2^{n(R-1)}(np+o(n))\binom{n}{np+o(n)} \end{align*}

Indeed, since p<1/2p<1/2 for nn large enough, we have for all i<np+o(n)i\lt np+o(n):

(ni)<(nnp+o(n)) \binom{n}{i}\lt\binom{n}{np+o(n)}

Now, as a small interlude, let's prove for knk\leq n, (nk)2nh(kn)\binom{n}{k}\leq2^{nh\left(\frac{k}{n}\right)}:

(kn+1kn)n=1i=0n(ni)(kn)i(1kn)ni=1(nk)(kn)k(1kn)nk1(nk)2nh(kn) \begin{align*} \left(\frac{k}{n}+1-\frac{k}{n}\right)^{n} & =1\\ \sum_{i=0}^{n}\binom{n}{i}\left(\frac{k}{n}\right)^{i}\left(1-\frac{k}{n}\right)^{n-i} & =1\\ \binom{n}{k}\left(\frac{k}{n}\right)^{k}\left(1-\frac{k}{n}\right)^{n-k} & \leq1\\ \binom{n}{k} & \leq2^{nh\left(\frac{k}{n}\right)} \end{align*}

For our main inequality, we get,

perror2n(R1)(np+o(n))2nh(p+o(1))(np+o(n))2n(R(1h(p+o(1))) \begin{align*} p_{error} & \leq2^{n(R-1)}(np+o(n))2^{nh(p+o(1))}\\ & \leq(np+o(n))2^{n(R-(1-h(p+o(1)))} \end{align*}

and if R<1h(p)=CR<1-h(p)=C, the probability perrorp_{error} goes to 00 as desired.