Information Entropy

In the late 1940’s, the American mathematician Claude Shannon looked to answer the question “what is information?”, and in so doing he single-handedly developed the subject of “Information Theory”; which turned out to be one of the fundamental cornerstones of all of computer science.  Shannon’s key idea was basically to try to figure out how much “actual information” is contained within a “stream of data”.

As it turns out most things that contain a lot of data, actually contain very little real information.  (Think of how often a 300 page book is simply a 3,000 word essay fleshed out to 100,000 words.)  According to Shannon most data in a data stream is “redundant” and can be “compressed” to its pure information content.  Shannon called this pure compressed content “information entropy”, and the more incompressible a system is, the higher will be its information entropy.

Information Entropy is in effect, “the limit of compression of redundant information”; it is “pure information undiluted by repetitive redundancy”…


Information entropy is also known as Shannon’s Entropy and is usually represented by the letter (H).

H  =  – Ʃ(P(x) ×  log2(P(x)))   where P(x) is the probability of an event

So for example the Shannon’s Entropy of a fair coin (probability of head or tail = ½)

H =  –  (½ log2(½)  +  ½ log2(½))

H =  –  ( ½ (-1)  +  ½ (-1) )

H =  –  ( -½  +  -½ )

H =  –  ( -1 )  =  1 bit

The Shannon’s Entropy of a biased coin  (e.g. probability of head = ¼  and probability of tail = ¾)

H =  –  (¼ log2(¼)  +  ¾ log2(¾))

H =  –  (¼ (-2)  +  ¾ (-0.415) )

H =  –  ( -0.50 +  -0.31 )

H =  –  ( -0.81 )  =  0.81 bits

The Shannon’s Entropy of a fair die (probability of each number = 1/6)

H =  –  (1/6 log2(1/6)  +  1/6 log2(1/6)  +  1/6 log2(1/6)  +  1/6 log2(1/6)  +  1/6 log2(1/6)  +  1/6 log2(1/6))

H =  –  (6  ×  1/6 log2(1/6))

H =  –  (log2(1/6))

H =  –  ( -2.585 )  =  2.59 bits

If all the letters of the English language were equally likely then the Shannon’s Entropy would be

H =  –  (27  ×  1/27 log2(1/27))

H =  –  (log2(1/27))

H =  –  ( -4.755 )  =  4.76 bits

But since they are not equally likely (with “e” having the highest probability and “z” the lowest), and since there are quite a few constraints (such as, “q” is nearly always followed by “u”, and the famous “i” before “e” except after “c”, etc), it means there is extra redundancy in the language which reduces the Shannon Entropy from 4.76 bits to approximately 1 bit.


The redundancy in any message, or data stream, is therefore equal to {the number of bits used to encode it} minus {the number of bits of Shannon’s Entropy}.  Consequently, the redundancy in a message is related to the extent to which it is possible to compress it.

Shannon’s Entropy is thus concentrated uncertainty — concentrated information, not diluted by redundancy. In other words

Information Entropy is the “Limit of Compression”…