Information Entropy

In the late 1940’s, the American mathematician Claude Shannon looked to answer the question “what is information?”, and in so doing he single-handedly developed the subject of “Information Theory”; which turned out to be one of the fundamental cornerstones of all of computer science.  Shannon’s key idea was basically to try to figure out how much actual “information” is contained within a stream of “data”.

As it turns out most things that contain a lot of data, actually contain very little real information.  (Think of how often a 300 page book is simply a 3,000 word essay fleshed out to 100,000 words.)  According to Shannon most data in a data stream is “redundant” and can be “compressed” to its pure information content.  Shannon called this pure compressed content “information entropy”; and so the more incompressible a system is, the higher will be its information entropy!..

Information Entropy (which incidentally is sometimes referred to as Shannon’s Entropy”) is in effect, “the limit of compression of redundant data”; it is “pure information undiluted by repetitive redundancy”…

Mathematically, Information Entropy is usually represented by the letter (H).

H  =  – Ʃ(P(x) ×  log2(P(x)))   where P(x) is the probability of an event

So for example the Information Entropy of a fair coin (probability of head or tail = ½)

H =  –  (½ log2(½)  +  ½ log2(½))

H =  –  ( ½ (-1)  +  ½ (-1) )

H =  –  ( -½  +  -½ )

H =  –  ( -1 )  =  1 bit

The Information Entropy of a biased coin  (e.g. probability of head = ¼  and probability of tail = ¾)

H =  –  (¼ log2(¼)  +  ¾ log2(¾))

H =  –  (¼ (-2)  +  ¾ (-0.415) )

H =  –  ( -0.50 +  -0.31 )

H =  –  ( -0.81 )  =  0.81 bits

The Information Entropy of a fair die (probability of each number = 1/6)

H =  –  (1/6 log2(1/6)  +  1/6 log2(1/6)  +  1/6 log2(1/6)  +  1/6 log2(1/6)  +  1/6 log2(1/6)  +  1/6 log2(1/6))

H =  –  (6  ×  1/6 log2(1/6))

H =  –  (log2(1/6))

H =  –  ( -2.585 )  =  2.59 bits

If all the letters of the English language were equally likely then the Information Entropy would be

H =  –  (27  ×  1/27 log2(1/27))

H =  –  (log2(1/27))

H =  –  ( -4.755 )  =  4.76 bits

But since they are not equally likely (with “e” having the highest probability and “z” the lowest), and since there are quite a few constraints (such as, “q” is nearly always followed by “u”, and the famous “i” before “e” except after “c”, etc), it means there is extra redundancy in the language which reduces the Information Entropy from 4.76 bits to approximately 1 bit.

The redundancy in any message, or data stream, is therefore equal to {the number of bits used to encode it} minus {the number of bits of Information Entropy}.  Consequently, the redundancy in a message is related to the extent to which it is possible to “compress” it.

So in the simplest possible terms, Information Entropy is the “Limit of Compression of Data Redundancy”, it is a measure of “Pure Information”.  And since pure information is itself is a measure of uncertainty (i.e. pure information is something that we did not previously know or could not predict), it could therefore be said that Information Entropy is the limit of compression of uncertainty — or more precisely

Information Entropy is the “Limit of Prediction”, the “Limit of Certainty”…

[Note: One final thing to pay attention to in the overall context of this website is that Information Entropy is the data equivalent to the mathematical “Law of Large Numbers” which is effectively a “Limit of Compression of Mathematical Redundancy”…]