What Is Data Entropy?

The concept of Data Entropy is greatly used in RAID Reconstructor. Invented by Claude Shannon in 1948, it provides a way to measure the average number of bits needed to encode a string of symbols.

When looking through an unknown set of raw data on a drive, we calculate the number of bytes needed to encode the content of any given sector. Since a sector contains 512 bytes, this is a number between 0 and 512. If we divide this by 512 we are dealing with a number between 0 and 1. A sector containing only ‘a’s has an entropy near 0. A highly compressed JPEG has entropy near 1.

Have a look at popular compression utilities such as Winzip or Winrar. If these programs compress a file with 1,000,000 bytes to 1000 bytes the entropy is around 1000/1,000,000 which is 0.001. A 1,000,000 bytes file compressed to 900,000 bytes has entropy of 0.9. Shannon’s formula enables us to calculate the entropy without actually performing the compression.

But how does this help us to reconstruct broken RAID’s? Most files have consistent entropy that does not vary much between sectors. For example, the entropy of the English language averages 1.3 bit per character. This means you need 666 bits (=83 bytes) to encode a sentence with 512 characters (stored in one sector). Our entropy of this sector containing English text would be 83/512 which is 0.16. You can assume that sectors with similar entropies belong together. This is how RAID Reconstructor decides what the drive order and the block size are for drives that previously belonged to a RAID. It also explains why sometimes RAID Reconstructor’s analysis fails. If the probed areas on the drives contain a huge amount of all the same kind of data, there is nothing RAID Reconstructor can “see”.

Next week I will have a look at RAID Reconstructor’s Entropy Test.


Leave a Reply

You must be logged in to post a comment.