So what is deduplication ? we know its the elimination of duplicates.. but how is this done in storage ? All we’re doing is taking a thing (file or block of data , depending on the type of dedupe deployed), hashing this “thing” (in most cases using SHA1), a unique fingerprint is generated based on the 1’s and 0’s of that “thing”. So when a “thing” is written to disk, upon hashing said “thing”, if the generated fingerprint already exists.. we don’t store it to disk , we just point to the pre-existing identical “thing”, if it doesn’t exist then we write it to disk and store a new fingerprint for future things to be pointed at. End result… suprise suprise.. Storage savings !!
Apologies for the excessive use of the word “thing”… A necessary evil.
Firstly, lets look at the different kinds of deduplication which are deployed out in the market today. There are a few aspects we need to consider when looking at dedupe. Where hashing and checking occurs, at what point the dedupe process takes place and the level of granularity of various types of deduplication.
Where is deduplication Handled (hashing/checking)
Dedupe at Source
We have dedupe at source (where the block delta’s are tracked on the client side in the form of an agent). This is currently deployed in the shape of Avamar by EMC and is used for backup to maximise capacity and minimise LAN/WAN traffic (see previous post on avamar). I believe Commvault may also be making a play for this in Simpana 9.
Deduplication at target
Simply means that dedupe is handled at the Storage Target. This is pretty common. Used by the likes of Data Domain, EMC Celerra, Quantum, etc.. this list goes on.
When does deduplication Occur
Data is handled immediately and deduplicated as part of the process of writing data to disk. This is not so common, as unless its done very well, there can be alot of latency involved due to the deduplication process having to take place before a write is commited to disk. Data Domain do this and they do it very well. Their process uses a system called SISL, where write performance relys on CPU power rather than spindle count. Fingerprints are stored in memory, so that when data is written to the device, the fingerprint lookups are handled in memory and the CPU power determines the speed of the hashing process. If it doesn’t find a fingerprint in memory, it will look for it on disk, but upon finding it will pull up a shed load of fingerprints with it which relate to the same locality of data (kinda similar to cache prefetching.), so sequential writes can again reference fingerprints from memory not disk.
Want more info on this.. see attached (DataDomain-SISL-Whitepaper).
This is most common as most people can’t handle inline dedupe as efficiently as Data Domain.
Level of Deduplication Granularity
File Level Dedupe
File level dedupe is where an entire file is hashed and referenced. Also known as single instancing, this is not as efficient as block level dedupe, but requires less processing power. You may be familiar with this technology from the likes of EMC Centera or Commvault’s SiS Single instancing from Simpana 7.
Fixed Block Dedupe
This is hashing individual blocks of data in a data stream and is much more efficient than file level dedupe. Although it incurs a fair amount more processing power.
Variable block size dedupe
This is essentially where the size of the blocks being hashed can be variable in size. The benefits of this for file data is minimal. This is best placed when there are multiple data sources in heterogenous environments or environments where data may be misaligned (ie, B2D data or VTL). Data Domain do this… and inline.. which is impressive.
EMC Celerra uses File level dedupe and Compression, it also uses a post process mechanism. So, when specify that you wish to enable dedupe on a file system, you also specify a policy of file types and/or files of a certain age which qualify for dedupe. It then periodically scans the appropriate file system(s) for files which match the policy criteria, compresses them, hashes them and moves them to a specific portion of the file system (transparent to the user), when the next scan runs and it finds new data which meets the policy criteria, it will compress them and hash them, then it will look at the hashes of previously stored files. If a file exists.. it doesn’t get stored (just points to the existing original file), if it doesn’t.. it gets stored… simples. The fact that there will most likely be a fair few duplicate files in user home directories, means that you should see a fair number of commonalities which qualify for dedupe in many environments and with compression also being used, will assist in making the best usage of available storage on your Celerra.
More information in an EMC white paper on the subject here.
and an online demo from emc below.