Celerra Dedupe… How does it work ?!

I’m getting a lot of questions about how EMC Celerra deduplication works. As deduplication is becoming evermore relevant in the market, I thought I’d best address it.

So what is deduplication ? we know its the elimination of duplicates.. but how is this done in storage ? All we’re doing is taking a thing (file or block of data , depending on the type of dedupe deployed), hashing this “thing” (in most cases using SHA1), a unique fingerprint is generated based on the 1’s and 0’s of that “thing”. So when a “thing” is written to disk, upon hashing said “thing”, if the generated fingerprint already exists..  we don’t store it to disk , we just point to the pre-existing identical “thing”, if it doesn’t exist then we write it to disk and store a new fingerprint for future things to be pointed at. End result…  suprise suprise.. Storage savings !!

Apologies for the excessive use of the word “thing”…  A necessary evil.

Firstly, lets look at the different kinds of deduplication which are deployed out in the market today. There are a few aspects we need to consider when looking at dedupe. Where hashing and checking occurs, at what point the dedupe process takes place and the level of granularity of various types of deduplication.

Where is deduplication Handled (hashing/checking)

Dedupe at Source

We have dedupe at source (where the block delta’s are tracked on the client side in the form of an agent). This is currently deployed in the shape of Avamar by EMC and is used for backup to maximise capacity and minimise LAN/WAN traffic (see previous post on avamar). I believe Commvault may also be making a play for this in Simpana 9.

Deduplication at target

Simply means that dedupe is handled at the Storage Target. This is pretty common. Used by the likes of Data Domain, EMC Celerra, Quantum, etc..   this list goes on.

When does deduplication Occur

Inline

Data is handled immediately and deduplicated as part of the process of writing data to disk. This is not so common, as unless its done very well, there can be alot of latency involved due to the deduplication process having to take place before a write is commited to disk. Data Domain do this and they do it very well. Their process uses a system called SISL, where write performance relys on CPU power rather than spindle count. Fingerprints are stored in memory, so that when data is written to the device, the fingerprint lookups are handled in memory and the CPU power determines the speed of the hashing process. If it doesn’t find a fingerprint in memory, it will look for it on disk, but upon finding it will pull up a shed load of fingerprints with it which relate to the same locality of data (kinda similar to cache prefetching.), so sequential writes can again reference fingerprints from memory not disk.

Want more info on this.. see attached (DataDomain-SISL-Whitepaper).

Post Process

This is most common as most people can’t handle inline dedupe as efficiently as Data Domain.

Level of Deduplication Granularity

File Level Dedupe

File level dedupe is where an entire file is hashed and referenced. Also known as single instancing, this is not as efficient as block level dedupe, but requires less processing power. You may be familiar with this technology from the likes of EMC Centera or Commvault’s SiS Single instancing from Simpana 7.

Fixed Block Dedupe

This is hashing individual blocks of data in a data stream and is much more efficient than file level dedupe. Although it incurs a fair amount more processing power.

Variable block size dedupe

This is essentially where the size of the blocks being hashed can be variable in size. The benefits of this for file data is minimal. This is best placed when there are multiple data sources in heterogenous environments or environments where data may be misaligned (ie, B2D data or VTL).  Data Domain do this…  and inline..  which is impressive.

EMC Celerra uses File level dedupe and Compression, it also uses a post process mechanism. So, when specify that you wish to enable dedupe on a file system, you also specify a policy of file types and/or files of a certain age which qualify for dedupe. It then periodically scans the appropriate file system(s) for files which match the policy criteria, compresses them, hashes them and moves them to a specific portion of the file system (transparent to the user), when the next scan runs and it finds new data which meets the policy criteria, it will compress them and hash them, then it will look at the hashes of previously stored files. If a file exists.. it doesn’t get stored (just points to the existing original file), if it doesn’t..  it gets stored…   simples. The fact that there will most likely be a fair few duplicate files in user home directories, means that you should see a fair number of commonalities which qualify for dedupe in many environments and with compression also being used, will assist in making the best usage of available storage on your Celerra.

 More information in an EMC white paper on the subject here.

and an online demo from emc below.

Advertisements

About interestingevan

I work as a Technical Architect for a Storage and Virtualisation distributor in the UK called Magirus. The goal of this blog is simply to be a resource for people the want to learn about or go and Sell storage. I'm a qualified EMC Clariion Technical architect, Commvault Engineer and Cisco Unified computing specialist. I have also worked with the rest of the EMC portfolio for a good few years. This Blog will provide information on how specific technologies work, what questions need to be asked in order to spec certain products, competative info and my two pence on some of these technologies. Please feel free to provide feedback as to the content on this blog and some bits you'd like to see. View all posts by interestingevan

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: