Disaster recovery has become something which is moving higher and higher up agenda on companies “to do” list. Its becoming increasingly more apparent what the costs to a given business are when companies suffer downtime and/or loss of data.. people are starting to think about the monetary cost to the business is when services or applications are unavailable to both internal staff and more importantly customers and with the big push of server virtualization over the last few years.. where is application data/file data/the application server itself sitting ? on the SAN; so it makes sense to leverage that existing infrastructure in the SAN and use some form of SAN based replication.
Bearing in mind the SAN is no longer a luxury only the privileged enterprise has access to and is becoming ever more important to even small businesses.. not all these organisations have access to biiiig dedicated links between sites and if they do, they’re probably subject to significant contention and unfortunately TCP isn’t the most efficient of protocols over distance.
So what do you do to make sure the DR solution you have in mind is feasible and realistic ?
Firstly make sure you pick the right technology
First port of call is sitting down with the customer and mapping out the availability requirements of their applications. Things like the RPO/RTO requirements of the applications they have in use. Alot of the time the company may not have thought about this in alot of detail, so you can really add value here if you are a reseller. Ultimately it boils down to the following being considered for each service :
- How much downtime can you afford before the business start losing money on each given application.
- How much data can you afford to lose in the event of a a disaster, before it does significant damage to the business
If you can get them to apply a monetary figure to the above, it can help when positioning return on investment.
There are a few types of Array based replication out there. They normally come in 3 flavours, A-syncronous, Synchronous and Jounaling/CDP and Synchronous Replication. Synchronous replication can be a bit risky for alot of businesses as usually application response time becomes dependent on writes being committed to disk on both production and DR storage (thus application response times become dependent also on round trip latency across the link between the 2 sites, spindle count becomes very important on both sites here also). I often find that aside from banks and large conglomerates the main candidate for synchronous replication in the SMB space is actually universities. Why ? because often universities don’t replicate over massive distances, they will have a campus DR setup where they replicate over a couple of hundred metres from building to building, so laying fibre in this case isn’t too costly. However, for the average SMB who wants to replicate to another town; syncronous replication isn’t usually preferable due to latency over distance and the cost of the large link required.
Mirrorview A-Syncronous (EMC)
A-Syncronous replication is typically what I see in the case of most small to medium size businesses. Why ? firstly, because application response times are dependent on the round trip time of syncronous replication. With A-Synchronous replication, usually a Copy on first write mechanism is utilised to effectively ship snapshots at specified intervals over an IP link. Below is a diagram showing how EMC Mirrorview/A does this :
EMC uses whats called a Delta Bitmap (A visual representation of the data blocks on the volume), to track what has been sent to the secondary array and what hasn’t. This Delta Bitmap works in conjunction with reserve LUNs (Delta Set) on the array to ensure that the data that is sent across to the secondary array remains consistent. The secondary also has reserve LUNs in place so that if replication were interrupted or the link was lost, the secondary array can roll back to its original form so the data isn’t compromised.
Also, you can use higher capacity less expensive disks on the DR site without affecting the response times to production (although application response times will still be affected in the event of a failover, as servers will be accessing disk on the DR box). One potential drawback with asynchronous replication, is that as both SAN’s are no longer in a synchronous state, you have to decide whether it is important that your remote copies of data are in an application consistent state. If it is important, then you’ll have to look at a technology which will sit in the host and talk to the application and will also talk to the storage. In the EMC world we have a tool called replication manager which does all the various required bits on the host side (calling VSS/Hot backup mode , flushing host buffers, etc).
Replication manager is licenced per application server (or virtual server in a cluster) and also required an agent per mount host, plus a server licence (or 2 depending on the scenario). There is a lot more to replication manager, but that’s a whole post in itself.
Recoverpoint is another way of replication technology by EMC which allows very granular restore points and small RPO’s over IP. Because it employs journalling rather than Copy on first write. It stubs and timestamps at very regular intervals (almost every write in some cases), allowing you to roll back volumes to very specific, granular, points in time. See below diagram for more detail :
RecoverPoint provides out-of-band replication. To be considered out-of-band, the RecoverPoint appliance is not involved in the I/O process. Instead, a component of RecoverPoint, called the splitter (or Kdriver), is involved. The function of a splitter is to intercept writes destined for a volume being replicated by RecoverPoint. The write is then split (“copied”) with one copy being sent to the RecoverPoint appliance and the original being sent to the target.
With RecoverPoint, three types of splitters can be used. The first splitter resides on a host server that accesses a volume being protected by RecoverPoint. This splitter resides in the I/O stack, below the file system and volume manager layer, and just above the multi-path layer. This splitter operates as a device driver and inspects each write sent down the I/O stack and determines if the write is destined for one of the volumes that RecoverPoint is protecting. If the write is destined to a protected LUN, then the splitter sends the write downward and will rewrite the address packet in the write so that a copy of the write is sent to the RecoverPoint appliance. When the ACK (acknowledged back) from the original write is received, the splitter will wait until a matching ACK is received from the RecoverPoint appliance before sending an ACK up the I/O stack. The splitter can also be part of the storage services on intelligent SAN switches from Brocade or Cisco.
For a CLARiiON CX4 and CX3, the CLARiiON storage processor also has a write splitter. When a write enters the CLARiiON array (either through a Gigabit Ethernet port or a Fibre Channel port), its destination is examined. If it is destined to one of the LUNs being replicated by RecoverPoint, then a copy of that write is sent back out one of the Fibre Channel ports of the storage processor to the RecoverPoint appliance. Since the splitter resides in the CLARiiON array, any open systems server that is qualified for attachment to the CLARiiON array can be supported by RecoverPoint. Additionally, both Fibre Channel and iSCSI volumes that reside inside the CLARiiON CX4 or CX3 storage array can be replicated by RecoverPoint. RecoverPoint/SE only supports a Windows host-based splitter and the CLARiiON-based write splitter. Also automatic installation and configuration for RecoverPoint/SE only supports the CLARiiON-based write splitter.
Below is a Video from EMC demonstrating Recoverpoint in a VMWare Environment :
Optimise So how do we ensure we are getting the most out of the links we use (especially over contended links such as VPN or MPLS) ? WAN optimisation.. there are a number of ways this can be done, some use an appliance to acknowledge back to the production SAN locally, then cache the data and burst it over the WAN. Some companies have found a more efficient way of transmitting data over a WAN, by using proprietary more efficient protocols to replace TCP over the WAN (such as Hyper IP), Below is a snippet from a mail I received from a company called Silverpeak who seem to deal with the Challenges of optimizing WAN efficiency quite well, in particular with SAN Replication :
Replication is a Biiiig topic.. there are many more factors to be considered; such as automation, cluster awareness, etc. I think the best way to summarise this post is…