Tag Archives: SAN

Sizing for FAST performance

So EMC Launched the VNX and changed elements of how we size for IO. We still have the traditional approach to sizing for IO in that we take our LUN’s and size for traditional RAID Groups. So lets start here first to refresh :

Everything starts with the application. So what kind of load is the application going to put on the disks of our nice shiny storage array ?

So lets say we have run perfmon or a similar tool to identify the number of disk transfers (IOPS)  occurring on a logical volume for an application. So we are sizing for a SQL DB volume which is generating 1000 IOPS for the sake of argument.

Before we get into the grit of the math. We must then decide what RAID time we want to use (as below are most common for transactional elements).

RAID 5 = Distributed parity, has a reasonably high write penalty, good usable vs raw capacity rating (equivalent of one drives usable capacity for parity) , a fair few people use this to get most bang for their buck. bear in mind that RAID 5 can suffer single drive failure (which will incur performance degradation), but will not protect from double disk failure. EMC Clariion does employ the use of hotspares, which can be proactively built when the Clariion detects a failing drive and used to substitute the failing drive when built, although if no hotspare exists or if a second drive fails during a drive rebuild or hotspare being build, you will lose your data. write penalty = 4

RAID 1/0 = Mirrored/Striped, lesser write penalty, more costly per GB as you lose 50% usable capacity to mirroring. RAID 1/0 provides better fault resilience and “rebuild” performance than RAID-5. It has better overall performance by combining the speed of RAID-0 with the redundancy of RAID-1 without requiring parity calculations. write penalty = 2

Yes there are only 2 RAID types here, but this is more to keep the concept simple.

So, depending on the RAID type we use, as certain write penalty is incurred due to mirroring or Parity operations.

Lets take a view on the bigger piece now. Our application Generates 1000 IOPS. We need to separate this into Reads and Writes :

So lets say. 20% writes Vs 80% reads. We then multiply the number of writes by the appropriate write penalty (2 for RAID 10 or 4 for RAID 5). Lets say RAID 5 is our selection :

The math is as follows :

800 Reads + (200 Writes x 4) = 1600 IOPS. This is the actual disk load we need to support.

We then divide that disk load by the IO Rating of the drive we wish to use. Generally speaking at a 4KB block size the below IO Ratings apply (this goes down as block sizes/pages to disk sizes get bigger).

EMC EFD = 2500 IOPS
15K SAS/FC = 180 IOPS
10k SAS/FC – 150 IOPS
7.2K NLSAS/SATA = 90 IOPS

The figure we are left with after dividing the disk load by the IO Rating is the number of spindles required. This is the same when sizing for sequential disk load, but we refer to MB/s and bandwidth instead of disk transfers (IOPS). Avoid using EFD for sequential data (overkill and not much benefit).

15k SAS/FC = 42 MB/s
10k SAS/FC = 35 MB/s
7.2k NLSAS – 25 MB/s

Bear in mind this does not take array cache into account and sequential writes to disk benefit massively from Cache, to the point where many papers suggest that NLSAS/SATA give comparable results to FC/SAS.

So What about FAST ?

Fast is slightly different. It Allows us to define Tier 0, Tier 1 and Tier 2 layers of disk. Tier 0 might be EFD, Tier 1 might be 15k SAS and Tier 2 might be NLSAS. When can have multiple tiers of disk residing in a common pool of storage (kind of like a raid group, but allowing for functions such as thin provisioning and tiering).

When can then create a LUN in this pool and specify that we want the LUN to start life on any given tier. As access patters to that LUN are analysed by the array over time, the LUN is split up into GB chunks and only the most active chunks utilise Tier 0 disk, the less active chunks are trickled down to our Tier 1 and Tier 2 disks in the pool.

fundamentally speaking, 90% of the IOPS for performance with the Tier 0 disk (EFD) and bulk out the capacity by splitting the remaining capacity between tier 1 and tier 2. You will find that in most cases you can service the IO with a fraction of the number of EFD disks vs if you did it all with SAS disks. I would suggest that if you know something should never require EFD such as B2D or archive data or Test/Dev, put them in a separate disk pool with no EFD.


Protocol considerations with VMware

A good video I came across from EMC discussing some storage protocol considerations when looking at VMware.


SAN Based replication ? no problem.. latency.. Problem..

Disaster recovery has become something which is moving higher and higher up agenda on companies “to do” list. Its becoming increasingly more apparent what the costs to a given business are when companies suffer downtime and/or loss of data..   people are starting to think about the monetary cost to the business is when services or applications are unavailable to both internal staff and more importantly customers and with the big push of server virtualization over the last few years.. where is application data/file data/the application server itself sitting ?  on the SAN;  so it makes sense to leverage that existing infrastructure in the SAN and use some form of SAN based replication.     

Bearing in mind the SAN is no longer a luxury only the privileged enterprise has access to and is becoming ever more important to even small businesses..  not all these organisations have access to biiiig dedicated links between sites and if they do, they’re probably subject to significant contention and unfortunately TCP isn’t the most efficient of protocols over distance.    

So what do you do to make sure the DR solution you have in mind is feasible and realistic ?    

Firstly make sure you pick the right technology    

First port of call is sitting down with the customer and mapping out the availability requirements of their applications. Things like the RPO/RTO requirements of the applications they have in use. Alot of the time the company may not have thought about this in alot of detail, so you can really add value here if you are a reseller. Ultimately it boils down to the following being considered for each service :    

  • How much downtime can you afford before the business start losing money  on each given application.
  • How much data can you afford to lose in the event of a a disaster, before it does significant damage to the business

 

If you can get them to apply a monetary figure to the above, it can help when positioning return on investment.    

There are a few types of Array based replication out there. They normally come in 3 flavours, A-syncronous, Synchronous and Jounaling/CDP and Synchronous Replication.  Synchronous replication can be a bit risky for alot of businesses as usually application response time becomes dependent on writes being committed to disk on both production and DR storage (thus  application response times become dependent also on round trip latency across the link between the 2 sites, spindle count becomes very important on both sites here also).  I often find that aside from banks and large conglomerates the main candidate for synchronous replication in the SMB space  is actually universities. Why ? because often universities don’t replicate over massive distances, they will have a campus DR setup where they replicate over a couple of hundred metres from building to building, so laying fibre in this case isn’t too costly. However, for the average SMB who wants to replicate to another town; syncronous replication isn’t usually preferable due to latency over distance and the cost of the large link required.      

Mirrorview A-Syncronous (EMC)    

A-Syncronous replication is typically what I see in the case of most small to medium size businesses. Why ? firstly, because application response times are dependent on the round trip time of  syncronous replication. With A-Synchronous replication, usually a Copy on first write mechanism is utilised to effectively ship snapshots at specified intervals over an IP link. Below is a diagram showing how EMC Mirrorview/A does this :    

    

EMC  uses whats called a Delta Bitmap (A visual representation of the data blocks on the volume), to track what has been sent to the secondary array and what hasn’t. This Delta Bitmap works in conjunction with reserve LUNs (Delta Set) on the array to ensure that the data that is sent across to the secondary array remains consistent. The secondary also has reserve LUNs in place so that if replication were interrupted or the link was lost, the secondary array can roll back to its original form so the data isn’t compromised.    

Also, you can use higher capacity less expensive disks on the DR site without affecting the response times to production (although application response times will still be affected in the event of a failover, as servers will be accessing disk on the DR box).  One potential drawback with asynchronous replication, is that as both SAN’s are no longer in a synchronous state, you have to decide whether it is important that your remote copies of data are in an application consistent state. If it is important, then you’ll have to look at a technology which will sit in the host and talk to the application and will also talk to the storage. In the EMC world we have a tool called replication manager which does all the various required bits on the host side (calling VSS/Hot backup mode , flushing host buffers, etc).    

Replication manager is licenced per application server (or virtual server in a cluster) and also required an agent per mount host, plus a server licence (or 2 depending on the scenario). There is a lot more to replication manager, but that’s a whole post in itself.    

EMC RecoverPoint    

Recoverpoint is another way of  replication technology by EMC which allows very granular restore points and small RPO’s over IP. Because it employs journalling rather than Copy on first write. It stubs and timestamps at very regular intervals (almost every write in some cases), allowing you to roll back volumes to very specific, granular,  points in time. See below diagram for more detail :    

    

RecoverPoint provides out-of-band replication. To be considered out-of-band, the RecoverPoint appliance is not involved in the I/O process. Instead, a component of RecoverPoint, called the splitter (or Kdriver), is involved. The function of a splitter is to intercept writes destined for a volume being replicated by RecoverPoint. The write is then split (“copied”) with one copy being sent to the RecoverPoint appliance and the original being sent to the target.    

With RecoverPoint, three types of splitters can be used. The first splitter resides on a host server that accesses a volume being protected by RecoverPoint. This splitter resides in the I/O stack, below the file system and volume manager layer, and just above the multi-path layer. This splitter operates as a device driver and inspects each write sent down the I/O stack and determines if the write is destined for one of the volumes that RecoverPoint is protecting. If the write is destined to a protected LUN, then the splitter sends the write downward and will rewrite the address packet in the write so that a copy of the write is sent to the RecoverPoint appliance. When the ACK (acknowledged back) from the original write is received, the splitter will wait until a matching ACK is received from the RecoverPoint appliance before sending an ACK up the I/O stack. The splitter can also be part of the storage services on intelligent SAN switches from Brocade or Cisco.    

For a CLARiiON CX4 and CX3, the CLARiiON storage processor also has a write splitter. When a write enters the CLARiiON array (either through a Gigabit Ethernet port or a Fibre Channel port), its destination is examined. If it is destined to one of the LUNs being replicated by RecoverPoint, then a copy of that write is sent back out one of the Fibre Channel ports of the storage processor to the RecoverPoint appliance. Since the splitter resides in the CLARiiON array, any open systems server that is qualified for attachment to the CLARiiON array can be supported by RecoverPoint. Additionally, both Fibre Channel and iSCSI volumes that reside inside the CLARiiON CX4 or CX3 storage array can be replicated by RecoverPoint. RecoverPoint/SE only supports a Windows host-based splitter and the CLARiiON-based write splitter. Also automatic installation and configuration for RecoverPoint/SE only supports the CLARiiON-based write splitter.   

Below is a Video from EMC demonstrating Recoverpoint in a VMWare Environment : 

   

  

Optimise So how do we ensure we are getting the most out of the links we use (especially over contended links such as VPN or MPLS) ? WAN optimisation..  there are a number of ways this can be done, some use an appliance to acknowledge back to the production SAN locally, then cache the data and burst it over the WAN. Some companies have found a more efficient way of transmitting data over a WAN, by using proprietary more efficient  protocols to replace TCP over the WAN  (such as Hyper IP), Below is a snippet from a mail I received from a company called Silverpeak  who seem to deal with the Challenges of optimizing WAN efficiency quite well, in particular with SAN Replication :      

“Just a few years ago, it was unheard of to combine SAN traffic with other storage applications, let alone on the same network as non-storage traffic. That is no longer the case. Silver Peak customers like NYK logistics are doing real-time data replication over the Internet. Want to learn more? Here is a demo of EMC replication running across a shared WAN ”  

 

   

   

    

    

 

 

  In summary   

Replication is a Biiiig topic..  there are many more factors to be considered; such as automation, cluster awareness, etc.   I think the best way to summarise this post is…      

To be continued     

      

 


An Apple a day…. Could help EMC get back into the Mac space ?

Now, for the bulk of organisations (in the UK at least), the majority of business applications are hosted on operating systems such as Windows, Linux, HPUX and Solaris. EMC do very well with these organizations; they have extensive lists of supported operating systems with all their revisions and service pack releases to boot. For these organisations and resellers selling into them, life is good, interoperability is rife and big vendors such as EMC give them much love. But there is another world out there, one often overlooked by the likes of EMC…   A world of glorious white, multicoloured fruit and virus free environments..   I shall call this place  Mac land, often visited by the likes of graphics design, advertising and publishing companies.

Without a support statement in site involving the words Mac OSX for some years and the likes of Emulex and Qlogic not forthcoming with a resolution, the future was looking bleak for resellers wanting to sell EMC SAN storage into Mac user environments.  But wait !! a solution has presented itself!!  in the form of a company called ATTO technology..  much like saint nick delivering presents in the night..  these guys are sneaking Mac OSX support statements onto EMC interoperability support matrices. I heard no song and dance about this !? but I was pleased to see it none the less….

The supported range of FC HBA’s come in single port, dual port and quad port models (FC-41ES, FC-42ES, FC-44ES) and the iSCSI software initiator is downloadable from their website.

Supported with Mac OSX 10.5.5 through 10.5.10 on apple Xserve servers and Intel based Mac Pro Workstations attaching to EMC’s CX4 range only; rather than just providing basic support out of neccesity, there are a few bells and whistles. Multipathing is supported with ATTO’s own multipathing driver and integrates with ALUA on the Clariion, a number of Brocade, Cisco MDS and Qlogic Sanbox switches are supported (with the exception a few popular recent switches such as Brocade silkworm 300, 5100, 5300 switches and Qlogic SANBox 1404’s).  Also, ATTO have released an iSCSI software initiator for iSCSI connectivity to Clariion or Celerra which is also supported.

Just a brief disclaimer..   I’ve mentioned some specific support statements, that is not to say that EMC would not support the switches I mentioned aren’t currently listed, but you may have to jump through some hoops to get your solution supported if certain elements aren’t on standard support statements. I would recommend checking the relevant support statements from EMC if you are Mac users looking at EMC, just to make sure your bases are covered.

Take a look at the press release from ATTO Technologies here


Is your EMC Solution Supported ? Why not check ?

EMC are pretty good at making sure they test, test and test again when it comes to interoperability with other vendors. The EMC Elab enables you to make use of all that testing and check that your Storage environment is supported with EMC.

See below for a walkthough guide of the Elab storage wizard.


Sizing for performance on Clariion

A few things I would suggest you do before just sizing a bunch of disks for capacity.

Firstly, your application response times are dependant upon a few things, one of the key things is ensuring you provision enough spindles/drives to support the disk load you are going to put on the SAN. If performance isn’t considered and the SAN isn’t sized with performance in mind, you could potentially see queue depth increasing on drives, the queue depth directly relates to the number of read/write requests waiting to access the drives. If the queue depth gets too high, applications which require sub 5ms or less response time (which a few do) may start timing out and you have problems. So you need to do a bit of data gathering..

In windows terms run something like perfmon in logging mode, looking at counters like bytes written, bytes read, number of reads, number of writes, queue depth. In Linux/unix terms, something like IOstat should be fine.

Ensure start logging over a reasonable period of time and ensure you capture metrics over peak hours of activity. We’ll come back to this in a minute.

Identify the profile of your data, which applications write to disk in a sequential fashion which write in a Random fashion. Sequetial writes are optimised by some clever bits the clariion does in cache, if you mix Sequential and random type data on the same RAID groups, you won’t see optimised writes with sequential data.

sequential data = Large writes (typically Backup to disk apps, archive applications, media streaming, large file)

Random data = lots of small read/writes. Ie. Database (exchange, SQL, oracle)

Next you need to think about the level of RAID protection required:

RAID 5 = Distributed parity, has a reasonably high write penalty, good usable vs raw capacity rating (equivalent of one drives usable capacity for parity) , a fair few people use this to get most bang for their buck. bear in mind that RAID 5 can suffer single drive failure (which will incurr performance degradation), but will not protect from double disk failure. EMC Clariion does employ the use of hotspares, which can be proactively built when the Clariion detects a failing drive and used to substitute the failing drive when built, although if no hotspare exists or if a second drive fails during a drive rebuild or hotspare being build, you will lose your data. write penalty = 4

RAID 3 = Dedicated Parity disk, great for large files/media streaming, etc. Although RAID 5 can work in the same fashion as RAID 3 in the right conditions.             

RAID 1/0 = Mirrored/Striped, lesser write penalty, more costly per GB as you lose 50% usable capacity to mirroring. RAID 1/0 provides better fault resilience and “rebuild” performance than RAID-5. It has better overall performance by combining the speed of RAID-0 with the redundancy of RAID-1 without requiring parity calculations. write penalty = 2

RAID 6 = Again distributed parity but instead of calculating horizontal parity only (as RAID 5 does) also calculated Diagonal parity essentially protecting you from double disk failure, there is a greater capacity overhead than RAID 5, but not as great as RAID 1/0 (equivalent of 2 drives usable capacity for parity).  The write penalty for RAID 6 is greater than RAID 5, although typically RAID 6 should only be used for sequential type data (back to disk, media streaming, large file), so writting to disk will be optimised by write coallescing in cache, writing all data and parity out do disk in one go without calculating parity on the disk (this incurring a lesser write penalty). This will only happen if the write sizes are greater than the stripe size of the RAID group and the drives are properly aligned.

There are a number of documents on powerlink outlining RAID sizing considerations for specific applications. Rule of thumb us keep your log files on seperate spindles to your main DB volumes.

I’m going to cut this a bit short –

from the data you gathered (mentioned at the beginning of the document), for each logical drive currently local or direct attached take the following :

number of reads + (number of writes x write pentalty) = disk load  (write penalty is specific to the raid type being used, RAID 5 = 4, RAID 1/0 = 2 )

Each drive type has an IOP rating = 10k FC = 150 IOPS, 15k FC = 180 IOPS, SATA = approx 80 IOPS

divide the disk load by the IOP rating of the disk type you’ve chosen and that will give you the spindles required to support your disk load (excluding parity drives). You will most likely have multiple volumes (LUNS) on a given RAID group, so ensure you divide the aggregate disk load of the volumes to reside on the given RAID group by the IOP rating of the drives in question.

Thats a starting point for you and some food for thought..   for more detail, there are FLARE best practice guides and application specific guides galore on powerlink…


EMC Clariion CX4 Virtual Provisioning

The EMC Clariion CX4  is a very flexible box. One of the ways which it enables admins to get the most bang for their buck is by utilising Virtual (Thin) provisioning. This effectivelly enables an admin to create a pool of storage from which smaller volumes of storage (thin LUNs) are provisioned. See below: 

  

(Click Image to enlarge) 

Many people will see thin provisioning as the answer to all their problems in terms of storage management.. No. It still needs to be monitored, managed and some intelligent thinking still needs to go into what volumes will be thin provisioned. The EMC Clariion CX4 also has some limitations around thin provisioning.  

Each model of Clariion has a limit as to how many drives may be contained in a pool of storage for virtual provisioning, so If you’re thinking of positioning a CX4-120 with one big pool of disks (ie 30 disks) for thin provisioning, think again.. please see below : 

   

Also please bear in mind, that although a disk pool for thin provisioning may contain more drives than RAID allows, the maximum size of a single LUN may not exceed 14TB. 

 

 


What is a SAN and how do I sell it ?

What is a SAN ? A SAN is a storage area network, with the sole purpose of providing dedicated storage to a server environment at block level. A SAN provides a central point of management for server storage, flexibility as to how that storage is managed and addresses the whole problem of under utilised pools of storage which you get when giving direct attached storage to servers on a one to one basis.

 

With the ever increasing interest in virtualisation technologies such as VMWare; virtualisation is driving more and more storage oppurtunities, as centralised storage is a must have requirement for functionalities such as High Availability and VMWare features such as VMotion and the likes.

 

So, you’ve established an oppurtunity; mr customer tells you they want to buy a SAN, what next ?

First things first, why do they want a SAN ? Typical reasons would be: they have direct attached storage for each server and its a nightmare to manage, they’re implementing or have implemented a virtualisation technology and need a SAN to unlock features around high availability, load balancing etc, they have an old SAN which is obsolete and simply won’t scale to the capacity or performance they need or our favourite, they bought a SAN based on a price driver only and their IT systems are suffering !

So the first exercise is something of a data gathering excersize and really you need to ween the following information :

The Data

What type of data will they be storing (file data/application data) ?

What applications will they be giving storage to ?

Performance requirements of these applications (based on perfmon/IO stat figures if possible).

Remember that different types of drives have different performance ratings. SATA drives being higher density lower performance drives are suited for the sequention type data (ie, file data, Backup to disk, archive, streaming).

Applications which will be accessing storage with data of a more random profile (ie, SQL, exchange, oracle, database in general) will usually require something with a bit more grunt, such as fibre channel or SAS drives which typically come in 10k RPM and 15K RPM flavours.

The rule of thumb being more drives = more performance (there is a science to this, so understanding the applications is key). Sometimes a good indicator would be to look at a given server and the number of drives (DAS) its using, if its performing as it should, then ensure it has drives which will give => IOPs in the SAN to meet its perfomance requirements in the SAN. Also, trying and keep data of a senquetial profile on seperate disks from data of a random type profile.

What is the current Capacity requirement ? and what is the expected growth over 3 years ?

How many LUNs (volumes) will be required for the customer environment ?

Connectivity

Do they want to utilise their existing infrastructure and use iSCSI or have a dedicated storage area network using fibre channel.

How many servers will be attaching to the SAN ?

What kind of IOP/Bandwidth load will each server be putting on the storage network ?

Most customers will want to have a redundant infrastructure so make sure you eliminate single points of failure from the environment. This means ideally 2 NICs or HBA’s per server, 2 switches, failover/multipathing software, etc.  So with this in mind make sure that you size the number of ports required on fibre/layer 2 switches accordningly (taking into account ISL’s).

Replication

In the spirit of minimising risk within a customer environment, customers are looking to local and remote replication technologies to do this.

Local recovery would be achieved by taking a point in time or fully copy of a given volume within the storage array, so if data is corrupted, as previous non corrupt version of the data from a prior point in time can be utlised.

If doing this then ensure the appriopriate capacity is accounted for on the storage array :

If using clones, then how ever many clones are being used will require the full capacity of the source volumes they are associated with. If Snapshots are being used, then each point in time ideally requires 2 x the amount on changed data of the source volume during the life of the point in time copy. Normally point in time copies would only exist for a number of hours until the next point in time image is taken. Its advised that volumes used for snapshotting reside on seperate drives to their respective source volumes as there is an IO overhead involved in snapshotting.

Remote Replication

If the customer wants to implement a remote DR solution, sometimes customer expectations of what is achievable within a given price bracket is slightly out of line with reality. So its important to sit with the customer and discuss things like:

  • How much data can I afford to lose before my business is adversley effected (RPO – Recovery Point Obective)

 

  • How much time can I allow for this volume/application to be offline before it becomes unacceptable (RTO – Recovery Time Objective)

The smaller the RTO and RPO, the greater the cost to the customer.

If the customer want to replicate at array level, using something like EMC’s Mirrorview/Syncronous product for example. These are the considerations.

Application response time on critial applications is always going to be the key consideration. If replicating syncronously, the write from the application is sent to the storage array and before a write acknowledgement is sent back to the application, the write must have commited to disk on both local and remote storage systems. If the response time requiment is around 5 ms, then that write needs to have written to have travelled the Link, committed to disk, acknowledge that write back to the production storage and then written to disk on production disk within 5ms, this means fast spinny disks and a very good link. If this is not sized properly, application will time out and Database admins will scream bloody murder.

Volumes which may not require such an agressive RPO, a-syncronous replication may be the way forward. So lets take a look at EMC’s Mirrorview/A-Syncronous product.

  

Firstly, the source volume is syncronised in its entirety with its respective opposite volume on the DR site. After this has syncronisation has taken place, then mirroriew will take point in time images of the source volume at specified intervals and replicate only delta changes to the remote site. The benefit of this would be that the application recievs acknowledgement of a write once its been written to disk and is not dependant upon the whole replication process.

There are some other methods, such as Continuous data protection (journalling) and other methods which ensure application transactional consistency, but I will come back to those on another post.