In one of my lectures at university, the lecturer said something that has all stayed with me, in regards to strategic planning. That was:
“Many business fail to plan but few business plan to fail”.
Well, I am saying that you should have a plan for when things fail – your Disaster Recovery Plan. DRP is a process that any company who has production IT systems should have in place.
DRP is not just about having a back up plan for data. It goes further to address the issue of what you do with that data that you have backed up and what hardware do you put it on? It takes into account the likely costs and associated business issues that comes with downtime. It is not an IT document but rather a business document.
Let me provide a real life example, which highlights the value of having a plan in place and how “Murphy’s Law” can apply at times:
On Tuesday 13th June 2006, the ineedhits.com website suffered a number of hours of downtime whilst we implemented our own disaster recovery plan, at an individual server level. Whilst the overall circumstances are quite complex, it is best under stood by looking at the timeline below:
4pm Friday 9th June: ineedhits’ production SQL server reported failure on a RAID 5 drive. For those people with an IT background, you will know that RAID 5 offers a level of redundancy that allows for one disk in the array to fail without any issue.
A new disk was ordered and under our agreement with our hardware manufacturer, would be delivered next business day.
Unfortunately, next business was Tuesday 13th June 2006 due to a localized public holiday where our data centre is located which is in a different state than ineedhits’ head office.
2pm Monday 12th June: the same server reported a high probability of failure to one the “mirrored” drives within this machine. A call was logged raising the urgency of the replacement drive(s), however the public holiday again slowed progress.
12noon Tuesday 13th June: A second drive in the array reported failure. The machine stopped responding. The maintenance banner was placed on the site whilst we examined our options.
Our plan called for a full copy of the database to be copied down to our alternative data centre via a secure VPN Tunnel. Even with a high speed link, this took multiple hours to achieve, finishing in the very early hours of the morning. In the meantime, we double checked the security and patch levels on our back up SQL server and bought them up to date.
Wednesday 14th June 2006: The restore was completed on Wednesday morning and site connectivity restored.
The first hardware technician replaced one of the failed drives in the array. Unfortunately this person was a Tier 1 level support person and did not have a great deal of experience or knowledge.
Thursday 15th June 2006: A more experienced Tier 2 support engineer arrived and replaced the SCSI backplane, as well as the failed mirror drive. He used his initiative and bought the backplane as two dries failing in a server less than 4 months old (from a name brand vendor) is highly unusual.
A rebuild of the array was commenced.
Friday 16th June 2006: The rebuild of the array completed but showed corruption of the data on the drive.
The decision was made to rely on backups and continue running on our alternative data centre until the main production server reliability could be assured.
As such, a 60 hour long “stress test” was applied to this server over the weekend.
Monday 19th June 2006: With confidence restored in the server after passing the stress test, the entire process completed on Tuesday 13th June and finished on Wednesday 14th June had to be reversed.
Tuesday 20th June: All systems appear to be up and running. However, if you are experiencing an issue, I strongly urge you to contact the ineedhits’ customer care team and they will gladly assist.
I would like to stress that ineedhits’ data has not been compromised by an external party. All data remains in a secure encrypted state. Thanks to having our plan in place, we were able toproceede with an acceptable downtime and with minimal disruptions. It is always highlyregrettablee when our site is unavailable.
For that – and perhaps most importantly – I’d like to apologize to all our customers and sites visitors for the inconvenience that this downtime may have caused. If you have any questions about orders you placed between Sunday 11th June and Wednesday 15th June 2006, please contact our customer care team!
Some hints / lessons learnt with DRP:
Do not assume that you are immune from a failure. Plan to fail – at least in an IT sense.
Check the definition of what is a “business day”. Generally these refer to the business day where your data centre is located and not the region where you sign your contract.
Check your maintenance contacts extremely carefully. Generally they will have phrases such as “commercially best efforts” in there, with regards to replacement parts, unless you have paid extra.
A maintenance contract is important – however it is only as good as the person responding to the call out. It is a matter of luck as to the level of experience, knowledge and customer focus that the person who responds to your call will have. Be thankful when you get a good person and do the right thing by them and acknowledge them to their management. If you get a person who is not at the expectation level, then this feedback also needs to be provided in a calm and rationale way.
If you have a database, make sure you know HOW this data is being backed up. It is a point in time back up or in real time. Point in time means that a “snapshot” of the database is taken at that point in time (generally once a day). Should a failure occur, you roll back to that point in time with a window of lost data. Real time means that every transaction to the database is backed up as it happens.
Practice your DRP! I strongly suggest you practice your DRP to make sure the plan is feasible.
Communicate as effectively as you can to all parties who have vested interests during downtime. This helps ensure expectations are being set correctly.
Be realistic – it is no point aiming for 10 minutes worth of downtime if it takes 30 minutes for your back up server to be turned on, mounted the logical drives and ready to start taking orders. See point 6.
Identify which systems have a dependency on others and use this to identify single points of failure. i.e. what is your firewall goes down? Does this mean your email, which is also your fax server stops working, which is your primary way of accepting orders?
Finally, if you have to put your DRP into place, determine how effective it was.
Do not fall into the trap of thinking that it won’t happen to you or that DRP is not for small businesses. It is!
If you plan to fail you will also plan to get back up and running!