Contact Us
 
 

Welcome to ineedhits Blog

Welcome to the ineedhits Search Engine Marketing blog, where we share the latest search engine and online marketing news, releases, industry trends and great DIY tips and advice.



Monday, June 19, 2006

Disaster Recovery Planning – When you should be Planning to Fail

Posted by @ 10:03 pm
2
  •  

In one of my lectures at university, the lecturer said something that has all stayed with me, in regards to strategic planning. That was:

“Many business fail to plan but few business plan to fail”.

Well, I am saying that you should have a plan for when things fail – your Disaster Recovery Plan. DRP is a process that any company who has production IT systems should have in place.

DRP is not just about having a back up plan for data. It goes further to address the issue of what you do with that data that you have backed up and what hardware do you put it on? It takes into account the likely costs and associated business issues that comes with downtime. It is not an IT document but rather a business document.

Let me provide a real life example, which highlights the value of having a plan in place and how “Murphy’s Law” can apply at times:

On Tuesday 13th June 2006, the ineedhits.com website suffered a number of hours of downtime whilst we implemented our own disaster recovery plan, at an individual server level. Whilst the overall circumstances are quite complex, it is best under stood by looking at the timeline below:

4pm Friday 9th June: ineedhits’ production SQL server reported failure on a RAID 5 drive. For those people with an IT background, you will know that RAID 5 offers a level of redundancy that allows for one disk in the array to fail without any issue.
A new disk was ordered and under our agreement with our hardware manufacturer, would be delivered next business day.
Unfortunately, next business was Tuesday 13th June 2006 due to a localized public holiday where our data centre is located which is in a different state than ineedhits’ head office.

2pm Monday 12th June: the same server reported a high probability of failure to one the “mirrored” drives within this machine. A call was logged raising the urgency of the replacement drive(s), however the public holiday again slowed progress.

12noon Tuesday 13th June: A second drive in the array reported failure. The machine stopped responding. The maintenance banner was placed on the site whilst we examined our options.
Our plan called for a full copy of the database to be copied down to our alternative data centre via a secure VPN Tunnel. Even with a high speed link, this took multiple hours to achieve, finishing in the very early hours of the morning. In the meantime, we double checked the security and patch levels on our back up SQL server and bought them up to date.

Wednesday 14th June 2006: The restore was completed on Wednesday morning and site connectivity restored.
The first hardware technician replaced one of the failed drives in the array. Unfortunately this person was a Tier 1 level support person and did not have a great deal of experience or knowledge.

Thursday 15th June 2006: A more experienced Tier 2 support engineer arrived and replaced the SCSI backplane, as well as the failed mirror drive. He used his initiative and bought the backplane as two dries failing in a server less than 4 months old (from a name brand vendor) is highly unusual.
A rebuild of the array was commenced.

Friday 16th June 2006: The rebuild of the array completed but showed corruption of the data on the drive.
The decision was made to rely on backups and continue running on our alternative data centre until the main production server reliability could be assured.
As such, a 60 hour long “stress test” was applied to this server over the weekend.

Monday 19th June 2006: With confidence restored in the server after passing the stress test, the entire process completed on Tuesday 13th June and finished on Wednesday 14th June had to be reversed.

Tuesday 20th June: All systems appear to be up and running. However, if you are experiencing an issue, I strongly urge you to contact the ineedhits’ customer care team and they will gladly assist.

I would like to stress that ineedhits’ data has not been compromised by an external party. All data remains in a secure encrypted state. Thanks to having our plan in place, we were able toproceede with an acceptable downtime and with minimal disruptions. It is always highlyregrettablee when our site is unavailable.

For that – and perhaps most importantly – I’d like to apologize to all our customers and sites visitors for the inconvenience that this downtime may have caused. If you have any questions about orders you placed between Sunday 11th June and Wednesday 15th June 2006, please contact our customer care team!

Some hints / lessons learnt with DRP:

  1. Do not assume that you are immune from a failure. Plan to fail – at least in an IT sense.
  2. Check the definition of what is a “business day”. Generally these refer to the business day where your data centre is located and not the region where you sign your contract.
  3. Check your maintenance contacts extremely carefully. Generally they will have phrases such as “commercially best efforts” in there, with regards to replacement parts, unless you have paid extra.
  4. A maintenance contract is important – however it is only as good as the person responding to the call out. It is a matter of luck as to the level of experience, knowledge and customer focus that the person who responds to your call will have. Be thankful when you get a good person and do the right thing by them and acknowledge them to their management. If you get a person who is not at the expectation level, then this feedback also needs to be provided in a calm and rationale way.
  5. If you have a database, make sure you know HOW this data is being backed up. It is a point in time back up or in real time. Point in time means that a “snapshot” of the database is taken at that point in time (generally once a day). Should a failure occur, you roll back to that point in time with a window of lost data. Real time means that every transaction to the database is backed up as it happens.
  6. Practice your DRP! I strongly suggest you practice your DRP to make sure the plan is feasible.
  7. Communicate as effectively as you can to all parties who have vested interests during downtime. This helps ensure expectations are being set correctly.
  8. Be realistic – it is no point aiming for 10 minutes worth of downtime if it takes 30 minutes for your back up server to be turned on, mounted the logical drives and ready to start taking orders. See point 6.
  9. Identify which systems have a dependency on others and use this to identify single points of failure. i.e. what is your firewall goes down? Does this mean your email, which is also your fax server stops working, which is your primary way of accepting orders?
  10. Finally, if you have to put your DRP into place, determine how effective it was.

Do not fall into the trap of thinking that it won’t happen to you or that DRP is not for small businesses. It is!

If you plan to fail you will also plan to get back up and running!









Discussion (2 - comments)

I did face the problem of disk crash and i never thought it would be so difficult to get to the solution and after much efforts i sent it to Disk Doctors Labs Inc where my Disk Was recovered

By Robin - June 29, 2006



Hi Robin,

We looked at that as an option but the cost of doing that was extremely high. Each of the disks in the array would have had to be provided to the data receovery expert who then charge “per meg” of data recovered.

With costs of IDE hard drives coming down, I strongly recommended people use mirror drives in their machines or even back up to USB Thumb drives. Simple, cheap and very simple.

Warren

By Warren Duff - June 29, 2006




Add Your Comments







SUBSCRIBE

Keep up to date with the latest from our blogs.

Subscribe to all blog posts

The Newsletter
BROWSE OUR POSTS




  • New Posts
  • Popular
  • Comments


Jobthread



More in Small Business News (123 of 129 articles)


More and more, Search Engine Marketing (SEM) is starting to become an important part of the marketing mix for every ...