Friday, September 14, 2012

Testing the theory - DR Testing Takeaways

Everyone says you need to do it - hey I even said you should do it, but how many of us do "really" test our disaster recovery processes? I mean from start to finsih, warts and all.  We've been testing our plans for years - but somehow never really got to putting everything together into a single full scale "get out there an do it" kind of test.  We've been able to extrapolate and make assumptions, and generally be pretty happy about our capabilities should the world of BAU come to an end, but could never say, for sure, in the end, it would all hang together.

We decided - particularly on the back of some fairly big updates this year, that we really needed to do a full on, no holds barred test, by taking our primary centre fully offline, and then seeing how it all went - here are some of the things we learned:

Be Prepared?
Yes, it's a question and not a statement - you really want to make a decision on just how "prepared" you want to be.  When we talked to our partners, we found everyone wanted to "plan" this test - and that's something you need to be a little careful of.  Having everyone and everything all in the best places possible to achieve a successful outcome might well be what you are used to doing - but in this case you risk lulling yourself into a false sense of security - in a real disaster, all the prep time you have has already gone.

What you DO need to be prepared however, and prepared as well as possible, is for is a clean and rapid rollback.  If something goes disasterously wrong with the test, or if curcumstances mean that safety or the business is put at risk - you really need to make sure that however you simulate the disaster, you can "unsimulate" it as fast as possible

Accept Failure
Whatever you think going into it - there are going to be things that don't work as you thought they should - to be honest I'd be more worried if everything DID work as it was supposed to - if so you probably missed something!  If your test is realistic (and not planned to primarily highlight the best bits of your DR plan!) then there should always be things you can learn - even if it's just an opportunity to speed something up.

When writing the test plan, it helps to have someone who didn't design the recovery procedure recommend what the scenario is - if you can resist it, try not to overthink the situation.

Record everything
Have someone who is not involved in the recovery act as referee, they will be able to avoid the hustle and bustle of trying to make things work, and they will actually have the time to write things down as the test progresses. The referee is also a great pair of eyes on other opportunity to improve processes that might be missed by those who are in the middle of it all

So.. How did it go?
I guess you are all wondering how it went for us then?  I suppose it would be unfair for me to preach the things above and not tell our story - so here goes...

We planned our DR event for late at night, when we only have  a few staff around, and impact to customers would be minimal - maybe not as big and scary as the middle of the day - but the sysems and processes are identical so it's still a valid test.

We simulated a complete loss of our contact centre and data systems, and at 11:18pm we pulled the plug (quite literally in some cases) on our internet, phone lines and external WAN connections.  Simultaneously we killed the lights and the staff had to get themselves out and into cabs to our DR site (diverting critical lines to mobiles as they went).

Once at the DR site, the fun began....

Overall we had a successful test - it was a great validation of the work we'd done over the last year - but the real value came in the things that maybe didn't go 100% to plan.  It took longer than expected for some of our recovery servers to come up - something only a realistic test would show - we've since reorganised the startup process.  One of our backup telephony servers also decided not to fail over cleanly (even though the previous 6 test were flawless) - we'd never seen the issue before - but now we know about it we're better placed for next time (test or real event).

One of the most suprising things though was how engaged the staff who participated in the excercise were - even though they were "off the clock" by later in the evening (morning!) - many staff were keen to stay on and help out even when they weren't specifically needed anymore (we had a second crew back at HQ who took over once the testing was complete and we'd "rolled back" - this save additional delays whilst we got staff back to base)

If you want to hear more about our test (particularly if you are a client of ours), drop me a line, I'll be happy to tell you in more gory detail!  But I'd like to leave you with one final, and most important learning from this exercise.  I've said it before, and I'll surely say it again...

Test it.  Test it again
No amount of talking about it, looking at diagrams, or testing parts of your DR process is anywhere near as valuable as taking the risk to do a full scale test.  If you've never done it (or only done it part way) make a resolution to yourself to prove it.  What's the worst that can happen?  If you keep your primary site/systems ready to go - not a lot - but you sure will learn where you need to focus your efforts. Once that's done, start thinking about doing it all over again.... best of luck !!

Steve Hennerley
GM IS ,Telnet

Labels: , , , , , , ,