Friday, March 2, 2012

The Art of Business Continuity – 5 Learnings

Many of our long time customers will know that Telnet puts quite a bit of effort into business continuity, and disaster planning – often going much further then companies the size of Telnet will normally go, and expecting more from our suppliers than they are used to providing. For Telnet, it’s a simple equation – our customers simply expect that we’ll be able to continue providing a service – almost no matter what – so we put in the time and effort to make sure that this is, as far as practicable, what we deliver.

Along the way, we’ve learned a few things, and we’ve turned many corners, and the goalposts have moved – many many times. We’re never actually “finished” with our implementations, and you quickly learn that this slippery, ever changing beast known as BCP, will never let you rest.

It’s a Journey not a Destination

It’s easy to get frustrated, depressed and downright annoyed at the fact that, almost immediately after you thought you had created a solution, someone gives you another problem, or a much easier way to solve the first one. Some solutions in fact, take so long to put in place, that by the time you’ve got there technology, that fast moving animal, has seen fit to move on.

This is just part of the game we are playing, and once you accept this, and realise you’ve never really got an endpoint, just a series of ever improving “beta” solutions (a bit like Google!) – then you realise that this is in fact a better state of affairs – and you’ll always be that little bit better than you were. They’re not goalposts you are chasing, simply milestones along a very, very long road.

It does make it a little trickier to manage budgets, and notify your users – but overall it’s just something you need to accept – the inevitable, and somewhat inescapable scope creep. The only thing you can really do to stay sane is to make sure you always stay focused on my next point:

What are you trying to protect?

The obvious answer that will be given to this question – particularly if it’s on the table of an executive meeting is “everything”, of course that may well be a possibility depending on the size and complexity of your organisation (if you only have one laptop, and can make a backup to the cloud you’re pretty much done) – but more likely you have to decide what’s really important for your business, and whether there is only one way to do that or not.

For us, it was perhaps this question that has most led the first point. “Back in the day”, when I first started with Telnet (almost 10 years ago now), the most important thing in the world was the ability to take phone calls, mainly for just our “most important” clients, with everything else (including capturing that we had in fact taken that call) as a secondary concern.

Roll forward a few years, and incrementally we’ve been adding feature after feature, gradually closing the line between what we do BAU, and what we can do in an emergency. From having a few seats reserved in someone elses call centre for our use (with their systems), we moved on to having our database environment replicated to a backup, then on to domain controllers, and eventually to having real time cloud based backups for almost all our line of business services. We’ve ended up with a second PABX and ACD server, and redundant diverse fibre paths and telco links between us and our DR site, hosted by a well known specialist BCP company in Albany!

Picking what to change and “improve” is always an interesting challenge, lying somewhere between “what our clients say we must do”, through “what our gut tells us” and into “what’s realistic to achieve” (passing somewhere near “what the latest conference/white-paper/news-report says we should do). Of course disasters such as Christchurch Earthquakes “help” by focussing the mind too – but that’s just the start of the process:

No, everyone else doesn’t have that issue too

When you’ve chosen what to change, you normally have a bit of a think, rationalise what it is you want to protect, and almost inevitably decide that everyone else must have this problem too – so I’ll just ask my suppliers/vendors/partners what I should do and make it all happen… right?

Now I really don’t think that I have had any really “out there” ideas around BC – take for example wanting to have out internet access delivered in a way that meant we can access it from our primary or secondary sites, allowing us to spin up our DR copies of our web servers if we were to lose our main site with little or no interruption. Sounds like a problem many people would have doesn’t it?

Another example is diversion of phone calls – we have well over 200 active phone numbers we answer calls on – surely most call centres ask their telcos for a way they can move this traffic at short notice to an alternate delivery location?

There’s many other examples, but the point is never to assume this sort of thing – just because your supplier is the biggest with the best reputation and range of clients, doesn’t mean for a second that they’ll “just know” how to solve a seemingly simple problem. For us we spent months (and in some cases years) just trying to find a solution to an apparently simple common problem it seemed nobody else (in NZ at least) even had.

Test it, test it again

Ok, ok, so I know when you started reading this you fully expected me to say this, and it IS obvious – but it’s so important I couldn’t miss it out – and not only for the blindingly obvious reason (ie. does it work?). Simply the process of doing something like this opens up cans of worms, Pandora’s boxes and any other metaphors you can think of.

It’s the testing process that one of the most direct drivers of the “journey” and “what to protect” phases, and one that keeps you thinking what the next stage would be. When you’re testing you should think a little outside your test plan, get users to “try stuff” and make notes on what you learn – these are the jewels that will take you to the next peg.

I’ll not go on about this stuff too much, we all know we need to do this, and we all don’t do enough – but writing it here reminds you (and me!) once again that it needs to be done. And again. And once more, at least.

Make sure you never have to use it

The best thing you can do after expending all this effort though (on an ongoing cyclic ad-nauseum basis) is do your level best to make sure that you’ve just wasted all that money you’ve spent on kit, and services, and processes. Nothing keeps your business going better then not having a failure in the first place.

In the real world of course, this is just as impossible as the concept of “finishing” your BCP plan – but the point is that no matter how much effort you put into backup systems, and alternate ways of doing things, these will never be as good as never having to use them – regardless of how good your BCP plan is – those regular patches and maintenance, and redundant hardware, and monitoring software, required to keep “Plan A” ticking along are more important now than ever!

I’ve just sent out to all Telnet’s customers the latest iterations in what we’ve been working on – there’s quite a few things we’ve done in the last 12 months to move us to the next level up – but as you might expect, there’s still more we can do – and of course that will be the way of things, not just for the near future, or the foreseeable one, if we stop looking for improvements, we’ve simply given up – and that’s not a place we want to be!

The conclusion I have come to is that there simply isn't one answer, or one approach that works, there doesn't even appear to be a consensus among service providers on how to do the things you might first think are obvious - providing the right solution for your business really is all about the art rather than the science.

Steve Hennerley

GM Information Systems @ Telnet

Labels: , , ,