This article will answer the following questions:
- What are the major IT risks that can cause disruption in our business?
- Is testing essential for continuity management?
- What levels of preparedness are there?
- What are best practises to back-up our systems?
- How often should we back-up our systems?
- What is a Recovery Point Objective (RPO)?
- Is a business impact assessment useful for our company?
- How do I develop a Disaster Recovery Plan (DRP)?
- Who should I involve in creating a DRP?
Risk Part IV: Business Continuity Management
In an ideal world, the IT cybersecurity manager will know the major of continuity risks a company faces and will have preparations in place to make sure nothing that disrupts critical processes ever happens. But if a disaster does occur, the cybersecurity manager needs a plan for that as well. We’re covering business continuity management in this chapter from a cybersecurity angle. The cybersecurity manager won’t be alone and should cooperate with other business units to cover other types of continuity risks besides cybersecurity.
Planning for those disasters is also known as business continuity management, the high-level planning for how to recover, restore normalcy, and minimise the loss after a serious incident. For instance, if you’re running a factory and there’s a flood, you might have to pause production for a week. That’s major. Water damage might be one risk; fire might be another. Of course, these types of disaster are improbable, depending on where your business is located.
Most companies, however, face the risk of an IT problem disrupting their factories or core processes. Most businesses depend on IT running smoothly. That means computers, chips, software, networks—even the flow of money depends on everything working in concert. The question is, when something major disrupts IT and, in turn, interrupts critical production and processes, how will you react, minimise the loss, and restore operations to normal? And what preparations can be made to make sure nothing like that happens?
The simplest scenario that affects just about every business would be a communications breakdown. As we know, almost every company is using the cloud. This means that if a factory is up and running, it’s connected to the cloud, and it’s business processes depend on the cloud. What happens when there’s no network? It can’t help but disrupt some part of the business.
Business continuity management requires looking at a number of different pieces: the network connecting the offices, the production sites, and the data centre or cloud are the obvious ones, but not the only ones. Another curious area to focus on is the domain name system (DNS). This is usually critical to everything happening inside networks. If the DNS fails or someone takes it down, the systems cannot resolve host names. They cannot find each other unless they are using an IP address for connecting. Similar systemic dependencies exist in most IT environments. The key is to identify single points of failure and build redundancy to counter scenarios that could bring business down to its knees.
In a perfect world, the cybersecurity manager would have a budget to cover double and triple backups or to run a redundant server for all systems. Few do. cybersecurity managers have to at least cover the critical functions; the difficulty lies in knowing which parts are critical.
Testing Is Essential
IT continuity management requires preparation, planning, testing, practising, and updating. A lot of companies focus on the plans but neglect the testing and practising parts because they’re harder and more expensive. But without testing, the company has no assurance that the backup plan actually works. Without practice, people won’t have the skills and experience to do what’s necessary to get everything back online in the event of a disaster.
It’s not unusual for a company to sink a million dollars into redundancy and backup hardware, or even establish a secondary disaster recovery site for IT, without ever testing the system. Will it work if something happens? No one knows.
We’ve seen numerous examples where a company had a primary processing site and a secondary site—a hot site and a cold site—and they never tried to switch over and operate it from the other side. If they don’t even know if it works, what is the point?
Levels of Preparedness
Disaster recovery preparedness exists on a spectrum; at the lowest level, the company does not have any redundancy. Everything’s running on one site, and that’s it. If a system or network goes down, they’re pretty much sunk and forced to recover any way they can. Quite often, this level of preparedness makes recovery slower; perhaps it takes a few days to get everything back on track.
The second level of preparation is to have some level of replication in place, like a cold site that holds data from the systems in a separate physical location or even in the cloud. Then, if there is an event, the other system can be brought to life, and network traffic can be routed to that new site. Ideally, systems can be brought back online in a relatively short amount of time, probably within a day or so.