This article will answer the following questions:

  • What are the major IT risks that can cause disruption in our business?
  • Is testing essential for continuity management?
  • What levels of preparedness are there?
  • What are best practises to back-up our systems?
  • How often should we back-up our systems?
  • What is a Recovery Point Objective (RPO)?
  • Is a business impact assessment useful for our company?
  • How do I develop a Disaster Recovery Plan (DRP)?
  • Who should I involve in creating a DRP?

Risk Part IV: Business Continuity Management

In an ideal world, the IT cybersecurity manager will know the major of continuity risks a company faces and will have preparations in place to make sure nothing that disrupts critical processes ever happens. But if a disaster does occur, the cybersecurity manager needs a plan for that as well. We’re covering business continuity management in this chapter from a cybersecurity angle. The cybersecurity manager won’t be alone and should cooperate with other business units to cover other types of continuity risks besides cybersecurity.

Planning for those disasters is also known as business continuity management, the high-level planning for how to recover, restore normalcy, and minimise the loss after a serious incident. For instance, if you’re running a factory and there’s a flood, you might have to pause production for a week. That’s major. Water damage might be one risk; fire might be another. Of course, these types of disaster are improbable, depending on where your business is located.

Most companies, however, face the risk of an IT problem disrupting their factories or core processes. Most businesses depend on IT running smoothly. That means computers, chips, software, networks—even the flow of money depends on everything working in concert. The question is, when something major disrupts IT and, in turn, interrupts critical production and processes, how will you react, minimise the loss, and restore operations to normal? And what preparations can be made to make sure nothing like that happens?

The simplest scenario that affects just about every business would be a communications breakdown. As we know, almost every company is using the cloud. This means that if a factory is up and running, it’s connected to the cloud, and it’s business processes depend on the cloud. What happens when there’s no network? It can’t help but disrupt some part of the business.

Business continuity management requires looking at a number of different pieces: the network connecting the offices, the production sites, and the data centre or cloud are the obvious ones, but not the only ones. Another curious area to focus on is the domain name system (DNS). This is usually critical to everything happening inside networks. If the DNS fails or someone takes it down, the systems cannot resolve host names. They cannot find each other unless they are using an IP address for connecting. Similar systemic dependencies exist in most IT environments. The key is to identify single points of failure and build redundancy to counter scenarios that could bring business down to its knees.

In a perfect world, the cybersecurity manager would have a budget to cover double and triple backups or to run a redundant server for all systems. Few do. cybersecurity managers have to at least cover the critical functions; the difficulty lies in knowing which parts are critical.

Testing Is Essential

IT continuity management requires preparation, planning, testing, practising, and updating. A lot of companies focus on the plans but neglect the testing and practising parts because they’re harder and more expensive. But without testing, the company has no assurance that the backup plan actually works. Without practice, people won’t have the skills and experience to do what’s necessary to get everything back online in the event of a disaster.

It’s not unusual for a company to sink a million dollars into redundancy and backup hardware, or even establish a secondary disaster recovery site for IT, without ever testing the system. Will it work if something happens? No one knows.

We’ve seen numerous examples where a company had a primary processing site and a secondary site—a hot site and a cold site—and they never tried to switch over and operate it from the other side. If they don’t even know if it works, what is the point?

Levels of Preparedness

Disaster recovery preparedness exists on a spectrum; at the lowest level, the company does not have any redundancy. Everything’s running on one site, and that’s it. If a system or network goes down, they’re pretty much sunk and forced to recover any way they can. Quite often, this level of preparedness makes recovery slower; perhaps it takes a few days to get everything back on track.

The second level of preparation is to have some level of replication in place, like a cold site that holds data from the systems in a separate physical location or even in the cloud. Then, if there is an event, the other system can be brought to life, and network traffic can be routed to that new site. Ideally, systems can be brought back online in a relatively short amount of time, probably within a day or so.

Let’s say there is a primary data centre somewhere and the cybersecurity manager wants to make sure that by having a cold site, they can recover in a few days’ time. The data, and maybe the hardware, are in the cloud and some other physical location, and there’s a plan in place for making the switch. If the network goes down or a data centre is destroyed, the cybersecurity manager can start firing up those systems in critical order in the new place. This will require some installation work for the systems, restoration of data from backups, and so on. Not the fastest option for recovery.

A more expensive option for redundancy is to create a recovery hot site. In a hot site, everything is duplicated in another location that resembles the primary data centre. Data is copied almost in real time from site to site, with dual systems up and running at all times in case something goes down in the primary site. The big benefit of using a hot site is that the switchover to a backup is almost transparent to users, at least in theory. Because the second site is hot, it’s running all the time—perhaps with a bit fewer resources than the primary one but functional nevertheless. Of course, this level of redundancy costs a lot of money.

The absolute minimum is to have backups of all the important data in another location, not all in one place. If companies don’t do that, they’ve lost a lot of insurance. There’s nothing the cybersecurity manager can do if the data is lost entirely. But if they have backups somewhere else, they always have a chance to recover. Remember 9/11? A lot of companies went down with those two towers. The ones who had backups from their data and some capacity to rebuild their systems mostly made it through the disaster. But there were many that didn’t.

Threats to Continuity

Business continuity risks are perhaps the most important ones to prepare for. Losing the company’s data is one, of course. It’s amazing how many companies don’t have backups. Many think they do until they test it the first time, usually in the event of an actual disaster. Unfortunately, some of them find out the system was behaving in an unexpected way. Maybe it wasn’t really doing a backup, or maybe the backup was encrypted so that no one could get their hands on it, but the only encryption key was in the primary system. Or the decryption and restoration process turns out to take so long that it doesn’t matter. We’ve seen all of these scenarios happen, and whatever the reason, they happen a lot.

One hospital in Singapore demonstrated just how bad these scenarios can get. They had a problem with ransomware, like nearly every company; nobody is immune, it just varies how much damage the attack does to operations. This hospital got hit badly. They had ten locations running—all hospitals and clinics—and someone clicked a link on a phishing email and downloaded a ransomware and got the receptionist’s computer infected.

After the initial infection, it took around three months until the hackers launched their attack. They had been infecting everything inside the network during that time. When the ransomware hit them, it activated on seven of the ten sites, and all of the servers and workstations. All of the data in the servers were encrypted. The company received a message telling them to contact a certain email address and pay a large sum of money to get the decryption key to get their data back.

Seven locations had no access to their office records or patient records or anything else. This is, of course, a continuity event. They couldn’t bill patients, couldn’t see customer data or anything. The IT manager went to restore data from the backups. They expected to be up and running again the next day, or in a few days’ time at least. But guess what? Their backups had been done with the same servers and technology that they used for storing the data normally. They had set up Windows shared folders and used software to backup data from servers and workstations to this backup share. The ransomware, of course, was able to find all shared folders in network and encrypt everything. So all of their backups were encrypted, too, by the same attackers. Data wasn’t copied to an offsite location away from the company’s operational IT systems. Big mistake.

So they had two options. The first one was to pay the ransom. The second one was to re-enter all the customer data manually from paper printouts. They finally opted to take whatever printouts they had in their archives and re-enter the data manually. Since they didn’t pay the ransom, they were never able to access their data, and they ultimately lost it all.

Whether it’s a ransomware attack or something else, the most common danger of IT disruptions is system downtime due to system malfunctions, unexpected changes, or any other unforeseen reason. Companies don’t understand how often this happens and what it means to the business. They think they have backups, so it doesn’t matter. When a problem occurs, it’s many times more expensive to fix than the cost of preparing for something upfront. The lesson is, if companies do nothing else, they should do backups and do them often. End of story.

Scheduling Backups

Nobody wants to risk a total loss of data, so one task that always makes the critical list is to put a system in place for regular backups. “Regular” depends on the type of business and system. This obviously needs to be planned well because the amount of systems, data, backup frequency, restoration tactic, and every other detail has to go in there.

There are two things cybersecurity managers need to figure out when they’re planning a backup system. The first is what data is important and needs to be in the backups, like customer data or banking records or encryption keys for the backup. The second is how often to make the backup. This might not seem very important, so many people tell us they do it every Friday and assume that’s good enough. But if they’re collecting data 24/7, 365 days a year, weekly backups aren’t often enough.

What they don’t think about, in many cases, is that the date and time of a backup defines the point in time to which they can return—the latest data they can get. Think of it as a time machine that can go back to the moment when the last backup was made. All progress that happened after that moment can be lost forever! A problem on Saturday morning is fairly minor if there’s a backup from Friday night. But what if the disruption happens on Friday morning? That means they will lose one full week of data because they have to go back to the last week in their backups. In business terms, that means the company has decided it’s okay to lose one week’s worth of data. In professional terms, we call this their Recovery Point Objective, RPO. Their RPO is then seven days’ worth of data that can be lost, and it’s okay to the business. This is a fairly high-level decision in any company. A lot can happen in a week. In most companies, that is not acceptable.

The cybersecurity manager should advocate for frequent backups. If IT gets to define the RPO, backups will most likely be made every Friday for all systems, or maybe once a month with an incremental backup weekly for whatever data was changed. Quite often, they forget to ask the executives what is actually an acceptable amount of data loss that the business can bear. The wise cybersecurity manager makes sure that this discussion takes place and that everyone understands the implications.

Underestimated Likelihood

If there’s a backup system in place, a lot of people in the company are going to ask the cybersecurity manager why they should do more than that. People will ask the cybersecurity manager why they should care. They’ve got this backup system, but they don’t see why they need a plan for it. The IT security manager or cybersecurity manager has to answer that question.

Why does the company need a backup plan, or why should they test backups? Why spend on redundancy, or something as expensive as a hot site replication? Because the probability of IT failures and cyberattacks is fairly high, much higher than conventional disasters like fires or floods. These things happen almost yearly, sometimes many times every year. Cyberattacks occur routinely; a fire happens in a business building maybe once in fifty years, on average. Yet most of the office buildings are mandatorily equipped with fire exits, fire alarms, sprinkler systems and so on—often much more expensive but invisible to the eye once people overlook their importance.

Business Impact Assessment

One effective way to demonstrate the need for a plan is to do a business impact assessment (BIA). Doing a BIA lets people see what would happen in different scenarios. For instance, the cybersecurity manager could work with a technical person and a business person, and challenge them to imagine what would happen if a certain IT service or system was not available. What if it’s down for a couple of hours? A day? A week? What happens to the business? By considering these scenarios, the participants will see that the longer the downtime, the bigger the impact on the business.

In most cases, in two hours, probably not that much happens. The service desk handles the customers with apologies. What if the outage lasts a week? How would people respond? Is there any way to do things manually for a longer period of time? What plans should be in place for this situation? Writing these procedures down and estimating their cost and effectiveness creates a BIA.

The BIA can be extremely useful, but most companies don’t do it. Even companies that recognise the critical nature of certain systems don’t usually do a BIA.

Companies usually begin doing BIAs when they grow to have hundreds of IT systems in place, and they need to classify them and decide which ones are really critical. A BIA can help them do that. At first, they may list thirty to fifty items as critical, but after doing BIAs, they might be able to narrow it down to ten critical systems. Not everything can be considered critical. The BIA has the power to help companies prioritise.

Developing Disaster Recovery Plans

Sometimes, all the preventive measures fail, and disaster happens. That’s when the company needs a disaster recovery plan, or DRP. A DRP explains in detail how to recover from certain types of disasters. For IT, it should be technical in nature, like planning for blackouts or disruption of critical IT systems. A company can have a bunch of these plans—one plan for each scenario.

One scenario could be losing network connectivity—not having access to resources at your main data-processing site or the internet. For example, this could involve losing connectivity via VoIP to your customer service, or something like that.

For the plan to be effective, it should be drafted with the people who are actually responsible for using it when a disaster happens. The cybersecurity manager cannot make this plan all alone. Ivory Tower documents don’t matter at all if an IT disaster strikes—but this DRP should. The cybersecurity manager should think about who will react and take action to bring the facility or system back online. Those people should be on the team that creates the plan. It’s also critical that the people who are actually responsible for bringing systems back online have access to the DRP, have paper copies of it, and know how to use it.

The cybersecurity manager doesn’t have to be too pushy about getting these people involved. They just have to say something like, “Hey, I have these headers here on an empty document, and I need to explain what to do in the event of a disaster. So I’m going to share it with you online in our collaboration platform and ask you to fill it in. I’ll send a printed copy, then everybody can access it if something happens. Sound good?”

The DRP doesn’t have to be complicated; it might be expressed in simple bullet points. In minimum, it should contain a list of systems in order of importance, along with a list of instructions about what to do and who to contact in case of emergency. The plan should also have actionable details, like how to run diagnostics to find out the cause of the problems, log in to systems, restart servers or services, rebuild entire systems, restore data, and so on. It’s a very different type of document from a policy. A policy should be readable and understandable. The DRP should be useful for the people who are restoring the system—step by step, in very succinct terms. Its core purpose is to be there when people are working under a lot of stress, and they have no time to start looking for instructions. They’re going to be working under a lot of pressure, so if plan helps them find the information faster, then it will work.

The steps in the plan might be something like, “Is it working? If not, run this command. If that’s successful, it’s working. Go to the next step. Sign into this system. See a log and what’s in there. Stop and start the service in question.” Very technical, very straightforward. A trained professional might do this in a few minutes’ time once they get a hold of the system and the DRP document. In the best case scenario, at least a few people will have access to the plan and practical experience on how to restore normalcy after a disaster.

With a concise, well-organised DRP, the company might be able to restore operations in a matter of fifteen minutes after a minor disaster. Without a plan, it can be a huge circus just to discover where the problem is and find out who to call. If too much time is spent trying to sort out who’s in charge, people will start to do their own analysis without understanding how to debug the problem, and that makes things even more difficult. In such cases it often happens that people make incorrect deductions about the cause of the problem, and the fix will take even longer.

But if all the commands are listed in the DRP, anybody on the team can log in and follow the plan. There’s no need to be dependent on just one person. Anyone with a copy of the DRP can execute it.

Too Many Bulldozers

Here’s a story that illustrates the key points in this article. There was a payment processing company that processed credit card transactions for almost an entire country. It was a huge component of the economy in that country. If the payment processing company were to fail, people would stop paying in the stores. Failure was not an option.

The company actually planned pretty well. They had two physical locations for their data centres that were close but not too close. All the systems were redundant—they just duplicated all the servers and all the data; it was thought out and executed well. But they didn’t test several scenarios, like what happens if they physically lose connectivity between those two sites. Would the customer still be able to pay, or would the entire payment processing operation go down?

In between those two data centres, a construction site went up. The construction company dug a hole in the ground, twenty metres deep. One of the bulldozers cut the fibre line between the two sites, and the payment processing company’s primary site was disconnected from the internet. Then the switchover to the secondary site didn’t work right, so they were down for the remainder of the day until they could figure out and bypass the problem.

In another situation, a big internet service provider and mobile service provider had a similar problem. This was also due to a bulldozer, but it was way worse. They actually thought that their two physically separate fibre optic lines in two locations didn’t go into the same service tunnel, but they did. When the bulldozer hit, all of their customers were cut off and couldn’t be restored until the physical fibre optic line was repaired. This put down communications for hundreds of thousands of people in the metropolitan area, and all of their business customers as well.

Redundancy is often a challenge in communications because companies often buy communication lines from two or three different internet service providers but don’t realise those companies are sharing one physical fibre. If that line is severed, having a variety of providers doesn’t make any difference. This is especially true for transoceanic fibre lines—there just aren’t many choices.

Send check result to email