David as a Service

Notes on tech and many other topics

Endpoint wipeout – a(nother) gap in DR planning

After a rather rough Friday for many of our colleagues in the IT industry, it’s clear that serious discussions will be taking place across organisations. One will no doubt be “What is our disaster recovery plan for this?”.

Many DR plans focus primarily on infrastructure and services, often giving only a cursory nod to client devices. Unfortunately this time client devices took the brunt of the impact from the breaking endpoint protection update, or at least became the most difficult to remediate.

While the scenario is fairly specific, below are just some of the things we should be including in DR plans and general policies around this topic.

DR plans and supporting materials

You should store your DR plans and supporting materials (key passwords, encryption keys, IP addresses, etc) securely in places that can be used during a disaster. Don’t just keep a copy on one server, as this will be inaccessible at time of need. Keep copies in different places (e.g. OneDrive or Confluence), and also consider keeping a copy on an encrypted USB drive, and implementing a procedure to keep that copy up-to-date.

BitLocker

This particular event was far harder to remediate if the device in question had BitLocker. The solution isn’t to avoid BitLocker (it should be used as a part of your layered security), however the next best thing is to hold the recovery keys elsewhere. Be that manually in a secure vault, or at scale, using Intune and other centralised management mechanisms.

Update management

In the ‘prepare’ section of the DR plan, ensure updates are released in a controlled manner.. In the case of this recent incident there wasn’t any way of controlling updates, however if products you are using do allow for some control over update deployment / update rings, use it. Where possible:

  1. Deploy updates to a test group first
  2. After a time, deploy to the wider business

There are however trade-offs that need to be made in terms of security products as you always want the latest patches available, opting to delay agent updates even just briefly (e.g. 12 hours) while still receiving signature/protection updates immediately may be the solution.

Diversify protection tools

To mitigate future incidents like this involving antivirus tools, it may be sensible to use different antivirus solutions for different areas of the business.

For example client endpoints would use a different AV solution compared to servers. Specifically in this scenario it would have reduced impact, leaving you with just one of the two groups to remediate. This does however come with some additional management overhead.

Separated DR devices

Giving your IT team members a secondary device, loaded with all the tools and details they need during an incident may be sensible. It would be kept offline until needed, thus not receiving any breaking updates automatically. It should still be updated periodically and tested, but otherwise remain a break glass device.

Prevent BSOD rebooting

Prevent Windows from rebooting after a BSOD, this will allow individuals to read and send BSOD details to you, and prevent boot loops.

Device reporting

Identifying which devices need fixing will expedite the recovery process. Have an RMM, MDM, or inventory tool that can tell you what hardware and software is in use where, and who or what uses it.

Communications

Clear communications and instructions are hard to write at the best of times, try writing them when the pressure is on during an incident.

This specific incident required IT teams to communicate with remote and distributed individuals, oftentimes people without a technical background, through some non-trivial steps due to the issue also preventing any RMM or other tools working. It may be sensible to draw up some communication templates that easily explain key functions and tasks for use in DR scenarios.

While this list is endless, one of a few communication templates could be: How to boot a Windows device into recovery mode with command prompt. This leaves some easy to follow instructions ready to use, allowing for further steps relating to the incident to be added at the time. Drafting some templates for other communications, such as an initial incident acknowledgement to be sent to the business during an incident, are also useful.

Test your plans

Simply put, any DR plan must be tested. This can be a tabletop test, parallel test, or if you really want to refine it, a full real-world test.