Every engineering discipline has a theory of reliability. For civil engineers, it lives in load calculations and redundancy specs. Aviation earned its safety record through obsessive incident review. In the software industry, reliability has become the practice of assuming things will eventually break: what happens when you stop treating failure as an anomaly, and start preparing for it as a certainty? The answer to this question has become known as Site Reliability Engineering, or SRE.
SRE originated at Google in 2003 when Ben Treynor Sloss founded the first site reliability engineering team. The discipline produced the Google SRE Handbook, which sought to answer one question above all others: how do you keep massive, complex technology infrastructure running reliably when the people operating it are human?
From ACE IoT’s vantage point, the building controls industry is only just now beginning to have SRE-related conversations. OT teams manage sophisticated technology infrastructure, and for the most part, they manage it by feel. SRE, by contrast, offers a mature, proven approach through which OT teams can manage complex technology infrastructure. SRE principles start with observability, building from there to structured performance accountability and end with the ability to recover from failures faster. The IT world has found this framework indispensable, and OT teams managing increasingly critical infrastructure have every reason to follow suit.
When we talk about all of the fanciful promises that we make for our systems, all of the advanced things that we want to do, they presuppose that we can validate things are running correctly. That’s the foundation that SRE lets us set; we can actually know we’ve got a system in a state where it’s capable of doing new things, more things, better things.
- Michael Melillo
Technology systems will always eventually fail. Instead of attempting to defy that reality and build perfect systems, SRE asks teams to be ready for the failure. That means instrumenting systems in a way to see failure coming and building recovery procedures that are tested thoroughly and work consistently. Applying the three practices to OT environments listed below could mean a future where OT teams both trust the systems that keep buildings operating effectively and know exactly what to do when they don’t.
SRE Principle #1: Start Monitoring Your Systems
Michael Melillo, Head of Technology at Albireo Energy and recent ACE Flight Logs’ guest, has spent 12 years in the controls industry. When he came across the SRE Handbook for the first time, the recognition was uncomfortable. “[Discovering SRE] really was a shifting of the foundations for me,” he says. “I needed to rethink everything about how I build systems up from the ground.”
That rethinking started with a simple question: do you actually know what your systems are doing right now? For most OT teams, Michael’s question is not easily answered. Without metrics being collected from supervisory devices and controllers, the system is effectively a black box until something breaks.
Michael’s advice: start with what is measurable today. CPU usage, memory allocation, remaining disk space, BACnet packet volume, and command latency on your automation servers and network control engines are all accessible. Collected consistently, these metrics can help establish a behavioral baseline. A trending metric can surface a struggling controller weeks before it fails. Melillo describes discovering a client’s controllers were silently struggling each night, visible only because the monitoring infrastructure caught the pattern. “We saw it before it happened,” he says, “and we could pull people into the room to start fixing problems.”
Logs add a second dimension. Centralizing logs makes it possible to compare what a system looks like during an incident with what it looked like months earlier when everything was healthy. This comparison is what turns a failure investigation from a guess (based on gut feel) into a diagnosable event (based on data).
So, where should an OT team start? Automation servers and network control engines are the right first targets. Their internals are accessible, and their health has downstream consequences across the system. But, as a truly practical and most beneficial starting point, Michael says, “Take the last thing that crashed, and monitor that. Like OSHA rules, every crash, every rule is written in blood of what came before. So, where are you bleeding? Monitor that.”
SRE Practice #2: Exercise Your Backups
Observability tells you when something is going wrong. A backup strategy determines what happens next. While many OT teams do have backups, most have never tested whether restoring those backups actually works.
There is a wide gap between having a backup and having a recovery strategy. “Anybody can throw darts at the wall and get lucky,” Melillo says. “But it takes a strategy, and having deployed that strategy beforehand to be able to do that consistently.” A backup that has never been restored cannot be a recovery plan on its own.
“If part of your strategy involves deploying backups, you’ve already presupposed that you have a way of knowing the backups are good.”
- Michael Melillo
Exercising backups means actually running them. Spinning up a recent backup, confirming it loads correctly, verifying the system returns to a known good state. For most of the industry, this happens rarely–if at all. Exercising backups annually is a reasonable starting point, and even that modest commitment puts a team meaningfully ahead of where most of the industry sits today.
Take it a step further, and put your organization’s recovery plan (including the backup exercise schedule) in writing. When backup exercising appears in writing in service agreements as a defined deliverable, it becomes something both parties can plan around, resource appropriately, and hold each other accountable to. That accountability is how the industry can gain recovery practices that can be depended on.
SRE Practice #3: Run Tabletop Exercises
Exercising backups requires access to systems, scheduled downtime, and organizational buy-in. Free advice?: don’t wait until it’s critical. The tabletop exercise is where any team can start, regardless of budget or technical readiness.
The premise is simple: gather the people responsible for your building systems and walk through a failure scenario together. What happens when a supervisory server goes offline? What happens when an IT team detects a breach and cuts network access? Who calls whom, what gets isolated, and what does recovery actually look like given the tools and people currently available?
Wouldn’t it be helpful to have experience with these scenarios before they really happen?
A team that has talked through a failure scenario in a low-stakes setting carries a practiced, shared understanding into a real incident. These exercises help ensure that everyone knows their role and can apply it efficiently and effectively to the real-world situation. This preparation gives every person in the room a role to play before the pressure arrives.
Running a tabletop exercise quarterly is a reasonable cadence, but even once a year is likely to deliver clear benefits to most organizations. Tabletop exercises cost almost nothing to run. Recovery, and the time it takes to recover, depends on the people who have to act under pressure having a common language and a rehearsed response ready before the moment arrives. Having a few tabletop exercises under your belt results in a prepared team and a well-managed, fast recovery.
The Conversation Has Started
OT teams have the advantage of inheriting a mature framework, tested at scale, with a freely available handbook and a growing body of practitioners who have applied it in exactly the kinds of environments buildings are becoming. As Michael says,
“If there’s a road to keeping systems healthy and not having to worry about mean time to recovery, but getting ahead of the crash, that’s the world I want to live in.”
- Michael Melillo
Notably, the three recommended SRE practices outlined above require no organizational overhaul to begin. A team that starts monitoring its supervisory devices, commits to exercising its backups annually, and runs a single tabletop exercise this quarter will be operating with more reliability discipline than the vast majority of the industry.
Buildings are becoming critical infrastructure faster than the industry is building the practices to match. Effective implementation of SRE principles and practices is how the gap closes.
If any part of this journal made you think differently about how your OT systems are managed, the full conversation goes even deeper. Check out the full interview between Michael Melillo and Andrew Rodgers in Episode 08 of Flight Logs:
Subscribe to ACE Flight Logs on YouTube or Spotify to be notified when the next episode drops!