Drifting Into Failure

Drifting into failure. It's one of my favorite ways to explain how the problems we face in our day to day, became problems in the first place.

You may have heard of drifting into failure before. Sidney Dekker published an entire book around the idea, aptly named Drift into Failure. He presents it as the idea that every small problem we ignore (or cover up), gets us closer to failure. What is a problem? What is a failure? That's one of the great things about keeping the idea abstract. It really lets you fill in the blanks and make up your own definitions.

Miss an oil change? Not the end of the world. Keep on putting it off week after week, month after month? Eventually it's going to catch up to you.

Deadlines pushing you to implement a less than ideal solution? Again, probably not the end of the world, but if you keep implementing those less than ideal solutions, you're going to eventually end up with a mess.

You get the picture. The idea here is that failures aren't often born from a single event, but the culmination of a lot of little events that didn't necessarily mean much on their own. We take little shortcuts here and there, but in the end, they will catch up to us in a potentially catastrophic way.

In the context of this post, I want to continue using this idea of drifting into failure, but speak specifically to how it applies to infrastructure.

To begin, you should be aware that there exists an approach to defining infrastructure using code. This is called IaC or Infrastructure as Code. It's a contract that says, make my infrastructure look exactly like this.

One alternative to Infrastructure as Code, is having someone on an operations team make ad hoc changes to the infrastructure. They would be notified that the infrastructure needs to change, so they would login and run some commands.

NOTE: Unfamiliar with the idea of contracts? You can read all about what they are and their benefits in Locked into a Contract? Lucky You!

These are two very different styles of managing the infrastructure. One is an imperative approach, and the other is a declarative approach. These are two very important concepts, so it's worthwhile explaining them.

Imperative and declarative approaches

For me, one of the easiest ways to reason about both of these approaches is that declarative focuses on the what and imperative focuses on the how.

If I wanted you to come to my office, I could say in a declarative way "Come to my office". I don't care how you get here, but my desired state is that you are in my office. Alternatively, I could say in an imperative way, "Leave your office. Go to the elevator. Take the elevator to the 14th floor. Walk into my office".

The end result is the same in both scenarios, but the declarative approach gives you the freedom to do it however you choose to get it done.

It's also important to note that declarative approaches often result in a series of imperative statements. To make that a little more clear, lets continue with the above example. I told you to come to my office, in a declarative manner. But that doesn't mean you just magically teleport in front of me. I made clear what I wanted the end result to be, but I left it up to you to figure out the individual, imperative steps to get there.

Dockerfiles are a great representation of this, in my opinion. A Dockerfile is a series of imperative commands, but they get bundled up into some desired state. The act of using an image from a Dockerfile is declarative (I want exactly this), but the steps required to get there were imperative.

Another important aspect of declarative approaches, is state. In a declarative environment, we need to know what the current state of the infrastructure is. We also need the ability to figure out what needs to be done to get the infrastructure into that desired state. For example, if I tell some process to make my infrastructure look like X, and that process gets interrupted, it should be able to figure out whats remaining to get the infrastructure to look like X when it resumes.

Imperative does not allow for this, as it has no notion of state. While imperative approaches are simpler, the lack of state means that if the process is interrupted we have to start over again, because we don't know where we left off.

This is the power of being declarative and using contracts. They allow us to say exactly what we want the environment to look like with the contract, and leave all of the finer details up to the tooling that is fulfilling the contract.

Declarative infrastructure

You may have figured out by now that Infrastructure as Code is very much a declarative way of defining what your infrastructure should look like. There are a lot of benefits to leveraging Infrastructure as Code. To give you an idea, here are a few that really stand out for me:

Always know what your environment looks like

How many times have you wondered what version of .NET is on the build servers? What was your solution to finding out that information? Most likely it was to call up someone from the Ops team and have them tell you. Using Infrastructure as Code, the configuration of each environment is stored in source control, just like code. If you want to know some properties of an environment, you just need to go to the infrastructure repository and look for yourself.

Push button environments

You'll most likely be hearing this phrase a lot in the days to come. The idea is that we should be able to define exactly what we want an environment to look like, and have some tooling do all of the work to fulfill our contract. Bringing this close to home, think about the time and effort it takes to stand up a new N. It takes a lot of people, a lot of time to make sure everything is just right.

The holy grail for push button environments is you do just that. You push a button, and after some time your environment will materialize before you. There shouldn't be any assumptions about where you are wanting to stand up your environment. Everything required to get that infrastructure up and running should be contained within the contract.

Testability

When we talk about testing, our minds most likely go to testing to make sure our applications work. We have an environment we want to deploy to, and we need to run our unit, integration, <noun> tests to make sure that our application will work when deployed. But where did that environment come from? How can we be sure the environment is correct?

Because Infrastructure as Code is.. code (it's right there in the name!), it gives us the ability to run tests against the infrastructure before it is actually applied. Some tests might be verifying that the docker daemon is installed and running or that some registry values exist. The key here is that we can now test the environment that our applications will be deployed to.

What does any of this have to do with drift?

Glad you asked! One of the key takeaways about contracts and Infrastructure as Code is that you cannot be allowed to make any change to the environment without updating the contract. Doing so, causes drift.

Environment drift occurs whenever someone changes the environment (applies a patch, adds some new registry value, etc) without updating the contract. The whole point of all of this is that we know exactly what the environment looks like, we can easily stand the environment up anywhere because everything about the environment is outlined in a contract, if the environment differs from the contract, then we lose those benefits.

When working with Infrastructure as Code, the same principles and processes we apply to deploying to applications, should also be applied to changes to our infrastructure. We don't just open up some source code on the production server, make a change, and apply it. The same rules and regulations should apply for our infrastructure.

If you want to make a change to the infrastructure, you must make a change to the contract and submit a pull request. If approved, that infrastructure will go through a series of tests, and will eventually be applied to the intended environment. How that change is applied could even mean we completely destroy the old environment and just stand up the result of the pull request.

Hooray immutability!

John Reese

Go . Kubernetes . DevOps