Deterministic software development practices and the horror of losing your Docker images

8 minute read

Sometimes when developing software, it can be hard to expect or account for some of the horrors that might arise as a result of a small and seemingly insignificant decision. Small, ostensibly innocuous decisions can have monumental ramifications down the line, as I’m certain any software developer with a few production systems under their belt can attest with a harrowed grimace. Over the past two days I had the joy of experiencing exactly such a consequence.

Deterministic Software Development Practices

While I continue the joys of pandemic-era South African enforced teetotalism despite my frayed nerves, it’s worth providing some context to my sudden desire to take a 6 month sabatical as a barista or an alcoholic.

Good software development practice emphasises determinism and reproducability when it comes to the code you write, the system you develop and the applications you deploy. In an ideal world, all code does the same thing every time, every system deploys identically every time, and applications only do what you expect them to every time.

Much to our misery, such an ideal world is the stuff of fairy tales, and it’s only through the rigourous implementation of protocols and procedures that we can enforce some degree of reasonable expectation upon reality. Such efforts are the spawning pool of concepts such as Unit Testing, Test Driven Development (TTD), codefied deployments, Continuous Integration (CI) and Continuous Delivery (CD), Docker, and Infrastructure As Code, among others.

Effectively tested code provides a valuable safety net of surety that running that code will produce the desired results when executed as intended; and if you’re testing effectively, that it will fail in a reliable and expected fashion when executed in an unexpected way.

Codefied deployments with tools like Ansible, Chef or Salt are an effective means to remove the human element of error from potentially lengthy and complicated processes; making an invaluable assurance that your deployment processes will continue to work in a consistent manner even after John the DevOps whiz leaves for his new job in Silicon Valley (or quits software development entirely in favour of a more relaxing life as a carpenter).

This testing and sets of strictly defined deployments can be further established as a deterministic process through the introduction of effective CI and CD, which can ensure that all code written is held to an appropriate testing standard, validating that no past tests have been broken (and in better implementations, that newly added code is effectively tested as well), and that deployments follow an expected deployment procedure based on the merging and versioning of code to various version control branches.

This would normally take the form of the codefied deployment being executed with the newest versioned set of code pushed to a specific git branch, probably called “develop”, to a “Staging” server for manual validation and testing; while a versioned merge to a “mastermain” branch would trigger the deployment of that code to “Production”, with all manner of potential checks and balances to ensure a safe and succesful deployment… if you’re not too scared of the word “DevOps”.

The result of all these efforts is that you can ensure that whatever code you have deployed is, at the minimum, exactly as or more functional than it was in any prior deployment, and that in the event of disaster, you can effectively recover by following the same preceduralized steps, and deploy a functioning version of your application should your manager/boss/product owner/client demand it be done. (Though we do drift ever so slightly closer to that mystical land of fairytales here, as disaster recovery is rarely quite so straightforward.)

The Horror of losing your Docker images

This all brings us to my past 2 days, having enjoyed the safety net of stable CI/CD which would reliably run the unit tests suite for my applications; and in the event of succesful testing and merging, build the associated application’s docker image to be stored in our internal registry for later deployment. (No automated deployments just yet, but the dream lives on for the moment.)

For the uninitiated, Docker is a nifty containerization protocol that took the developer world by storm as an effective way to package up both your application and the environment in which it needs to be executed in, thereby ensuring yet another layer of determinism and an amusing and succinct response to the problem so well known by the phrase “it works on my machine.” Which is to say: “Well, then we’ll package your machine.”

In terms of deployment, having your historical docker containers tagged and stored is an effective measure in terms of disaster recovery, as it provides a relatively easy mechanism by which to roll back your exact deployment in the event of disaster (or an unfortunately dynsfunctional deployment.) It can also be highly beneficial for simplying the development of an application between developers who might have differing local environments as well.

Today’s horror boiled down to one seemingly innocuous decision. Due to space constraints, it was decided that our internal container registry would not be backed up, as these images could easily be rebuilt through their dockerfile defintions in any case, right? Right. Absolutely. In fact, I had made various efforts to script up these build processes and attach them to our CD processes. So when disaster struck and our registry was lost, it seemed like no real calamity at first.

But then there was a bug we needed to patch in one of our systems… A story in of itself, once we eventually tracked down the relevant issue, we applied the relevant patch to our code, wrote the related tests to ensure it’s validity, and pushed it up for the lovely automated scripts to validate and build the docker image for future deployment. All seemed like smooth sailing as we confidently awaited the opportunity to resolve this strange bug in our system with but a singular deployment; none the wiser to our impending stress level hike.

We deployed the newly built and packaged docker image to our staging server as the vastly important final sanity check prior to deploying it to master; but were surprised to find that weirdly, the application suddenly refused to connect to the database. Hmmm. Peculiar. So the debugging questions began? Did you change something? Did I change something? Is something out of date? The error seemed to imply TLS certificate validation is failing; maybe lets rerun the cert generation script? Okay. Did that, certs are all emplaced and it’s still broken.

What, oh what, could it possibly be? Our despondency only grew as we checked increasingly more minute details. Cert timestamps checked out. Were we mounting the certs to the correct containers. Seems so. Did they verify? Yes again. Did they match? Unfortunately, since we knew we needed to find another thing to check, yes. Each increasingly unlikely question was answered unequivocably with the word “yes”; leaving us to break for lunch with severe frustration and feeling increasingly nervous as we were asked once again for the status on the patch we’d been working on prior to this issue.

Let it be said, the wonders of taking a break for lunch can do wonders for even the desperate, as once we came back, my colleague had begun thinking about the thing that we had changed and thought nothing of. All the docker images we’d rebuilt, and more specifically, the base docker images we had rebuilt and published to our registry on top of which our code would be layered and deployed. I found the idea completely unlikely, until he used the docker image history to examine the sub layers of our deployed image, vs those of the last known good image we had been previously running on staging. What became evident was that the underlying operating system layers of the newer image were also newer.

This to be perfectly clear… is not very deterministic. The implication of this is that the base Ruby image we had previously built our internal base image atop of had been functionally modified, which meant that our new base images were not identical to the base images we’d had prior to the loss of our image registry’s contents. Rolling back to a Ruby image that had not been modified further proved this, as our mystical cert verfication error disappeared in a puff of smoke. This of course, left us feeling relieved, exhausted and, understandably, very concerned that we may be unable to trust our docker images to build and behave in a reliable, expected and deterministic fashion. Or at least for how long it takes to rebuild all our images and manually ensure that they still run in the expected fashion.

It is frustrating to realize that this core versioned Ruby image dependency would be changed right out from under us like this. While our internal image storage was shielding us from the results of this external influence, it’s worth understanding that any of your code or systems which rely on external dependency’s to build or deploy are at the mercy of decisions which may be made outside of your control, and may be actively deleterious to your own software development efforts. We need only refer back to the events of the left-pad debacle to recall how one small, meaningless dependency broke the internet.

So the moral of the story is quite simple. Your build artifacts are important, and your deployment mechanism’s reliance on external entities is an active risk to the succesful continuation and operation of your application, and potentially the business itself. Backup your docker containers, and make sure that whatever images or dependencies you might be using as a base from which to build your applications are, to the best of your ability, cached internally to shield them from external influence, be it mistaken or actively malicious by bad actors.

Updated: