Replacing Terraform (for fun and profit?)

The client I’m currently spending most (all) of my working time with have a lot of Terraform code. And I mean a lot. Like, you will spend a solid amount of time getting an overview of where everything is before you start being productive, a lot.

And we are currently working on replacing most, if not all, of it. But why?, some would ask? Finally!, others will shout from some dark corner. So I thought I’d begin with this first post in what possibly might be a series as I dig myself into a rabbit hole. Starting with the reasons for why this might be a good (or bad) idea.

But why?!

It’s not because terraform is shit

It really isn’t. Terraform is fine for the problems it wants to solve. And naturally it is a bad fit for problems that are out of scope. Just about every week there is some dude pushing a slopped together hot-take on why Terraform is shit and how he has found the silver bullet to solve it. Usually followed by a mix of equally ignorant dudes replying “this is insightful” and brave veterans telling him he’s so far up his own ass he can’t see sunlight.

Terraform isn’t perfect. By far. No tool in your toolbox ever is. They screwed up when they changed their license model in 2023 and made people angry. I’m still a bit sceptic about the whole IBM aqcuisition. I curse over slow replies to bugs and feature requests on a regular basis. But overall it does a pretty good job of what it aims to solve.

We’re not replacing it with OpenTofu either

That would have been easy. The aforementioned license scandal of 23 and nerdy revolt that followed has ended up in OpenTofu being a part of CNCF, maturing as a standalone product. Right now it’s a drop in replacement in most use cases. Swap your CI/CD from terraform to tofu, keep calm and carry on. They might diverge more in the future, there might be that one super nice feature that makes your world so much better in one or the other, but mostly it’s a matter of preference.

It’s the pesky developers …

They are not happy. And when nerds are not happy they make noise. On rare occasions the angry clattering from mechanical keyboards are loud enough for management to notice. The combination of technical knowledge, frustration and a shared enemy is one hell of a propellant for motivation and driving change. So we’re throwing out a lot of old and boring, in favor of new and shiny.

Really?! But why?!

First of all: A big one is because our current codebase has sprawled somewhat out of control. Usually there is not one single reason for why this happens, but a series of unfortunate events. Pressure to prioritize new features over maintenance, lack of bandwidth, unclear expectations, scope-crawl, outside changes you don’t have control over ++. I think most teams and organizations struggling see some degree of most, if not all, these factors when analyzing why the promised land looks a bit like Shrek’s swampy cottage.
It’s CI/CD, stupid: I don’t know a single being who enjoys waiting for that pipeline to go from waiting to running to passed/failed. Especially not if it fails so you have to read the output, figure out why, fix it, and go back to start without collecting a prize. Over time, workflows have gotten too complex, intertwined, riddled with dependencies, requiring waiting for approvers to notice something is waiting, prone to random failures, or most/all of the above. We “solve” parts of it with pre-commit hooks, spend time decoupling and refactoring the worst tangles. But every time something ends up displaying a big ❌ it fuels frustration and lowers the reputation towards its users.
The world changes: Other people and companies are making changes that affects our little world. And no, I'm not even talking about the big bad slop. APIs change, open source projects abandoned by their maintainers, terraform modules deprecated, triggering a steady stream of changes, workarounds, accepted risks, explanations to management why something isn’t necessarily a huge problem++
If people don’t like it, they’re not gonna use it: Their reasons may be sane, or they may be ridiculous. But the end result is the same. People don’t buy products they don’t like. They don’t talk nicely about them, they don’t recommended them to others, and most of the time they won’t tell you why they don’t like it either so you’ll never get to know what needs fixing

uhhmmm … these are not Terraform problems

You’re right. They are not Terraform problems. They are mostly not technology problems at all. They are much harder than that: they are people problems.

We could have spent our time chipping away at each problem. Refactoring and untangling, removed dependencies to be more self-sustained, advocating to change outdated company policies and so on (and on it seems to go). It would have worked. It would have gotten easier to use, faster, smoother, more reliable. There is almost nothing we couldn’t have solved in the current platform stack.

Except this:

No matter how often you run CI/CD it’s still a push based workflow

Between each and every run there is potential for someone “just doing a quick fix” and forgetting/skipping to update it in code. Creating drift, frustrations when the next run show an unexpected result in plan, frustrations when fixes are needed, frustrations when configuration is rolled back. Where discipline is good and deploys are frequent enough this problem almost goes away, but when you follow good practices and split up your deployments in small, manageable, self-sustained parts, there will be some pipelines that have days, weeks or even months between each run (or you trigger them on a schedule just to make your life a bit miserable because you spend each Monday correcting last weeks drift).

What we want boils down to two very easy to understand demands:

Self-service in an interface we actually like
Self-healing all the time

The first we already solve. You get a landing zone. If you write your code to deploy whatever you need in Ansible, ARM templates (please don’t though), Bicep, Opentofu, Pulumi, Terraform or raw dogging HTTP requests it’s up to you. Whatever makes you happy bro!

For the second part it’s a bit harder. Us oldies remember the days of configuration management software for servers. We deployed an agent, told the server what kind of server it was going to be, the agent called home, got the details, and magic happened in the background. When we did changes, we just waited for more magic to propagate.

Few Infrastructure as Code-tools work this way. And no cloud platform, private or public, has a magic “configuration agent” service you can just tell to call home to figure things out. Except…developers are already used to this interface for their applications in Kubernetes. Update a helm chart? Flux does magic, a node dies? pods are moved and the application heals itself. You have “eventual consistency” and “continuous reconciliation”.

And that is exactly why we are replacing a bunch of Terraform, offering by offering, component by component, with Crossplane.