Recipe for disaster: AWS' Deployment Pipeline Reference Architecture

•

Another "so close and yet so far away" release from AWS AWS needs to realize that: 1️⃣there's a world outside AWS 2️⃣not everybody has AWS' reliability requirements & profit margins This could've been awesome, but no, AWS keeps repeating the same mistakes pipelines.devops.aws.dev

To be fair, the Deployment Pipeline Reference Architecture (DPRA) gets *A LOT* of things right. Good job on that! Seriously 👏 I especially like all the examples! And there's a big need for something like this. I applaud AWS tackling this! But then problems starts to appear...

Mean TLDR: no more OSS projects to run as a service? Let's make public services out of things we built internally. Nooo, of course everybody has the same needs as AWS and of course everybody will just get it. No need to spend precious time productizing it! Bias for action! 🙄

"Vlad don't be mean, how can you know this is AWS does for CI/CD internally?" They released a document detailing it a few years ago: aws.amazon.com/builders-library/automating-safe-hands-off-deployments/ Eliminate the multi-region deployment pipeline and it's eerily similar!

Problem 0: it's not pretty or easy to read Look, this is teaching material. We make teaching materials easy to read and pretty so folks can focus on the content and not go "huh?" and "wait what does that symbol mean?" and "there are colors here, do they mean different things?"

Problem 1: it's slow as heck with a target of "a couple hours" AWS suggests that the pipeline should take 10 minutes for building, 1 hour for testing, and god-knows how long for the prod rollout. So between 1 and ∞ hours 🤦‍♂️

How long should the whole pipeline take? Most experts argue less than 15 minutes, but I am seeing high performers targeting less than 5 minutes. References — this is not something new or groundbreaking: - ‼️ From 2016: netflixtechblog.com/how-we-build-code-at-netflix-c5d9bd727f15 - charity.wtf/2021/02/19/how-much-is-your-fear-costing-you

Problem 2: it's pushing outdated expensive, hard-to-implement practices. "But Vlad, not everybody has to be on the bleeding edge" That is not what I am arguing for! CI/CD pipelines should be boring, I agree. Let's dive a bit deeper!

TL;DR for problem 2: "wanna learn how to drive a car? Oh, no, no, nonono. First you must learn competitive running, and then cycling, and then horse riding. Learning those first will help you learn how to drive" Hell no, that will be a HUGE time and money sink.

Example that highlights problem 2: the "Test" steps are outdated and gloss over exceedingly complicated problems.

"Test (Beta)" suggests creating test environments on demand. Awesome. Sounds good. Except it's a horror story with countless pitfalls. Worse, we *know* this and have known for years.

Oh, databases on demand? Yeah, you need either golden images or seed scripts. Both need pipelines, care, and maintenance. And they're useless cause of course the data will always be nice and clean and perfect and not at all match any real environments.

Oh, are you running EKS? HA, good luck trying to automate cluster creation, workload deployments, and cleanup. It's not impossible, but requires your firstborn and mountains of workarounds.

You'll waste YEARS of engineering time trying to create environments on demand. And even if you succeeded, you're going to land into another ambush: they're expensive as heck and they create the worst cultural incentives. Your bill will explode and there won't be any easy fixes.

"Test (Gamma)" suggests having a "as production-like as possible" environment. We know that is a lie. Prod-like environments don't exist: they're the spherical cow of the software world! Staging environments are both useful and useless at the same time, so I'll give this a pass

But this is not how testing should be done today! This is how we did testing when we only had basic building blocks (VMs and immutable infra), around 2014-ish. We, as an industry, have seen this pattern fail over and over again. It does not work for most companies!

Most companies don't have AWS-levels of reliability requirements 🚨 Most companies can't afford creating 100s of environments a day 💰 Most companies don't have countless internal dev tools to make this process work and for most it does not make sense to have such teams 🛒♾️

So, what do high-performing teams do? They invest in people & they use off-the-shelf software and tooling that enable them to move fast in a cost-effective way while at the same time improving the customer experience. They build operable software and safely test in production!

Release each change in production under a feature flag and test it like that! Seriously, it's a well-documented and mature practice. We've known for years how to do this. We have mature off-the-shelf tools for this since 2016! Boring rolling deploy + feature-flag release = 🚀

Boring deploys! Rolling deploys are shockingly simple and reliable when everything is behind a feature flag 😉 Rare, complex change? Blue/green: add a new deployment with a new URL that gets only specific traffic (controlled by upstream feature flags)!

Feature flagging for releases! They're: - safer ▶️ rollbacks are done in seconds instead of minutes - more reliable ▶️ no always-clean databases and envs - cost-effective ▶️ no creating 20 servers to test a typo - better for the customer ▶️ targeted previews and rollouts

I am not saying folks should not have the ability to create environments on demand: that should definitely still be a thing! But it should be manual and rare: only done for rare intense changes or specific testing that demands *this level of effort*.

Getting back to DRPA's problems, Problem 3: it glosses over vital topics. There's barely any discussion of local development or versioning. Gitpod, Codespaces, Cloud9(🤣)? Tilt, Telepresence? GitOps? These are complex, highly debated topics that are just... ignored.

AWS' Deployment Pipeline Reference Architecture sets folks up for failure. Even if you somehow succeed, you'll find yourself with a sluggish, expensive, unmaintainable mess. Worse, this will traumatize a whole new generation of developers and ruin a whole new batch of companies

AWS' Deployment Pipeline Reference Architecture, in its current form, is a recipe for disaster. There is a desperate need for a reference architecture and more documentation, but this is not it.

Since I know I'll be spammed: I have no open slots for this type of work, but David Raistrick is an awesome consultant that can help with this! Reach out to him on LinkedIn: linkedin.com/in/draistrick

Recipe for disaster: AWS' Deployment Pipeline Reference Architecture

Write & publish everywhere with Typefully

Write & publish everywhere
with
Typefully