One of the most crucial things, if not the most crucial thing to do when building software is to minimise the risk of your deployment to production. Of course, we want our deployments to impact our customers positively because any negative impact will ultimately lead to a business losing money, be it because of losing customers or reputation damage. No production means unhappy customers, which means no money, which means no way of paying the developers and engineers who have lovingly crafted your product.
In a world where we’re always striving to deliver more value faster, moving with control or introducing gates to slow that deployment to customers seems counter-productive, however, a lot of this “red tape” can be automated and in high-performing IT teams, deployment is largely automated.
Deployment rings are one of many DevOps practices used to limit impact on end-users, while gradually deploying and validating change in production. The impact, sometimes called the “blast radius”, is typically evaluated through observation, testing, diagnosis of telemetry, and most importantly, user feedback. It is a controlled change through slow, measured roll-out of updates. It can (and probably should) be used alongside things like feature toggling, which is something we’ll cover in another post.
You can feasibly have as many rings as makes sense and, generally speaking, you can define your rings how you like, though if possible, always start with a canary or dog-fooding ring. This enables you to gather feedback and telemetry on the changes made in a real, in-use environment, by real people but without affecting customers. This is our first point of real user validation, not just internal development teams, automated and manual tests using it, but real life users. In addition to this real user feedback, your update will be generating telemetry, from how its performing on your App Service or VM through to how many of your users are interacting with the new feature/update/shiny thing. There’s plenty of tools available to this from plain old writing logs to disk and pushing them into tooling like Splunk, or there’s my preferred option, Application Insights. You can never have enough telemetry, so long as it is meaningful.
Your next ring could be early adopters, or it could be low-load customers. The key to the deployment rings is that with each ring, you increase the number of actual users now interacting with the update, until you’re at 100% saturation – this is called impact of change. Below is an example of how a company could manage their impact of change. Each one of these blocks, with the exception of pre-production (this has been added for illustrative purposes), would be a deployment ring.
Fictional Company Inc. have created their deployment rings up based on role and usage. They have a canary environment in which every change goes to first, then after an observation and feedback period, the update is then deployed to their trial ring, which is a series of environments for pre-sales demos and free trial customers. The rings after that, they determined that it would be best to deploy to environments based on the size of their customers, ranging from 10 users to 10,000, with the update reaching more users with each ring.
This movement between rings could be completely automated and ideally should be, though as part of that automation, what should happen if something goes wrong? How do we know when something is okay to continue on to the next ring? This is where the observation and feedback period can really help as we can use it to define gates or bulkheads between the rings. Those gates could be something like monitor the number of errors per minute and if it exceeds the threshold, stop further deployment, or they could be something like Twitter Sentiment Analysis. Needless to say there are plenty of options when it comes to automating your gates. Should a gate fail, the deployment should stop and potentially, the change may need to be rolled back. However, thanks to the joy of deployment rings, and a ton of other supporting practices such as feature toggling, working in small batches, and more, the impact of such a failure is greatly reduced – though if it does happen, still focus on getting your production deployment back up and running!