Do you remember the times when applications were developed in sprints and deployed once per month? I certainly do, and believe me, that I do not want to go back to that anymore!
This is a story about our transformation.
Once a sprint had ended after three or four weeks (some rascals had two-week sprints), you were supposed to have a deployable product which was tested, and it could be installed to the production environment, and users would finally be able to use the new features. How often that happened? Most of the time working software contained content from multiple sprints, aka 1-2 months worth of work.
If by some miracle you had a working software by the end of the sprint, then the fun part started. We built the deployment package, and either you, your team-mate or someone from the ops-team was responsible for installing the new package into the production. Usually, this required a maintenance break which needed to be agreed with the business, and they always insisted that it would be either on the weekend or by night. Few times we made the mistake of deploying on Friday. Spending Friday afternoon putting out the fires thought us a lesson; “don’t deploy on Fridays.”
The deployment itself included twenty-foot long checklist so that we could be sure that every task related to the deployment was performed correctly. When all the steps were completed, we executed some quick smoke tests and started to wait the next business day when all the users would start hammering the system. Sometimes everything went well, but quite often we needed to release fix-packs, one after another to get the things to stabilize. Moreover, there went our first week of the next sprint by taking out the fires.
The terrifying part was when something went wrong in the deployment and the person who performed it did not have any clue what happened. Usually, this was because other people deployed the software than the original developer who implemented the feature and could have noticed which task was missing from the checklist.
When things started to go wrong, we eventually began to fear the releases, and it increased the release interval. Fewer releases led only to more significant problems.
However, is there something that we can do about this? Almost ten years ago in the 2009 John Allspaw and Paul Hammond from Flickr gave a disruptive presentation called “10 deploys per day”. Surely you cannot have ten deploys per day even today… or can you?
For the beginning let me be clear about one important thing. Changing old monolithic deployments will be hard, I mean really hard, maybe even impossible. Thus, you will get a lot easier if you can build all the following characteristics directly to some new service. State-of-the-art deployment pipeline must be built into the service from the day one, even before implementing any features. It does not need to be top notch right at the beginning; you shall learn why later.
Our state-of-the-art pipeline allows us to deploy at any time, even on Friday afternoon! This gives us great opportunity to push features (code) into production as soon as we finish them by the developer who initially implemented the feature. There is almost zero waiting time before the new features are usable by the end users. Because we deploy whenever we want, we do not need to schedule maintenance breaks with the business. When deploys are part of our daily work, all fear and doubt magically disappear.
However, for us to be able to deploy anytime, quite a lot of stuff needs to be taken care of beforehand. We need duplicated services, rolling deployments, feature flags, fully automated deployment, confidence that the code works, small content and lean principles.
We have duplicated all of our services, and they are highly resilient by design. We plan for failures in the downstream services and ensure that they are tolerated by our services. Hence, if we do have all our downstream services offline, the user will at least see a friendly error message, and he or she is advised to try again later.
Our services will be updated one at the time so that there is always at least one service up and running ready to serve our users. For our purposes, two instances have been sufficient, but you will need to know how much traffic your service can handle as it will handle everything alone while the other is being updated.
Not every feature that we deploy to production should be seen by the majority of the users. We use feature flags to control which features are shown for users and therefore we can push new features visible only for a small number of users. This allows us to experiment with new features, gather feedback from end users and tweak our features before they are published for the rest of the users. We have even created dummy user interfaces for the new feature and gathered feedback from the end users, without creating any real functionality.
Deploying code to the production and rolling out new features are two separate actions and should always be distinct from each other. Our feature rollouts are merely a flick of the switch.
Computers do not make mistakes. Therefore we have entirely automated our deployment. All deployment-related steps (only a few of them) will be guaranteed to be performed. Most important of them are running all tests once more, building our production package, checking that it is valid and installing it to the production.
To deploy anytime, we need enough confidence that we do not break anything. Do we have excessive manual regression tests? Do we verify our features in the QA-environment? Do we have testers to test each feature? The answer is “No!”!
If we still happen to break something, it usually affects only users, who are seeing the new feature. Sometimes we do mess badly and cause service to be unavailable by the users. In such cases rollbacks are similarly quick and easy to make, so time to recover is usually less than 30 minutes. One important thing is not to fear about failure but learn from it. If we break something, rest assured that there will be some automatic check to verify that it will not happen again.
To deploy multiple times per day, we need to have small deployments. Features will be split into small, incrementally build tasks which can be completed independently and deployed to production. Feature flags can be used to hide features, which don’t have all the bells and whistles that we need for the final version.
Because we only have a small amount of code to deploy, we also regularly merge our code to the main branch in our version control system. There are no long living feature branches which get out of sync with the main branch and cause painful merging as developers are working weeks or months with the single feature. We measure the lifetime of the feature branches in days.
All of this technical excellence will not be any good if you do not know what features are valuable and which is the minimum amount of features that will start to produce value. It is common for projects that specification documents specify a long list of features which are needed and they are all equally important and need to be implemented before the system is “usable.” I challenge you to think if this is indeed the case!
Implementing one feature at a time and deploying it to production so that end-users can use the feature is the only way to start producing the value. Half completed features laying in some test-environment will not deliver any value. One should constantly think which is the most straightforward implementation for some feature that produces value. Instead of full-blown integration, could there be just emails in the beginning? Instead of trying to tackle all the complex special cases, how about implementing the most common use case first? It might be all that you need, and the difficult corner cases should be handled in the old way (usually we are implementing some existing business process with software, right?).
The same goes with the deployment pipeline. It needs just sufficient amount of features in the beginning. It does not need to be perfect.
You might think that deploying features into production is the end, far from it! Now we get to the hard part, the monitoring and learning.
We have built almost state-of-the-art monitoring systems to our environment. Our monitoring usually notices first if there is something wrong after the deployment or in general and alerts our team. Besides technical monitoring, we try to monitor how end-users are using the system. Anyone seldom complains about bad usability with some feature, but this can be detected proactively.
We have shared the outcome of our monitoring with a simple status page which shows current situation of our services. The goal is that we will instantly notify the users if there are problems with our services, and they will know that we are working on the issue.
Another important part is to learn and gather feedback from the end-users. We collect preliminary requirements for a new feature by running surveys right from the application. The goal is to learn what features end-users need, how it would benefit business and how to improve things continually.
Having quick deployment cycle and small lead times allows us to run different experimentations and to collect feedback from them. If something does not work, we change it.
You might think that companies such as Google, Amazon, and Twitter which are doing countless deployments per day were born as unicorns and your company cannot possibly do the same. Well, all of those companies have struggled with their deployments in the past. They just managed to start the transformation and succeeded.
You may also think that this will not concern you as you are building enterprise software. We are doing it with state-of-the-art deployment pipeline! In today’s hectic world, you just don’t have the luxury of waiting for deployments for months; you will need to experiment, measure, learn and improve continuously, not quarterly.
Building excellent service for your employees or customers will transform as higher value to your company, either with increased revenue or cost savings. You created the service because you needed it to solve some business problem. If no-one is willing to use the software or the software is more complicated than the original problem, the value will be negative. Building intuitive and straightforward services will allow users to concentrate on the real work and it also reduces time and money spent in instructing end-users.
You might have seen this image on the internet, right?
What would be better than a happy user who enjoys using your software?
See what we have to offer - Careers