Helix Engineering: Making dependency management suck less
The importance of repeatable builds and deploys
At Helix, predictable and repeatable builds and deployments are crucial. A large contingent of our software is responsible for storing, retrieving, and processing securely sensitive information. It is critical that our engineers know exactly which version of each microservice is currently deployed, the versions of its major dependencies, and even the version of the toolchain used to build and deploy the code. The combination of all that deterministic knowledge lets us reason, audit, and troubleshoot issues and manage upgrades in a systematic way.
Semantic versioning and GitOps FTW
Let’s talk shop. The primary dependencies we’re concerned with are:
|hss||Helix shared services—a library that all microservices use for common functionality|
|terraform||A collection of terraform projects and modules for provisioning AWS resources and deploy services to ECS|
|build-scripts||A set of build scripts to create Docker images and CI/CD logic|
|base docker images||Predefined set of Docker image on ECR that services use as their base image|
These dependencies live in their own git repo (except the Docker images) and contain a VERSION file with a semantic version that gets bumped for every change. Our tooling ensures that changes can’t be merged without bumping the version.
Each service has a file that contains the semantic versions of the terraform repo and the build-scripts repo that it requires to be built and deployed. It also has a Dockerfile that specifies the exact version of its base Docker image. Finally, the glide.yaml and glide.lock files specify the required version of hss. All these files live in the git repo with their corresponding version of the code. It’s GitOps at its best.
Well, LGTM. Problem solved. right? No, not really…
Upgrading dependencies across multiple repos
Most of these dependencies (with the possible exception of hss) are operational dependencies. The core logic and main purpose of the service are not affected by changes to these dependencies most of the time. But all these components go through rapid evolution, improvements, and adding new capabilities. As a result, service owners and developers have no incentive to upgrade to the latest and greatest. It can be argued that there is no business value in upgrading dependencies just for the sake of upgrading dependencies. But at some point, services do need to upgrade. It may be that the service requires a new capability, or it could be a critical security update or some other global change.
When that happens, it could be a nightmare. The developer suddenly has to upgrade to a version that is ahead of the current version it depends on by weeks or months. Just reading through the list of changes is overwhelming and it might break the service or its tests in very interesting ways.
So, the status quo used to be that services had a wide spectrum of versions of all these dependencies. When a service had to upgrade, it was an existential crisis. When a global change had to be rolled out to all the services, it took forever and some services didn’t upgrade even then.
The “Right Thing™” is for services to stay up to date and upgrade to the latest routinely, but that’s a thankless and labor-intensive task. If it was just easy to stay up to date…
To address this problem, the Helix DevOps team developed a tool called “Bumper.” Bumper has a little YAML config file that specifies what changes need to be applied and it automatically generates a PR with all the necessary changes in a consistent format, including updates to the CHANGES.md and VERSION files. It further integrates with Jira and creates a story for the target service-owner (based on Github custom topics) in the current sprint. Bumper can also generate a CSV report that shows the current version of each dependency for each service and the status of open PRs. At the end of the day, the service owner/developer only has to approve the PR, and we’re off to the races.
Now, what really happened is…
• Some service tests failed due to the changes, so the PR couldn’t be approved immediately without fixing those tests
• Service owners got upset when a new Bumper Jira issue got dropped into their sprint
• Some regulated services need to go through a special validation and verification process
• It took several weeks for the first mega-bump to propagate through all the services
We needed to improve the process. We switched from a push model where Bumper does everything, creating PRs and Jira issues for all services, to a pull model. With the pull model, each service owner is responsible for running Bumper just for their service. Bumper will create the PR and the Jira issue, but the service owner controls the timing.
Time will tell if this model is successful in the long run, but it is definitely better than being helpless and unable to control the proliferation of legacy dependencies. In a future blog post I’ll cover the internals of bumper and show how it works. Stay tuned!