Helix Engineering: Our approach to microservices, part 2
In Part 1 of our series published last week, we touched on the benefits of microservices, why Helix uses them, and how we’ve progressed down that path. This week, we’ll dig a little deeper.
The tech stack
The Helix platform tech stack relies heavily on AWS, including:
- Step Functions
Authentication and authorization
Security is paramount at Helix. Even inside our network, inter-service communication is both authenticated and authorized. Each service has a set of tokens it uses to talk to other services as well as authenticating them. When service A makes an API call to service B, it will pass the shared auth token A-B. Service B will verify that the token A-B is on its list of allowed services and proceed to process the request. If service A wants to invoke service C as well, it will have a different token A-C. This ensures that service B that has the A-B token can’t impersonate service A and access service C on its behalf. If service B needs to talk to service C then it requires a separate B-C auth token. (I won’t go into all the details involved in provisioning and maintaining the auth token in this post.)
But authentication is just one part of the story. Services that talk to each other don’t necessarily need access to all the data or all the endpoints exposed by the target service. For example, service A may expose an API with endpoints E1, E2, and E3. Service B may need to access only endpoint E1. This is checked by the target service and there are general-purpose helpers that require internal service access too.
This covers cross-service communication, but several services and applications are exposed to end users. For user-facing authentication we use OAuth2. Once a user is authenticated, there is another authorization phase where the user’s role and permissions are determined. For example, one of our critical user-facing tools is called Data Delivery Review, where the genome sequencing data received from the lab is examined by certified personnel. The review process is very meticulous with three levels of review by three different people of different roles (technical reviewer, licensed personnel, and lab director). At each level, the user may perform actions (e.g., accept or reject a sample) that trigger a workflow. All users use the same tool, but what they see and what actions they can perform is determined by the role-based authorization mechanism.
For sensitive workflows, every user action is recorded for later analysis or audit.
Most of the Helix services expose REST APIs that accept and return JSON payloads. A small number of these services are public and are used by our partners (always authenticated and scoped) or by third party services that notify our services via webhooks. Most of our services are internal and talk to each other.
Another approach to simplify interactions between services is to eliminate APIs altogether. Loosely-coupled services can interact indirectly through queues and don’t even need to be aware of each other, and definitely don’t need to authenticate and authorize each other. Queues change many things in the system architecture and have their own pros and cons. I personally believe that queues are a fundamental building block of distributed systems and are the best way to model and implement many real-world use cases. Note that we explicitly forbid services to communicate via shared data sources, which is a major anti-pattern.
Show me the data
Most of our services keep persistent data (not personally identified information) in relational databases on AWS RDS. This is really convenient because it removes a lot of operational effort to have safe, secure, and highly available persistent data, and RDS does most of the heavy lifting. We started with MySQL, started migrating to Postgres, and now we’re trying to standardize on Postgres Aurora. You get automatic replication to three availability zones, backups to S3, and you don’t need to do capacity planning or provision resources yourself.
We also store non-relational genomics data directly in S3 and use dedicated services to access them. More transient data are stored in SQS until processed.
Finally, we have an operational DB that aggregates data from several services (including third party services) and stores them in AWS Redshift. Redshift allows us to slice and dice the data easily and get a global view of lab and analysis workflows.
Stay tuned for more
That’s it for part two! In our final chapter, I talk about our CI/CD pipeline, testing, and where we’re headed next. Check it out here!
To make sure you’re seeing the latest from the Helix Engineering team, follow along here!