Compute
Compute is the component of Common in charge of providing out-of-band processing. It can run jobs both on-demand and on-schedule.
Public Oath
Fluid Attacks will constantly look for out-of-band computing solutions that balance:
- Cost
- Security
- Scalability
- Speed
- Traceability
Such solutions must also be:
- Cloud based
- Integrable with the rest of our stack
Architecture
- The module is managed as code using Terraform.
- Batch jobs use AWS EC2 Spot machines.
- Spot machines have Internet access.
- Spot machines are of
aarch64-linux
architecture. - Batch jobs are able to run jobs, but for as long as an EC2 SPOT instance last (so design with idempotency, and retrial mechanisms in mind).
- Jobs can be sent to batch in two ways:
- Using curl, boto3, or any other tool that allows interacting with AWS API.
- Defining a schedule, which periodically submits a job to a queue.
- AWS EventBridge is used to trigger scheduled jobs.
- On failure, an email is sent to development@fluidattacks.com
- Batch machines come in two sizes:
small
with 1 vcpu and 8 GiB memory.large
with 2 vcpu and 16 GiB memory.
- All runners have internal solid-state drives for maximum performance.
- A special compute environment called
warp
meant for cloning repositories via Cloudflare WARP uses 2 vcpu and 4 GiB memory machines on ax86_64-linux
architecture. - Compute environments use subnets on all availability zones within
us-east-1
for maximum spot availability
Contributing
Please read the contributing page first.
General
- You can access the Batch console after authenticating to AWS via Okta.
- If a scheduled job takes longer than six hours, it should generally run in Batch; otherwise, you can use the CI.
Schedules
Schedules are a powerful way to run tasks periodically.
You can find all schedules here.
Creating a new schedule
We highly advise you to take a look at the currently existing schedules to get an idea of what is required.
Some special considerations are:
- The
scheduleExpression
option follows the AWS schedule expression syntax.
Testing the schedules
Schedules are tested by two Makes jobs:
m . /common/compute/schedule/test
Grants that- all schedules comply with a given schema;
- all schedules have at least one maintainer with access to the universe repository;
- every schedule is reviewed by a maintainer on a monthly basis.
m . /deployTerraform/commonCompute
Tests infrastructure that will be deployed when new schedules are created
Deploying schedules to production
Once a schedule reaches production, required infrastructure for running it is created.
Technical details can be found here.
Local reproducibility in schedules
Once a new schedule is declared,
A Makes job is created
with the format
computeOnAwsBatch/schedule_<name>
for local reproducibility.
Generally,
to run any schedule,
all that is necessary
is to export the UNIVERSE_API_TOKEN
variable.
Bear in mind that data.nix
becomes the single source of truth
regarding schedules.
Everything is defined there,
albeit with a few exceptions.
Testing compute environments
Testing compute environments is hard for multiple reasons:
- Environments use init data that is critical for properly provisioning machines.
- Environments require AWS AMIs that are especially optimized for ECS.
- When upgrading an AMI,
many things within the machines change,
including the
cloud-init
(the software that initializes the machine using the init data provided) version, GLIBC version, among many others. - There is not a comfortable way to test this locally or in CI, which forces us to rely on productive test environments.
test
is the compute environment for testing.
It uses the
common/compute/infra/init/test
init data.
Below is a step-by-step guide to testing environments.
-
Change the
test
environment with whatever changes you want to test. -
direnv
to the AWSprod_common
role. -
Export
CACHIX_AUTH_TOKEN
on your environment. You can find this variable in GitLab’s CI/CD variables. If you do not have access to this, ask a maintainer. -
Deploy changes made to the environment you want to test with
m . /deployTerraform/commonCompute
. -
Queue compute test jobs with
or
-
Review that jobs are running properly on the
test
environment. -
Extend your changes to the production environments.