CI
CI is the component of Common in charge of providing the infrastructure of a Continuous Integration and Continuous Delivery system (CI/CD).
Public Oath
As the CI/CD is the backbone of our technological operation, we will constantly look for new technology that improves our development, testing and release processes.
Fluid Attacks will look for a CI/CD solution that is:
- Cloud based
- Scalable so developers get feedback as soon as possible
- Secure
- Integrable with the rest of our stack
Architecture
- The module is managed as code using Terraform.
- It implements an
inspector
AWS lambda that in Python. - Our CI/CD system is GitLab CI.
- It is hosted on AWS.
- It is an implementation of the terraform-aws-gitlab-runner module.
- Each product has a bastion machine that spawns workers based on job demand.
- Workers have a
aarch-64-linux
architecture. - There is a
common-x86
bastion that spawnsx86_64-linux
workers for jobs that only run on that architecture. aarch-64-linux
workers have 1 vcpu and 8 GiB memory.x86_64-linux
workers have 2 vcpu and 8 GiB memory.- Workers have Internet access.
- All workers have internal solid-state drives for maximum performance.
- Workers use the GitLab OpenID provider to assume AWS IAM roles provided by common/users.
- There is a
terraform_state_lock
DynamoDB table that allows locking terraform states when deploying infrastructure changes, helping prevent state corruption. - It uses a AWS s3 bucket for storing cache.
- There is an API Gateway that receives merge request and pipeline triggers from GitLab.
- The API Gateway forwards the requests to the Lambda.
- The Lambda performs the following actions:
- Cancel unnecessary pipelines.
- Rebase merge requests with trunk.
- Merge merge requests with green pipelines.
Contributing
Please read the contributing page first.
General
- Any changes to the CI pipelines must be done via Merge Requests.
- Any changes to the AWS autoscaler infrastructure must be done via Merge Requests by modifying its Terraform module.
- If a scheduled job takes longer than six hours, it generally should run in Compute, otherwise it can use the GitLab CI.
Components
We use:
- terraform-aws-gitlab-module for defining our CI as code.
- GitLab Bot for listening to GitLab webhooks and trigger actions like canceling unnecessary pipelines, rebasing MRs and merging MRs.
- Terraform state locking to avoid race conditions.
Make the lambda review my merge request
The lambda reviews all currently opened merge requests when:
- A new merge request is created.
- An existing merge request is updated, approved (by all required approvers), unapproved, merged, or closed.
- An individual user adds or removes their approval to an existing merge request.
- All threads are resolved on the merge request.
If you want the lambda to rebase or merge your merge request, you can perform one of the previously mentioned actions on any of the currently opened merge requests.
Tuning the CI
Any team member can tune the CI for a specific product by modifying the values passed to it in the terraform module runners section.
One of the most important values is the idle-count
, as it:
- Specifies how many idle machines should be waiting for new jobs. the more jobs a product pipeline has, the more idle machines it should have. you can take the integrates runner as a reference.
- It also dictates the rate at which the CI turns on new machines,
that is, if a pipeline with 100 jobs is triggered
for a CI with
idle-count = 8
, it will turn on new machines in batches of8
until it stabilizes. - More information about how the autoscaling algorithm works can be found here.
Terraform state lock
Sometimes, Terraform CI jobs get stuck in a failed state due to a locked state file.
The Terraform state file
stores local information
regarding our infrastructure configuration,
which is used to determine
the necessary changes required to be made in the real world (terraform apply).
This state file is shared amongst team members to ensure consistency;
however, if it is not properly locked,
it can lead to data loss, conflicts, and state file corruption.
In case of conflicts with the state file, please follow the steps below:
- Obtain the state lock id from the failed job
- Access the
terraform_state_lock
table in DynamoDB by going to AWS - production in Okta (requires prod_integrates role) - Search for the ID in the Info attribute and delete the
.tfstate
item - Attempt to rerun the job that failed.
Debugging
As we use a multi-bastion approach, the following tasks can be considered when debugging the CI.
Review GitLab CI/CD Settings
If you’re an admin in GitLab, you can visit the CI/CD Settings to validate if bastions are properly communicating.
Inspect infrastructure
You can inspect both bastions and workers from the AWS EC2 console. Another useful place to look at when you’re suspecting of spot availability, is the spot requests view.
Connect to bastions or workers
You can connect to any bastion or worker using AWS Session Manager.
Just go to the AWS EC2 console,
select the instance you want to connect to,
click on Connect
,
and start a Session Manager
session.
Debugging the bastion
Typical things you want to look at when debugging a bastion are:
docker-machine
commands. This will allow you to inspect and access workers with commands likedocker-machine ls
,docker-machine inspect <worker>
, anddocker-machine ssh <worker>
./var/log/messages
for relevant logs from thegitlab-runner
service./etc/gitlab-runner/config.toml
for bastion configurations.
Debugging a specific CI job
You can know which machine ran a job by looking at its logs.
Consider an example job where the line
Running on runner-cabqrx3c-project-20741933-concurrent-0 via runner-cabqrx3c-ci-worker-skims-small-0-1677537770-87c5ed70...
is displayed.
This tells us that the worker with the name
runner-cabqrx3c-ci-worker-skims-small-0-1677537770-87c5ed70
was the one that ran it.
From there you can access the bastion and run memory or disk debugging.
Custom API Tooling
We’ve developed tooling specifically for monitoring job performance in CI. These tools interact directly with Gitlab’s GraphQL API, extracting and analyzing data to generate both plots and tabulated data in CSV format on your local machine.
General analytics
Retrieve and visualize general job, pipeline or merge requests analytics using the CLI tool with its default workflow.
For fetching, parsing and plotting data, run:
Optional arguments:
Functional tests
Identify the slowest and flakiest functional tests within integrations using the CLI tool:
Optional arguments:
End to end tests
Detect flakiness, generate heatmaps, and timing plots for end-to-end tests with the following CLI tool:
Optional arguments:
Customization
Easily create your own CLI tool to extract, parse, and visualize Gitlab
information by leveraging the module located at common/ci/analytics
.
Refer to the source code of the aforementioned tools for inspiration.