Debugging machine flow
Machine flow in production relies on many cloud services and depends on many different actions and steps.
The basic process is as follows:
- An action triggers a machine execution (A scheduler, a reattack request, etc)
- The integrates backend generates the yaml config which skims will use to analyze the Git Roots.
- After that, it also puts a batch job on the skims’ queue.
- The batch job downloads the repository that skims will analyze and the config file generated by integrates.
- It executes skims and uploads the SARIF file to s3.
- It puts a message on an SQS queue that points to that SARIF file.
- Our k8s cluster has servers dedicated to catching the messages from that queue
- It uses the message to download the SARIF and the config yaml, and processes the results accordingly.
A lot of things can go wrong during this process, and a bug might be found in any of them.
To make debugging easier, once a developer identifies an issue, here is a step by step guide of what can be done to identify where the error is:
- Search the batch job associated with the execution you need to analyze.
Machine batch jobs are named using the following format:
machine_group name_job origin_job type_id
Where each component means:- group name: Each machine job runs a skims execution for any given number of roots belonging to a certain group.
- job origin: As mentioned before, there are many actions that can trigger a machine job.
- job type: What type of technique is the job executing (SAST, APK, ALL, etc)
- id: A unique identifier that is used to name the config yaml files and the SARIF files uploaded to S3. This is useful to search each associated file.
- If no batch job is found, that means the bug might be on the integrates action in charge of queueing the job.
- If the batch job is found, analyze the logs of the job to see if there were any errors during skims execution.
- If no errors are found and the SARIF was generated correctly, see if there might be an error in the SQS queue (Maybe a delay in the messages processing)
- If no errors are found, the SARIF processing might be the issue. Check that code on the integrates side (server_async module)
In almost all cases, the bug is found in any of these processes.