Skip to content

Debugging machine flow

Machine flow in production relies on many cloud services and depends on many different actions and steps.

The basic process is as follows:

  • An action triggers a machine execution (A scheduler, a reattack request, etc)
  • The integrates backend generates the yaml config which skims will use to analyze the Git Roots.
  • After that, it also puts a batch job on the skims’ queue.
  • The batch job downloads the repository that skims will analyze and the config file generated by integrates.
  • It executes skims and uploads the SARIF file to s3.
  • It puts a message on an SQS queue that points to that SARIF file.
  • Our k8s cluster has servers dedicated to catching the messages from that queue
  • It uses the message to download the SARIF and the config yaml, and processes the results accordingly.

A lot of things can go wrong during this process, and a bug might be found in any of them.

To make debugging easier, once a developer identifies an issue, here is a step by step guide of what can be done to identify where the error is:

  1. Search the batch job associated with the execution you need to analyze. Machine batch jobs are named using the following format: machine_group name_job origin_job type_id Where each component means:
    • group name: Each machine job runs a skims execution for any given number of roots belonging to a certain group.
    • job origin: As mentioned before, there are many actions that can trigger a machine job.
    • job type: What type of technique is the job executing (SAST, APK, ALL, etc)
    • id: A unique identifier that is used to name the config yaml files and the SARIF files uploaded to S3. This is useful to search each associated file.
  2. If no batch job is found, that means the bug might be on the integrates action in charge of queueing the job.
  3. If the batch job is found, analyze the logs of the job to see if there were any errors during skims execution.
  4. If no errors are found and the SARIF was generated correctly, see if there might be an error in the SQS queue (Maybe a delay in the messages processing)
  5. If no errors are found, the SARIF processing might be the issue. Check that code on the integrates side (server_async module)

In almost all cases, the bug is found in any of these processes.