Introduction

Sorts is the product responsible for helping penetration testers sort files in a Git repository by their probability of containing security vulnerabilities. It does so by using Machine Learning and producing a prediction model, which is then used by Fluid Attacks internal systems to update the priority in Integrates of the source code that Fluid Attacks penetration testers audit.

Public Oath

Sorts model can predict vulnerabilities with a precision of at least 75%, helping penetration testers prioritize files with improved efficiency.
The data used for the purpose of training the model is transformed and anonymized to prevent any connection with customer’s sensitive data.
Fluid Attacks will constantly look for new stack that improves the model’s performance in predicting vulnerabilities and aiding penetration testers in finding them more efficiently.
Fluid Attacks will constantly look for new stack that enhances overall security measures and protects customers’ data.

Architecture

Sorts architecture is designed to gather the necessary data for training its model and use it to aid penetration testers in finding vulnerabilities more efficiently. The process can be divided into the following phases:

Data Collection

The following takes place for a given file on the Fluid Attacks platform if, and only if, it is:
- Part of a currently active repository.
- Not identified as third party code.
- Written in one of the supported languages.
The git data necessary for training the model is extracted, such data includes the following:
- File path
- Creation date and time
- Last modification date and time
- Authors emails
- Associated commits deltas
- Associated commits hash
- Number of lines
This data is then transformed into Features, which are the information used to train the Sorts model. These Features DO NOT include PII (Personally Identifiable Information).
All file paths are encrypted using symmetric SHA256 industry-standard keys via Fernet.
The collection of Features is stored in a S3 Bucket to be used as a dataset for training the Sorts model.

Training

The dataset stored in S3 is used to train several models using different combinations of features.
The best model is selected based on which of them is able to predict vulnerable files with better precision.
This model is once again tuned to potentially further improve its performance.
The resulting model is then stored in S3.

Execution

The model stored in S3 is used to run Sorts over each active cliente repository, providing a sorted list of files based on their probability of containing vulnerabilities.
The list of each group is then used to perform a final prioritization using the last time each file within that list was reviewed by a Fluid Attacks hacker.
The result is used to assign a priority value to all the files within each group.

Data Security and Privacy

Amazon SageMaker

Amazon SageMaker protects customer data privacy by offering encryption for data at rest and in transit, support for multi-factor authentication, and providing tools like AWS Identity and Access Management (IAM) for precise access control at the time of using the service. (SageMaker Data Protection)

Amazon S3

Amazon S3 protects data with automatic encryption at rest (SSE-S3) and SSL/TLS for data in transit. Additionally offering fine-grained access control via IAM, bucket policies, and ACLs, while S3 Block Public Access feature prevents unintended exposure. (S3 Security and Access Management)

Contributing

Please read the contributing page first.

Development Environment

Configure your Development Environment.

If prompted for an AWS role, choose dev, and when prompted for a Development Environment, pick sorts.