Model Training

Sorts uses a machine learning model which outputs the probability of a file having a vulnerability. In order to obtain this model, a training process is performed weekly using a variety of training algorithms from where the best model produced is selected.

Training Algorithms

In order to train a model for our specific purposes we need the appropriate machine learning algorithm. There are mainly 4 types of machine learning algorithms:

Supervised learning
Unsupervised learning
Semi-supervised learning
Reinforcement learning

Supervised learning is the one used for the training of Sorts’ model. This is because supervised learning uses “labeled” data for training the model and with this information the model can put a label on the data that needs to be classified.

The supervised machine learning algorithms that have been used to train Sorts’ model are:

AdaBoost classifier
Gradient Boosting classifier
Histogram-based Gradient Boosting classifier
K-Nearest Neighbor Classifier
LightGBM classifier
Multi-layer Perceptron classifier
Random Forest classifier
eXtreme Gradient Boosting classifier

The labeled data that this algorithms receive are all of the files that exist in Integrates, because each of them has been labeled as “Safe” or “Vulnerable”, and all of them have git characteristics that can be used as input to the algorithms.

Feature Variables

In order to train Sorts’ model using the machine learning algorithms, we need to define the features that the labeled files have. A Feature is any measurable characteristic of the object that is being analyzed. In the case of Sorts, the features used to train the model are the git characteristics of each file which can be obtained from the git log of the repository that the file resides in and also the extension of the file codified in binary to have a numeric value comprehensible for the algorithm. The following are the features extracted from the git characteristics of each file:

Number of commits (CM): The number of commits that have included this file ever since it was created
Number of unique authors (AU): The number of unique authors that have modified the file since it was created
File age (FA): The number of days since the creation of the file
Midnight commits (MC): The number of times that the file was modified between 0AM and 6AM
Risky commits (RC): The number of commits modifying the file that have had more than 200 deltas
Seldom contributors (SC): The number of authors that have modified the file but rarely contribute to the repository
Number of lines (LC): The total number of lines the file has
Busy file (BF): Whether the file has been modified by more than 9 authors since it was created
Commit frequency (CF): The number of commits divided by number of days since the file was created
Last Commit Days (CD): The number of days since the file was last modified.

These 9 git characteristics, the codified extension of the file and the information that states if the file is vulnerable or not, comprise the totality of the information used to train Sorts’ model, as such, it is evident that Sorts does not use any Personally Identifiable Information (PII) of any of the users that utilize Fluid Attacks’ platform.

Training process

The training process for Sorts’ model is performed every week. In order to get the best model possible, multiple training algorithms are used alongside a combination of features following these steps:

The features of all the files present in integrates are collected into a CSV which is then used as input for the algorithms.
A number of algorithms are selected randomly from the current pool of algorithms that are used for training.
All the feature variables aren’t used at once, instead a list of combination of features is created, with each element being a set of 5 to 9 unique combination of features. As a result, a list of 250 combinations of features is created.
Each of the selected algorithms takes the list of combinations and iterates through it, producing one model for each combination.
After this process there would be 250 trained models for each algorithm from which the one that has the best F1 Score will be selected.
Then the tuning process is performed, where variations of the winner algorithm are produced by slightly modifying its parameters, producing 200 variations, which are then trained once again.
Finally the best one among these tuned models will be selected as Sorts new model.

Scope definition

The scope intended for Sorts are all those files that have at least a small possibility of having vulnerabilities. This means that there are many types of files that are outside of Sorts’ scope because in these files it is exceptionally rare or impossible for vulnerabilities to exist. As such, in order to avoid using these types of files that have little relevance for Sorts, there is a defined list of file extensions or common file names that allow only relevant files to be analyzed or to be added to the dataset used for training Sorts.

Allowed extensions

The list of extensions or common file names allowed by Sorts is very long, having over 700 items. The list was determined by taking all extensions from most programming languages, all the ones that are used by development frameworks of all kinds and also taking into account the types of files where vulnerabilities are commonly found in Fluid Attacks’ platform. The list is defined in two files, this file for all the allowed extensions and this one for the allowed common file names.

Excluded extensions

Even though the list for allowed extensions is pretty extensive, it’s nothing compared to the thousands that exist in the modern world, that is the reason Sorts uses a whitelist to allow the comparatively low number of relevant file extensions. Some examples of types of files considered irrelevant for Sorts are:

Image files: .jpeg, .gif, .png, .psd, .tiff and more
Video/Audio files: .mp3, .mp4, .wav, .mkv, .flv and more
Document files: .docx, .pdf, .wpd, .xlsx and more

The types of files above have a really low probability of having vulnerabilities and are excluded from Sorts’ scope because of this. However, we are always looking to improve Sorts not only by making it more accurate but also by adjusting its scope, so if you have any suggestions about types of files that should be included or excluded from Sorts’ scope, you can use the help channel to send us your suggestions.