Fluid Attacks' platform ETL

This ETL process (a.k.a. dynamo_etl) extracts data from the Fluid Attacks’ platform database (dynamo) and sends it to the warehouse (Snowflake).

Architecture

The ETL has two core procedures:

Data-schema determination

where the schema of the data is inferred.
Data refresh (the ETL)

where all data is updated from dynamodb to the warehouse

The ETL has four phases:

Segment ETL (dynamo -> s3)

where the ETL is executed over a segment of the dynamo data and saves it on s3.
Preparation

where a pristine staging (a.k.a. loading) warehouse-schema is created for temporal store of the new data.
Upload ETL (s3 -> warehouse)

where the codified s3 data is uploaded to the corresponding tables on the staging warehouse-schema.
Replacement

where the staging schema becomes the new source of truth.

ETL phases details

Segment ETL
- Segment extraction: The data is extracted using a parallel scan over one specific segment.
- Data transform: By using the auto generated data-schemas, the data is adjusted.
- S3 upload: Data is transformed into a csv file (one for each data-schema) and uploaded into the observes.etl-data bucket.

data is uploaded first to s3 and then to the warehouse due to performance issues. Loading from s3 is more efficient than direct upload queries.

Upload ETL

Data-schema determination

This process infer data-schema from raw data and stores the determined data-schemas into observes.cache s3 bucket for serving as a cache.

This process is triggered by an schedule. It has a frequency of execution of one week.

Remote and local execution

The main interface for execution is through m . /observes/etl/dynamo/bin for now on we call it dynamo-etl for simplification. With this you can trigger the execution of a phase locally (on the current machine) or remotely (on aws batch)

local with dynamo-etl local <phase-X>
remotely with dynamo-etl remote <phase-X>

Where <phase-X> is a placeholder. For more information use the --help flag

The pipeline

The ETL pipeline consist of a chain of aws-batch jobs that are coupled i.e. when some phase ends the other starts.

The pipeline is triggered by schedule.