Fluid Attacks' platform ETL
This ETL process (a.k.a. dynamo_etl) extracts data from the Fluid Attacks’ platform database (dynamo) and sends it to the warehouse (Snowflake).
Architecture
The ETL has two core procedures:
-
Data-schema determination
where the schema of the data is inferred.
-
Data refresh (the ETL)
where all data is updated from dynamodb to the warehouse
The ETL has four phases:
-
Segment ETL (dynamo -> s3)
where the ETL is executed over a segment of the dynamo data and saves it on s3.
-
Preparation
where a pristine staging (a.k.a. loading) warehouse-schema is created for temporal store of the new data.
-
Upload ETL (s3 -> warehouse)
where the codified s3 data is uploaded to the corresponding tables on the staging warehouse-schema.
-
Replacement
where the staging schema becomes the new source of truth.
ETL phases details
-
Segment ETL
- Segment extraction: The data is extracted using a parallel scan over one specific segment.
- Data transform: By using the auto generated data-schemas, the data is adjusted.
- S3 upload: Data is transformed into a csv file (one for each data-schema) and uploaded into the observes.etl-data bucket.
data is uploaded first to s3 and then to the warehouse due to performance issues. Loading from s3 is more efficient than direct upload queries.
Upload ETL
- Data-schema determination
This process infer data-schema from raw data and stores the determined data-schemas into observes.cache s3 bucket for serving as a cache.
This process is triggered by an schedule. It has a frequency of execution of one week.
Remote and local execution
The main interface for execution is through m . /observes/etl/dynamo/bin
for now on we call it dynamo-etl
for simplification. With this you can trigger the execution of a phase locally (on the current machine) or remotely (on aws batch)
- local with
dynamo-etl local <phase-X>
- remotely with
dynamo-etl remote <phase-X>
Where <phase-X>
is a placeholder. For more information use the --help
flag
The pipeline
The ETL pipeline consist of a chain of aws-batch jobs that are coupled i.e. when some phase ends the other starts.
The pipeline is triggered by schedule.