Fluid Attacks' platform ETL
This ETL process (a.k.a. dynamo_etl) extracts data from the Fluid Attacks’ platform database (dynamo) and sends it to the warehouse (redshift).
Architecture
The ETL has two core procedures:
-
Data-schema determination
where the schema of the data is inferred.
-
Data refresh (the ETL)
where all data is updated from dynamodb to redshift
The ETL has four phases:
-
Segment ETL (dynamo -> s3)
where the ETL is executed over a segment of the dynamo data and saves it on s3.
-
Preparation
where a pristine staging (a.k.a. loading) redshift-schema is created for temporal store of the new data.
-
Upload ETL (s3 -> redshift)
where the codified s3 data is uploaded to the corresponding tables on the staging redshift-schema.
-
Replacement
where the staging schema becomes the new source of truth.
ETL phases details
-
Segment ETL
- Segment extraction: The data is extracted using a parallel scan over one specific segment.
- Data transform: By using the auto generated data-schemas, the data is adjusted.
- S3 upload: Data is transformed into a csv file (one for each data-schema) and uploaded into the observes.etl-data bucket.
data is uploaded first to s3 and then to redshift due to performance issues. The custom redshift load query from s3 is more efficient than direct upload queries.
Upload ETL
- Data-schema determination
This process infer data-schema from raw data and stores the determined data-schemas into observes.cache s3 bucket for serving as a cache.
This process is triggered by an schedule. It has a frequency of execution of one week.
Remote and local execution
The main interface for execution is through m . /observes/etl/dynamo/bin
for now on we call it dynamo-etl
for simplification. With this you can trigger the execution of a phase locally (on the current machine) or remotely (on aws batch)
- local with
dynamo-etl local <phase-X>
- remotely with
dynamo-etl remote <phase-X>
Where <phase-X>
is a placeholder. For more information use the --help
flag
The pipeline
The ETL pipeline consist of a chain of aws-batch jobs that are coupled i.e. when some phase ends the other starts.
The pipeline is triggered by schedule.