Skip to content

Fluid Attacks' platform ETL

This ETL process (a.k.a. dynamo_etl) extracts data from the Fluid Attacks’ platform database (dynamo) and sends it to the warehouse (redshift).

Architecture

The ETL has two core procedures:

  • Data-schema determination

    where the schema of the data is inferred.

  • Data refresh (the ETL)

    where all data is updated from dynamodb to redshift

The ETL has four phases:

  1. Segment ETL (dynamo -> s3)

    where the ETL is executed over a segment of the dynamo data and saves it on s3.

  2. Preparation

    where a pristine staging (a.k.a. loading) redshift-schema is created for temporal store of the new data.

  3. Upload ETL (s3 -> redshift)

    where the codified s3 data is uploaded to the corresponding tables on the staging redshift-schema.

  4. Replacement

    where the staging schema becomes the new source of truth.

ETL phases details

  1. Segment ETL

    • Segment extraction: The data is extracted using a parallel scan over one specific segment.
    • Data transform: By using the auto generated data-schemas, the data is adjusted.
    • S3 upload: Data is transformed into a csv file (one for each data-schema) and uploaded into the observes.etl-data bucket.

data is uploaded first to s3 and then to redshift due to performance issues. The custom redshift load query from s3 is more efficient than direct upload queries.

Upload ETL

  1. Data-schema determination

This process infer data-schema from raw data and stores the determined data-schemas into observes.cache s3 bucket for serving as a cache.

This process is triggered by an schedule. It has a frequency of execution of one week.

Remote and local execution

The main interface for execution is through m . /observes/etl/dynamo/bin for now on we call it dynamo-etl for simplification. With this you can trigger the execution of a phase locally (on the current machine) or remotely (on aws batch)

  • local with dynamo-etl local <phase-X>
  • remotely with dynamo-etl remote <phase-X>

Where <phase-X> is a placeholder. For more information use the --help flag

The pipeline

The ETL pipeline consist of a chain of aws-batch jobs that are coupled i.e. when some phase ends the other starts.

The pipeline is triggered by schedule.