Skip to content

Introduction

Observes is the product responsible for company-wide data analytics.

Observes follows the Data Warehouse architecture, which means that most of what it does is Extracting data from different sources, Transforming that into a relational model, and Uploading the results to a Data Warehouse. This process is usually known as an ETL. Once the data is in the Warehouse, data can be consumed for creating dashboards and info-graphics that End Users consume.

Observes also provides a few services outside of the Data Warehouse architecture, for example: Generating billing information, and stopping stuck GitLab jobs.

Public Oath

  1. Data in the Warehouse is consistent, correct, and reasonably up-to-date.
  2. When deciding between correctness and speed, correctness will be given priority.

Architecture

Architecture-light Architecture-dark
  1. Observes declares all its ETLs using Python.
  2. It declares its own infrastructure using Terraform.
  3. Sensitive secrets like Cloudflare authentication tokens are stored in encrypted YAML files using Mozilla SOPS.
  4. The Data Warehouse is Snowflake cluster
  5. AWS QuickSight is the solution we use for Business Intelligence (BI).
  6. ETL tasks are scheduled using the Compute component of Common.
  7. ETLs fetch data from their corresponding service and transports it into a target location, commonly to the warehouse.

Generic ETL

  1. Fetch: The ETL fetch data from the source. e.g. API endpoint.

  2. Get schema: Then the ETL determines the schema (metadata of the data). This can be done by various means:

    • SDKs hardcoded typings
    • auto determination by raw data analysis.

    Both methods Have pros and cons.

  3. Target transform: with data correctly typed, then it is transformed into the expected types for the target. The most common transform is Flattering.

    • Flattering: nested/complex fields (e.g. list, dictionaries) are mapped into flat, primitive (as defined by the target) collections of fields. i.e. For Snowflake list and dictionaries are mapped into separate tables.
  4. Load: having transformed the data into the data type that the target expects, load commands are triggered for saving the data. i.e. in Snowflake this corresponds to SQL queries.

Particular Cases

For almost all ETLs:

  • target is set to Snowflake.
  • source is an API endpoint exposed on the internet.
  • schema is hardcoded on the SDK or auto determined on each trigger.
  • all data is erased and re-uploaded into the target.
  • the etl job triggers the full ETL procedure from start to finish.
  • the Compute component of Common triggers the ETL.

A few exceptions exist though:

  • s3 bucket target: They emit data into a bucket rather than to Snowflake. i.e.

    • /observes/etl/timedoctor/backup into observes.etl-data/backup_timedoctor
    • /observes/etl/code/compute-bills into integrates/continuous-data/bills
    • dynamo_etl emits first to observes.etl-data/dynamodb and latter into Snowflake.
  • Differential ETLs: they do not erase and re-upload all data. They store the current streaming state in the observes.state bucket. i.e. gitlab_etl, checkly_etl

  • Cached schema: They do not have hardcoded schema nor auto determination. They extract the schema previously saved in the observes.cache bucket. i.e. dynamo_etl (another job auto determines the schema and saves it on the bucket).

  • Fractioned ETL phases:

    • Zoho ETL:Has two jobs separating ETL phases i.e. /observes/etl/zoho-crm/fluid/prepare and /observes/etl/zoho-crm/fluid
    • Dynamo ETL: start phase is triggered into multiple concurrent jobs, but final phase is manually triggered.

Jobs

  • /observes/job/batch-stability: Whose task is to monitor the Compute component of Common.
  • /observes/job/cancel-ci-jobs: Whose task is to cancel old CI jobs on GitLab that got stuck.

Contributing

Please read the contributing page first.