Introduction
Observes is the product responsible for company-wide data analytics.
Observes follows the Data Warehouse architecture, which means that most of what it does is Extracting data from different sources, Transforming that into a relational model, and Uploading the results to a Data Warehouse. This process is usually known as an ETL. Once the data is in the Warehouse, data can be consumed for creating dashboards and info-graphics that End Users consume.
Observes also provides a few services outside of the Data Warehouse architecture, for example: Generating billing information, and stopping stuck GitLab jobs.
Public Oath
- Data in the Warehouse is consistent, correct, and reasonably up-to-date.
- When deciding between correctness and speed, correctness will be given priority.
Architecture
- Observes declares all its ETLs using Python.
- It declares its own infrastructure using Terraform.
- Sensitive secrets like Cloudflare authentication tokens are stored in encrypted YAML files using Mozilla SOPS.
- The Data Warehouse is a Redshift cluster on AWS deployed on many subnets provided by the VPC component of Common for High Availability.
- AWS QuickSight is the solution we use for Business Intelligence (BI).
- ETL tasks are scheduled using the Compute component of Common.
- ETLs fetch data from their corresponding service and transports it into a target location, commonly to the warehouse.
Generic ETL
-
Fetch: The ETL fetch data from the source. e.g. API endpoint.
-
Get schema: Then the ETL determines the schema (metadata of the data). This can be done by various means:
- SDKs hardcoded typings
- auto determination by raw data analysis.
Both methods Have pros and cons.
-
Target transform: with data correctly typed, then it is transformed into the expected types for the target. For redshift the most common transform is Flattering.
- Flattering: nested/complex fields (e.g. list, dictionaries) are mapped into flat, primitive (as defined by the target) collections of fields. i.e. For redshift list and dictionaries are mapped into separate tables.
-
Load: having transformed the data into the data type that the target expects, load commands are triggered for saving the data. i.e. in redshift this corresponds to SQL queries.
Particular Cases
For almost all ETLs:
- target is set to Redshift.
- source is an API endpoint exposed on the internet.
- schema is hardcoded on the SDK or auto determined on each trigger.
- all data is erased and re-uploaded into the target.
- the etl job triggers the full ETL procedure from start to finish.
- the Compute component of Common triggers the ETL.
A few exceptions exist though:
-
s3 bucket target: They emit data into a bucket rather than to Redshift. i.e.
/observes/etl/timedoctor/backup
intoobserves.etl-data/backup_timedoctor
/observes/etl/code/compute-bills
intointegrates/continuous-data/bills
- dynamo_etl emits first to
observes.etl-data/dynamodb
and latter into Redshift.
-
Differential ETLs: they do not erase and re-upload all data. They store the current streaming state in the
observes.state
bucket. i.e. gitlab_etl, checkly_etl -
Cached schema: They do not have hardcoded schema nor auto determination. They extract the schema previously saved in the
observes.cache
bucket. i.e. dynamo_etl (another job auto determines the schema and saves it on the bucket). -
Fractioned ETL phases:
- Zoho ETL:Has
two jobs separating ETL phases i.e.
/observes/etl/zoho-crm/fluid/prepare
and/observes/etl/zoho-crm/fluid
- Dynamo ETL: start phase is triggered into multiple concurrent jobs, but final phase is manually triggered.
- Zoho ETL:Has
two jobs separating ETL phases i.e.
-
LogRocket pushes its data to a S3 bucket that is later used by RedShift, meaning that there is no custom ETL running on Batch for this scenario.
Jobs
/observes/job/batch-stability
: Whose task is to monitor the Compute component of Common./observes/job/cancel-ci-jobs
: Whose task is to cancel old CI jobs on GitLab that got stuck.
Contributing
Please read the contributing page first.