Improving the quality and reliability of our end-user metrics | A data engineering journey

In 2021 the Data Engineering team took on an ambitious project to redesign our data infrastructure. In this post, we provide a high-level overview of where we started and where we are now. This overview will include the reasons behind the architectural decisions made, and what we see as the next steps in the evolution of data engineering at WeTransfer.

Where we started

Initially, to provide our end-users with metrics, we had a BI tool that could directly query the backend OLTP databases of our three products (Collect, Paste & WeTransfer), see diagram below. We also loaded internal data, with Amazon's Data Migration Service (DMS) and external data via web-hooks and Kinesis Firehoses. The transformations on all data sources were done with Apache Spark on EMR (orchestrated by Airflow), the results of which were either loaded into one of three possible Redshift warehouses or dropped into S3 to be queried via Spectrum.

Previous architecture

Screenshot 2021-09-21 at 21.07.09.png

Limitations of our initial approach

There were several problems with this setup, here I will outline just a few.

  1. The data were spread across three separate warehouses, (one for non-PII data, one for PII data, and one for Snowplow events) leading to unnecessary complexity and data duplication.
  2. The traditional ETL design abstracted the transformations away from the analysts; those with the greatest business knowledge now had the least control over the data modelling.
  3. The quality of data entering the warehouse was questionable as we had no real quality control checks; often an error would only be flagged by an end-user noticing inconsistency in the data weeks after loading.
  4. The end-users had nowhere to perform custom analysis, which forced them to use Metabase dashboards for this purpose instead. Metabase is not designed for this, e.g. it has no version control, and we quickly found we had a huge number of dashboards that were hard to keep track of and organise.

As WeTransfer's data is growing exponentially, we needed to make some radical! To begin the redesign we started by asking ourselves the following questions and coming up with solutions to them.

Redesign questions

Is it scalable?

To simplify our setup and prevent unnecessary data duplication, we decided to reduce our data warehouses from three to one. To do so, we needed to consider the storage capabilities of our Redshift cluster. Our current DC2 nodes did not provide us with sufficient storage capabilities. This led us to upgrade to the recently released RA3 nodes allowing us to guarantee our cluster could handle a large increase in data volume.

Next, we needed to assess whether our current orchestration tool, Airflow, would cope with a large increase in tasks and dependencies to run daily. We saw this as a great opportunity to move our Airflow scheduler to Kubernetes. Airflow on Kubernetes runs one worker pod per Airflow task, enabling Kubernetes to spin up and destroy pods depending on the load. Kubernetes allows Airflow to work as a distributed system. In other words, the scheduler, web server, and worker pods all run on different nodes consequently improving the service's fault tolerance.

The Data Engineering team is relatively small at WeTransfer and we wanted to make big changes fast. It was not feasible for us to handle the extract and load of all sources internally and still be able to guarantee the data quality. This led us to sign up with a data delivery provider, Fivetran. So far we have been happy with this decision, though it is important to monitor the cost closely, as under certain circumstances this can start to explode. We also decided not to load our two largest tables in this manner as the cost would be too large based on their pricing model.

Are the data discoverable?

We needed to improve the accessibility of the data to those with the greatest business knowledge, the analysts. It seemed only natural up until now, that it was the data engineers' responsibility for the transformation steps, as this avoided analysts having to skill up on Spark and setting up EMR clusters. However, we were finding that this setup created barriers between the teams and prevented data democracy within the company.

With this in mind, we now set our sights on simplifying the process so that other teams could have greater control and access to the data. This was achieved by moving from an ETL to an ELT approach. We removed our EMR Spark jobs running the ‘T’ step and replaced this with a SQL orchestration tool, known as the Data Build Tool (DBT), which would now perform the transformations inside our data warehouse using the Redshift cluster's computing power. Now it is much easier for the analysts to build data models themselves to feed their own KPI dashboards.

Furthermore, under the previous architecture, the overall DAG logic was hidden from the analysts, but now they can make use of DBT's built-in documentation website for DBT projects. Here the analysts can gain a high-level overview of model dependencies, view the compiled SQL code, and peruse table metadata.

Is it reliable?

It was imperative that metrics generated could be trusted by the end-users. This would entail implementing daily tests against our batch loads. This was simple to implement using the DBT test module. Now instead of end-users spotting mistakes weeks after the fact, the Data Engineering team would be made aware of pipeline failures in our alert channel on the day in which the job was run. Having the tests allow us to communicate the issues to the business quicker, and that increases the end-user's trust in the data.

Where we are now

Our primary objective was to improve the quality and reliability of our end-user metrics. This was achieved by ensuring our BI dashboards only rely on data from one data warehouse as opposed to three and no direct querying of the OLTP databases. Within this one data warehouse, we built a data mart that serves all business questions, dashboards, metrics, and KPI's - it is our single source of truth. This required us to switch from an ETL to an ELT design paradigm, whereby we run our transformations in the warehouse using our cluster's computing power orchestrated by DBT. The key KPIs are displayed within Metabase but we also deployed a Jupyterhub cluster in Kubernetes to give the end-users an environment in which to perform custom analysis. This has helped to ensure that Metabase is only used to track key metrics, keeping the number of dashboards to a manageable size.

We also opted to employ the services of Fivetran to handle the extraction & load of almost all our internal data source tables and all our external sources except Snowplow events. Furthermore, we've started to migrate to a more event-driven architecture, adding custom Snowplow events which we hope with time will remove our reliance on the OLTP databases. Lastly, we switched our dependency manager/orchestrator from Airflow to Airflow on Kubernetes.

Current Architecture

Screenshot 2021-09-21 at 20.43.28.png

Looking ahead

Currently, running DBT using our Redshift cluster's computing power does have some performance issues, especially when applying transformations to our largest tables. At the moment we navigate this problem by splitting the data into separate monthly tables and only running transformations on the last day of the latest partition. However, this technique increases the complexity of the orchestration.

An alternative could be to harness the power of Spark with DBT for larger tables or replace our current Redshift data warehouse with Snowflake. Spark will provide us with more flexibility as it provides APIs which are not available in Snowflake and can be utilised by Data Scientists to train ML models. On the other hand, Snowflake abstracts away the resource management, making our pipelines easier to maintain. Additionally, Snowflake has recently announced the release of ‘Snowpark’ which provides some Spark-like APIs. Both options have given us plenty of food for thought.

Another drawback of our current approach is that we haven't been able to completely remove DMS from our infrastructure. As previously mentioned, our two largest tables would be too expensive to bring in via Fivetran so we still have a DMS dependency here. With that in mind, we hope to eventually be able to stop these DMS jobs as we continue to add more custom Snowplow events, which would allow us to stop relying on our OLTP backend databases for building up our analytical data models.

Overall, it's been a busy year filled with great learning opportunities in the Data Engineering team. We cannot wait to see what the next year will bring as we continue to build further on the solid foundations laid in the past year.