How ManyPets Implemented The Modern Data Stack

Mark MacArdle
Data@ManyPets
Published in
10 min readMar 18, 2022

--

Using the latest tools to build a data platform that massively empowers data users

ManyPets’ data architecture (somewhat simplified)

In the past years machine learning and AI has firmly held the spotlight in the data world. However, at the same time there’s been another, quieter, but no less transformational data revolution happening. This has been the huge improvement in tools available for importing, modelling and analysing data. They have greatly sped up time to value and massively empowered data analysts and scientists by enabling them to do tasks that were once limited to data engineers.

This revolution started 10 years ago and really picked up speed in the last 5, but it’s only recently that this set of tools has gotten a collective name: The Modern Data Stack. In this post we’ll talk through what these tools are and how we’ve implemented them at ManyPets.

What is The Modern Data Stack?

It’s not just a description of a data stack that’s modern. It’s a modular set of best in class tools which consists of:

  • A data ingestion tool, also called an EL (extract-load) tool. For getting data into your warehouse.
  • A cloud data warehouse. For storing and transforming your data.
  • A data modelling tool. For managing your data transforming queries.
  • A business intelligence tool. For making dashboards, charts and reports.
  • A reverse-ETL tool, also called an operational analytics tool. For getting data out of your warehouse to where it’s most useful.

Data Ingestion

We use: Fivetran, Meltano, Airflow and various GCP services

Other options: Stitch, Airbyte

This is where you get the raw data from your many different sources into your warehouse.

Custom pipelines (and why to avoid them)

By this we mean pipelines where you have to write all the code and run the infrastructure yourself. They’re not really a part of The Modern Data Stack, but most companies will have a few unusual sources or use cases that force the do it yourself approach.

The very first pipeline we built at ManyPets was to import our policy and quote data from Mongo to our warehouse. We created a custom pipeline for this as the data is heavily nested and we felt it’d be easier to flatten out the jsons with Python than SQL. The second pipeline we made was for importing our customer reviews from Feefo and that was because none of the SaaS tools (discussed next) had a connector for it.

The Mongo pipeline uses PubSub and Dataflow on Google Cloud Platform (GCP). The Feefo pipeline uses Cloud Functions (what GCP calls lambdas) and python scripts running on Airflow. We use GCP’s managed service for Airflow called Cloud Composer to run Airflow and orchestrate the pipelines.

The biggest lesson we learned from developing these custom pipelines was: don’t make custom pipelines if you can avoid it. Both of them took weeks to develop and we have an ongoing (albeit relatively minor) burden of monitoring and running them ourselves. Even for cases we thought would be easy, like the customer reviews from Feefo, we still spent weeks getting lost in bad API documentation and finding/resolving obscure bugs.

Data ingestion tools (the joy of paying someone else to do it)

With SaaS tools like Stripe or Zendesk becoming so common, it should be unsurprising to hear that further SaaS companies popped up to offer the service of importing data from those tools to your warehouse. Fivetran and Stitch are the two biggest companies in this area and on both of their sites you can set up pipelines to import data from all the common SaaS tools in a few clicks. This is great, both for the weeks of development effort it saves, but also because it means the importing pipeline can be set up by data analysts. No longer do they have to wait on data engineers to do it for them.

We initially used Stitch which was quite cheap. However we later switched to Fivetran because Fivetran realised that not only does everyone want the same raw data in their warehouse, they also probably want to use it to do similar cleaning of it. Eg if you’ve Stripe payment data you probably want to see total spends by customer. Fivetran offer this cleaning as SQL dbt packages (discussed more in the Modelling section later).

Fivetran is multiple times the cost of Stitch but it was these cleaning packages that made it worth it to us. Getting familiar with a new data source and produce useful tables/metrics from it is a process that can take weeks so these packages are a huge time saver.

Meltano (open source to the rescue)

Meltano is an open source tool that makes it easy to run connectors written in the Singer open source standard. The connectors are created by the community so there’s a huge library of over 300 of them, although they can vary in quality. We found a connector for our telephony tool, PureCloud, here when no one else was providing it (thank you open source community!).

Meltano don’t currently offer a paid service so you have to run the pipeline yourself. We do this in a container triggered by Airflow. We wrote a guide on how to get Meltano running in a container, and how to run that container on GCP here.

Cloud Data Warehouse

We use: Google BigQuery

Other options: Snowflake, AWS Redshift

This is the foundation of The Modern Data Stack. Cloud data warehouses offer cheap storage (similar to S3 bucket prices), near limitlessly scalable compute and are either fully or nearly-fully managed.

When Redshift, the OG cloud data warehouse, launched in 2012 it changed the game for how pipelines could be built and by who. The cheap storage and powerful compute meant you could chuck in all your raw data (instead of only reduced and cleaned data in the old ETL (extract-transform-load) processes). This allowed the rise of the data ingestion tools. The raw data being in a warehouse then meant it could be modelled with just SQL, as opposed to with programming languages before upload to the warehouse.

This then meant that it could be data analysts and scientists who do the modelling without having to wait on engineering teams. This profound change has even birth a new job title. Those specialising in modelling raw data into more useful cleaned tables are now called Analytics Engineers.

For our data warehouse we use BigQuery. In my opinion it’s the easiest to run as it’s completely serverless and fully managed. It’s entirely on demand so you don’t have to do things like spec your compute size beforehand as you do in Snowflake. If you aren’t on GCP most will use Snowflake.

Data modelling

We use: dbt

Other options: Dataform

dbt is a tool for aiding SQL modelling (transforming/cleaning) of your data. It does a couple of things:

  • Makes it easy to reuse SQL code and generate the SQL programmatically with Jinja templating
  • Allows documenting column descriptions and specifying SQL tests to run on columns or tables in yaml documentation files
  • Automatically manages dependencies between queries and runs them in the right order.

You can use it either as a free and open source command line tool, or they have a paid managed service called dbt Cloud. We use the free version and run it in production with Airflow.

Before dbt we stored our modelling queries as SQL files and ran them as tasks in an Airflow job. We had to manually add dependencies between the tasks and we used Python’s string formatting to allow some basic code reuse. Initially, I’m now embarrassed to say, I didn’t think we needed dbt to help manage this. As we grew further though it became clear that we did and now I can’t imagine life without it!

Another innovation of dbt is its introduction of open source packages for SQL on dbt Hub. Packages are a key part of any mainstream programming language but SQL has never really had them due to the many different databases implementing slightly different flavours of SQL. There’s only 3 or 4 major cloud data warehouses in use today though and their syntax differences can be handled with the dbt’s Jinja templating.

This has made it feasible for companies like Fivetran (our data ingestion tool) to write packages that model the raw data they extract and run in any of the major data warehouses. We love these packages! Sometimes we modify them, as we do for our Stripe cleaning, other times we use them as-is like for our Google, Facebook and Microsoft Bing ad data. Fivetran even has a package that combines the data from their own cleaning of multiple ad data sources and puts it into one summary reporting table.

The final thing to say about dbt is that they have an amazing community Slack with channels for a wide variety of topics (even including data related memes). Anyone can join and I thoroughly encourage you to.

Business Intelligence Tool

We use: PowerBI

Other options: Looker, Tableau, Preset, Metabase, Hex and a million more

This is the tool for making charts, dashboards and exploring data. Currently we use PowerBI as it was the best compromise between a wide variety of use cases. We chose it for its low cost (which means less resistance to rolling out widely), its intuitive way for joining tables and that it works well with local CSV files (which we get a lot of from partners).

It has some major downsides though and if we ever change a tool in our stack in future it will likely be here. It has a desktop app centric approach and the app isn’t available for Macs. There is a web app but its functionality is limited. Those of us on Macs use AWS Windows VMs to get around this. More modern tools like Looker were developed as web apps from the start so don’t have this issue. They generally look better and do nice things like automatically switch to a mobile view when needed. However they’ve other trade offs like higher cost (especially for Looker 💸💸💸) and not playing nice with local files.

Reverse-ETL aka Operational Analytics

We use: Hightouch

Other options: Census, Grouparoo

You’ve gotten all your data sources into your warehouse, you’ve modelled and combined it with dbt, you’ve found interesting segments of customers with your business intelligence tool. Typically at this point the next step is to download a CSV of the relevant data then upload that into another tool like a Salesforce to enrich the data about a customer. The process is manual and analysts and scientists have just had to put up with it.

In 2021 however things changed as tools with an inaccurate, but catchy name of Reverse-ETL gained traction (ETL being the old process of getting data into your warehouse). These tools allow you to specify an SQL query and then, with a few clicks, have the results sent to the APIs of most popular SaaS tools on a given schedule. This means they can be used by analysts or scientists, again without depending on the engineers they needed in the past. This lowering of the effort needed means value can be created from insights in more situations without the tedium of manually downloading then uploading CSVs.

The most popular tools in the space are Hightouch and Census. We chose Hightouch due to a better development environment where it was easier to see and debug errors. Their pricing was also transparent and they have been developing additional connectors faster than Census.

This is the newest tool we’ve added to our stack and we’re only getting started with it. Currently we use it to send CSVs to Slack and S3 and SFTP buckets, and sales data to partner APIs. Further future use cases are likely enriching customer profiles in Segment or Customer.io (our email tool).

If you’re familiar with Segment (a customer data platform) you may think this sounds like the same thing. There is a lot of crossover but there are a few benefits. The main one is the aforementioned fact that reverse-ETL pipelines can be created by analysts or scientists, rather than the software dev teams which typically own Segment. The lower effort required means automation can be utilised in more use cases and by a broader group. Another benefit is that you can use whatever data model you like, rather than Segment’s customer centred one.

Conclusion

The data team at ManyPets has been very resource constrained in the past. Over our three years we’ve been a team of 1, then 4, then only in the last six months we’ve increased to 10. With this small team we not only have to build infrastructure but also have a significant workload in providing reporting and analysis for the business. Despite all this, we’ve built a data platform capable of supporting the analytics needs of a company that’s now near 500 people with a $2 billion valuation💪💪💪. This level of productivity and impact would have been unthinkable before the advent of The Modern Data Stack.

We’ve got a lot more to do here in Data at ManyPets and we’d would love your help to do it 🚀. See our careers page to come join us!

--

--