Data book review: The Informed Company

Mark MacArdle
4 min readNov 28, 2021

A modern data book for modern data warehouses

This is a book from Matt David (The Data School, Chartio) and Dave Fowler (CEO/founder of Chartio before it’s acquisition by Atlassian). It covers what approaches to use for managing a company’s data for analysis at four stages of maturity and lets you know when you should switch between them. They want to move the discussion on from Kimball’s Data Warehouse Toolkit book which was originally released in the ‘90’s.

Book Content

Stage 1: Querying Source Data

This stage was for those who don’t have much data or who are just getting started. You can get by with querying the production database or 3rd party tools (eg Zendesk) directly. You don’t have much need to combine data sources.

A good piece of advice they give is don’t start new dashboards by exploring the data (what I always do!). You’ll likely make a lot of charts and it’ll be hard to tell which ones, if any, are useful (a problem I often get!). Start by sketching out and iterating on ideas with the relevant stakeholders and you’ll better address their needs.

Stage 2: Data Lake

I found their terminology a bit confusing in this section as when they say “data lake” they mean raw, untransformed data in a data warehouse (eg Snowflake/BigQuery/Redshift). I don’t think they’re wrong, I’m just surprised they never clarify they don’t mean using an S3 bucket or Databricks.

Aside from the odd names there’s more good advice here. Switch to this stage when you need to start combining data sources. Use ELT not ETL. Don’t build connectors yourself, use tools like Fivetran or Stitch to do the extracting and loading for you.

This section also features a great diss of Hadoop.

Hadoop is like a barge. It can haul a lot but not well and only down a river. Just avoid it.

Ooof! 🔥🔥🔥

The book is only 200 pages and has a lot of pictures so you can get through it really quickly. Also ❤️loving❤️ the wide margins. So helpful for note taking!

Stage 3: Data warehouse

For when you start having a wider group of analysts/queries and need to start properly cleaning and structuring your data to improve usability and avoid repeating work.

The advice here is about cleaning tables in layers and setting up staging schema to hold intermediate cleaning steps. Lots of practical aspects are covered like using dbt for managing your SQL transformations and why you should use git and code reviews when updating code.

Stage 4: Data Marts

Data marts here are smaller schemas that filter or combine a sub-group of the tables available in the warheouse for better ease of use. Moving from the lake to warehouse stages was about cleaning. Warehouse to marts is about focusing on needs of different teams or business processes.

They make a strong recommendation here to not use Kimball modeling and instead use wide tables. They’re normally easier to use as well as design and implement. The removal of the storage/processing cost constraints of Kimball’s day now make this their preferred approach.

The final chapter was my favourite. It’s a real world example of three iterations they went through at Chartio of creating a mart to be used as a single source of truth for company metrics. They cover difficult points they had like should similar, but not quite the same tables be combined? They’re candid about mistakes they made and issues they found with each iteration, even the last one. I thought this was so helpful as, while the book contains straightforward advice, actually implementing something in the real world will present many little sticking points.

Other Similar Books

If you’re new to the modern data stack, another good book I’d recommend is The Analytics Setup Guidebook. That book is broader in scope and less focused specifically on how to structure your warehouse. It goes into more detail on the history of past tools and approaches. That historical information was really helpful to me for understanding where people only familiar with older approaches are coming from when they ask questions.

Conclusion

I think this is a great book for giving guidance on where you are and where you might want to go in organising your data warehouse. There’s precious few books that do this well.

As someone who’s already familiar with the modern data stack, what I found myself craving throughout the book was that interpersonal or prioritisation issues would be discussed more. Like having a ticketing system for business user requests is good advice, but what do you do when you get too many requests?

That doesn’t stop me recommending it though as it’s hard to give clear advice on those things without being overly generic. And besides, including it would have lengthened what is a short and focused book.

As a side point this book immediately got me onside by stating at the start that it’s not for AI/machine learning or Big Data use cases. That’s quite an untrendy statement to make so it’s refreshing to see it stated openly. They also give the biggest definition of “big data” I’ve seen yet: when you’re generating more than 100GB pre day of new raw data. I think what they’re getting at is that your data is only “big” if the off the shelf tools most companies use can’t handle it anymore.

--

--