April 2025 - amazonwebshark

In this post, I review Gaurav Ashok Thalpati’s 2024 book ‘Practical Lakehouse Architecture‘ published by O’Reilly Media.

Introduction
- The Author
- The Book
Motivations
Book Review
Summary

Introduction

I first found O’Reilly books a few years back in a Data Engineering-themed Humble Bundle. Since then, I’ve built an extensive library of both e-books and physical books, with many more on my Amazon wish list. At the start of 2025, I decided to actually start reading them…

So far, I’ve finished three. Now, I don’t feel compelled to review them all. But having finished Practical Lakehouse Architecture I decided to start the Shark Shelf. This will be an occasional series of review posts about books that I really like, or that deserve some fanfare. And yes – How To Solve It belongs on the Shark Shelf.

Now let’s talk about Practical Lakehouse Architecture.

The Author

Gaurav Ashok Thalpati hails from Pune, India, where he’s worked as an independent cloud data consultant for decades. He’s a blogger and YouTuber, holds multiple data certifications and is an AWS Community Builder.

In July 2024, O’Reilly published his first book, Practical Lakehouse Architecture.

The Book

From the Practical Lakehouse Architecture blurb:

This guide explains how to adopt a data lakehouse architecture to implement modern data platforms. It reviews the design considerations, challenges, and best practices for implementing a lakehouse and provides key insights into the ways that using a lakehouse can impact your data platform, from managing structured and unstructured data and supporting BI and AI/ML use cases to enabling more rigorous data governance and security measures.

Practical Lakehouse Architecture was released in July 2024. It is available in both physical and eBook forms from O’Reilly, Amazon US, Amazon UK and eBooks.

Motivations

Reading a book?! In 2025?! I know, right? This section examines my motivations for buying and reading Practical Lakehouse Architecture.

Project Wolfie

I recently wrote about the beginning of Project Wolfie. I kinda expected to have started coding by now. Instead, most of my work is currently on paper and whiteboards. But there’s a good reason for this.

Project Wolfie is greenfield. I don’t have any existing code or resources, and I can use modern tools freely. However, with this freedom comes responsibility. Every choice I make now affects the architecture and involves tradeoffs. As much as I want to start working on the deliverables, I also want to make sensible decisions that can withstand scrutiny.

My hope with Practical Lakehouse Architecture was that it would help me with critical areas like observability, CI/CD, and security. Because it’s not that there isn’t advice online…

Advice Spread Thin

Lakehouse architectures are relatively recent in the data landscape. As a result, their understanding is not as established as that of data warehouses and data lakes, and some aspects of Lakehouse architecture are still evolving.

Many Lakehouse resources are either brief overviews, opinionated deep dives into specific use cases or marketing posts acting as best practices. This makes it hard to find balanced advice. My hope with Practical Lakehouse Architecture was that it would offer clear, unbiased views.

Professional Curiosity

As of 2025, I’ve spent nearly a decade in technical data roles. And in that time I’ve seen massive changes in data management, ranging from a server cupboard in Stockport to huge, multi‑region distributed data platforms.

Over the years, I’ve cultivated a passion for data technology, evolving from writing blog posts and speaking at meetups to working as an AWS consultant. As an AWS Community Builder in the Data category, I can access early previews and best practices from AWS experts. Additionally, as an AWS User Group Leader, I help attendees and guest speakers discuss data patterns.

With this in mind, I was curious about what new insights Practical Lakehouse Architecture could offer me.

Book Review

Onto the review! In this section, I’ll summarise the chapters and examine what stood out in each.

Chapters 1 – 3

The first set of chapters introduces the foundations of Lakehouse architecture, comparing it with traditional models and exploring the importance of storage in modern data platforms.

Chapter 1: Introduction to Lakehouse Architecture lays the groundwork for the book, putting all readers on equal footing for the chapters ahead. Gaurav starts by defining and exploring the ideas and concepts of various data architectures. He then examines the characteristics, evolution and benefits of the Lakehouse architecture.

Chapter 1 can be viewed on the O’Reilly site.

Chapter 2: Traditional Architectures and Modern Platforms contrasts the Lakehouse architecture with traditional data lakes and data warehouses, outlining the benefits and limitations of each. Gaurav then shifts his focus to how modern cloud platforms have transformed these traditional architectures.

I like how Gaurav hasn’t dismissed lakes and warehouses here. Both are proven and well-understood options, and they are still the better choice in certain situations over Lakehouses.

Chapter 3: Storage: The Heart Of The Lakehouse examines the various factors surrounding data storage. Gaurav looks at row-based and column-based storage formats. He then explains the features and uses of Parquet, ORC, and Avro. He also compares newer open table formats, like Iceberg, Hudi, and Delta Lake, highlighting their similarities, differences, and use cases.

This is one area where the book really shines. Having topics like this explained clearly in one place, without having to go online, is incredibly useful!

Chapters 4 – 6

Next, these chapters focus on the operational and organisational elements of Lakehouse architectures. Topics include metadata management, compute engines, and governance. These elements are essential for effectively scaling and securing a modern data platform.

Chapter 4: Data Catalogs explores the purpose of data catalogs and the different types of metadata they can contain. It explains how catalogs support essential processes such as classification, governance, and lineage. Gaurav also compares data catalog implementations across AWS, Azure, and GCP.

Including multi-cloud examples both broadens the chapter’s scope and reinforces the cloud-agnostic nature of Lakehouse architecture – an important theme of the book.

Chapter 5: Compute Engines for Lakehouse Architectures examines compute options for batch and real-time data processing. Gaurav covers open-source tools such as Spark, Flink, and Presto, as well as cloud-native services like AWS Glue, Google BigQuery, and Databricks. He offers practical advice for selecting a compute engine, considering factors such as provisioning complexity, open-source support and AI/ML capabilities.

Chapter 6: Data and AI Governance and Security in Lakehouse Architecture explores governance and security, crucial areas for any production-ready data platform. Gaurav discusses core topics such as data quality, ownership, sensitivity and compliance. He also explores how governance responsibilities span both business and technical domains, emphasising the importance of organisational roles in maintaining control and oversight.

Chapters 7 – 9

Finally, these chapters focus on the practical realities of Lakehouse implementation – moving between theory and practice, and looking ahead to the architecture’s potential future.

Chapter 7: The Big Picture: Designing and Implementing a Lakehouse Platform examines considerations ranging from requirements gathering to defining business goals. Recommended Lakehouse zones are analysed and explained, and the expectations for each zone are defined. Finally, CICD is considered, and a sample design questionnaire is provided to help guide implementation planning.

Zones, or layers, are currently one of the most contentious areas of Lakehouse architectures. I like Gaurav’s stance on this – it’s somewhat similar to Simon Whiteley‘s. Yup – this video again.

Chapter 8: Lakehouse in the Real World does something I don’t see often – contrasting ideal scenarios with real-world events. It covers key stages in a Lakehouse’s development like analysis, testing and maintenance, examining what could go wrong and offering mitigation strategies.

This section is definitely accurate, as I’ve encountered some of these factors! It includes comparing greenfield and brownfield implementations, examining how business constraints affect technology choices, and considering if the desired RPO and RTO targets are financially and logistically possible.

Finally, Chapter 9: Lakehouse Of The Future looks ahead, exploring how Lakehouses might evolve in the years to come. Gaurav discusses potential intersections with trends like Data Mesh, Zero ETL and AI model integration. He also introduces emerging technologies like Delta UniForm and Apache XTable, which aim to improve interoperability across data processing systems and query engines. Finally, he touches on future innovations such as Apache Puffin and Ververica Streamhause that could further transform the data landscape.

(Sidenote: this Dremio post explores UniFrom and XTable very well.)

Thoughts

Having finished the book (in two weeks no less!), here are my thoughts:

Firstly, it’s not an intimidating read. At 283 pages, Practical Lakehouse Architecture is authoritative and content-rich without being overly complex or wordy. It also uses familiar O’Reilly conventions and style. When placed next to similar books I own, like The Data Warehouse Toolkit (600 pages) and Designing Data-Intensive Applications (614 pages), it’s easier to pick up and get into. And with some books, that’s a battle in itself!

Also, Practical Lakehouse Architecture‘s flow is very natural and the chapters make their points very well. I find some technical books, including some O’Reilly ones, hard to follow because they feel disjointed and jargon-heavy. That wasn’t the case here. The book held my attention very well throughout, and will serve me well as a future reference point.

Practical Lakehouse Architecture also feels like it will be relevant for a while. Some of my technical books have sections that are now outdated due to rapid technological changes. Here, ideas such as decoupled storage and compute, unified governance, and data personas will continue to matter for years to come.

Overall, an excellent book that I enjoyed reading.

Summary

In this post, I reviewed Gaurav Ashok Thalpati’s 2024 book ‘Practical Lakehouse Architecture‘ published by O’Reilly Media.

Ultimately, Practical Lakehouse Architecture is a well-written and informative book that caters to a wide range of skills. It’s a strong addition to the O’Reilly catalogue and complements titles like Rukmani Gopalan‘s 2022 book, The Cloud Data Lake, which I’m currently reading. It’s a great knowledge source for this constantly evolving modern data architecture.

If this post has been useful then the button below has links for contact, socials, projects and sessions:

Thanks for reading ~~^~~

Table of Contents