Wednesday, September 6, 2023

Unveiling Lakehouse – Data Warehouse Deep Dive Part3

 

This is this article we're looking at the good and not-so-good sides of the data warehouse and its potential replacement, the data lakehouse. In this article, we're checking out the things the data lakehouse needs to meet if it's going to fully replace the traditional warehouse.

The initial article “What is DataLakehouse?” introduces the data warehouse and examines its unique features. In the second article “ExplainingData Lakehouse as Cloud-native DW“, we explore data lakehouse architecture, aiming to adjust the essential requirements of data warehouse architecture to align with the priorities of cloud-native software design. Moving on, the fourth article will focus on the role of data modelling in creating, maintaining, and utilizing the lakehouse. Lastly, the final article will evaluate both the differences and the equally important similarities between the lakehouse and the platform-as-a-service (PaaS) data warehouse.

 

A Quick Recap of Data Lakehouse Architecture

The ideal data lakehouse architecture is like a puzzle where each piece works independently, unlike the classic data warehouse architecture. When I say "ideal," I mean the perfect design of this architecture. For instance, it breaks down the data warehouse capabilities into basic software functions (explained in the “Explaining Data Lakehouse as Cloud-native DW”) that operate as separate services.

These services are "loosely coupled," meaning they communicate through well-designed APIs. They don't need to know the internal details of the other services they interact with. Loose coupling is a fundamental principle of cloud-native software design, as discussed in previous articles. The ideal lakehouse is created by stacking these services on top of each other, allowing us, in theory, to replace one service's functions with another.

An alternative, practical approach links the data lake and data lakehouse services. Prominent providers like Databricks and Dremio have adopted this approach in their combined data lake/house implementations. This practical method has advantages compared to the ideal data lakehouse architecture, as we'll explore.

It's crucial to understand that while the tightly connected nature of a classic data warehouse has downsides, it also has advantages. Loose coupling can be a point of failure, especially when coordinating multiple, transaction-like operations in a distributed software architecture with independent services.

 

The Technical Side of Data Warehouse Architecture

Let's break down the formal, technical requirements of data warehouse architecture. To understand if the data lakehouse can truly replace the data warehouse, we need to see if its capabilities align with these requirements.

From a data warehouse perspective, what matters most is not just getting query results quickly but ensuring these results are consistent and reproducible. Striking a balance between speed, uniformity, and reproducibility is a real challenge.

Implementing this is trickier than it sounds. That's why solutions like Hive + Hadoop struggled as data warehouse replacements. Even distributed NoSQL systems often face issues when trying to step into the shoes of traditional databases or data warehouses.

Now, let's go through the specific requirements of data warehouse architecture:

  1. Central Data Repository: It serves as a single, central storage for business data, both current and historical.
  2. Panoptic View: Allows a comprehensive view across the entire business and its functional areas.
  3. Monitoring/Feedback Loop: Enables monitoring and feedback mechanisms into the business's performance.
  4. User Queries: Supports users in asking common or unpredictable (ad hoc) questions.
  5. Consistent Query Results: Ensures that everyone gets the same data through consistent and uniform query results.
  6. Concurrent Workloads: Handles concurrent jobs and users along with demanding mixed workloads.
  7. Data Management Controls: Enforces strict controls on data management and processing.
  8. Conflict Resolution: Anticipates and resolves conflicts arising from the simultaneous requirements of consistency, uniformity, and data processing controls.

Does the data lakehouse meet these criteria? It depends on how you implement the architecture. If you set up your lakehouse by using a SQL query service on a curated data lake section, you'll likely address requirements 1 through 4. However, handling requirements 5 through 8, which involve enforcing consistency and managing conflicts during concurrent operations, can be challenging for this type of implementation.

 

Reality Check: Maintaining Data Integrity Matters

In a typical, closely connected data warehouse setup, the warehouse often uses a relational database, or RDBMS. Most RDBMSs have safeguards known as ACID, ensuring they can handle multiple operations on data simultaneously while maintaining strong consistency.

While ACID safeguards are commonly linked with online transaction processing (OLTP) and RDBMS, it's essential to clarify that a data warehouse isn't an OLTP system. You don't necessarily need to set up a data warehouse on an RDBMS.

To simplify, the database engine in a data warehouse requires two things: a data store that can create and manage tables, and logic to resolve conflicts arising from concurrent data operations. It's possible to design the data warehouse as an append-only data store, committing new records over time, like adding new rows. With this approach, you avoid concurrency conflicts by only appending new records without changing or deleting existing ones. Coordination logic ensures that multiple users or jobs querying the warehouse simultaneously get the same records.

However, in reality, the most straightforward way to meet these requirements is by using an RDBMS. An RDBMS is optimized to efficiently perform essential analytical operations, like various types of joins. This is why the traditional on-premises data warehouse is often synonymous with the RDBMS. Attempts to replace it with alternatives like Hadoop + Hive have typically fallen short.

It's also why nearly all Platform-as-a-Service (PaaS) data warehouse services mimic RDBMS systems. As mentioned in a Explaining Data Lakehouse as Cloud-native DW article, if you choose to avoid in-database ACID safeguards, you must either build ACID logic into your application code, create and manage your own ACID-compliant database, or delegate this responsibility to a third-party database. In essence, maintaining data integrity is crucial.

 

Ensuring Data Consistency in Workloads

Whether we like it or not, production data warehouse workloads demand consistency, uniformity, and replicability. Imagine core business operations regularly querying the warehouse. In a real-world scenario, the data lakehouse replacing it must handle hundreds of such queries every second.

Let's break it down with an example – think of a credit application process that queries the lakehouse for credit scores multiple times per second. Statutory and regulatory requirements demand that simultaneous queries return accurate results, using the same scoring model and point-in-time data adjusted for customer variations.

Now, what if a concurrent operation tries to update the data used for the model's parameters? In a traditional RDBMS setup, ACID safeguards ensure this update only happens after committing the results of dependent credit-scoring operations.

Can a SQL query service do the same? Can it maintain these safeguards even when objects in the data lake's curated zone are accessible to other services, like an AWS Glue ETL service, which may update data simultaneously?

This example is quite common in real-world scenarios. In simple terms, if you want consistent, uniform, and replicable results, you need ACIDic safeguards. This is why data warehouse workloads insist on having these safeguards in place.

 

Can Data Lakehouse Architecture Ensure These Safeguards?

The answer isn't straightforward. The first challenge revolves around the difficulty of coordinating operations across loosely connected services. For instance, how can an independent SQL query service limit access to records in an independent data lake service? This limitation is crucial to prevent multiple users from changing items in the lake's curated area. In a tightly connected RDBMS, the database kernel handles this by locking rows in the table(s) where dependent data is stored, preventing other operations from altering them. The process is not as clear-cut in data lakehouse architecture with its layered stack of detached services.

A well-designed data lakehouse service should be able to enforce safeguards similar to ACID—especially if it controls concurrent access and modifications to objects in its data lake layer. Databricks and Dremio have addressed this challenge in their data lakehouse architecture implementations. They achieve this by reducing the loose coupling between services, ensuring more effective coordination of concurrent access and operations on shared resources.

However, achieving strong consistency becomes much tougher when the data lakehouse is structured as a stack of loosely connected, independent services. For example, having a distinct SQL query service on top of a separate data lake service, which sits on its own object storage service. In such a setup, it becomes challenging to ensure strong consistency because there's limited control over access to objects in the data lake.

 

Closing Thoughts: Navigating Distributed Challenges

In any distributed system, the main challenge is coordinating simultaneous access to shared resources while handling various operations on these resources across different locations and times. This applies whether software functions and their resources are closely or loosely connected.

For instance, a classic data warehouse tackles distributed processing by becoming a massively parallel processing (MPP) database. The MPP database kernel efficiently organizes and coordinates operations across nodes in the MPP cluster, resolving conflicts between operations. In simple terms, it makes sure it can enforce strict ACID safeguards while dealing with multiple operations happening at the same time across different places.

On the flip side, a loosely connected distributed software architecture, like data lakehouse architecture, deals with the challenge of coordinating access and managing dependencies across essentially independent services. It's a tricky problem.

This complexity is one reason why the data lakehouse, much like the data lake itself, typically operates as what's called an eventually consistent platform rather than a strongly consistent one.

On one hand, it can enforce ACID-like safeguards; on the other hand, it may lose data and struggle to consistently replicate results. Enforcing strict ACID safeguards would mean combining the data lakehouse and the data lake into one platform—closely connecting both services to each other. This seems to be the likely direction in the evolution of data lake/lakehouse concepts, assuming the idea of the data lakehouse sticks around.

However, implementing the data lakehouse as its own data lake essentially mirrors the evolution of the data warehouse. It involves closely connecting the lakehouse and the lake, creating a dependency on a single software platform and provider.

Stay tuned for the next article in this series, where we'll explore the use of data modeling with the data lakehouse.