Monday, January 15, 2024

Unveiling Lakehouse – Compare Data Lakehouse and PaaS DW Part5

 Exploring the Data Lakehouse and PaaS Data Warehouse

This marks the last article in a series where we’ve delved into the world of the data lakehouse, examining it independently and as a potential substitute for the data warehouse. In case you missed the first article, you can find it here.

In our previous discussions, we often portrayed the data warehouse as a bit of a strawman. We mainly compared the data lakehouse with traditional data warehouse setups, almost as if the concepts of the cloud-native approach hadn’t been applied to data warehouses. It’s like imagining data warehouse architecture is frozen in time.

However, I haven’t really touched on the platform-as-a-service (PaaS) or query-as-a-service (QaaS) data warehouse so far. I haven’t explored these approaches as innovative setups comparable in capabilities and cloud-friendly nature to the equally novel data lakehouse.

Although not explicitly discussed before, this idea has lingered in the background. In a previous article, I highlighted that data warehouse architecture is more of a technical guideline than a strict technology rulebook. Instead of specifying how to build a data warehouse, it outlines what the system should do and how it should behave, detailing the necessary features and capabilities.

This implies that there are multiple ways to implement a data warehouse, and the requirements of data warehouse architecture don’t necessarily clash with those of cloud-native design. Moreover, the cloud-native data warehouse shares quite a few commonalities with the data lakehouse, even as it diverges in crucial aspects.

With this foundation, let’s now shift our focus to the ultimate questions of this series: What similarities exist between the data lakehouse and the PaaS data warehouse, and where do they differ?

PaaS Data Warehouse: A Lot Like Data Lakehouse

The PaaS data warehouse and the data lakehouse share many similarities. Just like the data lakehouse, the PaaS data warehouse:

  • Resides in the cloud.
  • Separates its computing, storage, and other resources.
  • Can adjust its size based on demand spikes, seasonal use, or specific events.
  • Responds to events by provisioning or removing compute and storage resources.
  • Locates itself close to other cloud services, including the data lake.
  • Writes and reads data from cost-effective cloud object storage, similar to the data lake/house.
  • Can query and provide access to data in various zones of the data lake.
  • Doesn’t necessarily need complex data modeling, opting for flat or OBT schemas.
  • Handles semi- and multi-structured data, managing and performing operations on them.
  • Executes queries across diverse data models like time-series, document, graph, and text.
  • Presents denormalized views (models) for specific use cases and applications.
  • Offers various RESTful endpoints, not just SQL.
  • Supports GraphQL, Python, R, Java, and more through distinct APIs or language-specific SDKs.

Tighter Connections in PaaS Data Warehouse

When we look at the cloud-native data warehouse compared to the data lakehouse, it appears more tightly connected. This means the cloud-native warehouse has better control over various tasks like reading, writing, scheduling, distributing, and performing operations on data. It can also handle dependencies between these operations and ensure consistency, uniformity, and replicability safeguards. In simpler terms, it can enforce strict ACID safeguards.

On the other hand, the “ideal” data lakehouse is constructed from separate, purpose-specific services. For instance, this ideal implementation includes a SQL query service on top of a data lake service, which sits on a cloud object storage service. This design trend breaks down large programs into smaller, function-specific services that interact with minimal knowledge about each other. While this approach offers benefits, especially in terms of design flexibility, it also introduces challenges in managing concurrent computing, as discussed in the third article of this series.

Solving this problem in an ideal data lakehouse implementation is not straightforward. Databricks takes a different approach by coupling the data lake and data lakehouse into a single platform. This way, the data lakehouse can potentially enforce ACID-like safeguards. However, this also means tightly coupling the data lakehouse and the data lake, creating a dependence on a single software platform and provider.

Comparing Data Warehouse and Data Lakehouse: A Closer Look

Now, let’s explore a thought-provoking question: Can the PaaS data warehouse perform all the functions of the data lakehouse? It’s a possibility. Consider this: What sets apart a SQL query service that interacts with data in the curated zone of a data lake from a PaaS data warehouse in the same cloud environment, with access to the same underlying cloud object storage service, and the ability to perform similar tasks? What distinguishes a SQL query service offering access to data in the lake’s archival, staging, and other zones from a PaaS data warehouse capable of the same?

Over time, it seems like the data lake and the data warehouse have been moving closer together. On one side, the lakehouse appears to exemplify convergence from lake to warehouse. On the flip side, the warehouse’s support for various data models and its integration with data federation and multi-structured query capabilities—meaning the capability to query files, objects, or diverse data structures—are examples of a trend moving from warehouse to lake.

Let’s delve into some supposed differences between the data lakehouse and the data warehouse and examine if convergence has rendered these differences obsolete. Here are a few notable ones to consider:

Comparing Data Warehouse and Data Lakehouse Features: A Simplified View

  1. Enforcing Safeguards:
    • Original: Has the ability to enforce safeguards to ensure the uniformity and replicability of results.
    • Simplified: The PaaS data warehouse easily ensures consistent and replicable results.
  2. Performing Core Workloads:
    • Original: Has the ability to perform core data warehousing workloads.
    • Simplified: The PaaS data warehouse excels at essential data processing tasks, making it faster than a SQL query service.
  3. Data Modeling Requirement:
    • Original: Eliminates the requirement to model and engineer data structures prior to storage.
    • Simplified: Both PaaS data warehouse and data lakehouse benefit from basic data modeling for clarity, governance, and reuse.
  4. Protection Against Lock-In:
    • Original: Protects against cloud-service-provider lock-in.
    • Simplified: While the data lakehouse aims for flexibility, switching services may involve challenges like transferring modeling logic and data movement.
  5. Diverse Practices and Consumers:
    • Original: Has the ability to support a diversity of practices, use cases, and consumers.
    • Simplified: The data lake offers more flexibility and convenience for experimenting with data, giving it an advantage over the data warehouse.
  6. Querying Across Data Models:
    • Original: Has the ability to query against/across multiple data models.
    • Simplified: Both data lakehouse and PaaS data warehouse can query diverse data models, but challenges exist in linking information across models.

In summary, while the PaaS data warehouse and data lakehouse share some capabilities, they also have unique strengths and challenges in areas like flexibility, data modeling, and querying across different data models.

Final Thoughts on the Complementary Data Lakehouse

Let’s not underestimate the value of the data lakehouse—it’s a useful innovation. The compelling use cases we discussed earlier in this series are hard to dispute. Using the data lakehouse can be easier for time-sensitive, unpredictable, or one-off tasks, as it allows for quick action without being hindered by internal constraints.

Unlike the data warehouse, which is a strictly governed system with a slow turnaround, the data lakehouse has its advantages. It offers a less strictly governed, more agile alternative. In simpler terms, the lakehouse is not here to replace the warehouse but to complement it.

The challenges discussed in this article and its counterparts arise when trying to replace the data warehouse with the data lakehouse. In this particular aspect, the data lakehouse falls short. It’s tough, if not impossible, to find a perfect solution that aligns the design requirements of an ideal data lakehouse with the technical needs of data warehouse architecture.

Unveiling Lakehouse – Data Modeling Part4

 


In this fourth article in “Unveiling Lakehouse” series of five that explains the data lakehouse. The first article “What is Data Lakehouse?” introduced the data lakehouse and explored what makes it new and different. The second article “Explaining Data Lakehouse as Cloud-native DW” looked at the data lakehouse from a cloud-native design perspective, a significant departure from classic data warehouse architecture. The third article “Unveiling Lakehouse – Data Warehouse Deep Dive Part3″ explored whether the lakehouse and its architecture can replace the traditional data warehouse. The final article evaluates the differences (and some surprising similarities) between the lakehouse and the platform-as-a-service (PaaS) data warehouse.

This article examines the role of data modeling in designing, maintaining, and using the lakehouse. It evaluates the claim that the lakehouse is a lightweight alternative to the data warehouse.

Data Lakehouse vs. Data Warehouse: Making It Simple

Supporters argue that the lakehouse is a better replacement for traditional data warehouses, citing some extra benefits. Firstly, they claim that the lakehouse simplifies data modeling, making ETL/data engineering easier. Secondly, there’s a supposed cost reduction in managing and maintaining ETL code. Thirdly, they argue that the absence of data modeling makes the lakehouse less likely to “break” due to routine business changes like mergers, expansions, or new services. In essence, the lakehouse remains resilient because there’s no data model to break.

How Data Is, or Isn’t, Modeled for the Data Lakehouse

Let’s break down what this means by looking at an ideal scenario for modeling in the data lakehouse:

  1. Data enters the data lake’s landing zone.
  2. Optionally, some or all raw data is stored separately for archival purposes.
  3. Raw data or predefined extracts move into one of the data lake’s staging zones, which may be separate for different user types.
  4. Immediate data engineering, like scheduled batch ETL transformations, can be applied to raw OLTP data before loading it into the data lake’s curated zone.
  5. Data in staging zones becomes available to various jobs and expert users.
  6. A portion of data in staging zones undergoes engineering and moves into the curated zone.
  7. Data in the curated zone undergoes light modeling, such as being stored in an optimized columnar format.
  8. The data lakehouse acts as a modeling overlay, like a semantic model, superimposed over data in the curated zone or optionally over selected data in staging zones.
  9. Data in the curated zone remains unmodeled. In the data lakehouse, specific logical models for applications or use cases, similar to denormalized views, handle data modeling.

For instance, instead of extensively engineering data for storage and management by a data warehouse (usually an RDBMS), the data is lightly engineered, like being put into a columnar format, before being established in the data lake’s curated zone. This is where the data lakehouse comes into play.

Simplifying Data Volume Choices in the Lakehouse

How much data should be in the lakehouse’s curated zone? Well, the simple answer is: as much or as little as you prefer. But, in practice, it really depends on what the data lakehouse is meant to do – the uses, practices, and the people who will be using it. Let’s dig into this idea a bit.

Firstly, let’s understand what happens to the data once it’s loaded into the data lake’s curated zone. Typically, the data in this zone is stored in a columnar format like Apache Parquet. This means the data is spread across many Parquet objects, living in object storage. Here’s why the curated zone often goes for a simple data model, like a flat or one-big-table (OBT) schema. In simple terms, it means putting all the data in one denormalized table. Why? Well, this maximizes the benefits of object storage – high bandwidth and steady throughput – while keeping the costs in check (thanks to lower and more predictable latency). One big plus, according to lakehouse supporters, is that this approach eliminates the need for complex logical data modeling typically done in 3NF or Data Vault modeling, or the dimensional data modeling seen in Kimball-type data warehouse design. It’s a big time-saver, they say.

Rethinking Data Modeling in Warehouses

But hold on, isn’t this how data is modeled in some data warehouse systems?

The catch here is that data warehouse systems often use flat-table and one-big-table (OBT) schemas. Interestingly, OBT schemas were a thing with the first data warehouse appliances in the early 2000s. Even today, cloud Platform-as-a-Service (PaaS) data warehouses like Amazon Redshift and Snowflake commonly go for OBT schemas. So, if you’re not keen on heavy-duty data modeling for the data warehouse, you don’t have to. Many organizations choose to skip it.

Now, here’s the head-scratcher: Why bother modeling data for the warehouse in the first place? What’s the big deal for data management experts?

The thing is, whether we like it or not, data modeling and engineering are tightly linked to the core priorities of data management, data governance, and data reuse. We model data to handle it better, govern it, and (a mix of both) reuse it. When we model and engineer data for the warehouse, we aim to keep tabs on its origin, track the changes made to it, know when these changes happened, and importantly, who or what made them. (By the way, the ETL processes used to fill the data warehouse generate detailed technical metadata about this.) Similarly, we manage and govern data to make it available and discoverable by a broader audience, especially those who aren’t data experts.

To sum it up, we model data so we can grasp it, bring some order to it, and turn it into well-managed, governed, and reusable data collections. This is why data management experts insist on modeling data for the warehouse. In their view, this focus on engineering and modeling makes the warehouse suitable for a wide range of potential applications, use cases, and consumers. This stands out from alternatives that concentrate on engineering and modeling data for a semantic layer or embed data engineering and modeling logic directly in code. Such alternatives usually target specific applications, use cases, and consumers.

Navigating Challenges in Data Modeling

Let’s talk about the challenges with data modeling.

One issue is that the typical anti-data modeling perspective can be misleading. If you avoid modeling at the data warehouse/lakehouse layer, you end up focusing on data modeling in another layer. Essentially, you’re still working on modeling and engineering data, just in different places like a semantic layer or directly in code. And guess what? You still have code to take care of, and things can (and will) go awry.

Consider this scenario: A business used to treat Europe, the Middle East, and Africa (EMEA) as one region, but suddenly decides to create separate EU, ME, and Africa divisions. Making this change requires adjustments to the data warehouse’s data model. However, it also impacts the denormalized views in the semantic layer. Modelers and business experts need to update or even rebuild these views.

The claim here is that it’s supposedly easier, faster, and cheaper to fix issues in a semantic layer or in code than to make changes to a central repository like a data warehouse or a data lakehouse. This claim isn’t entirely wrong, but it’s a bit biased. It comes from a somewhat distorted understanding of how and why data gets modeled, whether it’s for the traditional data warehouse or the modern data lakehouse.

Both sides of this debate have valid concerns and good points. It’s ultimately about finding the right balance between the costs and benefits.

Key Points to Consider

Let’s wrap up with some important thoughts.

Assuming that the lakehouse eliminates the need for data modeling and makes ETL engineering less complex overlooks the essential role of data modeling in managing data. It’s like playing a game of moving tasks around—you can’t escape the work; you can only shift it elsewhere.

Adapting to changes in business is never straightforward. Altering something about the business breaks the alignment between a data model representing events in the business world and reality itself. While it might seem easier to move most data modeling logic to a BI/semantic layer, it comes with its own set of challenges. In scenarios where changes happen, modelers need to design a new warehouse data model, repopulate the data warehouse, and address issues in queries and procedures. Additionally, they must fix the modeling logic in the BI/semantic layer, adding extra work.

This challenge isn’t unique to data warehouses; it’s equally relevant for organizations implementing data lakehouse systems. The concept of a lightly modeled historical repository for business data is not new. If you choose to avoid modeling for the data lakehouse or warehouse, that’s an option, but it has been available for some time.

On the flip side, an organization that chooses to model data for its lakehouse should have less modeling to do in its BI/semantic layer, perhaps much less. The data in this lakehouse becomes clearer and more understandable to a larger audience, making it more trustworthy.

Interestingly, a less loosely coupled data lakehouse implementation, like Databricks’ Delta Lake or Dremio’s SQL Lakehouse Platform, has an advantage over an “ideal” implementation composed of loosely coupled services. It makes more sense to model and govern data in a tightly coupled data lakehouse implementation where the lakehouse has control over business data. However, achieving this in an implementation where a SQL query service lacks control over objects in the curated zone of the underlying data lake is unclear.