Nouvelle MOMENTUM

Data Warehouse and Data Lake, when 1 + 1 > 2

The terms data warehouse (DW) and data lake (DL) are often used to refer to the storage of massive data, but they are not interchangeable. Let's define these two concepts, their combination and their concrete use for a company. 

The difference between a Data Warehouse and a Data Lake

Data Warehouse 

A DW is a repository of structured, filtered data that has already been transformed for a specific purpose. It is a platform used to collect and analyze data from multiple, heterogeneous sources. Data stored in a DW is typically cleansed, organized and optimized to meet business needs.

 
Here are some key features:

  • Data structure : Data stored in a DW is transformed and cleansed.

 

  • They are ready to be used for decision-making.Target users: Specialists and business analysts access DW data.

 

  • Accessibility: Data access is more complex and costly, but queries are optimized.

 

Let's take a large retail company as an example. It uses a DW to store and analyze sales, inventory and customer data. Marketers can access the reports to make decisions on promotions and advertising campaigns.


In this example, the DW contains transformed data, such as sales by region, profit margins, and seasonal trends. This information is cleaned and organized upstream to facilitate analysis by marketing and/or sales analysts.


Data lake

A data lake is a vast deposit of raw data whose purpose has not yet been specified. It stores untransformed data, whether structured, semi-structured or unstructured. 


Here are a few key characteristics:

  • Data structure : Raw data is malleable and ideal for machine learning. However, they require rigorous governance to avoid becoming a data “swamp”.

 

  • Target users: Data scientists and data analysis experts access data from the data lake.

 

  • Accessibility: Access is easy and updates are fast, but modifications are more complex and costly.

 

Let's take the example of a social media company. It collects billions of tweets, photos and videos every day. Rather than transforming them immediately, it stores them in a DL.


Data scientists can mine this raw data to detect trends, analyze user sentiments, or predict viral topics. For example, they could search for popular hashtags linked to a sporting event or an election.


Combining a Data Warehouse and a Data Lake: the best of both worlds

Now we combine two concepts: the flexibility and scalability of a data lake with the management and ACID transactions of a data warehouse. The result is what we now call a data lakehouse, or simply, lakehouse.


Data lakehouse

The lakehouse represents a new, unified approach to data management and analysis, merging the best elements of data lakes and data warehouses into a single platform. Unlike traditional data architectures, which rely on separate systems for storage and computation, lakehouse decouples storage from computation, enabling organizations to evolve independently and optimize resource utilization.

 

The benefits of lakehouse architecture

 

Bill Inmon, widely regarded as one of the fathers of the data warehouse, is a fervent advocate of lakehouse architecture.
Here are some of his views on the benefits of this architecture:

  • Flexibility: He points out that this approach offers great flexibility by enabling a wide variety of data, both structured and unstructured, to be stored in its native format. This enables companies to capture and exploit data from a variety of sources without having to transform it immediately.

 

  • Scalability: Thanks to the use of massive storage technologies such as Hadoop and Spark, lakehouse architectures can easily scale to support growing data volumes without compromising performance.

 

  • Real-time analysis: This architecture emphasizes the ability to support real-time analysis, enabling businesses to make decisions based on real-time data.

 

  • Cost-efficiency: By combining the low-cost storage capabilities of data lakes with the analytical capabilities of traditional data warehouses, lakehouse architectures can offer a more cost-effective solution for large-scale data management and analysis.
Data Lakehouse
Source : Data Lakehouse from Databricks (https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html)

Why use a lakehouse architecture: The example of an e-commerce business

Let's imagine an e-commerce company that wants to improve its customer experience and optimize its operations by using a lakehouse architecture.

1. Real-time data collection
The company continuously collects data from its website, its mobile applications, its interactions with customers on social networks, as well as transactional data from its order and inventory management systems. This data is stored in a data lake in its raw format, without any initial transformation.

2. Real-time analysis
Using real-time analysis tools based on technologies such as Apache Spark, the company can analyze data continuously to detect emerging trends, real-time customer behavior, and potential problems on its website or applications.

3. Integration with the data warehouse
Relevant data is then extracted from the data lake and transformed as required for loading into the traditional DW. This enables further analysis, periodic reporting and complex queries that require cleansed and structured data.

4. Insight-based decision-making
Analysts and decision-makers use the data stored in the data warehouse to make strategic decisions, such as optimizing marketing campaigns, improving the customer experience, managing inventory, and identifying new business opportunities.

5. Training predictive models
Data from the DL and data warehouse is also used to train predictive models and machine learning algorithms. These models can be deployed in production to personalize product recommendations, detect fraud, forecast demand, and other use cases.


A winning combination? 


The data lakehouse concept is indeed fascinating, and may seem paradoxical at first glance. Yet it illustrates how the combination of a data warehouse and a data lake can create a sum greater than their individual parts. Which is why, in this case, 1 + 1 > 2!

 

Written by Sanzio Castor, Data Analyst

 

 

 


References


Building the Data Lakehouse par Bill Inmon
What is a Data Lake? https://azure.microsoft.com/en-us/resources/cloud-computing-dictionary/what-is-a-data-lake/
Data lake vs data warehouse https://www.talend.com/fr/resources/data-lake-vs-data-warehouse/
A quoi sert un Data Lakehouse? https://www.oracle.com/fr/big