OZ Digital, LLC

  1. Home
  2. /
  3. Resources
  4. /
  5. Blog
  6. /
  7. IS THE DATA WAREHOUSE...

IS THE DATA WAREHOUSE DEAD? YES AND NO.

sal

It all started simply enough until everything became data. With the rise and rise—and rise of data, data warehouses proliferated, too. Almost every business has one (sometimes up to five or six). But do we need one? Before we argue its merits, what is a data warehouse even?

Defining a data warehouse is like trying to define a sandwich, as Benn Stancil, Co-founder and Chief Analytics Officer of Mode points out: “Every attempt has weird and unacceptable exceptions. A sandwich is meat between bread? A PB&J isn’t a sandwich, and a hot dog is. A sandwich must have two separate pieces of bread? Most cheesesteaks aren’t sandwiches, and tiramisu is. A sandwich is food surrounded by leavened bread? Ice cream sandwiches aren’t sandwiches, and soup in a bread bowl is. By two pieces of leavened bread?” Defining data warehouses can be just as confusing.

As it stands, a data warehouse is like a menu. It’s not any one thing but a series of things—a collection of myriad data, processes, and activities.

 

How Did We Get Here

Traditionally, a data warehouse has meant collecting data from disparate sources in on-premises infrastructure. A data warehouse is where data extracted through passive pipelines (the “E” in ETL) got processed and stored. Later, this data would be transformed for specific purposes such as analytics, machine learning, or integration with other systems like Salesforce, aided by technologies for data observability, governance, and cataloging.

Early on, the original ETL (extract, transform, load) process involved extracting data from the source and transforming it before it entered the data warehouse. Many businesses still operate this way today, particularly large enterprises that prioritize data quality.

However, tech giants like Google abandoned this process, choosing instead to dump everything in the data lake. Fast-growing startups followed suit as the ROI of logically organizing the data didn’t quite add up. And loading (the “L” in ELT) data in the cloud had become much easier.

Leading transformation tools, modular code, and reduced runtimes made transforming data in the warehouse easier than ever. As a result, many of these transformation tools moved downstream from data engineers to data scientists and analysts.

It seemed like we had found a new best practice and were on our way to data integration and standardization. Except, there are several issues with an architecture that is monolithic and inflexible, chief among them:

Offers No Support for Machine Learning and AI

The term “big data” is so frequently used it feels like a cliché, but the reality is the world’s data has increased fivefold since 2018. And with the emergence of AI, we’re going to need even more data in a variety of formats: “Internet of Things” sensors, web, social media, mobile apps, and more.

As the volume of data grows and the demand for AI-powered insights increases, legacy data warehouses will struggle to keep pace. Let’s face it: it will no longer be enough for data warehousing to be of value to a handful of business use cases but will have to support how businesses will be run in the future. And that future is already unfolding, shaped by automation, AI, and machine learning. If data warehousing is to succeed, it must be modernized to support this future.

Difficult to Scale

Traditional data warehouses are typically less flexible compared to modern cloud-based solutions. Since they were conceived as monolithic architectures with all components tightly integrated (compute, storage, and network), scaling these systems often requires scaling all components together. It may also include hardware provisioning, database tuning, and migration of data, which can be both time-consuming and error-prone. The inflexibility makes it hard to scale resources up or down based on workload demands, leading to inefficient utilization and higher costs.

Expensive to Maintain

As the data increases, structuring becomes difficult, slowing down the ETL process while running up high costs. These typically complex and outdated architectures weren’t designed for the exponential data growth we’re experiencing. They’re underperforming—while the cost of owning and operating them is mounting.

In a world that’s becoming more AI-first, they don’t support advanced analytics capabilities and real-time insights either, differentiators businesses need to succeed. When an organization’s data needs can’t be met, it often leads to shadow IT as business teams turn to a variety of disparate and incompatible solutions, creating even bigger data silos.

 

Evolve Your Data Warehouse

Contrary to the original vision of a single data warehouse serving a single version of the truth, most organizations have multiple data warehouses. What we need to do is step back and reexamine the purpose of a data warehouse. The primary objective is to integrate and reconcile internal enterprise data, collecting historical snapshots for analysis.

From an architecture standpoint, that does not mean it has to reside on-premises in relational databases or rely entirely on SQL. Those are relics of data warehousing in the ’90s.

We should consider cloud migration—taking advantage of Azure  or Microsoft Fabric for fast data transformation—and accelerating ETL and NoSQL for offloading infrequently accessed data so you can improve performance for frequently accessed data by reducing table and database sizes.

Read: Why I Would Choose Microsoft Fabric as My Data Management Tool

Migrate to a Cloud-Based Data Warehouse

Cloud-based data warehouses based on platforms like Microsoft Fabric, in contrast, are more scalable. They decouple storage and compute, leverage distributed computing, and offer automatic scaling, which allows for more efficient handling of large and growing datasets.

With a cloud-based warehouse, you don’t have to throw all your existing ETL and BI tools away. You should be able to bring semi-structured data in and give the same TLC you give relational data. The cloud offers massive amounts of cheap storage and unlimited scalability, enabling you to offer processing power when needed. And just as important, not apply when it’s not needed. Performance and flexibility are also built in.

 

What’s Next?

On a scale of dying to thriving, the traditional data warehouse exists as “struggling.” Many of today’s data warehouses could be considered legacy systems, yet people still depend on them.

The idea of data warehousing will endure, even if a physical data store called a data warehouse does not.

To make it work in the era of AI, we ought to rethink the data warehouse in the context of:

Data Virtualization:

  • Are there cases where the data warehouse is trying to deliver real-time or low-latency data without needing extensive historical records? Would these scenarios be better handled with a data virtualization model?
  • Should some functions currently managed by legacy data warehouses be moved to a data lake, allowing the data warehouse to focus on what it does best—handling integrated, structured, and relational data?

Data Integration:

  • How are we moving data into the data warehouse and thinking about integration? Typically, data warehouses have a separate staging area. Does it make more sense for the data lake to be the landing zone for all incoming data and also serve as the staging area for data before it enters the warehouse?

Data Transformation Models:

  • Is the traditional ETL process still the best approach, or should we consider using extract, load, transform (ELT) in some cases? What about bulk loads?
  • Should we process data streams directly into the warehouse? If so, should these streams first land in the data lake and then be filtered for relevant events to be pulled into the warehouse?

There are many ways to modernize the data warehouse. However, we must first address these questions and redefine our entire data management architecture, not just data warehousing architecture. This includes understanding the roles of data warehouses, data lakes, analytic sandboxes, and master data management within the data ecosystem.

If you’re grappling with these questions, schedule a free consultation with our Microsoft-certified experts. And give your data a second life.