Today’s data warehouses are a far cry from the single-stack warehouses of the past. Instead of focusing primarily on data processing, as those early warehouses did, the modern version is all about storing lots of data from multiple sources, in various formats, and gaining insights compelling enough to drive business decisions.
Big data, cloud computing, and advanced analytics have all played major roles in the development of the modern data warehouse. In fact, they demanded it. Conventional data warehouses typically struggle to keep up with the growing challenges of large volumes of data—whether it’s structured and unstructured data managed on premises or cloud-based data hosted by third parties. Gleaning value from it all is harder still.
If your organization has plans to migrate to a modern data warehouse, here’s what you need to know.
Data store vs. distribution center
If a conventional data warehouse could be thought of as a data store, today’s modern version more closely resembles a mega distribution center.
The data warehouse is “best represented by the convergence of the traditional data warehouse and the data lake,” said John Santaferraro, research director at Enterprise Management Associates (EMA). In fact, it is “better defined as a unified analytics warehouse” (UAW).
A data lake is simply a repository that takes in data from multiple sources and can store it in any format.
The modern data warehouse is unified because it adequately handles multi-structured data in a single platform. It is an analytics platform because the primary use case for both the data lake and the data warehouse has always been analytics, Santaferraro said. It is a warehouse “because it stores multi-structured data in an organized and accessible manner for a broad range of analytics use cases.”
Traditionally, data lakes have focused more on data science use cases, while the data warehouse focused more on enterprise analytics. Enterprise data warehouses, by contrast, were designed to focus on specific raw data to draw conclusions about only that information and use a set of practices aimed at regular analysis for reporting and dashboards.
Data scientists take a broader approach that applies scientific methods, processes, and algorithms to extract insights from data overall, whether structured or unstructured, and can involve data mining and deep learning techniques.
Benefits of modernizing
There are many compelling reasons to develop and maintain a modern data warehouse, both at the user and admin level, and for the organization overall.
Users and administrators can expect to:
- Spend less time moving and preparing data
- Spend more time on innovative uses of analytics to drive new business models
Organizations deploying a unified analytics warehouse can expect to:
- Speed time-to-analytics
- Reduce overall cost of ownership
- Increase the productivity of their analytics workforce
From a technology standpoint, a modern data warehouse:
- Is always available
- Is scalable to large amounts of data
- Provides correct answers to queries in any schema
- Provides real-time updates
- Handles extract, transform and load (ETL, the process required when stored data is accessed prior to analysis)
- Supports batch and interactive workloads
- Supports large numbers of simultaneous users
Elements of a modern data warehouse
There are several major components to today’s data warehouses.
Organizations have historically moved their data from databases to file systems to save money. Now they are moving from file systems to object storage, EMA’s Santaferraro said.
In the world of analytics, it is important to remember that cheap storage has its limitations, Santaferraro said. “If the data is not accessible for analysis, cheap is not enough.”
For this reason, the UAW must provide a rich and consistent set of analytical capabilities across all storage tiers. More advanced UAWs will automate the movement of data in and out of file systems and object storage when needed, he said.
While many IT pros equate Hadoop with a data lake, many other tools are in common use and are mostly open source. These include:
- Apache HBase, a key-value columnar storage and database system
- Apache HCatalog, a metadata, table, and storage management system
- Hadoop MapReduce, a scalable data processing tool normally used with large datasets
- Apache Hive, an open-source language built on top of MapReduce that assists with the analysis of large datasets
- Oozie, a MapReduce job scheduling tool
- Apache Pig, a language connected with MapReduce used in parallel data processing
- Apache ZooKeeper, a hierarchical key-value store for synchronization
Cloud, multi-cloud, and hybrid solutions
Most organizations today have at least some data residing on cloud-based platforms, so a modern data warehouse must support those. Your data warehouse should also support cloud-to-cloud interoperability to enable those cloud platforms to share data. A modern data warehouse should also support interoperability among multiple cloud systems and on-premises systems, allowing all to work together without isolating data on any of the respective systems.
Cloud platforms for UAWs have become increasingly popular, as organizations seek ways to reconcile structured warehouse data with unstructured data in the data lake.
The movement by many organizations to a software-as-a-service model for enterprise applications has also led to greater interest in a UAW approach. The benefits can include greater scalability, agility, cost savings, faster processing speeds, faster deployments, easier disaster recovery, and improved governance and security capabilities.
Computing and processing
Some cloud-based architectures supporting the modern data warehouse completely separate compute functions from storage to optimize an organization’s investments in infrastructure. This total separation and the ability to query data in any tier of storage can produce huge advantages in total cost of ownership.
That is partly because cloud vendors typically charge higher rates for compute vs. storage (and, naturally, the compute-intensive analytics processes are the whole reason for the storage). So if compute capacity can be spun down when not needed, teams can save money by using just the storage capacity. As workloads require compute capacity again, that can be spun up dynamically.
Data-intensive analytical applications benefit from the use of multi-tiered data storage. The most advanced platforms provide high-performance techniques for complex data types within their original format.
To support a unified workforce, the modern data warehouse should support multiple approaches to data analytics. For example, Santaferraro said, a data scientist would need to be able to use R, Python, and notebooks to execute discovery analytics or advanced analytics such as machine learning on multi-structured data. The platform should also provide easy-to-access (i.e., SQL-based), high-performance analytics.
“It must be simple to combine these analytics for greater insight or ask questions of data in near-real time,” he said. “With the modern data warehouse, data engineers, data scientists, and data analysts no longer need to fight about who is right and who is wrong. They have a single environment where they can collaborate for the greater good of the enterprise.”
How to successfully implement a modern data warehouse
Organizations looking to implement a modern data warehouse should start by asking several key questions, starting with “what is the ultimate business
purpose?” After you’ve determined that, the more technical questions to ask are:
- Can the system handle data from multiple sources?
- Can the system handle extensive amounts of data flowing simultaneously?
- Does the architecture allow for scalability and increased performance?
- Does the architecture allow for real-time analysis of streaming data?
- Does the organization support a bimodal business intelligence model?
- Does the system support data virtualization and data integration?
- Does the system accommodate automated orchestration and improved agility?
How an organization answers each of these questions will help it determine a best practices approach to adopting a modern data warehouse that aligns with business needs.
Your organization should start off by identifying the business goals and needs so that results match those. Once you’ve defined a data model, create a data flow chart, develop an integration layer, adopt an architecture standard, and consider an agile data warehouse methodology.
Your warehouse model should accommodate multi-source database aggregation, database updates, automation, transaction logging, the ability to evaluate and analyze data sources, and easy-to-change development tools.
Finally, as your organization weighs the option of build versus buy, don’t forget the basic requirements: Your data warehouse should support cloud-hybrid and multi-cloud environments; must support all types of data, including structured, semi-structured, and unstructured; and that must support all data latencies, including batch, real-time, and streaming data.
- Get up to speed fast with TechBeacon’s guide to the modern data warehouse.
- Download the Buyer’s Guide to Data Warehousing in the Cloud.
- Get up to speed on digital transformation with TechBeacon’s Guide.
- How important is DX to your org? Take our survey and find out how you stand next to the competition.
Very interesting article, and much more technical than I expected to see here. We’ve built a software, M.Toolbox, focused on warehouse management and supply chain, that does all this technical work for you, and allows you to focus on the data, creating your dashboards, customer portals, etc. with a mix of real time or historical data stored in the system’s data lake. Feel free to reach out if you want more details at https://macgregorpartners.com/m-toolbox/