Modern Approaches for Data Platforms

Introduction

In any sufficiently complex company, different domains care for different aspects of the business. Examples are logistics, customer relationship management, fulfillment, … Each domain owns data that is crucial for its operations. As an example, customer relationship management obviously owns customer data. Other examples are orders in e-commerce, the flow of goods in and out of a warehouse, or activities on a web site. Data accumulates over time and becomes historical.

Historical or analytical data is of huge importance for companies. It is used for decision making supported by business analytics, a better understanding of customer behavior, optimizing customer experience, automated pricing, and fraud detection – just to name a few use-cases.

Data-driven companies are committed to ground their decisions on analytical data – from strategic over tactical decisions to the working of operational systems. Analytical data needs to be discoverable and usable with minimum effort and without risking wrong conclusions due to unexpected bad data quality or misunderstandings about the meaning of data.

In this article, we discuss some of the challenges that companies with complex and evolving business logic and technology stacks face when designing their platform for analytical data. We compare the classical and conceptually opposed approaches of

Data Warehouse and
Data Lake.

Finally, we take a look at the three novel approaches of

Data Lakehouse,
Data Fabric, and
Data Mesh

that aim to fix the shortcomings of Data Warehouses and Data Lakes from different perspectives.

Challenges for Data Platforms

The design of a data platform for a data-driven company poses unique challenges:

How to integrate and govern analytical data across domains?
How to ensure that semantics of analytical data, based on complex and evolving business logic, is not lost?
How to support different ways of storing and processing analytical data for different use-cases?
How to integrate legacy components and be ready for an evolving technology stack?

It is important to find a good balance between standardization and control on the one hand, and openness, flexibility and the distributed nature of domain knowledge on the other hand to ensure that the data platform increases the value that is gained from data, instead of being bypassed due to its limitations.

A crucial ingredient of every data platform is metadata, i.e., data about analytical data across boundaries of domains and technologies. Metadata ensures that analytical data is discoverable and understandable, and that rules of access control and compliance are properly applied.

Analytical Data across Domains

Very often, use-cases require analytical data across domain boundaries. One domain might require analytical data from other domains, e.g., to train a machine learning model. Decision makers might want to combine analytical data from multiple domains to gain new insights through business analytics.

However, cross-domain sharing of analytical data has a couple of challenges that need to be tackled to ensure long-term usability.

Analytical data is derived from very different data sources using different technologies and data structures, e.g., operational systems, external partners, …
The meaning of analytical data might depend on subtleties in the operations of its domain. E.g., an order in e-commerce might go through a complex sequence of state changes.
Used technologies, data structures and meaning of data changes over time according to the needs of the individual domains.
Different domains might even use the same name for different concepts, leading to a risk of misinterpretations.

Data Governance

The aim of data governance is to increase the benefits that can be derived from data by building trust in data and the ways it is used – for producers, consumers, and also its subjects.

Producers need to be sure that their analytical data is only accessible to the right persons and use-cases, stored and shared in compliance with regulations like GDPR and company interests, and understood by consumers in its meaning and quality.

Consumers want to be able to find the right analytical data for their use case and ensure that they understand how analytical data can be used and what its limits in terms of quality are, and that they can’t accidentally use analytical data in a way that conflicts with regulations or company interests.

Finally, the subjects of analytical data, e.g., customers of a retailer, are only willing to share their data if they retain autonomy over it. That means, if they can be sure that their data is only used for purposes they support and if their data is fixed when incorrect or even deleted if they wish so, according to the regulations expressed in the GDPR.

Heterogenous Technologies for Analytical Data

The complexity and variety of companies’ operations, the business logic required to carry them out, and the analytical data that is collected and utilized is growing.

But also the landscape of approaches and technologies to store and integrate analytical data is growing in complexity and variety. There are approaches that focus on scalability and flexibility while accepting a lack of data modelling, whereas others favor a consistent and integrated data model from the beginning on to facilitate use of analytical data for certain use cases. Moreover, there is an ever growing diversity of technologies for analytical data which, even within the same company, might be spread over a hybrid multi-cloud environment combining components that live on-premise with components in the public cloud.

Data Formats

For different use-cases of analytical data, different ways of structuring and storing it are suitable. A star schema comprised of relational tables might be best suited for business analytics with typical BI tools. Near real-time processing, e.g., for fraud detection, is better supported by event streams. More complex use-cases of data analytics as well as machine learning can benefit from data in a simple flat file format or even semi-structured or unstructured data that is closer to its original form.

This raises the question whether analytical data should be stored in

a sanitized and canonical structured form (schema on write), or in
a raw and potentially unstructured form close to its origin (schema on read).

In the first case, misunderstandings in the use and interpretation of analytical data are reduced as the transformation happens close to the source and its experts. On the other hand, a lot of flexibility and potentially even historical data is lost by an early decision on which data is needed and in which form. In the second case, the analytical data remains in its raw state, ready for new use-cases. However, consumers need to understand the intricacies of the analytical data in its raw form, misunderstandings can happen easily, and cleansing and preparation of the analytical data will need effort that is repeated for every use-case.

Hybrid Cloud Environments

Either through technological evolution of the technology stack of a company, driven e.g. by new use-cases, or through organisational changes (mergers and acquisitions) it is not uncommon that analytical data is stored and processed by different technologies. Analytical data might even be distributed over an on-premise private cloud as well as IaaS and PaaS providers in the public cloud. More likely than not, the technological landscape will diversify further in the future.

A data strategy needs to ensure that analytical data across these technologies is discoverable, can be combined, and follows the same guidelines of governance.

The Danger of Shadow Copies

A data platform that does not embrace the distributed nature of its analytical data can easily become a bottleneck in onboarding of new data sources or use cases, or the communication of unavoidable change between producers and consumers of data. It also might become too inflexible to incorporate new data technologies, new data formats, or new requirements of governance, e.g., due to compliance and regulations like GDPR.

In this case, the company risks that shadow copies of analytical data pop up for pragmatic reasons , e.g., when someone can not find a data set in the data platform, that data set is not accessible or not in the right format. In such cases, users are tempted to obtain copies of data “out of band”, e.g., from a colleague that is able to find, access, or transform the data. Such copies are outside of control of data governance and their meaning and freshness soon becomes unclear. Thus, a data architecture that is too rigid for the dynamics and use-cases of a company can easily lead to a loss of control.

Classical Approaches for the Management of Analytical Data

Data Warehouses and Data Lakes are two classical and widely used approaches to store analytical data and make it accessible for analytics. A comparison is interesting because the two approaches are shaped by very different use-cases for analytical data. Moreover, the Data Lake approach reflects the growing complexity and dynamic of IT landscapes in companies since the earlier introduction of Data Warehouses.

Data Warehouse

Data Warehouses were introduced in the 1980s. Here, analytical data is collected and stored centrally in a highly structured and integrated way. There is usually a central Data Warehouse team that owns and governs a structured, interlinked and sanitized model of all data sources across domains, and also maintains ETL (Extract, Transform, Load) pipelines into the systems of the domains to gather the analytical data from its sources. Data Warehouses are usually relational and thus queried using SQL.

The advantages of a Data Warehouse are its ease of use for business intelligence and reporting and easy to enforce data governance.

On the other hand, Data Warehouses lack agility in a domain-oriented and dynamic IT environment – data sources in domains can change fast, making centrally owned ETL pipelines unreliable, and the Data Warehouse team can become a bottleneck between domains and business analysts when it comes to subtleties in the meaning and quality of analytical data.

A Data Warehouse can also turn out to be limiting for advanced data analytics use-cases. One example for such a use case could be AI, that would benefit from data being less sanitized and not structured based on possibly premature assumptions about future use-cases. A Data Warehouse may also be too limiting for use-cases that require near real-time processing and are best supported by continuously ingested streams of events.

Data Lake

The Data Lake approach was introduced roughly 30 years after the Data Warehouse approach and is conceptually opposed to it. It is a highly scalable central storage for unstructured, semi-structured and structured data. In contrast to a Data Warehouse, data from the domains is not necessarily integrated into a coherent model but instead primarily stored as it is, and later transformed into a suitable integrated model for a use-case or a class of use-cases. Thus, in comparison to Data Warehouses, we are not talking about ETL pipelines but about ELT (Extract, Load, Transform) pipelines here.

The advantages of a Data Lake over a Data Warehouse include greater scalability and flexibility. New sources of analytical data can be easily added as no immediate integration into a coherent model is necessary. New use-cases for analytical data also do not require changes to an already existing integrated model but can resort to the data in its original form.

From a technical side, it is typical for Data Lakes to keep a separation between storage and computation, so that new use-cases for analytical data do not have to use the same query engine as in a Data Warehouse, but can decide to use technologies of their own choice for the processing of analytical data here.

A disadvantage of Data Lakes however is that the meaning and quality of data is easily lost before it even is transformed or processed for a specific use-case, and potential cleansing for a use-case can take up a lot of time. This holds true especially as data that is collected over a long time inevitably goes through changes in, e.g., regarding its structure, meaning, statistical distribution, or timeliness. Processing of data in the Data Lake can be challenging due to the lack of transactions and their guarantees on atomicity, consistency, isolation, and durability (ACID) . In contrast to a Data Warehouse, where a central team implicitly knows about the integrated data, a Data Lake requires much more effort to keep an overview of the available data and to ensure proper data governance.

There is a risk that a Data Lake degenerates into a so-called data swamp of hard to use and possibly even contradicting data with unclear semantics and quality.

New Approaches

In the last 15 years, new approaches for analytical data were developed that aim to combine the advantages of Data Warehouses and Data Lakes while supporting an ever-growing diversity of use-cases and technologies for analytical data, highly agile, decentralized and domain-oriented IT landscapes, and increasing requirements on governance and compliance.

Technology: Data Lakehouse

A Data Lakehouse is a technological approach that builds on top of a Data Lake and provides uniformly structured and governed access to its data, even supporting unified queries across the Data Lake using a single query language like SQL. It is implemented as an additional metadata layer on top of the storage providing metadata, imposing structure on unstructured and semi-structured data, and even ensuring ACID properties.

Data Lakehouses combine the advantages of Data Warehouses and Data Lakes: Data can be analyzed in a structured and integrated way while preserving the scalability, flexibility and openness of a Data Lake. A central metadata layer also allows unified governance.

However, a Data Lakehouse still relies on a single storage layer and thus struggles to integrate heterogeneous and potentially legacy data technologies. Moreover, a Data Lakehouse as such does not yet give any guidelines how to handle the autonomous evolution of distributed domains, their business logic, analytical data and use-cases.

Data Fabric and Data Mesh propose answers to these questions of architecture and data culture.

Architecture: Data Fabric

Data Fabric is an architecture that proposes a decentralized and loosely coupled data platform. Data is shared and consumed in a graph of distributed services, ranging from raw data sources in operational systems to services with specialized and curated analytical data for certain use-cases. In contrast to a single Data Warehouse or Data Lake, services are spread over different technologies, potentially in a hybrid multi-cloud environment. This also facilitates the integration of legacy components.

An important component of a Data Fabric is a metadata layer that ensures discoverability and connectivity of services and that enforces governance across all services.

One risk of Data Fabric is that ownership of and dependencies between the distributed services are not clear and thus knowledge about analytical data and reliability may become lost.

Culture: Data Mesh

Data Mesh was introduced in 2019 by Zhamak Dehghani. It assumes that the business of the company is organised in domains that are supported by cross-functional software development teams. Data Mesh proposes a decentralized and distributed graph of domain-owned analytical data that is shared in so-called data products.

Obviously, there is a certain overlap with the Data Fabric approach. In the context of Data Mesh, the focus shifts from architecture to data culture, emphasizing ownership and responsibility for analytical data, guided by principles of domain-driven design and product thinking.

The core elements of the Data Mesh approach can be summarized in the following four points:

Domain-ownership: Analytical data is owned and managed by the same cross-functional development teams that also handle the operational systems and operational data of the domains. The right domain is determined as the expert for the analytical data, e.g., the owner of the operational system that is the origin of the analytical data, or the owner of a use-case for analytical data that determines the requirements on the analytical data.
Data as a product: Services for analytical data are intentionally designed and supported with long-term use by a variety of decoupled consumers with different use-cases in mind. To this aim, data products are documented, versioned on changes, and guarantee SLAs on availability and data quality. They are designed for reusability by current and future consumers.
Federated computational governance: Unified governance is automatically applied to decentralized data products across domains and finds a balance between overarching concerns like compliance on the one hand and autonomy as well as expert knowledge within the domains.
Self-serve data platform: A self-serve data platform is not only designed to provide end users with analytical data, but also to facilitate the standardized and interoperable development of data products by generalist developers within the domains.
One particular risk of the Data Mesh approach are the organisational changes of ownership and responsibility it requires. It is sometimes difficult to determine boundaries between domains properly and to assign ownership of analytical data. Moreover, the changes in ownership also have to be considered in resource planning, prioritization and training.

Summary

The design of a data platform depends on many factors and rarely happens in isolation. It depends on

already existing solutions for analytical data and operational systems,
current and future use-cases for analytical data,
and the structure of the company and its business logic.

Data Lakehouses can help to improve governance and usability of an already existing Data Lake, whereas a Data Fabric provides the ground on which heterogenous components from operational systems up to already used technologies for analytical data across clouds can be integrated. The cultural shift triggered by the Data Mesh approach can be especially advantageous in a company that uses or aims to use domain-driven design to tackle its complex and evolving business logic.

Of course, the three described approaches of Data Lakehouse, Data Fabric, and Data Mesh are not mutually exclusive – a specific data platform can utilize ideas from all three approaches. For example, this article explores the design of a new data platform for Engelbert Strauss that combines architectural principles of Data Fabric with the cultural framework of the Data Mesh approach.

Contact us if you envision a data platform that is tailored to your company. Together, we will assess your current data usage and needs to find a custom solution handling all aspects from technology over architecture to governance and data culture.

Modern Approaches for Data Platforms

Introduction

Challenges for Data Platforms

Analytical Data across Domains

Data Governance

Heterogenous Technologies for Analytical Data

Data Formats

Hybrid Cloud Environments

The Danger of Shadow Copies

Classical Approaches for the Management of Analytical Data

Data Warehouse

Data Lake

New Approaches

Technology: Data Lakehouse

Architecture: Data Fabric

Culture: Data Mesh

Summary

Author

Share

More posts

The Role of AI in Data Engineering: Transforming Workflows and Efficiency

From a legacy Monolith to modern Microservices

Latest Media Content

Software Architecture: Building Systems That Fit Your Needs

Cloud Native: Foundation of Modern Software Development

Getting your Data Ready for AI