From Mapping to Blending – Clarifying Data Integration Terminology

It’s no secret that the importance of data-driven decisions has increased and might very well continue to increase. Most organizations have recognized the need to collect data from their business transactions and set them in the context of general industry information to gain valuable, objective insights for informed business decisions. These decisions can be about the day-to-day business, e.g., customer feedback data must be appropriately integrated to improve customer satisfaction. But, they can also be about medium- or long-term business plans, such as adjusting the pricing strategy based on the company’s revenue data or making merger and acquisition decisions based on the target’s market data.

Data Integration

In most cases, only collecting and buying data is insufficient. A valid data set as a single source of truth is crucial for building a valuable business analysis. Therefore, something needs to happen to the raw data before models can successfully be applied.

This “something” is data integration, i.e., combining data from different sources and presenting it in a unified view. It’s a process that consists of various steps, including (but not limited to) cleansing, mapping, merging, reconciliation, consolidation, enrichment, and blending of the data. This article aims to give an overview of this process while disentangling these terms and stressing the value of data integration.

The need for having a data integration process is due to the simple fact that data comes from various, potentially constantly changing sources. While the data in those sources may be related, most sources will have different structures. They might contain different aspects of the needed information or even contradicting claims. Companies need to be able to find the relation between the various data sources, clean the data, and combine all this information. Otherwise, they might have a business analysis based on erroneous or incomplete data. Undetected duplicates in the data might also lead to wrong conclusions and, thus, bad business decisions

1. Data Cleansing

Data cleansing is an obvious step when exploring a new data set. It involves handling missing data, identifying and handling inaccurate data, outlier detection, deduplication, standardization, etc. Depending on the use case, not all these methods will need to be applied before integrating the data (or at all).

Since data cleansing is almost always a prerequisite for incorporating data into analytical models that is not specific to creating a consolidated data set, we won’t go into detail here. However, it’s an important step to mention because without it, data quality cannot be assured, and the output of any data model is questionable.

2. Data Mapping

Data mapping is one of the first steps when combining multiple data sets into a single set. It refers to defining how data from one source relates to data in another. It involves creating a map between fields in different datasets or systems to accurately transform, migrate, or integrate data.

A simple example is two datasets containing customer information. Both have some basic information about the customer, such as an ID, the name, the address, etc., in common, but they might use different names for those fields, e.g., CustomerID vs. client_id or FirstName and LastName vs. name. To map those fields, one needs to find out which contain similar information and if this information is of the same format, i.e., are CustomerID and client_id from the same ID universe, or do they refer to a different thing, and is the name field an aggregation of first and last name or something else?

Proper data mapping is foundational for many data processes. It ensures that the structures of the data sets are aligned, which is essential before considering the individual entries of the data sets.

3. Data Merging

Through the mapping of the data, common attributes are discovered. Data merging means combining the data based on these common attributes. It provides a single coherent data set, making data analysis easier.

Typically, merging is based on a shared identifier. An example is joins in SQL (e.g., LEFT JOIN, INNER JOIN, etc.), which are based on primary and foreign keys and used to combine normalized data for analysis. Imagine a customer behavior analysis where data about each customer is distributed over several tables containing distinct information such as their orders and demographics.

4. Data Reconciliation

After mapping the data, some of it can easily be merged; however, in most cases, data reconciliation needs to take place. This means that the corresponding entries from the different sources need not only be identified but also, importantly, discrepancies must be found and resolved. Finding corresponding entries from different data sets is basically a deduplication task on the combined data set; however, the various data sets’ structures are also unified through mapping and reconciliation.

For example, when working with two data sets containing the sales of a company’s products, finding the corresponding entries might be an easy merge if an article ID is present. If this is not the case, it’s more challenging, and other attributes, such as an article description, might help. Multiple sale numbers from different sources corresponding to the same article must be resolved. How to do this depends on the nature of the data; maybe one data set is more trustworthy than the other one, or perhaps both contain valid data from different points in time, regions, etc.

The goal is to ensure that data across different sources is consistent and accurate. Data reconciliation plays a crucial role in this process. It is, therefore, necessary for consolidating data sets and when a data set (or a whole data vendor) should be replaced with another data set.

5. Data Consolidation

Data consolidation means assembling data from multiple sources into a single dataset or database. Thus, it is usually preceded by data cleansing, deduplication, mapping, merging, and reconciliation.

Having a consolidated view of the data increases the efficiency of working with it. Since it’s in one place, there is no need to query different sources. Essential mappings, merges, and reconciliations are saved and don’t have to be calculated every time the data is used. Having just one place instead of many that holds the relevant data can also make it easier to ensure compliance and data security regulations.

6. Data Enrichment

Data enrichment enhances existing data by adding additional information from external or internal sources. It can be a data consolidation task, which adds further data to an existing consolidated data set instead of building a completely new one from scratch, or a data blending task (see next paragraph).

Its benefit relies on the assumption that a new data set provides different information than the existing one. That information might be additional fields for existing entries or new entries. In any case, some form of data mapping and reconciliation must be applied to ensure that the enriched data set remains consistent and valid.

An example use case for data enrichment is a company that already has a database for its suppliers and its interactions with these suppliers but wants to add additional information about the suppliers, such as their revenue, market capitalization, relations to other companies, etc.

7. Data Blending

In contrast to data consolidation, data blending is more of an “on the fly” way of integrating data, meaning the combined data is not necessarily consolidated into a single database but used for ad hoc analysis. Not storing the merged data set in a database makes sense for exploratory analysis or analysis that happens so infrequently that the cost of storing a consolidated view is higher than blending the data during use. It also offers higher flexibility for the analyst in deciding how to use which data, which may be an advantage for exploration but also a risk for errors.

A typical example would be blending data from an Excel sheet with data from a SQL database in an analysis tool. This example illustrates the enrichment of an existing consolidated data source (SQL database) with additional information (Excel sheet), a common use case for data blending.

Conclusion

While this article disentangles the different processes and terminologies involved in data integration, most of these processes overlap and partially depend on one another. For example, data reconciliation cannot happen without data mapping and merging, and the terms are often used interchangeably. Enriching a data set is impossible without mapping, merging, and reconciliation. The aim of all these processes is usually to create a consolidated data set. Thus, data consolidation is often used as a single term to capture the whole process. Data consolidation and merging are about combining data into a single data set, but they have a different focus, i.e., having shared data storage vs. merging individual data entries. Data cleansing can happen separately or as part of the other processes. Data Blending can be based on consolidated or separate data sets. The seven steps provided in this article serve as a rule of thumb on the processes that need to be applied for data integration and the order in which to apply them.

The need for data reconciliation, consolidation (or whatever other term applies to a specific use case) appears to be clear and commonly accepted. A more significant challenge is how to get there. The data architecture or pipeline may vary; there might even be a ready-made software product that handles those questions. However, it is essential to note that combining data ultimately remains a data problem that requires customized solutions based on the actual data and business use case. CID can support you in finding a customized solution for your data problem.

From Mapping to Blending – Clarifying Data Integration Terminology

Data Integration

1. Data Cleansing

2. Data Mapping

3. Data Merging

4. Data Reconciliation

5. Data Consolidation

6. Data Enrichment

7. Data Blending

Conclusion

Author

Share

More posts

The Role of AI in Data Engineering: Transforming Workflows and Efficiency

From a legacy Monolith to modern Microservices

Latest Media Content

Software Architecture: Building Systems That Fit Your Needs

Modern Approaches for Data Platforms

Cloud Native: Foundation of Modern Software Development