“So, you finally got it?”
“So, you finally got it?”
“Well your Data Hub!!!!”
“Oh yeah, that’s great, we implemented it right after our Datalake!!!”
“But then what are the benefits?”
“Well, first of all, imagine the Datalake!”
“We put a lot of stuff in it, so we are able to do plenty of things with it, right!”
“OK, like what?”
“Well, many kinds of things… Things you can’t even imagine…”
It is funny to see that our job is often made up of “buzz words” that are sometimes poorly mastered or badly used. These terms accompany the emergence and adoption of more or less innovative technologies; and so it is with Big Data technologies and the implementation of these famous Data Lakes and Data Hubs.
But what are we finally talking about? Of course, new technologies and tools make it possible to overcome previous limitations (performance…) or to lower costs (storage…); which can therefore change ways of working and result in a different solution’s design, but without questioning what really makes them particular: their proper functionalities and their benefits to business users.
This is what is often forgotten, particularly because we can see them as fads: “I need to set up my Data Hub”, because that is the current trend; the “initiative” could even be “sold” to business users because everyone is talking about it, but for what functionalities and benefits?
Indeed, data hubs did not wait for the arrival of Big Data to exist; architectures linked to operational data integration have been around for a long time and cover several functionalities depending on the maturity of the solution and the involvement of the business stakeholders:
- Centralization of data flows: from an architectural perspective, this is obviously the primary objective of a data hub. We are moving from the (in)famous spaghetti dish, made of directly connected flows between systems, to a centralized approach built with half incoming and half outgoing flows. These half data flows can be made in “push” mode or in “pull” mod, depending on the considered systems.
- Storage: to meet a half-flow strategy and historical recording needs, data persistence must be ensured in the Data Hub. This persistence can depend on a given “validity period” of the data in accordance with the business processes that carry it. A “Data Lake” concept can also be used for this function, to minimize storage costs. The different layers of a Data Lake will be set up to distinguish between incoming raw data and outgoing transformed data (see below), or even to allow direct access and use by the business users.
- Data consolidation: as such a centralized system is meant to carry all the operational data exchanges, it is an opportunity to define all the business objects in a global and universal perspective; the definition of objects becomes unique for everyone and for all systems; the particularities of each supplier system are taken into account and each consumer system picks up the desired attributes from this complete vision and adapts them to downstream needs. It is therefore important to be able to qualify these business objects in regard to busines keys to ensure such a consolidation. We meet the concepts of MDM and Golden Record for reference data, that can also be supported by a Data Hub when no MDM system exists.
- Tracking and traceability: it is essential to ensure that all data in transit is correctly ingested and transmitted in such a way as to guarantee the consistency of the data flowing out of the Data Hub, to guarantee the reliability of downstream business processes.
- Data governance: another objective, along with the centralization and therefore the simplification of the data flows, is to give back the initiative and the responsibility to the business users on exchanges, reliability and propagation of their data. Even if the implementation of a Data Hub can be seen as a “simple” technical overhaul of an existing system, the involvement of business users and the set-up of a real data governance will help for sure to avoid to corrupt gradually the new system, because of initiatives that would be too local or isolated, which could only lead to rebuild data silos…