Limitations

One of the fundamental assumptions of this methodology, and so one the crucial points during the design phase lies in determining the Business Keys (BKs) and their immutability in every process and source system to be considered: the definition of the BK and of its attributes must be shared between all the business areas in the organization and be the same in each process.

Yet, in some cases, this immutability can be very hard to assess (the different businesses have to agree on a common BK definition) and can sometimes lead to a dead-end when no agreement can be made between businesses…

Indeed, we can consider that in some cases the knowledge around a business object gets better and better as linked data is acquired; for example, in a process such as customer data acquisition and Contact / Customer knowledge through Sales and Marketing actions. Data consolidation and 360° vision on the overall data lifecycle can get more and more complex in such a case, simply the information available at the very beginning of the lifecycle can be very (or too much) fragmented to guarantee the uniqueness of the definition for a Contact / Customer (for example a mere email is known, then changes as a primary contact information, or surnames and first names are fed without a date of birth… et anyway could it even guarantee the uniqueness of a Contact???).

Anyway, we can encounter cases when there is a lack of a real BK, fulfilling what it should be (regarding unique identification of an object and then guarantee there is no doubloon), during the time Contact / Customer knowledge is getting better as long as data is acquired, to guarantee this uniqueness.

Before getting to that point, what about “incomplete” historical data, but that can however be an essential information for some indicators (for example: Marketing effectiveness on first touchpoints on Contacts), which requires to reconcile information in order to aggregate it to get a true 360° vision.

In those cases, if we’re keeping it to a strict BK notion, which should be strictly defined, unique and shared between Business areas, processes and systems, we notice an obvious limitation as a BK must be fixed to address the whole traceability of an object, as well as its auditability.

Possible enhancements

However, it is possible to bypass this limitation and this lack of flexibility. Just with relying on existing structures of Data Vault, without twisting or corrupting the methodology and its definition in any manner.

Remember that the main asset of Data Vault relies on the flexibility and the scalability of the modeling: it is mainly based on the Links and their ability to represent weakly-constrained and scalable relationships (meaning N-N relationships).

A particular type of Link can then be used: the “Same-As” Link; it is, as its name says it all, a simple reflexive link stating that 2 technically different objects (with 2 different BKs) represent the same business object.

Implementing these Same-As Links can lead to an additional complexity, but far to be insurmountable, provided that:

  • One can rely on technical information shared between systems processing a data as its identification is enriched as it’s gradually propagated: transmission of technical identifiers from an upstream system to a downstream system.
  • Resync to upstream of technical identifiers, which is mainly useful when deduplication processes exist in a downstream system. The benefit of keeping data history in the EDWH makes this point optional, as historization will retain all the evolutions at a given point in time and so the relationships between the BK and the corresponding technical identifiers in each source system.
  • Determination of a unique reference BK based for example on precedence of systems and the creation date of the objects. So that the same technical identifiers are finally used in dimensions or axis of analysis.

This usage of the Same-As Links can of course be strengthened by Data Quality processes to improve objects’ identification and make deduplication of objects more reliable; the DQ processes will then supersede the syncing of technical identifiers between source systems, at the Business Data Vault level, to have additional reconciliation criteria, even if they’re not part of the BK.

And this while keeping the base definition of Data Vault objects and the traceability constraint which is part of and a demand of the methodology!