top of page
Laptop keyboard, coffee, sticky notes, and pencils on wood background

Data Science and Data Vault 2.0: Part 1, Gold-Panning, AI and Wild-West Data

This is the first part of a three-part blog series based on a Keynote talk at this year’s World-Wide Data Vault Conference on the role of Data Vault 2.0 in supporting Data Science. The first blog focuses on how to feed data to Data Science Projects.


Data Vault 2.0 is a good fit for supporting Data Science projects. Its methodology encourages data quality, veracity and uniqueness as standard, some of the common bug-bears that plague many Data Scientists. However, the relationship between the Data Warehouse and AI models is often one-sided. Many Data Science projects compromise the core principles of Data Governance and warehousing to hoover data chaotically. After those compromises, they then provide little opportunity for the warehouse to capture AI model outputs. But Data Science in the business is maturing rapidly, and it’s ready to play by the same rules as everyone else.


Data Vault 2.0’s success is built on its foundations in pattern-based architecture and implementation. Still, Data Science has proven the stubborn exception to many of the core rules of the methodology. This quirk of Data Vault 2.0 isn’t by accident. Data Scientists are relatively unique amongst data users in their operation, preferring messy, raw data; large sandpits of server and disk space to play in; and expensive one-off data extracts for training. The standard Data Vault 2.0 answer is to give Data Scientists access to Persistent Staging Areas (PSAs) and the Raw Vault, where no other business user would be allowed access.

[/vc_column_text][vc_empty_space][vc_column_text]This PSA-access approach is born from a traditional view: Data Science should sift and pan through an entire Data Warehouse to find unexpected statistical nuggets of gold. However, Data Science has matured since Data Vault 2.0 was first established. These scientists are now usually tasked with far narrower objectives: improving existing models, developing specific knowledge products, or identifying operational improvements in individual business processes. Now that we’re past “AI gold-rush” Data Science, you don’t have to opt for a wild-west approach to data provision.


There is no need to push Data Scientists towards PSAs; there is even a danger in keeping them there. Data Scientists don’t benefit from standard data cleaning and quality processes; data schema change management; or even the separation of test data or incorrect data from single-point-of-truth records at the information mart layer.


Furthermore, with the absence of governance policies in the PSA, AI models can hoover up and train on sensitive data and even regurgitate it in the right conditions or malicious hands.


The better approach is to treat them as any other business user and provide their narrower scope of work with data through information marts. These Information Marts may be denormalised or flattened “big” tables rather than star-schema Marts. Where they need raw data or metadata, Data Vault 2.0’s Data Marts and Meta-Marts can be used to provide them with hands-off access to the information they need. Why let your Data Scientists spend all their time building their pipelines when we can provide the right marts directly for them?


In short, Data Science no longer has the justification it once did to break the rules for data access in Data Vault 2.0. Treat them like any other business user with mart access. They will benefit from your data cleaning and centralisation, while you won’t have to compromise your governance and regulatory compliance

bottom of page