Data Science and Data Vault 2.0: Part 3, Applying lessons

Jan 31, 2022
3 min read

Updated: Aug 27, 2025

This blog is the third and final part of the series based on a Keynote talk at this year’s World-Wide Data Vault Conference on the role of Data Vault 2.0 in supporting Data Science. The third blog goes through how to apply patterns from the previous blog to a regular Data Vault project. In the last blog, we found that implementing

AI into a limited case study warehouse gave us a few patterns in our architecture:

AI projects should be only feeding on the information delivery layer.
We can and should treat production-ready AI like any other soft business rule
We need to capture AI development metrics like any other business source for the data warehouse.

The general trend is that we should treat Data Science as other business areas. While it requires awkward information and more information, it should still retrieve from the same place as any other team. Its activities and outputs (metrics and models) should be treated no differently from an accounting team’s reports and accountancy rules.

Let’s take a look at the standard Data Vault 2.0 architecture pattern in the first figure. The current advice is that a Data Scientist accesses the Warehouse from a Persistent Staging Area (PSA) but may dip into everything from source systems to information marts. However, there are huge issues with allowing unfettered access (in red) to the entire system:

Data Quality measures are meaningless if a Data Scientist accesses source data or a PSA, and AI is very vulnerable to data quality issues. Typically, a Data Scientist will have to hand-crank their cleaning, duplicating effort and wasting time.
Data Scientists are unlikely to be familiar with the idiosyncrasies and issues in source data in a PSA. Would they be able to spot test data or corruption issues handled in our hard business rules?
Data Scientists specialise in feature extraction and ML modelling, not data management and engineering. Yet, we give them ungoverned access to vast swaths of sensitive data in the PSA when far more data-savvy colleagues would never be allowed.

A final nail in the coffin for the current approach is that there is little feedback and capture of value for the expensive and disruptive activities of a Data Science team.

Capture more, break fewer rules

AI has changed a lot in the last eight years. Existing approaches to AI in Data Vault are geared towards handling a broader exploratory role of Data Science well – an increasingly outdated role. It is no longer necessary to break many of the fundamental rules of Data Vault and warehousing to accommodate Data Science activities.

Applying the patterns we developed in the last blog, we end up with the access behaviour in the figure below.

Data Scientists feed primarily from the information delivery layer. They need raw or meta-data, and we can create raw marts and meta marts to serve them rather than direct access to the vault.
Models and metrics from data science activities are captured as a data source and ingested into the Warehouse.
We deploy Operational AI as soft business rules rather than operating in isolation outside the Warehouse. This pattern is the only security concern of three and will need to be managed, versioned and tested like any other business rule to be used safely.

At no point does a data science team read from the PSA, Raw/Business Vault, or source systems, or any information feed without data quality or governance measures. Data scientists are accessing data safely and with limited data munging, scraping, and manipulation required.