Data Science and Data Vault 2.0: Part 2, Case Study

Dec 17, 2021
3 min read

This is the second part of a three-part blog series based on a Keynote talk at this year’s World-Wide Data Vault Conference on the role of Data Vault 2.0 in supporting Data Science. The second blog explores lessons from a use case in academia.

In the first blog in the series, we saw that the management of AI (Artificial Intelligence) projects has changed since the inception of Data Vault 2.0. We can treat Data Science like any other information user, not a special case that breaks governance and data management policies and approaches. But what exactly does a new approach to data science in the Warehouse look like?

We’ll take an academic data vault and AI project as a case study to answer this question. Stepping into academia allows us to separate data governance, security, and feedback requirements from a warehouse’s Data Science requirements. We’ll then look at those simplified patterns of interaction between AI activities and the Data Vault.

In this academic vault, the priorities for design are primarily source management, data fusion and data integrity. Even in smaller research projects, STEM (Science, Technology, Engineering and Mathematics) research often needs us to manage various sources and merge disparate and awkward technical data. We also need to prove the reproducibility of our data manipulation or risk accusations of academic fraud.

The case study: Volcanoes and AI

Volcanic eruptions are difficult to predict, but we can constrain their likelihood of eruption from patterns in localised earthquakes caused by underground magma. Every volcano and even every eruption at the same volcano can occur very differently. Any AI project to find generalised patterns would need to consume various events from different sources. A significant problem is a lack of standardisation of seismological data collection with massive variation in format, sensors, and channels. Worse, many records are digitised scans of inked chart paper from analogue seismographs.

Three significant sources exist in the case study: data from Eritrea, Etna and Montserrat containing multiple episodes of activity. This data goes through several “academic” rules:

Feature extraction.
Picking positive events from background noise.
Classifying the event into several mechanisms (rock fracture, magmatic movement, deeper tremors, for example).

The ingestion of these sources into the academic data vault model is essentially the same as a typical Data Vault project. However, derived values in the Business Vault are usually the result of intermediate AI. Rather than existing as a series of outbound data consumers, trained AIs are used as versioned and managed business rules.

As AIs become operational, we store their weights, biases and parameters in the Data Vault as another data source. Capturing this data improves the reproducibility of any model. Furthermore, we can avoid relying on shadow-IT data science tools, such as weights and biases (wandb). The figure above shows a simpler standard data vault implementation and the academic case study with AI implementations in green.

Patterns in supporting your AI projects from your Warehouse

Several patterns emerge once a Data Vault is focused on AI operations:

AI is consuming data from information marts, not from a PSA or the raw vault.
AI is also a business process. We don’t treat it any differently than any other “soft” rule.
Some information exists raw, but a lot doesn’t. Even in a limited example, AI still needs information marts to be adequately served.
Model outputs are being treated as a new warehouse source, not as discarded metrics.

In the next blog, we’ll apply these patterns to a regular data vault implementation outside an academic case study.