top of page
Laptop keyboard, coffee, sticky notes, and pencils on wood background

SAS taught Patrick Cuba to survive in the data world and inspired guidebook on how to build a Data Vault

Updated: Aug 27

Patrick Cuba has worked as a consultant specialising in data modelling on SAS platforms for more than two decades. Having worked in South Africa, Australia and the UK, he first worked on a Data Vault project less than five years ago.

As a result of the effects of the coronavirus pandemic just over a year ago, Patrick felt there was a need to write an all-inclusive step-by-step guide to the decisions and processes involved in creating an agile, enterprise data warehouse, for both the Cloud and on-premise servers. Patrick self-published ‘The Data Vault Guru: A Pragmatic Guide on Building a Data Vault.’


Datavault’s Andrew Griffin spoke to Patrick about his varied experiences in consultancy work, how the book came about, what he hoped to achieve by writing it – and the feedback he has received from around the world over the last seven months.


Q) You have some 20 years working in the data industry. How did you start?


A) Well, I was born in Brazil but grew up in South Africa where my father was an electrical engineer. I studied for an IT diploma in Johannesburg before completing a bachelor’s degree in the subject.My first job was at IBM where I worked as an IT specialist from 2000 to 2002 while I studied.I found that I wanted more from my job after completing my studies – unfortunately, my role within IBM wasn’t going to change soon.

Through a student friend, I was hired as a SAS consultant without knowing what SAS was, after a week’s transition I was the SAS data warehouse administrator!

The clients varied but I found that I cut my teeth in consulting when placed at a debt collection company, working under Dr Kendal Jordi.He really moulded the way I think about consulting, he would leave a meeting with the customer telling them how they should run their business – he was that good.

The odd thing about the business was that SAS was being used as an operational source.

Yes that’s right, various source systems’ data was pushed to SAS and the data warehouse would publish back corrections to the source systems, daily.

Just before the global financial crisis, I moved to Australia – thanks to those SAS skills – and I spent two years in Perth as a SAS expert.

Those same skills enabled me to move to the UK for a couple of years at another consultancy and worked at Bank of America’s European credit division, based in Chester.


Q) So by the time you came to the UK you were very experienced in helping to build and maintain data warehouses?


A) Yes, when I was at Bank of America my initial role was as a system architect, but it evolved into a funded programme to test new features available in SAS. I saw an opportunity to improve my own understanding of other features in SAS such as hash-maps and hash lookups to base a framework of dimensional model automation on.

I designed an automation framework with a front-end as a structured Excel sheet, you could configure a dimension or a fact table and the SAS code used that as input to deploy that into SAS SPDS from an overnight batch from Dallas, Texas. It took about eight months to build it out with vigorous testing and we found some undocumented “features” within the tooling, some which required regular phone calls with SAS’s R&D guys. It’s a pity the bug fixes, or should I say, features were never given a label like “solved in Chester by Patrick & Co.’s team” – just kidding. But at the end of it we turned a six-hour load into 45 minutes based on this framework; queries that ran for at least 20 minutes down to about two minutes using star-join optimization on SAS SPDS. And the framework and automation patterns made it possible for this platform to run without editing any SAS code – you simply configured it.


After that I moved back to Australia.


Q) How did you first come to build a data warehouse using the Data Vault 2.0 method?


A) At the time, I didn’t understand that there was Data Vault 1.0 and Data Vault 2.0.

However, through a Queensland-based consultancy we clubbed together a few contractors to go on a Data Vault course run by Genesee Academy.

Well in the middle of the training it struck me – “All of this stuff can be automated really easily on SAS.”

The work I had done in the past with Bank of America had shown me when you have a limited set of table types you need to load, all you need to do is configure it and load it, which I had the experience of doing already. I just needed to think about how to do if for a Data Vault on SAS.

I designed such a pattern-based automation engine on SAS, MS SQL Server back-end and Excel frontend that formed the basis of my presentation at the SAS Global Forum in 2018.


Q) What was the reaction to your presentation?


A) Well the only problem is the SAS Global Forum is always held in the US, but they want the presentations to appeal to people from all over the world – SAS Australia sponsored accepted papers to be flown into the US. It was a huge event with thousands of people attending and hundreds of presentations being held over several days.

A few weeks before the event I received an email from SAS themselves stating that there seems to be quite a bit of interest in my paper and that it will be a recorded session. So, I did my piece, got asked a couple of questions and due to the attendance numbers, I did not think too much more about it afterwards. I did what any traveller does and spent a few days around the event exploring Colorado and the ski scene near Freico and then flew home.

At the next SAS user group meeting in Australia, I was surprised that a few people came up to me and said, “Oh we heard your presentation went quite well.” And I received some very positive messages on Linkedin and email, even from my former manager at Bank of America, “You’ve come a long way since you did that work for us,” he said.

Now when you type in “SAS Data Vault” on Google it’s the first paper on the list!


Q) So from that point, were you sold on Data Vault?


A) Well not yet. Down the years, I have always had an interest in data modelling, and I did some on SAS and even used other tools like IBM Data Architect, Erwin Data Modeler and some not so exclusive products.


While working for another consultancy I got a call up to join a consultancy that specialises in Data Vault 2.0. In my first week, they put me on Data Vault 2.0 training under Nols Ebersohn.


I feel lucky to have been paired with such individuals who are so understanding, and who value teaching the way they do. After training we were even paired on the same site and engaged on a two-month scoped piece of work.


With Nols as the proverbial echo-chamber, I learned a lot of the Data Vault 2.0 philosophy from him – and in the very rare occasions, I even managed to change his mind on a few Data Vault 2.0 things.


There seems to be a lot of free information in the market. And to be honest I didn’t see the real difference between Data Vault and Data Vault 2.0 until the last few years through consulting.


I hear people compare Data Vault 2.0 to Kimball modelling when really that should not be the focus of the discussion. Worse, when some professionals describe Data Vault 2.0 as being Data Vault 1.0 with surrogate-hash keys and dropped end-dates then they haven’t really been exposed to the philosophy behind Data Vault 2.0, and why the current thinking of building (legacy) data warehouses no longer applies to the modern data warehouse.


Sure you can buy the next best thing off the shelf, like how great Snowflake has re-engineered how a Cloud data warehouse should be delivered. But to change the dominant thinking from the data warehouse of old, takes the discipline and investment to understand the philosophy of the change. If you choose to skip ahead and breeze through the important parts (i.e. understanding the mechanics), all you are really doing is buying new toys, but still using it the same old way.


Q) So what was the main motivation for writing ‘The Data Vault Guru‘?


A) I found that online there are a lot of conflicting ideas around Data Vault, some straddle between what is a Data Vault 1.0 versus a Data Vault 2.0 implementation – and in truth they are very different. Some practitioners, who have never been certified or trained on Data Vault 2.0, call their implementation Data Vault 2.0 simply because they adopted surrogate hash keys and dropped end-date columns in satellite tables. In fact, that is not the only difference – you could write a book about it.

I felt what was needed was a guide, in other words, something that says: “Yes, build this way,” but also explained “why you are building it this way.”


Q) So once you had conceived the idea, how did you get to actually write it?

Well, thanks to the coronavirus pandemic, I found myself between consultancy projects and had the time to commit to the task.


A) So I started writing in February 2020, and was able to publish it in September. I unintentionally had the practice for the book. I write articles on platforms like Medium and LinkedIn, and a few of them have even been published on the Data Vault Alliance. By the reaction and feedback I got, it helped me solidify my thinking, and some articles have sparked a lot of debate – which is awesome. It helps get everyone on the same page with regards to implementation and architectural techniques.


The ‘Data Vault Guru’ is based on a framework that is agnostic to the technology underneath, and also emphasises the techniques that are available to make your Data Vault perform on your platform of choice.


Q) So what approach did you take to try and deal with those distinctions you felt needed to be made?


A) Practitioners on Data Vault 1.0 often over-modelled their solutions – think of bag of keys and keyed-instance hubs – and, in fact in my opinion, build technical debt into the model.

Data Vault 2.0 on the other hand focuses on the scalability of the solution.

In order to deliver a successful Data Vault 2.0 solution, you must keep the modelling simple, while being cognisant that you are integrating multiple data sources into a single enterprise Data Vault model – to which Data Vault 2.0 has the proven techniques to do so.


Q) You chose to split the book into five different areas – can you summarise them?


A) Part one lays out the data landscape you have to play with, with suggestions that the reader should at least understand – the business architecture when building an analytics architecture to support it.

Part two is probably where readers would want to skip to the classic raw vault, business vault and all their variations and techniques.

Part three discusses automation of the Data Vault components, every single one with a proposed test framework and a chapter in here is dedicated to automating timeline correction as well.


Part four discusses techniques and patterns for getting data out of the Data Vault, make no mistake, you’ve added a lot of artefacts into the data warehouse – although they’re based on just three table types – you need to know how to get the data out.


Finally, part five gets into modelling scorecards, Data Vault implementation variations like schema vault, jira vault and a chapter is even dedicated to building your own automation tool.


I could say the book is attempting to get the reader to deliver a solution that is geared towards modern data platforms and to tear the reader away from thinking in terms of how data warehouses were delivered in the ’90s and noughties.


Q) So after producing such a structured and comprehensive guide, do you think you missed anything?


A) Even after writing the book, I feel I could have added more. For example, I attended a webinar on a Data Vault 2.0 implementation on Snowflake and the speaker mentioned a top tip is to use Base64 to convert the hash keys into text for Power BI.

That is totally the wrong implementation. You should never be joining Data Vault tables inside your BI tool. Hence I continue to blog!


Q) So do you think it will be a case of adding a second edition or a revised edition in the not-too-distant feature?


A) In the last month I have added an errata section in GitHub. So if I find I have made any mistakes, say on a section about testing, I can go there and change it straight away. I did find one while preparing for a presentation recently and thought “That’s not right! But am I not going to go back and change the book?I will just update the errata section and hopefully it will remain pretty small.” But where I am in my knowledge with Data Vault 2.0 right now – and my exposure to it – I don’t think I could add to the book.

I do think though, there is going to be some evolution, as Dan Linstedt has alluded to, and I hope I get some exposure to that and learn from it as well.


Q) So would you also rule out writing a book on anything else you have learned from your 20 years of consultancy work?


A) Well you have seen what John Giles has done with his books on data modelling and Data Vault – he hasn’t focused on hubs, links and satellites, and all the nitty, gritty stuff. What he’s done is saying, “Well, actually, when you are modelling, you should consider the enterprise view of Data Vault modelling.” There’s a lot of exciting stuff going on in the market – and I guess I got really lucky with the exposure I have had.


Even as recently as three years ago, I learned about how distributed computing systems work, like Apache Cassandra and Spark and things like Presto, Apache Airflow and dbt. The pace and innovation in these technologies really blew my mind. There’s so much clever stuff going on out there.


I like to keep on learning and progressing, and always have interests in learning new stuff, especially if it is data-driven. I saw Data Vault as one of those things that is business-driven and can be delivered as a data-driven solution if it’s done right. So if something else new comes along and I am in the right position and have the time to pursue it, of course I would be interested – but I never want to write about something when I don’t really know what I am talking about.

An American economist Thomas Sowell once said that “kids these days are taught it’s very important to have an opinion, but what they are not taught is to know what they are talking about.” And I have to agree with that whole-heartedly.


Q) What kind of feedback have you been getting since your book was published, and have you been pleased with the reaction from the Data Vault community?


A) Yes the direct feedback has been very good. I have seen some reviews online as well – it’s been good. I got one in Germany, which Amazon translated – it pinpointed the things that I really focused on in the book, which was pleasing.

It also stated that the content is well-illustrated with great examples. That again is exactly what I was aiming for. He also made the point of saying that he hasn’t had an ad hoc question about Data Vault he couldn’t find the answer for in the book.

A lot of the art I created I have uploaded to GitHub, along with Visio stencils you could even import into something like Lucidcharts (which has a plugin for confluence).

So if anyone who is working on a project or presentation wants to reuse any of them, I say, “Go for it.” It’s free – under Creative Commons.


Q) So what other books have your read and influenced your thinking, which you might recommend?


A) There is great book by Martin Kleppman. It’s a very good book – and gets into building systems based on some complex algorithms. He punctuated the content with one of those old-style world maps the early explorers used, to illustrate keywords and concepts, which I really liked. Without plagiarising it, I thought I could use something similar in my book and I added my own photography to delineate parts of the book.

Another book I found really intense was – here you needed to think about the Kappa architecture, data in motion (unbounded data) versus data at rest – watermarks – ensuring that data arrives in the correct order, sliding windows, and so on.

Both books were recommended to me by Bruce McCartney – he’s based in Canada and is one of the authorised Data Vault 2.0 trainers.


Q) And when you are not deeply absorbed in designing Data Vaults or writing articles about them, what else do you like to do.


A) I play a fair bit of squash still. I have also renewed my interest in brewing my own beer thanks to COVID.

I first tried that when I lived in Perth, at a pub I was paying something like $12 a pint. I guess that’s why Australia has had a craft brewing culture for so long. I bought a brew-making kit for my dad back in South Africa and when we have Skype catch-ups, we enjoy a pint and talk about different aspects of brewing. He’s massively into it.


CONCLUSION

So regardless of whether it is building a Data Vault or a brewing beer, you can see a strong level of consistency in Patrick’s work. He researches the subject thoroughly, before learning any required new skills. Then, he experiments and refines the process – maintaining and improving his knowledge along the way, before finally presenting his work to the public. So who knows, with all the interest in micro-brewing and craft beers, what would stop Patrick writing a book on that subject?


But in all seriousness, I think there is plenty more to come from Patrick on his thoughts about his first love… data-modelling – and building a Data Vault that first and foremost serves the needs of the business.

‘The Data Vault Guru: A Pragmatic Guide on Building a Data Vault” available on Amazon.

bottom of page