Professional Documents
Culture Documents
The Elephant in The Fridge - Gui - John Giles
The Elephant in The Fridge - Gui - John Giles
Foreword
Introduction
Acknowledgments
Chapter 1 Se ing the Scene
An “owner’s manual” is a good start, but …
Did I hear a warning?
A vital context
Chapter 2 A Data Vault Primer
A bit of data warehouse history
Data Vault made (too) easy
Untangling “raw” and “business” Data Vault
Chapter 3 Task #1: Form the Enterprise View
Se ing our compass bearing
Before we go any further, what is an ‘enterprise data model’?
Where can we find an enterprise data model?
How to build a “sufficient” enterprise data model yourself
Chapter 4 Task #2: Apply the Enterprise View to Design a Data
Vault
The important bit: Ge ing business data out
Well-formed Hubs: Ge ing the right foundations
Well-formed Links versus “ugly” Links
What if we get it wrong?
Chapter 5 Task #3: Map Your Source Systems to the Data Vault
Design
Understanding Data Vault by understanding data loading
Doing the source-to-Data Vault mapping
Chapter 6 Task #4: Using Business Rules to Close the Top-Down /
Bo om-Up Gap
Business rules, and the “Transformation” bit
Applying business rules to “close the gap”
Chapter 7 Tying Off a Few Data Vault Loose Ends
A quick glance at some more advanced topics
Using a Data Vault to Shine Some Light on the “Process” World
A bit on projects
A few controversies
More uses for top-down models
In conclusion
Appendix Common Data Model Pa erns – An Introduction
Pa ern: Party & Party Role
Pa ern: Position
Pa ern: Agreement
Pa ern: Resource
Pa ern: Event
Pa ern: Task
Pa ern: Product (tangible goods, intangible services)
Pa ern: Location
Pa ern: Document
Pa ern: Account
Pa ern: Classification
Bibliography
Index
Table of Figures
It was one year ago this month that I first met John Giles. I knew
almost immediately that I had met a kindred spirit—a lover of
nature and of data. I was in Canberra, Australia to participate in
Steve Hoberman’s Data Modeling Zone conference and had the
pleasure of visiting the Tidbinbilla Nature Reserve with John, Steve,
and Andrew Smailes. Even in the midst of a unique nature
experience (we saw wallabies, koalas, platypus, many colorful birds,
and much more) we slipped into data discussions. That’s the mark
of true data geeks!
Dave Wells
Director, Data Management Practice
Eckerson Group
Sea le, WA, March 2019
Introduction
The core aspects of Data Vault 2.0 are depicted in Figure 1. Perhaps
you already have a good grasp of these Data Vault fundamentals. If
so, I am genuinely pleased because it is essential to have these
foundations. If not, don’t despair – this book includes an optional
primer for those new to the wonderful world of Data Vault, though
as the picture suggests, the focus for this book is on how to shape
the model for overall Data Vault success.
Whether you come to this book with a commendable prior
knowledge, or if you are on a steep learning curve and about to
acquire foundational knowledge, the fundamentals are essential.
However, there’s a serious warning: Data Vault project experience
suggests that just understanding the basics will most likely not be
sufficient to drive a project to success.
Data Vault projects can have the methodology and architecture
topics well covered, but they can run aground nonetheless due to
modeling challenges. Modeling can sound so easy – there are only
three components (Hubs, Links, and Satellites), and they can be
learned relatively quickly. They’re not the problem.
Dan Linstedt, the founder of Data Vault, has clearly warned of some
of the danger zones, yet there are still shipwrecks, which is really
sad, given the enormous potential of the Data Vault approach.
This book intends to assist in achieving Data Vault success by
erecting a few lighthouses around the Data Vault modeling shoals,
and by providing a few navigation maps (to push the sailing analogy
a bit harder). By avoiding some of the known dangers, it is my
sincere desire that you get to deliver measurable business value
from your Data Vault investment.
So imagine one of the world’s longest straight roads, with not a tree
in sight. The drive might be boring, but at least you’re pre y safe, as
there is plenty of traffic to help you if you get in trouble. (“Plenty” is
a relative term.)
I’ve driven the Nullarbor, but I’ve also ventured well off the beaten
track, and you need to heed the warnings as there’s no-one likely to
come your way for a long time if you’re stuck. Such as the time my
wife and I and some friends were heading along one of my infamous
“shortcuts”. We could have gone the long way round, but why tackle
dirt roads that are corrugated from all of the traffic (at least one car a
day, apart from the wet season when the roads are totally
impassable)? So we tackled my shortcut.
It was really nothing more than a disused track in the middle of
nowhere. There were two major problems. Some of the time we
couldn’t even find the track, so we had to stop the cars and fan out
looking for the next visible hint of where it might be. And when we
did find it, it was obviously a number of years since anyone last
came along, as there were small trees that had grown between the
wheel ruts – we had to push them over just to get through. One
stretch of 70 kilometers (40 miles) took a full two days. My shortcuts
haven’t made me popular with most of my travelling companions.
Finally we got to the end of the track and pulled up on a picture
perfect beach that I’d assured my friends would be wonderful. What
could be be er than sun, surf, sand, and the whole beach to
ourselves – apart from the resident salt-water crocodiles. I wasn’t
popular. Again.
Here’s the message. Australia has some great places, and many
visitors who have ventured beyond the cities have loved the
experience. But there are very real dangers. Heed the warnings, and
you’ll have a great time. Ignore the warnings, and things can go
badly wrong.
Dan Linstedt, the founder of Data Vault, has also issued a number of
warnings. Not about the Aussie outback, but about how to build a
Data Vault. Some people ignore travel warnings about the Aussie
outback, and come unstuck (or stuck as the case may be). Some
people ignore Dan’s warnings, too, and wonder why things don’t go
so well.
A vital context
It’s easy to get lost in the detail of any book, and wonder how all the
pieces are going to fit together. To maximize your pleasure, and the
value you get from reading this book, I want to quickly lay out the
roadmap. It’s not too hard, really. Perform 4 tasks and you are on
your way to Data Vault success. Well, at least that’s the goal.
Seeding the Data Vault project
We start by collecting knowledge that hopefully is freely available
within our organization.
This is at the heart of top-down design for a Data Vault. Dan quite
correctly says we need an “enterprise ontology” to shape the design.
Many data model professionals will talk of enterprise (or
“business”) data models, enterprise conceptual models, or
enterprise logical models. David Hay, in “Achieving Buzzword
Compliance”, provides precise definitions, and even goes so far as to
identify three types of conceptual models! Here, I am going for
something simpler. And for the sake of this book, I am pre y
flexible as to what we call this thing. Yes, it has an enterprise-wide
view, and it’s a model of the data we’re interested in. I guess that
makes it an enterprise data model.
But what does it look like? For us, it’s got two simple parts.
I like to start with a one-page diagram identifying the major data
subject areas of the enterprise, with an icon for each, typically
based on generic data model pa erns.
I then do a drill-down from the generic data model pa erns into
the specifics of the organization. This includes what some will
refer to as the “taxonomy” – the hierarchy, from supertypes (that
represent the pa erns but often are too generic to represent the
business directly) down to subtypes (that do represent the
business).
It’s a light-weight framework, and it can be assembled in weeks (or
even days), assuming you’ve got a working familiarity with the
pa erns.
Task #2: Design the Data Vault, based on the business view
We’ve now got a light-weight view of how the business sees their
data. Next we want to map that into our Data Vault design. Again,
it’s top-down.
Hubs are selected from a sweet-spot on the supertype / subtype
hierarchy – the taxonomy
Links are constructed based on business Links between business
Hubs!
Business Satellites specify how the business wants to see their
data a ributes, not how the source systems name and hold the
data items. These business-centric Satellites are likely to become
conformed Satellites, used to drive Data Marts a bit later. At this
stage, they are a first-cut capture of the target information the
business wants.
Please note something very important. For these two tasks, we
haven’t even spoken about source systems. We don’t want to build a
“source system Data Vault”. We want to deliver tangible value to the
business, and we want the Data Vault to reflect their business view,
not a technology-based view, so we start with the business, then of
course we must move to the hard reality of what data is available to
us – the next task.
Task #3: Bo om-up Source-to-Data Vault mapping
We’ve already got business-centric Hubs, Links and Satellites
defined. This is where business-centric Data Vault design
distinguishes itself from source-centric Data Vault design – we want
to map to the business objects, not to source-centric Hubs or Links
unless we absolutely have to. Satellites are a li le different.
We will go into a lot more detail later, but let’s look quickly at this
topic.
If we’ve got 50 source data feeds with Customer data, we don’t want
50 Customer Hubs, one per source, with one Satellite each! Instead,
we can create Source-specific Satellites, a ached to just one
business-centric Hub. You’ll notice that I’ve crossed-out the Hub in
the source-centric box. It’s something we rarely if ever want.
Links can be a bit more tricky. The short story version is that if the
source feed has Links that map neatly to the business-centric Links
that have already been identified, that’s where they map. However,
you will often encounter Links that represent transactions / events
as presented by source system data feeds, and these may require
construction of their own Links. (And these Links may require
Satellites, too, but that’s a separate and sometimes contentious
topic.)
At this point we’ve done the mapping to populate probably all we
can from direct source feeds. I say “probably”, because, for example,
we might have a source-system feed whose data a ributes match
directly to the business-centric Satellite design the information
consumers want. If that’s true, you should consider yourself lucky.
Most of the time, we need a li le bit of data transformation to shape
our data ready for consumption, and that’s the next task.
Task #4: Define business rules
We’ve designed the business-centric Hubs, Links and Satellites as
our targets in Task #2, and we’ve done mapping to populate the Data
Vault in Task #3, but not all of the target objects have been included.
There are some gaps, and we need business rules to fill them.
The two most common forms of business rules are:
For those not yet familiar with the marvelous world of Data Vaults,
first let’s kick off with a Data Vault primer. For those already well
acquainted with where Data Vault fits in relation to the other data
warehousing approaches of Ralph Kimball and Bill Inmon, and who
are comfortable with the Data Vault modeling fundamentals, you
may prefer to go straight past this entire Data Vault primer and get
your teeth into Dan’s “ontology” stuff in the next major section,
“Task #1 – Form the Enterprise View”.
The same operation systems all feed into one central “normalized”
repository. Data from there is then pushed out in forms ready for
easy consumption, including, of course, Data Marts.
One advantage of this approach is consistency – all reports are
sourced from one, clean, uniform version of the “truth”. Another
advantage appears over time – the data supplied from each single
source feed can potentially be reused to support multiple Data
Marts. But this approach does have a price tag. The first Data Mart to
be supported has an overhead not needed by the Kimball approach.
Then along comes Dan Linstedt with his Data Vault architecture.
Some may say, “We’ve already got turf wars between the followers
of the Kimball versus Inmon approaches, so why on earth do we
need a new player in the field?” Perhaps we can answer that by
looking at the origins of Data Vault.
They say that necessity is the mother of invention. Dan had a client
in the 1990s whose demands pushed the boundaries of Data
Warehousing. I understand that they wanted to store petabytes of
data. That’s thousands of gigabytes if you’re not familiar with the
term. Storing that much data is one thing; ge ing timely results
from queries is another. Hence, Data Vault was born.
A more recent Data Vault implementation is at Micron, the
computer chip manufacturer. They add billions of rows to their Data
Vault each day. It’s probably fair to say that the title of the book by
Dan Linstedt and Michael Olschimke highlights one of the many
reasons people adapt Data Vault – they’re in the business of
“Building a Scalable Data Warehouse …”
But scale is only one reason for looking at Data Vaults. Frankly, most
of my clients have very modest requirements if judged by “big data”
considerations such as volume and velocity, but they may be
interested in another of the big data V’s - variability. When you’re
required to ingest not just predictable structured data, but also
unstructured and semi-structured data, Data Vaults’ ability to
accommodate all such forms is impressive.
Then there’s agility and extensibility. Wouldn’t we all love it if we
could quickly incorporate changes, or even be er, if we got things
“right” the very first time? Data Vault isn’t some silver bullet that
solves all problems, but it certainly eases the pain we will all face.
For example, the philosophy that we always insert new stuff and
never update what’s gone before is a breath of fresh air, avoiding
tedious reloading.
The “sales pitch” for Data Vault goes on and on, and reflects the ever
changing demands. Some want real-time feeds. No problem. Some
want a service-oriented architecture that interacts dynamically with
operational systems and the Data Warehouse. Data Vaults can
accommodate that. Many people want full traceability of the data
lineage, going from the view presented to data consumers, right
back to the operational systems that sourced the data. And again,
Data Vault supports this need.
Some batch processing approaches are based on ETL (extract-
transform-load), while others adopt ELT (extract-load-transform).
The ETL approach often aims to clean up the data on the way into
their data repository. In contrast, Data Vault has the goal to load “all
the data, all the time” into the Data Vault, and then apply business
rules to clean it up. If we later decide we got the rules wrong, we can
change them and regenerate the derived Data Vault artifacts. We
hadn’t discarded the “dirty” data on the way in. What’s more, the
dirty data often is very instructive.
Enough of the “sales pitch”. Let’s take a very brief look at some of
the Data Vault architectural components and their context:
A case study
I talk of “hypothetical” scenarios such as emergency response to
fires, floods, earthquakes and the like, but unfortunately in many
parts of the world these scenarios are far from hypothetical.
We start with Hubs. Like the name suggests, they are at the center of
data structures in a Data Vault. They are based on business
concepts, and hold business keys. The word that appeared twice in
the previous sentence was “business”. More on that later, but I do
want to draw your a ention to the fact that building a Data Vault is
not just a technical exercise; it must involve the business.
For the sake of simplicity, we will assume that a Fire is a recognized
business concept, with a business key of Fire Reference Number.
Likewise, let’s assume that a Fire Truck is a recognized business
concept, with a business key of Registration Number.
So now we’ve got two Hubs, and they’ve both got a unique index for
the nominated business key. Relational databases typically expect a
single key to be nominated as their “primary key”. Every row in the
table must have a unique value for its primary key. We could
obviously make the business key the primary key for the table. After
all, it is unique. However, in Data Vault, we may choose to use
another mechanism, often called a “surrogate key”. We probably
don’t need to go into details of this style here. Sufficient to say that
it’s typically a unique but meaningless number generated by some
software. Here, in a Data Vault, one reason for using a surrogate is to
improve performance.
In Data Vault 2.0 (again, I am using the standard as published in
“Building a Scalable Data Warehouse with Data Vault 2.0”, and I
recognize that standards may be challenged and changed over time),
this “surrogate primary key” is a hash key – another technical
artifact. Put simply, the text string value of the natural business key
is thrown into a mathematical algorithm, and out pops some
number. The important thing is that the same text string will always
produce the same hash key value. The previous version of Data Vault
had a sequence number that served a similar purpose. But please
note that whether your Data Vault is Data Vault 1.0 or 2.0, the Hub’s
surrogate key is for Data Vault internal use only. It should not be
used as an alternative business key. In reality, I doubt that anyone
looking at the big, ugly structure of the value generated for a hash
key would want to use it that way!
Some may ask why we might use a hash key, and there may be
several good reasons. Put simply, this approach helps deliver good
performance in several situations. It may help for loading data
(including facilitating massive parallel processing if we need to go
that far), or with relational join performance on data retrieval, and,
as a bonus, assist with the joining of relational and non-relational
data. The last point on non-relational data is all a bit technical, and
not my area of strength, so I am happy to accept that there are
sound computer science reasons for this stuff, and then get back to
my modeling!
We’ve still got two more columns in the Hub tables. The Record
Source column notes the source system that first presented the
business key value to the Data Vault, and the Load Date / Time notes
the precise date and time that this occurred. These columns may
prove useful for audit and diagnostics purposes, but please don’t
get side-tracked; a Hub table fundamentally exists to hold the set of
business keys for a given business concept, and nothing more.
Let’s look at the a ributes and their values that one instance in the
Fire Hub table (as shown in Figure 13) might have:
Fire Reference Number: “WF2018-123” (Wild Fire number 123 in
the year 2018).
Fire Hash Key: “27a3f042…”, being a hexadecimal representation
of the hash key generated by presenting the text string “WF2018-
123” to some hashing algorithm.
Load Date / Time: “10/10/2018 12:34:56.789” being the very
moment (to a nominated level of precision) the row was loaded
into the Data Vault table, not the moment when the row was
created in some source operational system.
Record Source: “ER/RA”, being an acronym (“ER” for the
Emergency Response IT operational system, and “RA” for
Resource Acquisition, the data feed providing load data for the
Data Vault). Note that the “source” can be a multi-level hierarchy,
and is typically finer-grained than just the IT system.
Instances in the Fire Hub never change after they are created. No
ma er how many of the fire’s descriptive a ributes such as Fire
Type (grass fire, forest fire …), Fire Name, Fire Severity, and so on
change, the Hub instance is rock solid. The Hub instance says that
the Data Vault first saw Fire “WF2018-123” on 10/10/2018, and that
this visibility of the nominated fire was provided by the Resource
Acquisition data feed from the Emergency Response IT operational
system. These facts will never change, nor will the instance in the
Fire Hub table. So where do we hold the Fire Type, Fire Severity and
other a ributes? In a Satellite!
Satellites (on a Hub)
While most people think of satellites as things that zoom around in
the sky, the term is used in Data Vault for a very different purpose.
Typically, at the same time a Hub instance is created, we want to
capture the related a ribute values as they existed at that first-seen
point in time for the business key. The first instance in the Fire
Satellite table (refer to Figure 13) might have values such as:
Fire Hash Key: “27a3f042…”. This is the Foreign Key, relating the
Satellite to its “parent” Hub for Fire Reference Number “WF2018-
123”. Of course, we could have designed the Hub to have the
natural business key (the Fire Reference Number) as its Primary
Key, and simply used that natural business key value here in the
Satellite as the Foreign Key. That might be easier to understand,
but it’s not the Data Vault 2.0 standard, and there are reasons for
using the hash key construct.
Load Date / Time: “10/10/2018 12:34:55.555” being the very
moment (to a nominated level of precision) the row was loaded to
the Data Vault table. Note that the load time for the Satellite is
not the same as the load time for its “parent” Hub. While as
human beings we might think that a Hub and its Satellite values
are loaded together, particularly as they may come from exactly
the same data feed, there might be one code module for loading
Hubs, and a separate code module for loading Satellites. If one
load succeeds and the other fails until a restart the next day, the
time difference can be significant, and that’s OK – the Load Date /
Time provides an audit of when the load actually occurred, not
when we might have expected it to occur. In this example, the
load time for the “child” Satellite is actually before the load time
for its “parent” Hub. For the more technically minded raised with
relational databases, they might have expected to have referential
integrity enforced between the Hub and its Satellite, meaning
that the “parent” Hub had to be created first. On the entity-
relationship diagram I have presented in Figure 13 above, we
could quite rightly state that is in fact what I am implying, but in
the Data Vault world, we can have a “child” Satellite created
before its “parent”. It is to be noted, however, that when the
entire data load is completed, there should be no “children”
without “parents”. This approach may be a real bonus if we want
to perform Massive Parallel Processing (MPP) to handle vast
volumes of data.
Record Source: “ER/RA” as an acronym, with “ER” for Emergency
Response (the IT operational system) and “RA” for Resource
Acquisition. In this example, this value is the same as the Record
Source value in the related Hub, as one source feed of data
provides data for the Fire Hub and the Fire Satellite, though they
may be processed in separate load modules as noted above.
Satellites primarily exist to hold a copy of data values for the
Hub’s non-business-key values. In this example, the data values at
the time of initial creation of the Hub could be:
Fire Type: “Forest Fire”
Fire Name: “Black Stump”
Fire Severity: “2”
Incident Controller: “Alex Anderson”
Planned Start Date of “NULL” in this case, as we didn’t plan to
have this fire. Where I live in Australia, we do have deliberate
burns in the cooler weather to reduce the fuel load, but this
hypothetical fire might have been started by a lightning strike.
Planned End Date is also “NULL” at this stage, but will later be
set to a target date once we’ve worked out how big the fire is
and what our plan of a ack is to be.
Actual Start Date: “10/10/2018”
Actual End Date is also “NULL” at this stage.
The example above follows the Data Vault 2.0 standard, with its use
of hash keys. In contrast to Data Vault 2.0 where the “child” Satellite
can be created before its “parent” Hub, in Data Vault 1.0, the
Satellite load program might have to look up the “parent” Hub table
by doing a search against the natural business key and then find the
sequence number before the Satellite instance can be created.
OK, so our Satellite now has the snapshot of values for the fire at the
time of creating the matching Hub instance. That’s great, but values
change over time. A few hours after first reporting the fire, its Fire
Severity drops to “1”, and the Incident Controller is now “Brooke
Brown”. These new values are presented to the Data Vault. What
happens?
It’s simple, really. A new instance in the Fire Satellite is created for
the same “parent” Hub, but with a different Load Date / Time, and
of course the data values as they now apply. We’re starting to build
up the history we need.
Note that each instance in the Satellite is a complete snapshot of all
data values active at the time. Even if the Fire Type and Fire Name
haven’t changed, the new Satellite instance will hold their
(unchanged) values, alongside the changed values for Fire Severity
and Incident Controller.
[One li le side note: In the above example, all the a ributes from the
source were held in one Satellite. It is possible to split a Satellite so
that, for example, a ributes whose values frequently change can go in
one Satellite, and a ributes with relatively static values can go in
another.]
Links
Hubs represent business concepts, such as a Fire and a Fire Truck. One
of the roles of a Link is to represent business relationships such as the
assignment of a Fire Truck to a Fire.
Hub instances are identified by a business key, such as the
Registration Number of a Fire Truck or the Fire Reference Number
for a Fire. Link instances are identified by sets of business keys, such
as the combination of a Fire Truck’s Registration Number and the
Fire’s Fire Reference Number.
If we look at the Entity-Relationship Diagram for the Data Vault in
Figure 13, we see that there is a table named Fire-Truck-Assigned-
To-Fire Link, which is the short-hand way of saying we’ve got a Link
table that associates a Fire Truck with the Fire to which it is
Assigned. Taking just the association between Fire Truck “ABC-123”
and the Fire “WF2018-123” to which it is assigned as an example, the
instance in the Link table might have a ribute values such as:
Fire Truck Hash Key: “f0983ba7…”. This is a Foreign Key, relating
the Link to one of the Hubs that participate in the Link’s
relationship. In this scenario, the Foreign Key points to Hub for a
Fire Truck with Registration Number “ABC-123”.
Fire Hash Key: “27a3f042…”. This is a Foreign Key, relating the
Link to one of the Hubs that participate in the Link’s relationship.
In this scenario, the Foreign Key points to Hub for a Fire with Fire
Reference Number “WF2018-123”.
Load Date / Time: “10/10/2018 12:44:55.555” being the very
moment (to a nominated level of precision) the row was loaded to
the Data Vault table, not the moment when the row was created in
some source operational system.
Record Source: As we continue with this example, we are
assuming that the data feed is the same as that of the two Hub
instances that are participating in this Link instance. The value is,
again, “ER/RA”.
Fire-Truck-Assigned-To-Fire Hash Key: “12ab34cd…”, being a
hexadecimal representation of the hash key generated by
presenting a text string that is itself formed by concatenating the
business keys of the participating Hubs (Fire Truck and Fire). The
text string to be hashed might, in this example, be “ABC-
123|WF2018-123”. Note that it is recommended that the text
string formed by concatenation of the Hub’s business keys have a
separator (such as the so-called pipe symbol, “|”) between the
component parts. Consider a Fire Truck with Registration
number “ABC-432” and a Fire with a Fire Reference Number “10-
Black Stump”. Without a separator, the concatenated string
would be “ABC-43210-Black Stump”. If we then consider the
admi edly unlikely scenario of a Fire Truck with Registration
number “ABC-43” and a Fire with a Fire Reference Number “210-
Black Stump”, then concatenate them without a separator, we
would get the same result as a text string, and hence exactly the
same hash key, which we don’t want – the same identifier for two
distinct relationships. Conversely, with a separator, the first
string would be “ABC-432|10-Black Stump”, and the second
“ABC-43|210-Black Stump”. To the human eye, they look almost
the same; to a hashing algorithm they are totally different and
will produce totally different hash results. Job done.
The formation of the text string combining the Hub’s business keys
must always follow the same order of concatenation. If one time, for
the relationships between a given Fire Truck and a Fire, we put the
Fire Truck business key first, then later for the exact same intended
relationship we reverse the order, we will not get a matching hash
key.
In a similar manner to Hubs, instances in the Fire-Truck-Assigned-
To-Fire Link never change after they are created. The Link instance
we’ve been studying says that the Data Vault first saw the
relationships between Fire Truck “ABC-123” and Fire “WF2018-123”
on 10/10/2018, and that this visibility of the nominated relationship
was provided by the Acquisition data feed from the Emergency
Response IT operational system. These facts will never change, nor
will the instance in the Fire-Truck-Assigned-To-Fire Link table.
Also like Hubs, a Link can have associated information that does
change over time, and again this can be stored in a Satellite
belonging to the Link. This requirement is not shown in the
diagram, but it’s not uncommon for Links to have a ributes such as
the relationship start date-&-time and the relationship end date-&-
time. By now we’re unfortunately ge ing into some territory where
there are differing views on Data Vault modeling approaches. Just
for a moment, let’s stick with Dan’s Data Vault 2.0 standard, and
later we’ll discuss some other opinions.
Please note that in this very simple example of a Fire Truck assigned
to a Fire, there are only two Hubs involved. However, life isn’t
always that simple. Maybe the assignment transaction also required
nomination of the Employee authorizing the assignment. In such a
case, the Link may involve the Fire Truck Hub, the Fire Hub, and an
Employee Hub. A Link can have as many participating Hubs as are
required, but must always have at least two (or in the case of self-
referencing relationships, two or more Foreign Keys can point to the
same Hub table).
A drill-down into cardinality on Links
If we look back to Figure 12 at the Entity-Relationship Diagram for
the operational system that is the source for this hypothetical Data
Vault, at any point in time a Fire Truck can only ever be at one Fire.
That makes a lot of sense in this case study example. However, over
time, the same Fire Truck, say Registration Number “ABC-123”, can
be assigned to many Fires. At some time in the past it was at Fire
“PB2018-012” – a “Prescribed Burn” fire deliberately lit to reduce
fuel load. Today it’s at the Fire “WF2018-123” and in the future it
may be assigned to Fire “WF2018-456”.
Now if we look back at Figure 13 for the Entity-Relationship
Diagram for the Data Vault, those of us with a data modeling
background may notice that the structure for the Link is a many-to-
many resolution entity – that’s a technical term that says this
particular Link entity can accommodate the Fire Truck Hub and the
Fire Hub being in a many-to-many relationship. The Data Vault can
record all of the relationships over time, which is what we want.
This flexibility has another important feature; the ability of the Data
Vault to absorb changes in the cardinality in the operational systems
without having to have the Data Vault reconstructed. Let’s say we
want to record marriage relationships between people. I recognize
there are diverse and changing views about what constitutes a
“marriage”, but let’s look at a hypothetical country that has some
interesting twists and turns in the a itudes of its people and the
law-makers. I also recognize that there may be a difference between
what the medical profession records as a child’s sex at birth
compared to the gender an individual may later nominate for
themselves, but for the sake of simplicity, this hypothetical country
only recognizes males and females as declared at birth. My simple
request is that we put aside personal views as to what relationships
should or should not be recognized, and please let’s inspect the
hypothetical as that country’s views change over time (maybe too
much for some, and too li le for others).
In this made-up scenario, the laws of this country originally defined
marriage as not only being between one man and one woman at any
point in time, but they went one step further. In their country, each
individual could only be married once in their entire lifetime. If your
partner left you, or even died, sorry, but you could not enter into
another legally recognized marriage. If this “business rule” was
enforced in the operational system, you would have a one-to-one
relationship between a man and a woman. Using the UML notation
so we don’t have to get into debates on primary keys and foreign
keys, the operational system might be represented as below.
Again using the UML notation, and suppressing all of the Data
Vault a ributes (Load Date Time, Record Source, and keys), we
might have a Data Vault model as shown in Figure 15. We’ve got a
“Man” Hub and a “Woman” Hub in the Data Vault. And at this
point, the “Marriage” Link between these two Hubs only needs to
handle a one-to-one cardinality, even though the Data Vault can
happily cope with many-to-many.
Let’s pretend the country’s rules are loosened a li le. Remarriage
after death of a spouse is now permi ed. The operational system
might not need any change if it only records current status (still one-
to-one), but the Data Vault needs to accommodate many-to-many
over time. But it already can, so no change there either.
Have you ever seen the film, “Paint your Wagon”? There was a man
with multiple wives who passed through a gold-mining frontier
town. If this hypothetical country likewise allowed a man to have
more than one wife, the UML representation changes in the
operational system.
Wind the clock back to the late 1970s. Projects were run using
waterfall approaches. Find the requirements, document them, and
get them signed off. Do a solution design, and get it signed off. Do
development, and document it. Do testing and write up the results.
Finally, put the system into production and train the users. Of
course, if any mistakes were made in analyzing the requirements,
but they weren’t detected until the solution was in production, there
can be costly rework.
I’m the project lead, and in those days, that meant being a jack-of-all
trades – OK, I know how that phrase finishes! I’m the analyst, the
solution designer, a programmer, the lead tester, the trainer, and I
could make coffee for the team when needed. But here’s the first
problem. The client is sited at the other side of Australia. If I had my
own private jet, it might be a quicker trip, but on commercial routes
hopping around the coast, it’s nearly 4,000 kilometers (well over
2,000 miles), and with poorly connecting flights, it took the best part
of a day.
Now here’s the next problem. There’s no Internet, and no such
thing as video conferencing. If I want to do waterfall-style analysis, I
have to fly up to the remote mine site. If I want to do further
interviews, or present the solution for sign-off, I have to fly up again.
So during the analysis stage, I try really hard to absolutely pin down
cardinality. Each mine-site project has only one prime supplier.
Sure, that same supplier can be the successful bidder for several
projects, but each project must have one and only one supplier that
takes overall responsibility.
We designed it that way, and at the end of the project, I fly up again
to do the training and handover. Thankfully, I take one of my best
developers, Dianne, with me. That’s a story in its own right. When
we got to the mine site, Di stayed in the single women’s quarters –
she was single. She temporarily raised the population of single
women by 33% - from three women to four. And single men? A few
thousand. Needless to say, her presence was noticed. But back to the
project. I start training, and one of the students says our system will
never work. Great! What’s the problem? I’m now told that there are
plenty of projects where the prime responsibility is shared. Each
project can have more than one “prime” supplier. Thankfully Di was
a brilliant developer. We paused the training, and she made the
changes over a couple of days, at the mine site. A bit of testing, and
it’s back in production and we are ready to go again.
The point is that the tightest of specifications on cardinality can
prove to be false. Changes can be challenging in the operational
system, especially if you don’t have a Di on the team. But a Data
Vault doesn’t flinch as it’s always capable of handling all
cardinalities.
Satellites (on a Link)
To me, the really sad thing was that after much time and money
being spent, the project-level solution was technically the same as
the enterprise-level design presented in the first iteration, just
replicated by project. If the decision makers had simply applied the
mantra of using an enterprise data model to shape the Data Vault
design, and not let technical issues relating to business key
structures get in the way of what the business said they wanted,
Iteration #1 could have been the final, and successful, solution.
Instead, those who wanted the simple elegance of a business-centric
view were moved off the project. The subsequent word on the street
was that not only was the project a failure, some without deep
knowledge also unjustly classified Data Vault itself as being a
failure.
I saw the whole saga as a missed opportunity. The project should
have succeeded.
Please let’s focus on the bit that gives the most leverage, and leads
to success
There’s a diagram I want to share that I think provides some hints
about a framework for understanding the moving parts, and for
allowing us to focus on the one bit that really simplifies the whole
job of building a Data Vault, and makes it much more manageable:
Terminology
In the heading for this part of the book I have referred to an
enterprise view, implying an “enterprise data model”. That’s one of
many common phrases, and you may see it in the DMBOK (Data
Management Body of Knowledge) book. I like the phrase, but over
the years, there have been many classifications of types of layers in
data models. Some may remember talk coming out of the 1970s
about a three-schema model, comprising an External Schema, a
Conceptual Schema, and an Internal (or Physical) Schema.
Then there’s John Zachman’s six layers for the data area.
Another common classification scheme is to label a data model as
Conceptual, Logical, or Physical. Seeing a Data Vault design is
meant to be centered on business concepts, a Conceptual model
looks like it might meet our requirements. OK, so we might just
focus on Conceptual models.
But then there’s debate about what constitutes a “Conceptual”
model.
Steve Hoberman, Donna Burbank, and Chris Bradley in Data
Modeling for the Business[8] have two levels above the Logical and
Physical models – Very High-level data models, and High-level data
models. I actually like these classifications as they make sense to the
business.
Then along comes David Hay in his solidly founded and well-
researched book “Achieving Buzzword Compliance”[9] which tackles
head-on the whole topic of buzzwords in our industry, including
terms such as a “Conceptual” model. Steve and his friends had two
types of what some may call Conceptual models. David breaks
Conceptual models into three types. I can’t do his book justice in a
few bullet points, but my simplification of his three Conceptual
models goes something like this:
1. At the top (the most broad-brush view), he has Overview
Models. They’re great in giving the big picture to non-technical
audiences, and helping us understand the scope of the
following, more detailed, models.
2. Next come Semantic Models. While an enterprise may have
one Overview Model, it may have multiple Semantic Models,
each one describing a view of the enterprise from within one
work group, using their language. David labels these as
“divergent” models – each view may diverge from the
apparently simple, single, united Overview Model.
3. Finally we’ve got an Essential Model. Yes, just one of them. It is
“convergent” because it pulls together the diverse views from
the Semantic Models to create a new single, unified view.
I suggest that it is this single “Essential” enterprise view that should
contribute to the design of the business-centric Data Vault objects.
But didn’t Dan say what we need for that purpose is an “enterprise
ontology”? Here comes Wikipedia to the rescue. (OK, I know some
treat Wikipedia as providing less than the ultimate, authoritative
word on ma ers, but let’s accept that it is a great resource if used
with a touch of caution.) Under the topic of data models, Wikipedia
suggests (emphasis mine):
“In the 1970s entity relationship modeling emerged as a new type of
conceptual data modeling … This technique can describe any ontology, i.e.,
an overview and classification of concepts and their relationships, for a
certain area of interest.”[10]
So at last we’ve got it. We can do a Conceptual data model for the
enterprise, call it an ontology if we like, and we’ve got the
foundation for a business-centric Data Vault solution. [As a side note,
the Wikipedia quotation above seems to imply that a data model will use
one of the forms of an Entity-Relationship Diagram (ERD) notation, but I
suggest that use of the Unified Modeling Language (UML) notation may
also serve you well as an alternative.]
But what does one of these things really look like?
Sample snippets from enterprise data models
Given that there are so many definitions of what an enterprise
(Conceptual) data model might look like, I will share a few examples
that might give you a feel for their shape. Of course, others may
provide other examples that are quite different, and their samples
may well fit the definition, too.
Schematic
David Hay’s definition of types of Conceptual models starts with an
Overview Model. I often work with the business people to assemble
a single-page schematic. I have created a pictorial overview which
many have found helpful. An example of it follows, showing people
involved with the buying and selling of houses and land, placing
mortgages, and the like.
We will go into the details behind each icon in the Appendix when
we look at the technical details behind the pa erns they represent,
but for now, in this se ing:
Parties and their roles could include vendors (the sellers of
houses and land), the buyers, their solicitors, involved real estate
agents, banks and more.
Agreements could include the most obvious one – a contract
between the buyer(s) and seller(s) – but also the contract to
engage a real estate agent, the bank loan contract, and maybe a
guarantee agreement provided by loving parents to a young
couple who can’t quite borrow enough money in their own right.
Important Agreements such as the loan contract typically require
documentary evidence, managed by the Document icon. Note
that sometimes an agreement is nothing more than a handshake
and has no documentary evidence, or conversely, some of the
documents such as photographs of the property are not tied
directly to an Agreement.
Resources (also known as Assets), are likely to include houses
and land but also money.
Some of the Resources such as land have a Location (a position on
a map), while other Resources (such as money) don’t have a
Location.
Just to demonstrate the flexibility of these simple icons, let’s look at
one more enterprise-level schematic. This time, it’s for an
organization that is responsible for the registration of health
practitioners (doctors, nurses, chemists …), and for oversight of
their on-going work.
Some people love the Party and Party Role pa ern, some don’t, but
before you write it off, I really do recommend you first read the book
co-authored by Len Silverston and Paul Agnew[11] on the different
levels for each pa ern. Some levels are great for communicating
with non-technical people, other levels are more aimed at technical
implementation. David Hay[12] has a similar number of levels. Len
and Paul refer to their levels as spanning from specialized through
to generalized, while David talks of his levels as spanning from
concrete through to abstract. I think there is much in common.
The full Party pa ern defines entities for things such as name,
address, phone number and so on.
A Party can play multiple roles. For example, an Employee of a
company may also be a Customer of the same company.
We have just looked at the health practitioner regulation scenario,
presented above as an overview schematic. Let’s now look at it in a
bit more detail.
The clear boxes at the top of the diagram represent a tiny fragment
of a somewhat standard pa ern for Party and Party Role pa ern. The
shaded boxes represent a few sample subtypes of “Party Role” in the
health practitioner scenario, and show how we can do some simple
drill-down into subtypes, and their subtypes, as deep as is helpful.
They both have parties, but the roles for the land transaction involve
roles such as the sellers, the buyers, their solicitors, involved real
estate agents, banks and so on, whereas the roles for the mineral
resources scenario involve roles for exploration and mining
companies, the owners of the land, the government and more.
They both have agreements. The first scenario has mortgages,
caveats, easements, and loans. The second scenario has exploration
licenses and extraction licenses.
They both have resources, be it money and land in the first case and
gold and oil in the second.
Not only are their similarities between the chosen pa erns (Party,
Agreement …), but there are some similarities between
associations. Both have parties acting as signatories to agreements,
and both have constraints on the use of resources defined in related
agreements. Yes, there are differences, but there are also similarities.
Over the development of many such high-level frameworks, a few
relationships begin to reappear. The following diagram captures
some of the most common reusable relationships.
There’s a point to these comparisons. The same components get
used again and again, though in different ways and different
configurations. I see an analogy with a car. I’ve got a simple 2-wheel
drive vehicle with one differential and one gearbox. I also used to
own a 4-wheel-drive with two differentials, one for the front axle and
the rear axle. And I currently own a 4-wheel-drive with three
differentials, one at the front, one at the back, and one in the middle.
The role of a differential is the same in all cases. A single drive shaft
comes in, and two drive shafts come out that can turn at slightly
different speeds. It’s just the way the differentials are placed in a
vehicle that changes.
In a similar way, the land titles office and the centralized register of
exploration and extraction of mineral licenses use a number of
common components, and it won’t surprise me if your organization
might reuse some of these pa erns and their interrelationships,
though in a way that’s unique to you.
Sharing the understanding
I’ve introduced data model pa erns to you, and thanks for hanging
in there with me. But if you’re using the list of Steps as a guide to
developing your own enterprise data model, please make sure the
whole team (the business folk and the technical team members) all
have a level of comfort with the ideas behind these pa erns. I
suggest you go through the material above, and gently introduce the
ideas behind the pa erns in the Appendix.
For those of you familiar with the UML, it’s called multiple
inheritance. A Person is a Resource, and a Person is also a Party.
Would we directly implement something like this? Possibly not,
especially as some object-oriented languages don’t seem to be too
comfortable with multiple inheritance.
But here’s an important message. This is a model about business
concepts, and at this level, some people see a person as a type of
party, and others see them as a resource. The model captures both
views, and if nothing else, will generate discussion when
implementation has to be faced, be it for a Data Vault or something
else that requires an enterprise view.
After a few twists and turns, we end up with a very high level
schematic that looked a bit like the group model presented in Figure
46.
Next steps
Assuming prior familiarity with the pa erns, and a good selection of
creative participants, it is possible to assemble this framework in
just days. It’s helpful, it gives the business and the technical people
a common vocabulary, it is pa ern based and hence likely to be able
to be extended as new aspects are discovered. That’s all good news.
But it is only a schematic, and for many purposes, we are going to
need a bit more detail, though not so much that we drown in a never
ending search for the last a ribute, so we want to target our drill-
down efforts to where we will get the most value. Before I end the
workshop, I seek feedback on “data” problems that have caused
pain in the past. That can serve as a checklist for focus as we seek to
flesh out the model where it can first deliver value.
Along comes another source data feed, this time from the Resource
Management system. It handles a whole heap of Resources that can
be deployed to an emergency event such as a fire. This might
include water pumps, generators, bull dozers, chainsaws, and lots
more, including fire trucks! This means that (logically) the Resource
entity is a supertype, and Fire Truck is one of its subtypes. As a
source system data feed, it has a different grain. If we follow the
pa ern we used for Fire Trucks, we might end up with a Hub and
Satellite like that below.
The Satellite has been included in part to represent those more-
common a ributes such as Make, Model, and Year of Manufacture,
but the important reason for including it is to show the “Resource
Type” a ribute that classifies Resource instances as a Generator, a
Bulldozer, or whatever. There may also be other Satellites (not
shown) that may hold specialized a ributes peculiar to very specific
types of assets.
Now the challenge. We’ve got Fire Truck instances in the Fire Truck
Hub, and also in the Resource Hub. What do we do? Here’s one
approach, and it’s pre y simple:
1. We continue to load data into the Fire Truck Hub from source
system feeds at the grain of Fire Truck – these are the Fleet
Management system, and the Shared Resource Pool system
feed. This is a load of raw source data.
2. We also load data into the Resource Hub from the source
system that feeds at the grain of Resource – the Resource
Management system feed. This also is a load of raw source
data.
3. We then have a business rule whose pseudo code is something
like:
a. If the Resource Type in the Resource Hub’s Common
a ributes Satellite is “FT” for Fire Truck, present the
Resource Hub’s Resource Code (for example, “CCC-333”) to
the Fire Truck Hub as a Registration Number.
b. If the presented Registration Number hasn’t yet been
loaded to the Fire Truck Hub, it is loaded as we would for
any Hub load!
Note: Just to reiterate, this is a load where the Record Source
records a business rule as the source, not a raw source system. Now
we have instances in the Fire Truck Hub that are “raw”, and some
instances in the same Hub that are “business”! We can’t say that the
Fire Truck Hub itself is a “raw Data Vault” table, nor can we say it is
a “business Data Vault” table. Instead, we note that instances in the
Fire Truck Hub can be either raw or business.
Of course, there may be business rules to get Satellite data across
from the Asset to the Fire Truck as well, but those principles have
been covered in the preceding section.
That’s one example of having business rules relating to Hub
instances, where there are conflicting levels of “grain” (supertype /
subtype) across different source feeds. Another common example
relates to consolidation of Same-As Links. Sometimes Same-As
Links identify accidental duplicates in the same source – one of my
clients had customers who deliberately tried to have new customer
records created so they could avoid the consequences of unpaid
debts under their earlier customer identity! Other times, different
source systems create instances using different business keys, but
the Same-As Links record that the same real-world entity is known
by different key values across multiple systems.
Whatever the reason for ending up with multiple Hub instances that
represent the same real-world entity, the business may well ask that
the duplicates be logically consolidated before being consumed.
Please note that I stress the consolidation is logical, not physical.
The duplicates, with all of their history, are still retained as raw
source loads to the Data Vault. It’s just that business rules are used
to give the impression that there is only one Hub instance for every
real-world employee (or customer, or …).
Our data is now much be er prepared for consumption by the end
users, which after all is our primary goal.
Preamble
When we set out to design Hubs, we wanted to hear from the
business about their business concepts, their business keys, and
their hierarchies of types. When designing Links, we are again well
advised to go to the business and ask them about the relationships
between their core concepts, rather than having our thinking
constrained by constructs in the source systems.
The Links the business wants to see for downstream analysis are
typically founded on these fundamental business-centric
relationships. If we’re half lucky, the raw data as supplied to the
Data Vault may directly reflect those business object relationships.
For example, we might get a data feed that is a simple declaration
that a certain employee is the account manager for a certain
customer. There are two Hubs (employee and customer), and a
simple Link. However, it’s more than likely we will get plenty of data
feeds that capture complex events or transactions, and the data may
be far from normalized. Before it can be consumed by end users, it
may need quite a bit of transformation, including untangling the
implied relationships. It’s interesting that some end users of this
data may also want to see the raw transactions as posted. Either way,
we simply can’t tackle mapping from the raw source data to any
enterprise view without having first defined the enterprise view, and
thought about how it might map to Hubs, and Links!
When the business and systems views happen to align
We’ve had a few scenarios to look at so far for the design of Hubs.
We’ve looked at Resources, especially Fire Trucks. We’ve also looked
at Emergency Events, including Fires and Floods. And we’ve spent
quite a bit of time looking at Employees. In one form or another,
we’ve designed Hubs for all of those subject areas, solidly founded
on what the business had decided is the right level. That’s a really
good start. Next we need to “link” them together, starting with Fires
and Employees.
When we talked to the business people about assigning employees
to emergencies, we heard two things. Firstly, the Human Resources
(HR) department can be approached to release employees into a
pool of available resources for a particular fire. Maybe they’ve been
asked for one logistics officer, two bulldozer drivers, and three
firefighters. They talk to candidate employees, and one by one
release them to the fire. The screen they use in the HR operational
system might look something like this:
I call this the “push” scenario. A resource from head office is
pushed out into the front line’s resource pool.
Now we look at a second scenario. Instead of pushing resources into
the pool, an administrative officer working on the wildfire response
thinks of particular people he/she would like assigned, and uses
his/her screen to log specific requests to pull the nominated people
from head office into the wildfire’s resource pool.
Interestingly, it’s really the same data structure in both cases, even
though sourced from two separate operational systems. Nominate
an employee (one Hub), nominate a wildfire (another Hub), provide
some date and percent assignment details, and the transaction is
completed. The enterprise data model takes a holistic view, not a
source system view, and represents both source-centric scenarios as
one Resource Pool Assignment relationship:
If you recall, we chose to design the Data Vault Hubs at the Wildfire
and Flood level of granularity rather than at the Emergency Event
supertype level. We have taken the relationship that an Employee
can be assigned to an Emergency Event, and pushed that supertype
relationship down to become a more specialized relationship
involving the Wildfire subtype as a Hub. Leaving aside the Data
Vault a ributes for things like load dates, source systems, and hash
keys, a Data Vault model, including Satellites on the Links, might
look like this:
Note that the Hubs are a direct reflection of how the business sees
their core concepts, and that this particular Link is likewise aligned
to how the business sees their core data relationships. If only it were
always that easy!
When life is a bit more complex
In the real world, and in operation systems, events occur and
transactions are executed. Each event or transaction can involve
many business objects. Something as common as a sales transaction
can involve the customer, the store, the employee, and of course,
products.
If the source system is nicely normalized, and if we have Change
Data Capture (CDC) turned on in the source system’s database, we
might be presented with lots of tight li le feeds to represent the one
sales transaction. For example, the allocation of a salesperson to a
store might be triggered by a Human Resources event, and only
refer to two Hubs – the salesperson (an employee), and the store
(another Hub). A totally separate feed might associate the order to
the customer, and yet another set of individual feeds might
associate individual line items to products. That simplicity,
especially if each part aligns with agreed business relationships
between business objects, was presented in the section above.
Now the harsh reality. Many data feeds triggered by events or
transactions in the operational systems deliver data that is not
“normalized”. Each data feed can reference many Hubs (Customer,
Store, Employee, and Product). That’s just one example. Another
typical example might be an extract from a source system where
multiple joins across multiple tables have been performed to get a
flat extract file. Let’s work through an example related to the
Resource Pool Assignment scenario presented in the “When the
business and systems views happen to align“ section above, but
taken from the view of timesheet transactions.
Let’s start with the nice, clean enterprise data model view.
We’ve already seen the assignment of Employees to Emergency
Events. Now we can see the idea that large chunks of work called
Jobs are done in response to an emergency. These Jobs are broken
down into smaller pieces of work, known as Tasks. Timesheets are
filled in for Employees working on the Tasks, and signed by other
Employees to authorize payment for work done. Each Timesheet
Entry (a line on the timesheet) refers to a specific Task the Employee
worked on at the nominated time.
An extract from a hypothetical timesheet transaction file follows.
We could try to “normalize” the input on the way into the Data Vault,
breaking it into the relationships represented by the enterprise data
model. There are implied relationships between emergency events
and their jobs, between those jobs and the smaller tasks they
contain, and so on. But notice the highlighted bits of the third row.
Maybe these implied relationships may contain bad data. Does the
Emergency Event Wildfire WF111 really contain tasks prefixed with
Job WF222?
This overly simple example carries a message. If we’ve got
transactions coming into the Data Vault that are not normalized
(and hence may reference lots of Hubs), it might be be er to create a
Link & Link Satellite for storing a faithful, auditable copy of the
incoming raw transaction.
The model below holds source-centric Data Vault values. From there
we can generate more information as business-centric Data Vault
objects – read on.
Business-sourced Data Vault Link and Satellite instances
For the baseline shown in Figure 77 above, we can go on to use
business rules to populate business-structured Data Vault Links that
represent the enterprise data model’s fundamental business
relationships, as shown (logically) in Figure 75. These might include
the Wildfire-to-Job Link and the Job-to-Task Link, with Satellites as
appropriate. We might also be able to imply the Employee-Assigned
To-Wildfire Link? And could we (should we) create a Timesheet
Hub, implied by all of the timesheet transactions sharing common
parentage?
All of these are possibilities we might want to discuss with the
business. Here’s another one: what if the business later decides that
Jobs and Tasks are really the same thing, a bit like Dan Linstedt and
Michael Olschimke’s example of an aircraft engine,[31] and the
engine’s parts (turbine, compressor …) all being just items in the
Part Hub. Now we would have one consolidated Hub. Can we
populate that Hub from this data? Of course we can. Change the
business rules and we’re off and running.
Now we’ve also got some sample data we want to load to the Fire
Truck Hub. We begin with sample data from the Fleet Management
system. It is a simple extract of all of the rows in the source system’s
table. On January 1st, there are only two fire trucks in the entire Fleet
Management system. The extract looks like this:
The logic, or pseudo code, for loading any Hub goes something like
this for each consecutive row read from the source:
If the Business Key (in this case, the Registration Number of the
Fire Truck) does not already exist in the target Hub, create an
instance in the Hub with the sourced Business Key
Else (if the Business Key already exists in the target Hub), do
nothing with this row
Note that for loading the Fire Truck Hub, only the Registration
Number is considered; the other data a ributes will come into play
when we consider Satellites, but for now we can ignore them.
Pre y simple?
Additional data is provided during the load, such as:
A computed Hash Key, using the Business Key as input to the
hashing algorithm (for the sake of simplicity in this case study,
the source text string is shown instead of the long and complex
hash result).
Load Date / Time (again for the sake of simplicity, the values only
capture hours and minutes, but not seconds, let alone decimal
parts of seconds).
A code to represent the Record Source (again simplified in this
example to the code FM/FT to represent the Fleet Management
system’s Fire Truck extract).
In this case study hypothetical, the Data Vault (and hence the Fire
Truck Hub table) was empty before the load process was run,
therefore every Registration Number in the source system must
cause the creation of a new instance in the Hub table. The resultant
contents in the Fire Truck Hub might now look as follows:
Easy so far? Now we come to the second day. The full extract from
the source data feed looks like the following (with some of the more
interesting entries highlighted):
Note:
Fire Truck AAA-111 has had no changes; the pseudo code will say
that the Business Key already exists in the Hub, so there is no
action required.
Fire Truck BBB-222 has had its Fuel Tank Capacity changed from
55 to 60. This can be expected to result in some action in a related
Satellite, but the pseudo code for the Hub only looks at the
Business Key, BBB-222, which already exists in the Hub, so again
there is no action required.
Fire Truck CCC-333 is new. The pseudo code notes that the
Business Key does not exist, so will add it.
The resultant contents in the Fire Truck Hub might now look as
follows, with the entry for Fire Truck CCC-333 showing it was loaded
to the Data Vault on January 2nd at 1:00am:
Note that the Load Date / Time for Fire Truck AAA-111 records the
“first seen” timestamp of January 1st; it is not updated to record the
fact it was later seen again on January 2nd.
OK, so we’ve performed a few Hub loads from one source system.
How do things change when we introduce a second source? Let’s
look at a data feed from the Shared Resource Pool system we spoke
about earlier. A full extract on January 2nd holds the following
information:
Let’s note that, as described in the case study introduction, there are
partial overlaps in the Fire Trucks represented by the two systems.
Fire Trucks AAA-111 and BBB-222 are in the Fleet Management
system, but have not (yet) been made available for deployment
via the Shared Resource Pool system.
Fire Truck CCC-333 is in the Fleet Management system and in the
Shared Resource Pool system.
Fire Truck DDD-444 may be a privately owned farmer’s fire truck.
It is available in the Shared Resource Pool system, but because it
is not owned or leased by the corporation, it will never appear in
the Fleet Management system.
Let’s think back to the pseudo code for loading a Hub. Even though
Fire Truck CCC-333 is presented to the Data Vault by a different
system, the Business Key already exists and no action is required.
Fire Truck DDD-444 has not yet been seen by the Data Vault, and it
is loaded. If the load time on January 2nd was 1:00am for the Fleet
Management system, and 2:00am for the Shared Resource Pool
system, the Fire Truck Hub might now look like:
Note that the Record Source for the Fire Truck DDD-444 is “POOL”
(a code for the Shared Resource Pool system).
Now here’s a trick question. What difference, if any, might be
observed if the load sequence on January 2nd was reversed, with the
Shared Resource Pool loading first at 1:00am and the Fleet
Management system loading afterwards at 2:00am? The answer is
below:
When the Shared Resource Pool load was executed at 1:00am, Fire
Truck CCC-333 did not exist, so would have been created. In this
variation, the “First Seen” information for Fire Truck CCC-333 would
note that the “First Seen” Record Source was “POOL” rather than
“FM/FT”.
Populating a Hub’s Satellites
Hubs are at the very heart of business-centric integration. If
multiple sources hold data about the same “thing”, all of their data
values are held in Satellites that hang off the one central, shared
Hub.
The “good practice” default design is to have (at least) one separate
Satellite for each source data feed. I say “at least one” because
sometimes a Satellite is split further, for example to move a ributes
with frequently changing values into their own Satellite. But for the
sake of simplicity, we will look a bit deeper into how population of
Satellites works with one Satellite per source.
We’ve already seen some sample data, but it’s repeated here for
convenience, starting with the data that was used to commence the
population of the Fire Truck Hub.
When we used this data to create rows in the Fire Truck Hub, we
were only interested in the business key values from the
Registration Number column. Now the focus shifts to the current
values for the Fire Truck’s other a ributes. We need a Satellite to
hold them.
The logic, or pseudo code, for loading any Satellite against a Hub
goes something like this for each consecutive row read from the
source:
If a ribute values in the source load are different from what’s
already on file for the Business Key (either because the values
have changed since the last load, or because it’s the first load and
there’s nothing there yet!), create a row in the Satellite.
Else (if the same data values as are found in the source load
already exist in the Satellite for the current Business Key), do
nothing with this row
Again, the logic is pre y simple, and as for Hubs, we record the
Load Date / Time and the Record Source to aid in auditing. Note
that the Load Date / Time is part of the Satellite’s Primary Key, so
we end up with snapshots of the data values each time any of the
values presented to the Data Vault change.
Though not shown here, one of the optional extras for a Satellite
table is a Hash Difference a ribute. The pre-processing in the
Staging Area can calculate a hash value across the entire
combination of values (in this case, the Vehicle Type plus the Fuel
Tank Capacity plus the Water Tank Capacity as a single text string). If
this is stored in the Satellite, next time a new source row comes in
(also with its hash value), instead of comparing every a ribute to see
if anything has changed, comparing the hash value as a single
a ribute achieves the same result.
The Fire Truck Satellite for data sourced from the Fleet Management
system was empty before the first day’s processing began, but will
now look like this:
That was not too challenging – no data existed prior to the load, so
the “first seen” values are simply created. Now things get a li le bit
more interesting as we load the second day’s data. Again for
convenience, the second day’s data from the Fleet Management
system is repeated here:
Let’s think through the logic of the pseudo code for each row.
Fire Truck AAA-111 has had no changes. The pseudo code will
conclude that all of the data a ribute values for Business Key
AAA-111 already exist in the Satellite, so no action is required.
Fire Truck BBB-222 has had its Fuel Tank Capacity changed from
55 to 60. The pseudo code will note that at least one value has
changed, so will write a new row to the Satellite table.
Fire Truck CCC-333 is new. The pseudo code notes that the data
a ribute values have changed (because none existed before!), so
will create a new Satellite row.
The resultant contents in the Fire Truck Satellite for data sourced
from the Fleet Management system might now look as follows. Note
that there are two rows for Fire Truck BBB-222, one being the initial
snapshot on January 1st, the other being the more recent snapshot
on January 2nd, with one data value (Fuel Tank Capacity) changed
from 55 to 60.
When we load Hubs, the data instances can overlap. For example,
Fire Truck CCC-333 is known in both systems. However, when we
load Satellites, if we follow the guidelines and we define a separate
Satellite for each source data feed, life is simpler. Our data model
might look like the model in Figure 83.
Loading the data sourced from the Shared Resource Pool system to
its Fire Truck Satellite would result in the following:
It is worth noting that the types of a ributes held about fire trucks
are different between the two systems, and even where they appear
to be the same, they can have different names for the a ributes.
Many practitioners suggest that the a ributes in a Satellite loaded
from a raw source data feed should align with data structures in the
source systems rather than use the more business-friendly names
we would hope to see in business-centric Data Vault Satellite
structures. Two arguments I have encountered for this approach
that I think have real merit are:
1. Easier traceability of data back to the source systems (though
this should also be achievable by other data mapping
mechanisms).
2. Support for data consumers who already have familiarity with
source system data structures.
Populating Links
You might remember the skeleton Data Vault model we presented
earlier for the introductory case study? When the Satellites are
removed, it looked like:
We’ve been loading Hubs and their Satellites for Fire Trucks. Let’s
assume we have likewise been loading Hubs and their Satellites for
Fires. To bring the two Hubs together, we now need to load the Fire-
Truck-Assigned-To-Fire Link.
In the hypothetical case study, there are two ways an assignment of
a Fire Truck to a Fire can occur.
The first method is driven by a logistics officer at the Fire itself. He
or she sees a need, and submits a request to acquire a Fire Truck
from the shared resource pool. They try to “pull” a resource from
the pool so that they can put the fire truck under their direct control.
Their data entry screen might look something like:
In this case, the logistic officer has requested the assignment of Fire
Truck CCC-333 to the Fire WF123, starting from February 2nd, with
the assignment to be indefinite.
(For my northern hemisphere friends, please remember that February is in
our Australian summertime. The most recent fire that put my own house
at risk was one February when it reached 46 degrees Celsius, or 115
degrees Fahrenheit. And that was in the shade!)
You may remember that the source data for populating the Fire
Truck’s Hub had lots of data in addition to the Registration Number
business key, but for a Hub, we were only interested in the business
key, because the a ribute values were managed by the Hub’s
Satellites. We’ve got a similar scenario here. The assignment
transaction provides two business keys (Fire Truck CCC-333 and Fire
WF123), and also two contextual data a ributes (a “From” date and
a “To” date), but for populating the Link, we only need the
participating business keys; the a ribute values are managed by the
Link’s Satellites.
To spotlight some of the flexibility of a Data Vault, while we had
nightly batch loads populating the Fire Truck Hub, let’s assume the
resource assignment function needs to be much more dynamic. The
sample screen above could be a web-based screen, with the service
running on top of an enterprise service bus, in real time. If the
operation represented by Figure 85 were to execute at 12:33 on
February 2nd, with an almost immediate load into the Data Vault, the
resultant Fire-Truck-Assigned-To-Fire Link could look like:
The two Satellites hold different data that reflects what may be a
conflict in the real world as to when the Fire Truck’s assignment
ends, but they faithfully hold the facts as presented by the source
system. That’s good. The Data Vault is now positioned to expose any
apparent inconsistencies, and action can be taken to resolve the
disagreements if required.
We now roll the clock forward. A month later, on March 3rd there’s a
new fire, WF456. In response, the manager responsible for the
Shared Resource Pool makes several decisions:
1. The large Fire Truck, CCC-333, with a highly experienced crew,
is reassigned from Fire WF123 and handed over to the new
threat, WF456. This action has two parts:
a. The assignment of Fire Truck CCC-333 to WF123 was
intended to be up until April 4th, but the assignment is cut
short, effective immediately.
b. The formal assignment of Fire Truck CCC-333 to WF456 is
also made.
2. To backfill at the mopping up action for the original Fire,
WF123, a smaller Fire Truck, DDD-444 is assigned to that Fire.
The resultant assignment transactions are loaded to the Data Vault:
Now let’s assume we have a source data feed from the Loan
Management operational system. It might look something like the
following.
Next we note that the two Hubs identified by the business also have
a Link identified by the business as a fundamental “business
relationship”. We are in luck. The raw source has the same level of
“grain” – the business-centric Link relates to two Hubs, and the raw
source data feed also relates to the same two Hubs. It is not always
so, but in this hand-picked example, we can map the raw source feed
directly to a business-centric Link.
Now we’ve got some data in the raw source that needs to go in a
Satellite somewhere. We’ve already designed a “business-centric”
Satellite for the Loan Hub, called the Loan Conformed Satellite. If
we could guarantee that the Loan Hub was only ever going to be fed
from one source, maybe we could treat the Conformed Satellite as if
it were the Satellite from a single source. However, we expect that
the Loan Hub will have data coming in from multiple sources, with
each source being given its own source-specific Satellite, so we
design a new Satellite just for the “LMgt” (Loan Management)
source, as shown in “Figure 93: Mapping raw data to a Satellite”.
Though not shown in Figure 93, we would do likewise for a Satellite
holding the Credit Rating a ribute, hanging off the Purchase Hub.
It’s nice when life is simple, but it’s not always that way. If you’ve
ever bought or sold a house, there’s the exciting day when
“se lement” occurs (see Figure 94). And it can be pre y
complicated. Behind the scenes, you’ve got the buyer (“Purchaser”)
and the seller (“Vendor”). At the table, you might have the legal
representatives for each of these two parties, and maybe their two
banks. The vendor might reference a mortgage to be discharged,
and the purchaser might reference a new mortgage to be registered.
Then at the center is the land title being transferred. And maybe the
purchaser’s mum and dad as guarantors for the loan, and on and on.
It’s no wonder that there are times se lements “fall over” on the
day because some small but vital part doesn’t line up. A subset of
this complexity is presented below, where a source system data feed
has been marked up with Business Key (BK) tags.
In this feed alone, there are nine business keys identified. If we try
to find a Link to match, you may hear people refer to a Link with a
“grain” of nine. Realistically, the chances of finding a business-
centric Link that perfectly matches are low. The solution? One
approach we can adopt is to create a source-centric Link (with a
Satellite) that matches the event / transaction being presented to us
for mapping.
The diagram is detailed, ugly, and hard to read. But the message is
simple. If the business had already identified the Se lement
transaction as a business concept with its own business key, it may
have modelled Se lement as a Hub. But in this scenario, we have no
such luck. Se lement is merely a transaction with no business key,
and so it’s a Link, tied to a multitude of Hubs.
We’ve captured the data, and that’s good. But we may want to map
some of the data to objects that are more business centric. For
example, maybe the Loan Amount should be in a Satellite hanging
off the Loan Hub, and maybe we want to record the business
relationship between the Loan and the Purchaser in the business-
centric Loan Customer Link. How do we do this? By using business
rules to do some transformation of data. Please read on.
Chapter 6
Task #4: Using Business Rules to Close the Top-
Down / Bottom-Up Gap
Same-As Links
One commonly quoted usage of a Same-As Link is to associate
multiple customer instances in a Customer Hub that actually
represent the same real-world customer. Maybe a few years ago Sam
Smith dealt with the company, using Customer Number Cust123.
Now, some years later, Sam turns up again and starts buying. The
“correct” process may be to recognize Sam as the same person, and
get them to use the old Customer Number, but the process breaks
and Sam gets a new Customer Number Cust456.
The Data Vault ends up with history for purportedly two separate
customers. Subsequently, a cleanup initiative involving the source
systems spots the duplication, or maybe an operator identifies and
records the duplication. We don’t want to rewrite history by merging
data in the Data Vault. That breaks the trust in its audit trail. So
what do we do? It’s simple. We create a Same-As Link that has two
Foreign Key relationships, one to point to the “old” Customer
Number, and one to point to the “new” Customer Number.
Reference data
Often our IT systems have what many call “reference data”.
Examples could include a collection of country codes and names, a
collection of order status codes and names, or a collection of
employee gender codes and names. There are a variety of ways to
manage reference data within a Data Vault. Let’s begin the
discussion by looking at an example from a banking scenario. In the
operational system, we’ve got Customers, and they are classified
according to some Customer Type reference data. A simple
reference table could look like this:
We’ve already decided that the Customer from the operational
system is to be loaded to a Data Vault Hub. But where do we load
the Customer Type reference data?
One of the first questions to ask is whether or not the history of
changes is to be kept. We could use the data structure from the
above operational table structure directly in our Data Vault if we
only want the current values for a given code. As an alternative, if we
want our Data Vault to hold the history of slowly changing values in
our reference table, the following table structure might do the trick.
Each time a Customer Type Name or Description changes, we can
add a new row holding a snapshot of the new values, while retaining
the previous values, too.
A batch job fires up a bit after midnight each night to take a full
snapshot extract from the HR system. The Staging Area persists a
copy of “active last time” Employee Numbers. Its pre-processing
function for this batch noted that Employee 111 wasn’t in the last
extract, but is now. It generated a row of data for loading to the Data
Vault with a Status Code of “First Seen Date”, and its own Staging
Area run date-&-time to declare that Employee 111 was first seen by
the Staging Area at 1:00am on Tuesday January 2nd.
The intended Start Date for Employee 111 is just a piece of data that
can be mapped to a Satellite. The Staging Area’s pre-processing
simply assigns a Status Code of “Effective Start Date”. No End Date
is nominated for the Employee, so the logic simply ignores a NULL
value and does nothing in the way of generating an “Effective End
Date” row.
After loading a new instance in the Employee Hub (a given), we also
added rows to our special shared-records Satellite.
Now we swing our a ention to the source system that issues passes
to permit access to buildings. Instead of using a batch job for a full
extract, this system runs on a database with the ability to generated
Change Data Capture rows. The HR system had created a record for
the new employee a week ahead of the expected start date. The
person responsible for creating new building access passes normally
waits until a new employee actually turns up on their first day. The
source system then, by default, issues cards for six months from the
actual date of the card’s creation. The Effective From and Effective
To dates are generated and stored.
However, the person who issues cards is taking Monday off, so
instead creates the access card ahead of time, on the Friday, and
forgets to override the Effective From date, giving the impression
that the new employee actually commenced on Friday the 5th. It’s a
data quality mistake, but one that the Data Vault will faithfully
record. On Friday January 5th, a new row is created in the Building
Access system, and a CDC “insert” row is automatically generated
by the CDC mechanism. Again we look at the Data Vault results,
and then discuss them in more detail.
The Staging Area pre-processing function generates a row of data
for loading to the Data Vault with a Status Code of “CDC Physical
Insert”, and the precise date and time, to a fraction of a second (not
shown), when the physical insert actually took place in the source
system. The Staging Area pre-processing also generates rows for the
Effective From and Effective To dates.
Most employees like to get paid, but that doesn’t usually happen on
day 1 of their employment. While company policy is to get the HR
records loaded before the new employee arrives, and to create the
building access card on their arrival, creation of the Payroll records
can wait a bit. On Tuesday January 9th, the Payroll records are
created, and exposed to the Staging Area early the next morning,
Wednesday January 10th. Just like the HR system, a full extract is
provided as a batch job. Like the HR system, the “First Seen Date” is
generated, as is the “Effective Start Date”, matching that of the HR
system. The Payroll worker also provided an “Effective To Date” of
Monday October 8th to reflect the 9-month employment contract.
Two more stories from my time working with Bill. The first was
where a data analysis showed a spike in errors for the IT systems’
data from one phone exchange. Bill dug a bit deeper, and uncovered
the fact that the staff at that particular exchange never updated the
central records because they had all of the information they needed
on a spreadsheet. Then there was the time when Bill’s dogged
pursuit of the truth uncovered an illegal hard-wired patch between
two telephone exchanges.
A number of us have seen less dramatic examples of where the IT
systems don’t capture the facts. Maybe a “standard” process doesn’t
work in fringe cases, so people “break the rules” to get the job done.
A recent client is involved in forensic analysis related to deaths that
the coroner wants investigated. I won’t confront you with the sad
background, but sufficient to say that the central IT system simply
can’t cope with some of the tragic real-world scenarios. To do their
job, workarounds are absolutely essential, and totally ethical. But
the key message is that if we were to load the data from operational
systems into a Data Vault and then analyze it in isolation from the
“real world”, we could come to some dangerously wrong
conclusions. To try to protect the community by learning from the
past, we must have solid analysis. In this forensics se ing, lives
depend on it.
What’s the key message from all of this? That while the Data Vault
can facilitate data analytics by triggering lines of investigation, or
providing detailed insight as the analyst drills down, we need to
have a healthy skepticism and challenge what the data is telling us.
A capable human being is required to drive what Dan and Michael
call “gap analysis”. Don’t underestimate the value that the Data
Vault can add to the equation in providing an integrated view, and
dates that challenge how processes are really working, but I do
encourage you to dig deeper. Bill used to require his team to “ride
the trucks”, even if that meant crawling around a dirty, dusty phone
exchange to understand what was really happening.
A bit on projects
Agility
Data Vault has the potential to be “agile”, but agility doesn’t just
happen because someone thinks it’s a good idea. There’s a lot of
stuff that needs to be in place for agile to succeed. Just some of the
things we might need for any agile software development project
include:
Some idea of the goals. No, not hundreds of pages of
excruciatingly detailed requirements’ specifications, but a broad
idea of what success might look like.
Some tooling. What’s the development environment (Java, .NET
…)? Have we got an interactive development environment? What
about automated / regression testing tools? Can we easily do
DevOps rapid promotion to production? Have we got something
as fundamental as a database to store our data! And on and on.
A team. Pre y obvious? But in agile, it’s not about having a
bunch of technical propeller-heads locked away. The business
needs to be on board, too. And agile projects recommend the
team be co-located, which leads to …
Facilities – a work area, probably a whiteboard, some
collaboration tools (even if they’re Post-it notes), computers
already loaded with required software, … And a good coffee
machine?
Training. Sure we can learn on the job, but if we want to hit the
ground running, have at least some of the team proficient in the
languages, tools, and processes.
Permission. You might wonder why I raise this, but I’ve seen
projects threatened by people who say things like “You can’t have
access to that data” or “You can’t use those tools” (because
they’re not authorized by head office, so the resultant code will
not be permi ed to be moved to production) or “You can’t have
Sam on the project because something more important has just
come up”.
Much of the list above applies to any agile project, whether it is for a
Data Vault build or not.
Platform & tooling projects versus the Data Vault project
There are two threats to Data Vault projects I’ve observed that relate
to muddying the waters on project scope.
The first threat is confusing the project to build required
infrastructure with the project to build a Data Vault. Maybe we want
Hadoop as part of our architecture? Fine. And we need a project to
get it in place in our organization? Fine again. And we want to fund
and manage the Hadoop implementation as part of the Data Vault
project? That’s where I see red lights and hear warning bells
clanging.
No ma er what the new software components might be (Hadoop as
described above, or real-time mechanisms, or service-oriented
architectures, or …), if they’re part of our strategy, identify the
interdependences, but consider funding and managing them
separately.
There’s an option we might want to consider: perhaps we can at
least start the Data Vault project on an interim platform, deliver
tangible value to the business, and accept we may have some
technical debt to pay back when the other projects catch up with us.
After all, much of the metadata we need (including for the design of
Hubs, Links, and Satellites, and source-to-target mapping) should
turn out to be relatively stable even if the underlying technology
changes.
The second threat is similar, but closer to home for the Data Vault
project, and it’s about confusing the Data Vault delivery project with
the tool evaluation / acquisition / build project. Sooner or later
(probably sooner), we will want to use a tool for moving the data
from layer to layer, doing extract/transform/load, though not
necessarily in that order!
We may choose to buy a tool, and that strategy has lots of merits. I
encourage you to check out what’s on the market. Evaluate the
candidates. Do we need to specifically model columns such as
Source and Load Date-&-Time stamps, or is the tool smart enough to
know these are part of the Data Vault package? Can we define,
manage, and execute business rules? Do they integrate with our
testing tools and DevOps strategies, or maybe they even include
them? Lots of questions. And if we can find a tool that meets our
requirements, or is at least “sufficient”, that’s great.
Alternatively, we might decide to build our own tool, or at least
hard-code some bits while we do a Data Vault proof-of-concept.
I consulted to one organization that was very altruistic. It wanted to
gift its entire Data Vault (less the data) to others in the same
industry, and not have them need to get a license for the Data Vault
toolset. They chose to develop the tools in-house so they could give
them away. I take my hat off for their generosity. But I simply want
to warn that building you own toolset does come at a price. And like
with establishing other software infrastructure, I recommend that
the build of a tool be separately funded and managed.
A few hints
OK, we’ve got everything in place. We’ve got the team, they’re
trained, we’ve got the complementary software, a tool, and we’re
ready to go. I’ve got a few final hints for us to consider as we build
our Data Vault:
At the very outset, check out the infrastructure bits we’ve been
given to work with by creating a “walking skeleton” – some tiny
bits of code that exercise the architecture, end-to-end. Yes, agile is
meant to deliver business value as part of each iteration, but treat
this as a technical spike to try to eliminate the possibility of
architectural show-stoppers.
Next, do something similar to give end-to-end visibility of some
business data. Pump a tiny bit in from a source, and push it all of
the way through to end-user consumption. Again, perhaps this
might provide minimal business value, but it shows the business
how the whole thing works. More importantly, it provides a
baseline on which we can build to progressively deliver
incremental business value.
Now for the remaining iterations, keep delivering end-to-end,
focused on tangible business value.
There’s one phrase that appeared in all of the above bullet points.
It’s “end-to-end”. Why do I see this as being important? I’ve
unfortunately seen tool vendors, and Data Vault consultants, who
impress with how quickly they can load data into the Data Vault, but
fail to give enough focus on actually delivering against business
expectations. The good will that might be won by demonstrating a
speedy load will sooner or later evaporate unless business value
pops out the other end of the sausage machine.
One more suggestion. In the earlier section titled “Is “agile
evolution” an option?“, I shared a confronting story. The lack of a
data model hampered an agile project. The team took one iteration
to assemble a relatively stable data model. Subsequent iterations
resulted in a 12-fold increase in productivity.
My conclusion? For a Data Vault project, I suggest that the data
modeler(s) on the team try to think about what’s coming up, at least
one sprint ahead of the development.
A few controversies
There are a number of well-recognized voices in the Data Vault
community in addition to that of Dan Linstedt, the founder of Data
Vault. I’ve heard a joke that goes something like this: “If you’ve got
three data modelers in a room, you’ll get at least four opinions.”
Oops.
I take heart from the PhD of Graeme Simsion on a related ma er, as
adapted for publication in his book, “Data Modeling Theory and
Practice”.[45] I can’t do an entire book justice in a few sentences, but
here’s my take-away from his work. If we’ve got a bunch of data
modelers, and each one thinks there can only be one “correct”
solution for a given problem, don’t be surprised if there is heated
disagreement. Conversely, if all accept that several variations may
actually “work”, a more balanced consideration of alternatives and
their relative merits is possible.
I think it is healthy if there is room for respectful discussion on
alternatives to Data Vault modeling standards. For me, my baseline
is the Data Vault 2.0 standard, but I welcome objective consideration
of variations, especially as they may suit specific circumstances. Dan
Linstedt also welcomes the standards being challenged. He just,
quite rightly, warns against changes without due consideration as to
unwelcome side effects.
Modeling a Link or a Hub?
We may encounter debates as to when something could or should
be modeled as a Hub versus modeling it as a Link. I know this is
contentious in some quarters. Part of me would prefer to not even
raise the topic, as whatever position I take, I suspect someone will
disagree. Nonetheless, avoidance of the issue won’t make it go away,
so I will stick my neck out, hopefully in a manner that is respectful
of the views of others.
I would like to share two scenarios that might help shed some light
on this topic.
In the first scenario, there is an accident on a highway. Cars are
damaged, drivers are upset, insurance companies may be faced with
claims, but thankfully no one has died. From the perspective of one
of my clients, there may be a police report with “links” to cars and
their registration numbers, drivers and their driver’s license
numbers, health practitioners who may assist, police force members
identified by their member number, and so on. The “key” message
(pun intended) is that there is no business key for the accident itself.
However, in our Data Vault, we can create a Link (and maybe a
Satellite – see the next “controversy” for comments on Links with
Satellites) to associate the accident with the cars, drivers, health
practitioners, police members, and more.
In the second scenario, a fatality occurs and forensic examination of
the deceased is required. A “Case Number” business key is
assigned. In the Data Vault, there is a Case Hub. We can create an
instance in the Case Hub, populate its Satellite, and also populate
Links to the cars, drivers, health practitioners, police members, and
so on.
Which model is correct, or are both models OK? Here are my
starting-point rule-of-thumb guidelines for choosing Hubs versus
Links, at least as they relate to business views of the world:
If the thing to be modeled is recognized by the business as a
tangible concept, with an identifiable business key, it’s a Hub.
If the thing to be modeled looks like an “event” or a “transaction”
that references several Hubs via their business keys, but doesn’t
have its own independent business key, it’s a Link.
There we are. Some nice, clean guidelines.
But what about an event or transaction that also has a clearly
recognized business key? It seems to span my two definitions; it’s
an event so it’s a Link, but it has an independent business key so it’s
a Hub! For example, maybe an engineering workshop has
“transactions” that are work orders, with a Work Order Number.
If the work orders are “recognized by the business as a tangible
concept, with an identifiable business key” (such as a Work Order
Number given to the customer so he/she can enquire on progress),
then to me, the work orders look like they belong to a Hub. (If
something walks like a duck, quacks like a duck, and swims like a
duck, maybe it is a duck.) But if the transaction didn’t have a Work
Order Number and was identified by nothing more than the
collection of business keys in related Hubs, the work orders look
like they belong to a Link.
Even if my guidelines are helpful, please recognize that, as noted by
Graeme Simsion’s research, there might still be times when
experienced modelers may have to agree to disagree, take a position,
and see how it works.
Roelant Vos also adds a valuable perspective. We can start by
modeling at a higher level of abstraction, capturing how the
business sees their world. From this position, if we have flexible
tools, we may be able to generate variations of the physical Data
Vault tables while having a single, agreed logical view. That
approach may actually defuse some of the tension in the following
topics. Thanks, Roelant!
Modeling a Satellite on a Link?
I’ve got another controversy to share at this point. Some suggest
that a Link should never have a Satellite, apart from maybe an
“Effectivity Satellite” that has a From Date and a To Date to record
the period the relationship was active. Those who hold this view
seem to suggest that if a “thing” has a ributes, it has to be a Hub.
Conversely, others happily model Satellites against Links.
For a moment, I want to step back to traditional data modeling, then
to the Unified Modeling Language (UML) to see if we can get some
clues to help us decide which way to go.
In a traditional data model, we have relationships, and these
somewhat approximate a point-in-time equivalent of a Data Vault
Link. If an operational system has a one-to-one or one-to-many
relationship, the relationship itself is represented by the presence of
a Foreign Key in one entity. This type of relationship is unlikely to
have associated a ributes, and if it does, they will be mixed in with
other a ributes in the entity holding the Foreign Key. If data from
this operational system maps to a Data Vault Link, there is probably
no data to even put in a Satellite off the Link.
Conversely, if the relationship is a many-to-many relationship,
modeled as a resolution entity, it can have a ributes in addition to
the Foreign Keys that point to the participating “parent” entities.
When we map this resolution entity into a Data Vault, I think most if
not all people are happy that the additional a ributes go into a
Satellite. But is it a Satellite hanging off a Link, or a Satellite hanging
off a Hub?
Before we make a decision, let’s quickly look at the UML Class
Diagram notation – a modeling style with some correlation to a data
modeler’s Entity-Relationship Diagram. A UML class approximates
an ERD entity, and a UML association approximates an ERD
relationship. Now if the association has its own a ributes, a new
type of object appears – an “association class”. It’s an association (a
relationship) that can have its own a ributes in a dedicated class
(entity). It looks something like a data modeler’s resolution entity,
but can hold a ributes for an association (relationship) even if the
cardinality is not many-to-many.
So where does all of this lead us? One answer is that there are many
answers! Dan Linstedt’s Data Vault 2.0 standard of allowing
Satellites to hang off a Link looks like it’s got good company with
data modeling resolution entities, and with UML association classes.
Others may argue that the UML association class gives us a hint that
the moment an association / relationship has a ributes, it needs a
class, and does that hint at a Hub in the Data Vault world?
OK, I will share my opinion; for me, I like to start by considering
whether a “thing” looks like a Hub or a Link (see the section
immediately above), independent of whether it has its own a ributes
or not. And if it turns out to look like a good Link, and it has
a ributes, I am happy to create a Satellite on the Link. I think this
approach is clean, simple, consistent, and it aligns with much of
what I observe in both data modeling and class modeling (UML)
practices.
Having said that, you may reach a different conclusion, and
hopefully we can still be friends!
Normalization for Links, and Links-on-Links
We may encounter guidelines on Links that encourage the structure
to be “de-normalized”. A purist data modeler may look at the
examples, and suggest that in fact the structures are being
“normalized”.
In some ways, it’s simply not worth the debate. Very few can quote
precise definitions for first, second and third normal form, let alone
Boyce-Codd normal form, or fourth or fifth normal form. And I
suspect few care about such abstract terminology anyhow. It is
arguable that a working knowledge of these levels of normalization
is more important than their definitions. In another way, precision
as to whether something is normalized or not does ma er. But
before we get to that, let me note that the examples I’ve seen on
Data Vault modeling sometimes touch on the more advanced levels
of normalization, such as 5th Normal Form (5NF). That may sound
scary to some, so let me please share a very simple example from a
cut-down model for the airline industry. The examples given in the
text below the diagram are not meant to be necessarily correct, but
just indicative of the types of information that could be held.
The large circles represent core business concepts for airlines. They
could be relational tables in an operational source system, or Hubs
in a Data Vault.
The Airport table/Hub holds instances for the airports of interest
– Melbourne, in Australia (MEL), Los Angeles in the USA (LAX),
Heathrow in the UK (LHR), and even the tiny airport I’ve used in
a remote part of Australia, the Olympic Dam airport (OLP).
The Airline table/Hub holds instances for participating airlines
such as Qantas, United, and British.
The Aircraft Type table/Hub holds instances for types of aircraft
such as Airbus A380, Boeing 747, Being 767, and maybe the
Cessna 172.
Forming an outer ring are three boxes with arrows to the core
entities. They could be many-to-many resolution entities in the
operational system’s relational database, or Links in a Data Vault.
The Terminal Rights table/Link holds pairs of codes from the
Airport and Airline entities, defining which Airlines have
terminal rights at which Airports. Examples might include that
Qantas has landing rights in Melbourne and Los Angeles, United
has rights in Los Angeles, and British Airways has rights in
Heathrow and Melbourne.
The Landing Authorization table/Link holds pairs of codes from
the Airport and Aircraft Type entities, defining which Aircraft
Types are authorized to land at which Airports. Examples might
include that Airbus A380s and Boeing 747s are authorized to land
at Melbourne airport. Not in the list might be Cessna 172s at
Heathrow (emergencies aside, it’s too busy to permit regular use
by light aircraft), and Boeing 767s at Olympic Dam (it’s just a dirt
airstrip next to an outback mine).
The Ownership table/Link holds pairs of codes from the Airline
and Aircraft Type entities, defining which Aircraft Types are
owned by nominated Airlines. Qantas owns A380s, …
Each of the above three tables relates to just two “parent” tables.
They are normalized. If we tried to have a single three-way table that
held the Cartesian product of all combinations from the three
separate tables, it would not only cease to be normalized, but it may
hold combinations that are not valid. For example:
Maybe Qantas has landing rights in Melbourne (MEL), Los
Angeles (LAX), and Heathrow (LHR).
Maybe Qantas owns Boeing 747s, 767s and Airbus A380s.
Maybe Melbourne (MEL), Los Angeles (LAX), and Heathrow
(LHR) can handle Boeing 747s, 767s and Airbus A380s.
But a Cartesian product of all of these data sets may be misleading.
Perhaps Qantas chooses to only operate its A380s between
Melbourne and Los Angeles (not its 767s). Maybe it operates 767s
between Melbourne and Heathrow (not its A380s). It is misleading
to create a 3-way Link between the three Hubs, based on possible
combinations; retaining the set of three 2-way Links is actually
holding the relationships in a normalized form.
Now we turn the example on its head.
The single 3-way Operations table/Link holds triplets of codes from
the Airport, Aircraft Type, and Airline entities, defining which
Airlines actually operate which Aircraft Types out of which
Airports. One of the examples above was Qantas operating its A380s
into and out of Los Angeles.
It was dangerous to assume we could take all combinations of the
three 2-way Links to generate a single 3-way Link to represent which
airline flew what aircraft types in and out of what airports. Likewise,
we can’t simply take the single 3-way Link and assume it represents
all combinations for the other 2-way Links. Maybe no airlines
actually operate out of Roxby Down with a Cessna 172, but the
Airport is capable of handling them. This time, retaining the single
3-way Link is actually holding the relationships in a normalized
form!
We could have included another Link on the diagram, showing
Flights. It would be a 4-way Link, involving the Airport table/Hub
twice (From, and To), plus the Airline and the Aircraft Type. It too
would be normalized for this purpose. So what are the take-away
messages?
1. The logical model of the business concepts and the business
relationships should be assembled as a normalized view.
2. The Data Vault model can have multiple Links that in part
share common “parent” Hubs, and that’s fine. There is no need
to have Link-to-Link relationships – keep them separate, just
like the 3-way Link and the set of 2-way Links have
independent lives as shown in the diagram.
Closing comments on controversies
I’ve seen divisive debates on alternative approaches. My own
opinion is that while the differences are important, there is much in
common. Let’s leverage off what is truly common, and respectfully
and objectively evaluate the differences.
More uses for top-down models
Remember that Dan Linstedt said that if we have an enterprise
ontology, we should use it? That’s great if we have one. But what if
we don’t? The section of this book titled “Task #1: Form the
Enterprise View” was wri en to help people who didn’t already have
an enterprise data model to create it quickly, without comprising the
quality.
But can the same enterprise data model be used for other purposes?
The good news is there may be many initiatives within an
organization where an enterprise data model can potentially make a
contribution. Some of these are listed below.
Strategic planning: I’ve participated in several projects where
strategic planning, for the entire enterprise or just for the IT
department, was assisted by the development and application of
an enterprise data model. In each case, a “sufficient” enterprise
(top-down) model was developed in a few weeks.
Enterprise data integration: One of my clients was involved in the
merging of 83 separate organizations. The management did an
amazing job given the pressing time frames mandated by an act
of parliament. But after the dust se led, an enterprise data model
gave them a long-term framework for integrating the multitude of
data to provide a single, consistent view.
Master Data Management (MDM) and Reference Data
Management (RDM): It’s probably a self-evident fact to say that
management of data as a corporate asset requires a corporate
vision! An enterprise data model can provide the consistent
framework for Master Data Management and Reference Data
Management initiatives.
Benchmarking candidate IT package solutions: A company
engaged me to develop an independent view of their data. They
didn’t want their target model to be influenced by assumptions
consciously or unconsciously incorporated in any vendor software
packages that had been shortlisted for selection. The top-of-the-
list package was ruled out when compared to the essential
features of the benchmark model. It was estimated that the cost
savings from this insight could be measured in millions of
dollars.
Facilitating IT in-house development solutions: One client wanted
an enterprise data model to drive the design for their service-
oriented architecture, particularly the data structure of the data
payloads in XML. Another wanted a UML class model to drive a
Java development, and the enterprise model (using the UML
notation), was the kick-start they needed.
Facilitating communication: Most if not all of the above uses have
a technical aspect. What I have seen again and again is the
participation of business and IT people in the development of an
enterprise data model delivering a massive beneficial side effect –
the two sides now talk the same language!
Process modeling: Further to the above, the existence of a
common language can facilitate the development of process
definitions.
… and last but not least, we return to the catalyst for this book –
using the enterprise data model to shape a Data Vault design.
So an enterprise data model can be used for many purposes. But
what happens when it is used for more than one of these reasons, in
a single organization?
Even if our motivation for pursuing top-down big-picture enterprise
modeling had been for only one of the above reasons, that’s enough.
But if the model is used across two or more such initiatives, the
multiplier effect kicks in as the investment in one area contributes
to another related area. For example, if several agile in-house
development projects also produce source data that will be required
in a Data Vault, why not base all such projects on a common data
architecture? Everybody wins.
In conclusion
Australia has some cities that grew somewhat organically, and their
origin is reflected in some chaotic aspects! Australia also has some
major cities that were planned from the outset, where the designers
articulated a vision for the end-state, and the authorities defined a
clear path to get there.
Some Data Vault practitioners appear to follow a simplistic bo om-
up raw source-centric Data Vault design and hope the business-
centric Data Vault solution will magically appear. Like the source-
driven tool vendors, they can impress with their speedy delivery of
something, but subsequent delivery of tangible business value may
be much slower, and the project, and therefore the Data Vault
initiative, may fail.
Of course, the Data Vault is flexible, adaptive, and agile. You can
create source specific Hubs and Links, and later create business-
centric Data Vault artifacts to prepare the raw data for business
consumption. But if you head down this route, don’t be surprised if
you experience an explosion of source-specific Hubs, followed by a
matching explosion of source-specific Links.
Data Vault is about integration. Just because Data Vault can use the
business rules layer to untangle a messy gap between source-
specific Hubs and Links, and ready-for-consumption artifacts, why
should it? Why not create a small number of clean, agreed, business-
centric Hubs, and then define business-centric Links around them to
represent the more stable business relationships? If you take this
approach of identifying business-centric Hubs and Links, the
source-specific Data Vault Satellites can now be loaded happily
against this framework, immediately providing a useful level of
integration.
The good news is that this can be done. By properly designing the
Data Vault, from the very outset, based on an enterprise data model,
you can expect to reduce “business debt”. And hopefully this book
will help you achieve tangible business value both initially and in
the longer term.
And now a parting thought. While Data Vault is capable of
contributing to the cleaning up problems in the source operational
systems, let’s not forget the role of data governance in moving
resolutions back to the source where reasonably possible.
Operational staff who use these source systems will thank you. But
to guide such an endeavor, you’ll benefit if you start with a top-down
enterprise model.
Appendix
Common Data Model Patterns – An Introduction
Pattern: Position
Some simplistic models for reporting hierarchies within
organizations depict employees reporting to employees. For
example, if Alex is the manager of Brooke, an Employee class can
have a simple self-referencing relationship from the subordinate
employee’s record to that of their manager.
Pattern: Agreement
An Agreement (or “contract”) represents some formal or informal
arrangement between parties. Examples of formal Agreements
might include lease contracts for vehicles, or employment contracts.
An example of an informal Agreement might be the recording of
Dan’s willingness to chair tomorrow’s design review meeting.
An Agreement may be “typed” via the Agreement Type class, or if
specific a ributes are required, it may be typed as a subclass, such
as the Employment Contract shown as an example.
An Agreement may be further detailed through its Agreement Item
class. For example, an Agreement to purchase some products may
have items to detail each intended product purchase.
Agreements have an interesting relationship with the Document
class, which is responsible for managing instances of hard-copy or
electronic documents. Not all Agreements require collection of
“documentary evidence” for the Agreement, but where they do, the
Document class comes into play. The Document class has its own
pa ern, and is described separately. This Agreement class is
responsible for managing the structured data related to Agreements.
Examples of a ributes recorded might include date signed, status,
reference number, and the involved parties and their roles.
As shown in Figure 115, Agreements are typically associated with
the Party class or the Party Role class that records the parties
involved in the Agreement, and their roles. For example, an
employment contract could be expected to have the employer and
the employee involved. A transfer-of-land agreement might involve
the vendor(s), the buyer(s), solicitors for both sides, banks for both
sides, perhaps a guarantor, a witness, and so on.
As explained in the Party Role pa ern (refer to “Figure 37: Party Role
pa ern“, and its supplementary text), the participants in the
Agreement can have declared roles such as the role of the real estate
agent, or contextual roles such as the witness to the signature.
Agreements can be associated to other associated Agreements. For
example, a subcontract may be associated with the larger,
overarching contract, or one Agreement may be the updated
replacement for the now-obsolete version.
The Agreement To Agreement Type class defines allowable types of
interrelationships, and the roles of each participant. For example, if
the type of relationship between Agreements is a “Replacement”
relationship (one Agreement is the replacement for another), the
Participant 1 Role might be defined as “Obsolete Version” and the
Participant 2 Role might be defined as “Replacement Version”.
Pattern: Resource
The concept of a Resource is sometimes also known as an asset.
Examples could include buildings, computers, company cars,
consumables, and much more. The model below gives several
examples from a wildfire emergency response scenario. It is
interesting to note that from a wildfire logistic officer’s perspective,
people could be seen as “just” resources! The Human Resources
people might disagree.
Typically, the Resource class is subclassed when there is a
requirement for specific a ributes and/or associations. It must be
noted that some Resources do not require any specialization and
hence may be treated using the generic Resource class (with a
resource type – see the associated Resource Type class).
The concept of a Resource may be relatively straightforward, but its
relationships may be more complex. One of the inter-relationships
between types of Resources is shown below, where there is a
relationship between a slip-on four-wheel-drive tray vehicle and the
slip-on tank-&-pump unit that is a ached to the vehicle in summer.
Other types of relationships for a Resource might include a number
of types of objects, all of which are pa erns in their own right:
An association with its geospatial Location.
Its association with Events, such as a fire truck involved in an
accident.
Its involvement with Agreements, Accounts, and Parties (e.g. the
association of a helicopter leased from a United States supplier).
Its assignment to certain Tasks, such as a fire truck assigned to a
work on firefighting activities.
Pattern: Event
The idea behind an Event is “something noteworthy” that happens
at a point in time. Maybe going for a cup of coffee is an event of
importance to me, but it is probably not considered to be
noteworthy from a corporate perspective. In an emergency response
se ing for wildfires, the outbreak of a fire is a major event. A
hopefully infrequent but still vitally important event is an
occupational health and safety “personal injury” event if a
firefighter is hurt.
When something happens that the business deems to be
“noteworthy”, a record of that business event is created.
Pattern: Task
The idea behind a Task is typically some work to be done. A Task can
represent a small unit of work. An example might be a simple
checklist item (“make sure that someone rings the client back to
notify them of a delay”, “perform a credit check on the prospect”
…). A Task can also represent a large and complex unit of work such
as a major project or even an entire program of work. Of course,
some of these may require specialized a ributes. For example, a
project may have a budget and a project manager. Nonetheless, they
are all represented by a generic Task, which can be extended by
appropriate subclassing.
The Task class has two subclasses, namely the Specific Task class and
the Template Task class:
Template Tasks define an overall profile of Tasks that occur
frequently. For example, it would be reasonable to expect that a
template would be defined that describes a profile for day-to-day
responsibilities such as responding to a product enquiry.
Templates have a profile, but are not related to any calendar dates
/ times, and may describe generic deployment of types of
resources, but will not typically nominate specific resources.
For Acme Air, Product Type would have three entries, each instance
representing one of the company’s product types as clients might
expect to find in the company’s product catalogue. And Acme’s
Product Item would have 900 entries, each instance representing
one product instance as it relates to one customer, for example
Alex’s Acme “evaporative cooler” model air conditioner with the
serial number 12-34-56.
Len’s Product model distinguishes between tangible (physical)
goods such as an air conditioner, and intangible services such as the
installation, or servicing, of the air conditioner. The extended model
now looks like:
If we look at Acme Air, we started with three Product Types, but
Acme Air also sells two services – the installation of their air
conditioners, and the servicing of those air conditioners. We now
have five Product Types.
What the above model is saying is that one Product Type can be
either one Goods Type (such as Acme Air’s evaporative cooler), or
one Services Type (such as servicing). That might be enough for
some scenarios, but what if we want a more flexible model? If we
look at a more complex scenario, we need more flexibility. For
example, a telecommunications company might have the following
amongst its range of product types:
Mobile Phone XYZ: Comprised of one mobile phone handset (a
goods type), a ba ery charger (another goods type), a missed-call
answering facility (a service type), and 12 months international
roaming (another service type).
Basic Home Entertainment Pack: Comprised of one set-top-box (a
goods type), and access to a library of golden-oldies movies (a
service type).
New Home Starters Pack: Comprised of two Mobile Phone XYZ
product types plus one Basic Home Entertainment Pack product
type.
The first two Product Types are each made up from multiple Goods
Types and Services Types. The third Product Type is a package made
up of other Product Types. A model to support this diversity
follows.
Note:
A Product Item can be made up from one or more tangible Goods
Items and/or one or more intangible Services Items and/or one or
more other Product Items.
Similarly, a Product Type can be made up from one or more
tangible Goods Types (as specified via the Product Goods
Component many-to-many resolution class) and/or one or more
intangible Services Types (as specified via the Product Services
Component many-to-many resolution class) and/or one or more
other Product Types (as specified via the Product Sub-product
Component many-to-many resolution class).
It is to be noted that one Goods Type (such as a type of set-top-
box) can be included in the construction of multiple Product
Types. Similarly, Service Types can be used again and again in the
construction of varying Product Types. And of course, these
flexibly constructed Product Types can themselves be re-packaged
again and again to create bundled Product Types.
That’s a lot of flexibility for constructing a product catalogue.
Pattern: Location
This data model pa ern represents data constructs typically found
in a Geographic Information System (GIS) for mapping items to a
position on the earth’s surface. Such software is typically purchased
as an off-the-shelf software component. We could develop our own
GIS, but for most of us, this facility would be purchased rather than
home-built. Nonetheless, this pa ern is useful for inclusion in a
high-level enterprise model as it represents concepts the business
may need, even if its physical implementation is via a purchased
solution.
In the wildfire emergency response scenario, recording data about
locations is vital. What is the point at which a fire started? What line
on a map represents the current fire front? What area has already
been burned, and based on fire behavior predictions, what are is
likely to be burned in the near future?
The Open Geospatial Consortium (OGC) has rich definitions for the
structuring of data about geospatial locations. However, the
following model may be a sufficient approximation of location data
structures for top-down enterprise modeling.
There is a separation of concerns between the Geometry class which
manages positioning of the location on a map (for example, the
boundary of a shire council’s area of responsibility), and the
Geospatial Object class which records structured data (a ributes
and relationships) for the various types of mapped objects (for
example, the shire council’s name of the Council, a list of all
contacts within the Council, statements on intended zoning
changes, and so on).
The Geometry class is subclassed into Point (for example, the
position where an event occurred), Line (for example, the position of
a road or a river), and Polygon (for example, the area covered by the
base of a building or defined by a land title). This can be extended to
include 3-dimensional shapes also.
The model diagrams in this book use the UML notation, but usually
only show the class and its a ributes. For the Geometry class, the
Get Implicit Proximity operation is included. This operation is noted
to highlight the fact that a Geometry within a Geographic
Information System (GIS) will have functions to determine one
geospatial object’s proximity in relation to other geospatial objects.
Examples include determination of what it is contained within it,
what it contains, what it shares a partial boundary with, and what it
partially overlaps.
The Geospatial Object class captures structured data (for the shapes
that can be presented on a map via the Geometry class). These
Geospatial Objects are classified according to entries in the
Geospatial Object Type class, which approximates the classification
of “layers” on a map (the “roads” layer, the “rivers” layer, and so
on). Where these different types of Geospatial Objects require their
own a ributes and/or associations, the Geospatial Object class may
be subclassed – National Park and Geopolitical Zone are shown as
examples.
The Geometry class had the Get Implicit Proximity operation to
dynamically compute the proximity of one shape to another. In
contrast, the Geospatial Object class delegates this responsibility to
the associated Geospatial Explicit Geometry class to record manually
determined proximities between a pair of Geospatial Objects. The
Proximity Type a ribute can record classifications such as
“Complete containment”, “Partial overlap”, “Sharing of a partial
boundary”, or “Close”. These are not necessarily precise, and these
classifications may be supplemented by a Proximity Description
such as “Proceed from the first object in a northerly direction until
you encounter the black stump, then turn left for another 50 metres
...”!
Pattern: Document
This data model pa ern represents data constructs typically found
in a document / records management system, used to safely store
and retrieve important documents. Even though such software is
typically purchased as an off-the-shelf software component, this
pa ern is useful for inclusion in a high-level enterprise model as it
represents concepts the business may need, even if its physical
implementation is via a purchased solution.
Chapter 11 of Hay’s “Data Model Pa erns” book elaborates on the
concept. My simplified version follows.
The Document class is responsible for managing storage of
documents, be they paper-based Physical Documents, or Electronic
Documents. Electronic Documents in turn may be Unstructured
Electronic Documents (such as a smart phone video clip) or they can
be Structured Electronic Documents that contain structured,
machine-readable information (for example, in an XML document).
The Document Format class records the type of storage form, and
for some forms may correlate to typical file extensions (“doc”, “jpg”,
“xls”, etc.).
Pattern: Account
This data model pa ern represents data constructs typically found
in an accounting package. Even though such software is typically
purchased as an off-the-shelf software component, this pa ern is
useful for inclusion in a high-level enterprise model as it represents
concepts the business may need, even if its physical implementation
is via a purchased solution.
David Hay and Len Silverston offer multiple versions across their
books. My model below is intended to present an easy-to-
understand, light-weight simplification that may be a sufficient
framework for high-level modeling.
Each instance in the Account class represents one account within
one of the company’s financial ledgers.
Subclasses of Accounts Payable and Accounts Receivable are shown
as examples of “typing” of Accounts. Alternatively, the Account
Type class can be used to logically represent types, and subtypes of
Accounts.
The Account class has a self-referencing hierarchical relationship to
enable Accounts to be defined as sub-Accounts of other
consolidation Accounts.
Each instance in the Accounting Transaction class represents one
accounting transaction such as an invoice issued or a payment
received. Some of these types of Accounting Transactions may
contain many “items”. For example, an accounting transaction for a
customer invoice may contain many invoice lines, each being an
itemized charge as it relates to a discrete product purchase.
Similarly, a payment might itemize the particulars of what is being
paid. Each instance in the Accounting Transaction Item class
represents one such item.
Account Pa ern
and associated parties, 279
Account pa ern core, 278
Account, Pa ern, 277–79
Achieving Buzzword Compliance (Hay), 66
Agility, 26
Agreement, 257–60
Agreement pa ern core, 258
Agreement pa ern inter-relationships, 259
Agreements pa ern and associated Parties, 259
associated Parties, and Account Pa ern, 279
Bill Inmon schematic of, 25
Bradley, Chris, 66
Bridge tables, 230
Building a Scalable Data Warehouse with Data Vault 2.0, 29
Burbank, Donna, 66
business key, 34
child Satellite, 36
load time for, 35
Classification Codes, 282
Classification of Data, 280
Classification pa ern, 281
Classification Scheme class, 281
Classification, Pa ern, 280–83
conflicting designs, case study, 55–59
Consultants
and source system Data Vault, 53–55
Customer Hub design, iteration, 56, 57, 59, 61, 63
Customer Same-As-Link, 215
Customer Type
no history, 215
slowly changing history, 216
Dan Linstedt schematic, 27
Data Marts, 25
data model pa erns, 253
Pa ern Account, 277–79
Pa ern Agreement, 257–60
Pa ern Classification, 280–83
Pa ern Document, 275–77
Pa ern Event, 262–64
Pa ern Location, 272–75
Pa ern Position, 253–83
Pa ern Product, 269–72
Pa ern Resource, 260–61
Pa ern Task, 264–69
Pa ern, party and party roles, 254
Data Model Pa erns (Hay), 269, 275
Data Model Resource Book, The (Silverston & Agnew), 280
Data Modeling for the Business (Hoberman, Burbank & Bradley), 66
Data Vault
business, 26
Hubs in, 32–34
origins of, 26
raw, 26, 27
source system, 52–53
source system and consultants, 53–55
Data Vault 2.0 components, 9
Data Vault modeling, 13, 51
Data Vault schematic, 27
Data Vault standards, 28–29
Data Warehouse, 31
data warehouse, history, 23–28
David Hay
Achieving Buzzword Compliance, 66
Document pa ern, 276
document storage, 276
Electronic Document Location, 276
ELT (extract-load-transform), 26
Emergency Response (ER), 35, 45
Employee Hierarchy
Data Vault, 214
operational, 214
enterprise data model, 65–67, 65–67
enterprise ontology, 14, 52
Entity-Relationship Diagram, 39
Entity-Relationship Diagram (ERD) notation, 31, 51
ER/RA, 35
ETL (extract-transform-load), 26
Event pa ern, 262
Event pa ern & Task pa ern synergy, 263
Extensibility, 26
Fire data model, Data Vault, 31
Fire Hash Key, 34
Fire Hub, 34
Fire Reference Number, 34
Fire Truck Hub, 178, 180, 181, 182
Fire Truck Satellite, 184, 185, 186, 187
Fire” data model, 188
Operational source system, 30
Fire-Truck-Assigned-To-Fire Link, 190
Fleet Management, 180, 183, 185
Fleet Management source data, 179
Foreign Key, 34, 37
Geographic Information System (GIS), 272, 273
Geospatial Object class, 274
Get Implicit Proximity operation, 273, 275
hash keys, 32, 36
purpose of, 33
Hay, David, 68, 75, 277
Data Model Pa erns, 269, 275
Hoberman, Steve et al
Data Modeling for the Business, 66
Hub table, purpose of, 33
Hub, and Satellites, 34–36
Hubs, 32–34
Hultgren, Hans, 50
Inmon, Bill, 24
Kimball, Ralph, 23
Link Satellites, 44
Links, 37–39
cardinality on, 39–44
populating,
Same-As, 214
Satellites on, 44–47
Linstedt, Dan, 25–26, 29, 226
Load Date / Time, 33, 34
load job, restarting, 196
Location pa ern, 274
Man-&-Woman Marriage, 41
marriage scenario, 40, 41, 42
Massive Parallel Processing (MPP), 35
natural business key, 32, 34
nominated business key, 32
Object Types
types and, 268
Olschimke, Michael, 26
ontology, 13–14, 51, 67
enterprise, 14
Open Geospatial Consortium (OGC), 273
overlapping data
fire trucks and, 177
parent Hub, 36
load time for, 35
Pa ern
Classification, 280–83
Pa ern Account, 277–79
Pa ern Agreement, 257–60
Pa ern Document, 275–77
Pa ern Event, 262–64
Pa ern for Complex products, 272
Pa ern for Products, 270
Pa ern Location, 272–75
Pa ern Product, 269–72
Pa ern Resource, 260–61
Pa ern Task, 264–69
Pa ern, Position, 254–56
Person-to-Person Marriage, 43
Point-In-Time (PIT) sample data, 227
Point-In-Time = end, 229
Position pa ern, 255
primary key, 32
Product pa ern core, 270
Product, Pa ern, 269–72
purpose-built data marts, 24
Ralph Kimball
schematic, 24
Range End Break Rule, 283
Range Start Break Rule, 283
raw Data Vault, 26, 27
Record Source, 35, 45
Record Source column, 33
Relative data volumes, 64
reporting hierarchies, 254
Resource Acquisition, 45
Resource Acquisition screen, 189
Resource and Event subtypes relationships, 74
Resource assignment Satellites, 193
Resource pa ern, 261
Resource Release function, 191
Same-As Links, 214
Same-As-Links
Customer, 215
Satellite for Emergency Response, 194
Satellite for Shared Resource Pool, 194, 195
Satellites, 34–36
on a Link, 44–47
purpose of, 35–36, 35
Scalable Data Warehouse
and Data Vaults, 26
Shared Resource Pool, 181, 186
Silverston & Agnew
The Data Model Resource Book, 280
Silverston, Len, 269, 277
source system Data Vault, 52–53
consultants and, 53–55
standard information / data model (SID), 270
surrogate primary key, 32
Task pa ern, 265
Task, features of, 266–67
Task, Objects assigned to a, 267
Third Normal Form (3NF) Data Warehouse, 24
tool vendors, 52–53
Tracking Satellite
access load on, 223
Building Access and, 225
HR Load and, 222
HR load on, 224
Payroll load and, 225
Payroll load on, 224
transaction-processing systems, 23
Unified Modeling Language (UML) notation, 253
Utility pa erns, 280
Zachman, John, 66
[1] h ps://danlinstedt.com/allposts/datavaultcat/datavault-models-
business-purpose-data-as-an-asset/.
[2] Linstedt D. & Olschimke O. (2016) “Building a Scalable Data
Warehouse with Data Vault 2.0”.
[3] h p://roelantvos.com/blog/?p=1986.
[4] Hultgren H (2012) Modeling the Agile Data Warehouse with Data
Vault.
[5] h ps://danlinstedt.com/allposts/datavaultcat/datavault-models-
business-purpose-data-as-an-asset/.
[6] h ps://en.oxforddictionaries.com/definition/ontology.
[7] h ps://danlinstedt.com/allposts/datavaultcat/datavault-models-
business-purpose-data-as-an-asset/.
[8] Hoberman S., Burbank D., & Bradley C. (2009) “Data Modeling for
the Business”.
[9] Hay D. (2018) “Achieving Buzzword Com pliance”.
[10] h ps://en.wikipedia.org/wiki/Data_model, August 2018.
[11] Silverston L. & Agnew P. (2009) “The Data Model Resource Book,
Volum e 3: Universal Pa erns for Data Modeling”.
[12] Hay D. (2011) “Enterprise Model Pa erns: Describing the World”.
[13] Boehm B. & Turner T. (2003) “Balancing Agility and Discipline”,
page 40.
[14] Hay D. (2011) “Enterprise Model Pa erns: Describing the World”.
[15] Silverston L. (2001) “The Data Model Resource Book, Volum e 2: A
Library of Universal Data Models by Industry Types”.
[16] Silverston L. (2001) “The Data Model Resource Book, Volum e 2: A
Library of Universal Data Models by Industry Types”, pages xiii to xiv.
[17] Hay D. (1996) “Data Model Pa erns: Conventions of Thought”.
[18] Fowler M. (1997) “Analysis Pa erns: Reusable Object Models”, pages
2-3.
[19] Gamma E., Helm R., Johnson R. & Vlissides J. (1995) “Design
Pa erns: Elements of Reusable Object-Oriented Software”, page11.
[20] Hay D. (2011) “Enterprise Model Pa erns: Describing the World”.
[21] Silverston L. & Agnew P. (2009) “The Data Model Resource Book,
Volum e 3: Universal Pa erns for Data Modeling”.
[22] Hay D. (1996) “Data Model Pa erns: Conventions of Thought”.
[23] Silverston L. & Agnew P. (2009) “The Data Model Resource Book,
Volum e 3: Universal Pa erns for Data Modeling”.
[24] Fowler M. (1997) “Analysis Pa erns: Reusable Object Models”.
[25] Hay D. (1996) “Data Model Pa erns: Conventions of Thought”.
[26] Linstedt D. (2011) “Super Charge your Data Warehouse: Invaluable
Data Modeling Rules to Im plem ent Your Data Vault”, page 7.
[27] Hultgren H (2012) “Modeling the Agile Data Warehouse with Data
Vault”, page 60.
[28] h ps://danlinstedt.com/allposts/datavaultcat/datavault-models-
business-purpose-data-as-an-asset/.
[29] Linstedt D. & Olschimke O. (2016) “Building a Scalable Data
Warehouse with Data Vault 2.0”.
[30] Hultgren H. (2012) Modeling the Agile Data Warehouse with Data
Vault.
[31] Linstedt D. & Olschimke O. (2016) “Building a Scalable Data
Warehouse with Data Vault 2.0”, page 131.
[32] Hay D. (1996) “Data Model Pa erns: Conventions of Thought”.
[33] Hay D. (2011) “Enterprise Model Pa erns: Describing the World”.
[34] Silverston L. (2001) “The Data Model Resource Book, Volum e 1: A
library of Universal Data Models for All Enterprises”.
[35] Silverston L. (2001) “The Data Model Resource Book, Volum e 2: A
Library of Universal Data Models by Industry Types”.
[36] Silverston L. & Agnew P. (2009) “The Data Model Resource Book,
Volum e 3: Universal Pa erns for Data Modeling”.
[37] Linstedt D. & Olschimke O. (2016) “Building a Scalable Data
Warehouse with Data Vault 2.0”.
[38] Linstedt D. & Olschimke O. (2016) “Building a Scalable Data
Warehouse with Data Vault 2.0”, pages 143-149.
[39] h ps://danlinstedt.com/allposts/datavaultcat/datavault-models-
business-purpose-data-as-an-asset/.
[40] Linstedt D. & Olschimke O. (2016) “Building a Scalable Data
Warehouse with Data Vault 2.0”.
[41] Linstedt D. (2011) “Super Charge your Data Warehouse: Invaluable
Data Modeling Rules to Im plem ent Your Data Vault”, page 7.
[42] h ps://danlinstedt.com/allposts/datavaultcat/datavault-models-
business-purpose-data-as-an-asset/.
[43] Linstedt D. & Olschimke O. (2016) “Building a Scalable Data
Warehouse with Data Vault 2.0”, pages 521-523.
[44] Linstedt D. & Olschimke O. (2016) “Building a Scalable Data
Warehouse with Data Vault 2.0”, page 522.
[45] Simsion G. (2007) “Data Modeling Theory and Practice”.
[46] Silverston L. (2001) “The Data Model Resource Book, Volum e 1: A
library of Universal Data Models for All Enterprises”.
[47] Hay D. (1996) “Data Model Pa erns: Conventions of Thought”.