Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

5 BEST PRACTICES FOR

DATA WAREHOUSE DEVELOPMENT


Author: Kent Graziano
2 Introduction
3 Create a data model
6 Adopt an Agile data warehouse methodology
7 Favor ELT over ETL
8 Adopt a data warehouse automation tool
9 Train your staff on new approaches
10 Summary
INTRODUCTION
Cloud data warehouse technology has alignment across your business, so the data
revolutionized how businesses store, access, warehouse you create meets both current and
and analyze data. Whether your organization future business needs. These best practices for
is creating a new data warehouse from data warehouse development will increase the
scratch or re-engineering a legacy warehouse chance that all business stakeholders will derive
system to take advantage of new capabilities, greater value from the data warehouse you
a handful of guidelines and best practices will create, as well as lay the groundwork for a data
help ensure your project’s success. Some of warehouse that can grow and adapt
those best practices may seem obvious, but as your business needs change.
all too often, businesses fail to spend time up
front setting and documenting these decision
points, resulting in headaches and inefficiency
down the road.
In this ebook, we outline five
recommendations for putting structure
around your data strategy and getting

2
CHAMPION GUIDES
1. CREATE A
DATA MODEL
The first key step in data warehouse
development is to create a data model:
an abstract representation that organizes
elements of data and describes how they Diagram: Logical 2 – 3NF

Author: kgraziano
relate to one another and to properties Created on: 2018-02-05 04:29:13 UTC City Lived

of their real world entities. A data model Modified on: 2018-02-05 04:29:13 UTC
# * Year

Modified by: kgraziano


establishes a common understanding and Design: JSON Models

definition of what information is important to Model: Logical

the business, as well as the business’s overall


data landscape. Having a data model gives
you a way to document the data sets that will
be incorporated into the data warehouse, the Person City

relationship between those data sets, and the # * Name


* Age
# * City Name

business requirements of the data warehouse. * Gender the parent of

Could you create a data warehouse without a


data model? Yes, but when you decide not to take a child of

this basic step, you lose many valuable insights.


Creating a comprehensive data model is often an
eye-opening exercise for businesses, as it forces
different functional teams to agree on the definition
and delineation of data assets and business
Phone
# * Area Code
requirements of the data warehouse before * Subscriber Number

development is underway.
A well-defined data model drives a positive impact
long after the data warehouse is live. For example, a
data model establishes data lineage for all the objects
in the data warehouse, making it easier to onboard
new team members or to bring new data objects into
the data warehouse as business needs change.
Figure 1—A typical 3NF (third normal form) logical data model

3
CHAMPION GUIDES
The data model also provides clear documentation of
the data warehouse’s content, context, and sources.
This makes it easier to audit or to comply with new
data requirements, such as those presented by
GDPR (the EU’s General Data Protection Regulation
framework that sets guidelines for the collection and
processing of personal information from individuals).
Having a strong data model also helps prevent
confusion and costly re-engineering down the road.
It’s always a good idea to incorporate a source-
agnostic integration layer that enables analysis
across multiple data sets based on the data sets’
commonalities.
A data warehouse brings together many different
sources and types of data, including traditional data
sets such as customer relationship management
(CRM) data and enterprise resource planning (ERP)
data, as well as data sets like blogs, Twitter feeds, IoT
data, and even data sets that have yet to be invented.
This is why having a flexible integration layer that isn’t
too tightly tied to any single system will help future-
proof your data warehouse.
A highly effective data model should employ
definitions and semantic structures that are defined
by the business, not by the specific definitions of any
single source system. For example, one CRM system
may refer to customers as “cust,” while another
refers to “cust_ID.” Establishing a business-wide
semantic rule for how users should name, access, and
analyze that data across data sets is key to the data
warehouse’s success. Figure 2—An example data warehouse data model using the Data Vault modeling approach

4
CHAMPION GUIDES
As a company goes through changes, mergers, and • Data Vault (DV): Developed specifically to address
acquisitions, the CRM system it is using today may agility, flexibility, and scalability issues found in
be replaced by a different CRM system. If your data other approaches, DV modeling was created as a
warehouse model is tightly coupled to a specific granular, non-volatile, auditable, easily extensible,
source system, then you will have a lot of re- historical repository of enterprise data. It is highly
engineering in order to integrate the second source normalized and combines elements of 3NF and
system that replaces the legacy system. A source-
star models.³
agnostic integration layer makes data mapping much
easier, so you can swap an old source system for a Each architecture has its advantages, but the choice
new source system without affecting downstream of which to adopt will depend on the business needs
reports or having to change user behavior. of the organization.

Within the data model, it’s critical to select a standard Frankly, more important than which architecture your
approach. The main types of data modeling standards organization selects is that it selects, documents,
used in data warehouse design include: and continually supports this architecture as part of
developing a data model for the warehouse. Doing
• 3NF: 3NF, which stands for “third normal form,” so will enable future efficiency, allowing for a single
is an architectural standard designed to reduce the support and troubleshooting methodology, that will
make it easier for new team members to ramp up
duplication of data and ensure referential integrity
more quickly.
of the database.1

• Star schema: The simplest and most widely used


architecture to develop data warehouses and
dimensional data marts, the star schema consists of
one or more fact tables referencing any number of
dimension tables.²

https://en.wikipedia.org/wiki/Third_normal_form
1

https://en.wikipedia.org/wiki/Star_schema
2

https://www.snowflake.com/blog/data-vault-modeling-and-snowflake/
3
5
CHAMPION GUIDES
2. ADOPT AN AGILE
DATA WAREHOUSE METHODOLOGY
In the past, data warehouse creation was a With business needs changing faster than ever, Within the Agile worldview, a variety of approaches
large, monolithic, multi-quarter (or multi-year) and new data sources coming online more quickly, have emerged to help deliver value faster, including:
effort, subject to the traditional “waterfall” businesses need to be able to adapt and leverage
• Scrum—Named for the rugby formation in which
these inputs concisely and rapidly. That means
process. In the modern age, that’s no longer forwards interlock arms and advance, Scrum is
learning to build data warehouse and analytic the most widely used process framework for Agile
the norm as many organizations are choosing solutions in an incremental and Agile fashion. With development. A lightweight framework, Scrum
to adopt a more flexible and iterative, or Agile, proper planning that aligns to a single source- emphasizes daily communication and the flexible
design approach. agnostic integration layer, data warehouse projects reassessment of plans that are carried out in short,
can now be broken down into smaller pieces that iterative phases of work.5 Ralph Hughes codified
can be delivered more frequently, thus providing Scrum’s application to data warehousing in a series of
incremental business value much more quickly. seminal works that are useful to businesses adopting
this approach.
Data warehousing architects are adopting the Agile
methodology, which first appeared in the software • Kanban—Kanban is a method for managing the
development world, to achieve this goal. In the Agile creation of products with an emphasis on
methodology, requirements and solutions evolve continuous delivery without overburdening the
through the collaborative effort of self-organizing and development team. Like Scrum, Kanban is a process
cross-functional teams and customers. When applied designed to help teams work together more
to data warehouse conception and construction, the effectively. Named for the “Kanban” cards that track
Agile methodology enables businesses to activate production in a factory, Kanban was created by Taiichi
new datasets and solve new business challenges Ohno, an industrial engineer at Toyota, to improve
more quickly.4 manufacturing efficiency.

• BEAM—BEAM, or Business Event Analysis and


Modelling, was introduced by Lawrence Corr and Jim
Stagnitto in their groundbreaking work, Agile Data
Warehouse Design. BEAM focuses on business events,
rather than on known reporting requirements, in order
to model the whole business process area. It leverages
seven dimensional types (the seven Ws: who, what,
when, where, how, how many, and why) to identify and
then elaborate on business events.6

http://www.Agiledata.org/essays/dataWarehousingBestPractices.html
4

ttps://www.scrum.org/resources/what-is-scrum
5

https://www.bisystembuilders.com/beam/
6 6
CHAMPION GUIDES
In order to leverage the benefits of Agile warehousing workflows. Retooling an IT team to
development more fully, an Agile data warehousing work comfortably in an Agile environment can take
platform is very helpful. Cloud-based data six to 12 months, which may seem paradoxical given
warehouses provide that structural flexibility and the Agile methodology’s’ goal of delivering value
elasticity, enabling rapid scaling as business needs more quickly. This can be accelerated by engaging
evolve. Cloud-based data warehouses require less with a seasoned Agile coach. But once the shift is
effort, maintenance, and administration to be useful, made, teams can begin to deliver new incremental
and they can grow and adapt to changing business changes to the data warehouse in weeks, instead
requirements. By leveraging a cloud-based data of months.
warehousing environment, teams can spend less time
tuning queries and provisioning storage, and more
time addressing immediate business challenges and
delivering business value.
Leveraging Agile methodologies and structures is no
small undertaking. It requires a cultural commitment
within the organization and is often a significant
shift in mindset and workflow from traditional data

7
CHAMPION GUIDES
3. FAVOR ELT
OVER ETL
In the past, data warehousing development businesses can use the power of the database to directly is faster and cheaper. The ELT approach also
took an extract-transform-load (ETL) perform transformations, whether that’s changing the makes it easier to audit and trace the data in the
approach, extracting the data to be imported structure of the data (that is applying a data model), future, because it provides an image of the original
applying business rules, or performing data quality source data directly within the data warehouse. In
into the data warehouse from the source
measures to cleanse the data (for example, correcting this way, the data warehouse itself can play the role
systems, cleaning it or applying business rules incomplete addresses, standardizing data field names, of what has come to be known as a “data lake,” where
to it on an external server, and then loading resolving duplicates). raw data is stored persistently.
it into the target data warehouse. Increased
The ELT approach has two distinct advantages: cost
data warehousing computing power and savings and greater traceability. ELT helps realize cost
capabilities have yielded a new preferred savings as it allows businesses to leverage the power
approach: extract-load-transform (ELT). of the data warehouse to transform data, instead
of using an external server. Cloud-based computing
power is typically much less expensive than
In the ELT approach, raw data is extracted from performing transformations and data manipulation on
the source and loaded, relatively unchanged, into an external server, so moving data to the warehouse
the staging area of the data warehouse. Metadata,
load date, or source information may be added to
the data, and then it is brought directly into the
data warehouse. Once inside the data warehouse,

8
CHAMPION GUIDES
4. ADOPT A DATA WAREHOUSE
AUTOMATION TOOL
The goal of the data warehouse is to activate deploy code more quickly. Because many data documented standard for these different artifacts,
and deliver data more quickly so it can inform warehouse methodologies are pattern-based, as well as an enforcement and quality assurance
business decisions and drive greater value. One the coding required for loading and structuring (QA) mechanism to monitor that all developers and
designers are following that standard.
way to increase speed of delivery is to adopt data is often repeatable, which means it can be
the Agile methodology. Another is to adopt automated. A number of tools on the market Automation tools that use templates to generate
code are especially helpful, because they enforce
automation tools that can help develop and automate some or even all of the design and
standards by making them the preferred properties
build tasks, and the list grows daily.
within the templates themselves. This makes
Automation allows businesses to leverage their IT onboarding faster, as new developers and
resources more fully, iterate faster, and enforce designers will use these standard tools,
coding standards more easily. It enables the creation guaranteeing consistent implementation and
of standardized code, which is incredibly useful in shortening the learning curve. A consistent
organizations where the ETL code and data models implementation has the added benefit of being
are developed by hand. Automation provides a easier to test and debug because code is developed
using the same standards.
Iteration also becomes faster by using these tools,
because automated code generators don’t tend to
make syntax errors. Updating code typically means
adding a new object to the tool or changing the
templates’ properties at the global level,
generating new code that is immediately available
for deployment in the data warehouse environment
for testing and validation.
.

9
CHAMPION GUIDES
5. TRAIN YOUR STAFF
ON NEW APPROACHES
A move to the Agile methodology or automated Many industry resources are available to help
code development isn’t just a shift in skill sets— manage the transition to the Agile methodology.
it’s a shift in mindset. Training and education The Agile Alliance, a global nonprofit member
organization dedicated to promoting the concepts
are required to ensure a data warehouse
of agile software development as outlined in the
team is leveraging these new approaches Agile Manifesto, offers many training options for
and technologies effectively. This may mean introducing Agile concepts. The Scrum Alliance
bringing in external experts to train teams on offers certifications and training for foundational
the Scrum best practices or educating teams and advanced Scrum training. Likewise, DataVault
on the benefits, rules, and best practices for bootcamp and certification is offered by selected
whatever standard architecture the business partners, such as Scalefree and Performance G2.
has adopted for its data warehouse. As with any new process and cultural change,
organizations should manage the adoption curve to
ensure a consistent and effective shift to the new
approach in day-to-day operations. Identifying pilot
or proof-of-concept projects for initiating the teams
to the new approaches will ensure practitioners
build and master the skills in protected, yet real,
scenarios that will accelerate competence and
abilities in these new skills.

10
CHAMPION GUIDES
SUMMARY
All of the best practices outlined in this ebook
require an upfront investment in order to
achieve the long-term business value they can
deliver. But, the return on that investment
is two-fold: It will lay the foundation for a
successful data warehouse at the outset
and accelerate the successful delivery of
incremental business value to your data
warehouse environment long after the first
production release.

11
ABOUT THE AUTHOR
Kent Graziano is the Chief Technical Evangelist for Snowflake Inc. He is an award-winning author and recognized
expert in the areas of data modeling, data architecture, and Agile data warehousing. He is an Oracle ACE Director
- Alumni, a member of the OakTable Network, a certified Data Vault Master and Data Vault 2.0 Practitioner
(CDVP2), and an expert data modeler and solution architect with more than 30 years of experience.

ABOUT SNOWFLAKE
Snowflake Cloud Data Platform shatters the barriers that prevent organizations from unleashing the true
value from their data. Thousands of customers deploy Snowflake to advance their businesses beyond
what was once possible by deriving all the insights from all their data by all their business users. Snowflake
equips organizations with a single, integrated platform that offers the only data warehouse built for
any cloud; instant, secure, and governed access to their entire network of data; and a core architecture
to enable many other types of data workloads, including a single platform for developing modern data
applications. Snowflake: Data without limits. Find out more at snowflake.com.

© 2019 Snowflake, Inc. All rights reserved. snowflake.com #YourDataNoLimits

You might also like