Data Lakes


Data is Big
Space is big, Douglas Adams mused in The Hitchhikers
Guide to the Galaxy. Really big. The same can be said of
data: Its big. Really big. You might think you have a lot of
data in your financial system, but thats just peanuts to
big data.
If youre saying to yourself, I dont have Big Data, so a Data Lake doesnt
apply to me, please keep reading. The principles of the Data Lake and
Modern Data Architecture permeate past just gobs of data.

Weve all probably read at least one headline or article in the

past year talking about big data. The author may have written
paragraphs describing numerous big data challenges; possibly
quoted astonishing numbers representing the worldwide growth
of data; or given specific examples of projects or organizations
that are generating huge volumes of data all in order to
hammer home the growing challenge of successfully handling
previously unheard of amounts of information.
Big data means something different to everyone.
Every organization has data, and in many cases, it
is larger, more variable or more complex than most
reporting platforms and teams can handle.





Ultimately, leaders are looking for proven

techniques to quickly and easily deliver the
data to the people who need it.

Answering the Needs
of the Business
Business analytics have been around as long as business itself. Analytics began the first time
someone tabulated two columns of numbers and used the difference between them to determine
a profit or loss. As business evolved and companies collected more data, it became possible, and
important, to create reports and analyses on different facets of an organization. Fast forward
a few thousand years and the concepts of data warehousing and business intelligence became
the norm. These disciplines promoted a single, central version of the truth for an organization; a
repository to gather and integrate data to quickly and easily create reports.

The progress of analytics was a response to evolving business
needs. Todays business leaders understand that data still
holds the key to understanding the patterns of their customers,
competitors and markets. Only by analyzing this information
can they take action and make educated and supportable

The Need for

Something More

The traditional data warehouse/business intelligence approach has
done a great job of simplifying data access and reporting, as well
as combining data from many sources, in order to answer all of the
questions an organization may have. But its impossible to anticipate
every question a business might ask and every report they might need.
Metrics change from year to year, month to month and sometimes even
day to day.

In addition, there is a flood of new data types. Information

from the web, social media, servers, sensors, documents,
comments and devices has caused an explosion in the
volume of data that organizations are trying to understand.
For example, 15 years ago companies never expected to
have to keep track of things such as social media likes.

In a traditional data warehouse solution, organizations would probably
ignore most of these external data sources because they are either too
voluminous or in a format that is difficult to manipulate and store. If
companies used any of it, it was probably for an edge reporting need.
Such limitations often result in potentially valuable data and insights being
inaccessible and possibly lost forever.

In recent years, this data explosion has spawned a

new set of technologies and techniques.

Apache Hadoop and the Hadoop data lake are at the center of the big data
movement. A data lake is an arsenal to store vast amounts of raw data
for future use. With all the media hype, it is difficult to sift through the
buzzwords and understand where and even if these new technologies
make sense for your analytics needs. Many people believe that implementing
a Hadoop data lake means throwing away their investment in a data
warehouse. This perception ends up either sending them down the wrong
path or causing them to sideline big data as a future project.

The good news? Hadoop, big data and the data lake
dont replace a companys existing investment in
analytics. In fact, they complement it very nicely. By
building a Modern Data Architecture, organizations
can continue to leverage their existing investments in
analytics, while collecting all of the data they have been
ignoring or throwing away, all while enabling analysts
to get company data and insights faster.

Introducing the Modern
Data Architecture

Big data technologies support and enhance modern analytics but do not necessarily replace traditional
analytics systems. Building a Modern Data Architecture that incorporates all of the benefits of a data lake,
combined with the high-speed query and analytics provided by traditional relational data warehouse and
online analytical processing (OLAP) engines, supports data consumption at all levels of the business. It
also provides all classes of data consumers with the capabilities they require.

All data, regardless of form, is collected into the Persistent layer of
the Data Lake
Data from all internal and external source systems including structured, semi-structured
and unstructured data, as well as streaming sources is gathered in a single Persistent
layer in the data lake. Not all data in the Persistent layer is promoted to subsequent layers,
but rather collected for future analytics use cases. Data scientists and analysts are granted
access to the data at this layer in order to perform discovery and experimentation in an
Analytics Sandbox set aside for their use. As these analysts identify new data sources that
may provide additional business insight, they will help to shape and Curate this data to
provide self-service analytics to a broader audience.

Analysts and data scientists help shape and Curate the data for
business use
As self-service analysts continue to refine the use of Curated data sources, they will work
with the data management team to Operationalize data to be presented to the broadest
audience of the business. Since these data artifacts are generally consumed through the
highest levels of the organization and are required for day-to-day decision making, they will
ultimately reside in the high-speed query engines of the Enterprise Data Warehouse (EDW)
and OLAP layers to support typical Business Intelligence functions.

The EDW supports a subset of data (generally governed by time). The Hadoop data lake
provides the opportunity to create an Active Archive to store additional historical data and
make it available for query for extended analytics use cases.

Self-service analysts continue to refine the Curated data into an

Operational layer for broader use
Data moves through the architecture by means of a strong Integration framework. Data
must be ingested from source systems, organized in the data layers, transformed, enhanced
and ultimately loaded to the interfaces that provide data to end users. In a traditional data
warehouse architecture, this process was known as ETL (Extract-Transform-Load) and in fact,
this approach is still very viable. In many cases an alternative ELT (Extract-Load-Transform)
is preferable because the data lake lends itself to loading data prior to transformation.
All transformation and integration is done in the layers of the data lake. Regardless of
the methodology, it is important to choose appropriate tools that can be automated and

Maintaining control and records of the content stored in the various layers
of the data lake is very important. Having a strong but flexible governance
policy and mechanism for metadata and content management to support
discovery, standardization, master data management and security is a key
factor in the success of implementing a big data strategy.

How Does a Data Lake
Differ from a Data
Wikipedia1 defines data warehouses as:
Central repositories of integrated data from one or more
disparate sources. They store current and historical data and
are used for creating trending reports for senior management
reporting such as annual and quarterly comparisons.
This is a very high-level definition that describes the
purpose of a data warehouse, but doesnt explain
how the purpose is achieved.

A data warehouse also has the

following properties:

It represents an abstracted picture of the

business organized by subject area.
It is highly transformed and structured.
Data is not added to the data warehouse until

the use for it has been defined

It generally follows a methodology such as

those defined by Ralph Kimball2, an original

data warehousing architect, and Bill Inmon3,
whom many refer to as the father of data

Data warehouse development is
characterized by requiring lots of
discovery, planning and development
work before any data makes it into
the warehouse.

By way of contrast, the term data lake was coined by Pentaho CTO James
Dixon. He describes a data mart (a subset of a data warehouse) as akin to a
bottle of water, cleansed, packaged and structured for easy consumption
while a data lake is more like a body of water in its natural state. Data flows
from the streams (the source systems) to the lake. Users have access to the
lake to examine, take samples or dive in.

This is also a fairly imprecise definition. Lets add a few

specific properties of a data lake:
All data is loaded from source systems. No data is turned away.
Data is stored at the leaf level in an untransformed or nearly untransformed state.
Data is transformed and schema is applied to fulfill the needs of analysis.

So, to summarize,
a data warehouse is a highly structured store of the data that the business has
deemed important while a data lake is a more organic store of all data without
regard for the perceived value or structure of the data.

Lets add some more specific details on the

differences between a data lake and a data

Data Lakes
Retain All Data
During the development of a data warehouse, a considerable amount of time is spent
analyzing data sources, understanding business processes and profiling data. The result is
a highly structured data model designed for reporting. A large part of this process includes
making decisions about what data to include and to not include in the warehouse. Generally,
if data isnt used to answer specific questions or in a defined report, it may be excluded from
the warehouse. This is usually done to simplify the data model and also to conserve space on
expensive disk storage that is used to make the data warehouse performant.

In contrast, the data lake retains ALL data. Not just data that is in use today but
data that may be used someday and even data that may never be used at all
just in case. Data is also kept for all time so organizations can go back to any
point in time to do analysis.

This approach becomes possible because the hardware for a data lake usually differs greatly
from that used for a data warehouse. Commodity, off-the-shelf servers combined with cheap
storage make scaling a data lake to terabytes and petabytes fairly economical.

2 Data Lakes
Support All
Types of Data

Data warehouses generally consist of data extracted from transactional

systems and consist of quantitative metrics and the attributes that
describe them. Non-traditional data sources such as web server logs,
sensor data, social network activity, text and images are largely ignored.
New uses for these data types continue to be found, but consuming and
storing them can be expensive and difficult.

The data lake approach embraces these non-traditional data types. Data
lakes store all data, regardless of source and structure. Data is kept in its
raw form and only transformed when it is ready for use. This approach is
known as Schema on Read vs. the Schema on Write approach used in
the data warehouse.

3 Data Lakes
Support All Users
In most organizations, 80 percent or more of users are operational. They want to get their
reports, see their key performance metrics or slice the same set of data in a spreadsheet
every day. The data warehouse is usually ideal for these users because it is well structured,
easy to use and understand and it is purpose-built to answer their questions.

The next 10 percent or so do more analysis on the data. They use the data warehouse
as a source but often go back to source systems to get data that is not included in the
warehouse and sometimes bring in data from outside the organization. Their favorite tool
is the spreadsheet and they create new reports that are often distributed throughout the
organization. The data warehouse is their go-to source for data but they often go beyond its

Finally, the remaining users do deep analysis. They may create totally new data sources
based on research. They mash up many different types of data and come up with entirely
new questions to be answered. These users may use the data warehouse but often ignore it
as they are usually charged with going beyond its capabilities. These users include the Data
Scientists and they may use advanced analytic tools and capabilities like statistical analysis
and predictive modeling.

The data lake approach supports all of these users

equally well. The data scientists can go to the lake
and work with the very large and varied data sets
they need while other users make use of more
structured views of the data provided for their use.

4 Data Lakes
Adapt Easily to
One of the chief complaints about data warehouses is how
long it takes to change them.

Considerable time is spent up front during development getting the

warehouses structure right. A good warehouse design can adapt to change
but because of the complexity of the data loading process and the work done
to make analysis and reporting easy, these changes will necessarily consume
some developer resources and take some time.

Many business questions cant wait for the data warehouse team to
adapt their system for answers. This ever-increasing need for faster
answers has given rise to the concept of self-service business

In the data lake on the other hand, since all data is

stored in its raw form and is always accessible to
someone who needs to use it, users are empowered
to go beyond the structure of the warehouse
to explore data in novel ways and answer their
questions at their pace.

If the result of an exploration is shown to be useful

and there is a desire to repeat it, then a more formal
schema can be applied to it and automation and
reusability can be developed to help extend the
results to a broader audience. If it is determined
that the result is not useful, it can be discarded
and no changes to the data structures have been
made and no development resources have been
5 Data Lakes
Provide Faster
This last difference is really the result of the other four. Because a data lake contains all data
and data types, because it enables users to access data before it has been transformed,
cleansed and structured, it enables users to get to their results faster than the traditional
data warehouse approach.

However, this early access to the data comes at a price. The work typically done by the data
warehouse development team may not be done for some or all of the data sources required
to do an analysis. This leaves users in the drivers seat to explore and use the data as they see
fit. However, the operational users referenced earlier may not want to do that work. They
still just want their reports and KPIs.

These operational report consumers will make use of the more structured data views in
the data lake those that resemble what they had in the data warehouse. The difference is
that these views exist primarily as metadata that sits over the data in the lake rather than
physically rigid tables that require a developer to change.

Just Add Technology
The Modern Data Architecture described above is a functional
model. It describes layers within which data will be ingested,
organized and presented to the business but it doesnt
specifically call out technologies that will be used to build
these layers. This functional model aligns to physical
layers within a final technical deployment.

The diagram below and the paragraphs

that follow describe these physical layers.

Data Acquisition
This layer refers to the ingestion and initial movement of data from the
source systems whether they be traditional relational/transactional
systems, user-generated data, unstructured or semi-structured data,
external data or streaming data.

Data Curation
In the Modern Data Architecture, Apache Hadoop plays a key role as a
data storage and curation layer. Using the data lake approach, all data
no matter what type is stored in the data lake and is organized, shaped
and made available for consumption by other layers. A variety of Hadoop
technologies are brought to bear in the curation layer to support the
required analytic and data processing workloads.

Data Provisioning
Operational reporting and analytics are best served by more traditional
data stores. The high-speed query capabilities of relational database
systems make them ideal for serving data to support interactive query
and analytics. Depending on the scale and needs of the organization, an
Enterprise Data Warehouse built on a relational database platform may be
coupled with several subject-oriented data marts to serve various reporting
needs. In addition, an Online Analytic Processing (OLAP) engine can help
facilitate complex, interactive query.

Data Consumption
This layer represents all end-user interfaces. A wide variety of tools and
technologies are available to fill the roles defined in this model. It should
be mentioned that although these physical layers may imply that there is
no direct flow of data from the Curation layer to the Consumption layer,
in some cases there is. The functional model supports the ability for some
users to connect directly to the data lake as needed.

To Cloud
or not to Cloud?
At least as popular as the topic of big data is the topic of cloud computing. Cloud
service providers give organizations the choice to avoid the costs of building and
managing a data center on premises by moving storage, compute and networking
to hosted solutions. Because a Modern Data Architecture involves a wide variety
of technologies and can represent a significant investment in both hardware and
software, a careful analysis of the options and costs is required before embarking
down the path of building a solution.

There are several considerations you should
take into account when making the decision
to deploy on-premises or in the cloud:

Do you have capacity?

If you have unused server hardware or capacity on existing virtual hosts, it may make sense for you to
start standing up an architecture in your own environment. This may be a great approach for a proof-of-
concept or a point solution, but you also need to consider scalability into the equation. Big data solutions
have a tendency to grow, and can need to grow, rapidly. If your data center and infrastructure teams
arent prepared for this type of growth, you may want to consider cloud. If you dont have capacity, then
leveraging a cloud provider makes it easy to get started quickly. You can start very small for your proof-
of-concept and then scale the infrastructure out as needed.

Do you prefer cap-ex or op-ex?

Some organizations budget for capital expenditures. They prefer large investments that they can
depreciate over time. If this is the case for you, you may want to consider an on-premises infrastructure.
You can decide on your needs for the year and plan for the coming two or three years and budget
accordingly. On the other hand, if you prefer operational expenditures, then cloud services will be to your
liking. You can pay month to month and adjust as needed.

Is your data already in the cloud?
If all of your data is on premises, then you may be thinking it will be a challenge to move that data to
a data center outside of your four walls. However, if you are using hosted services for some or all of
your systems, your data may already be in the cloud. In this case, network bandwidth may be less of a
consideration and it may be easier for you to get your data to a cloud provider. A related consideration
is whether your company is global. If you have data centers all over the world, you may already have
bandwidth concerns moving data between your own data centers. In this case, it may actually be easier
to get your data to a cloud data center that is local to your own data centers than it is to centralize your
data in one of your own premises.

Do you like managed services?

Building an on-premises deployment of a Modern Data Architecture will require you to have or develop
skillsets on a wide variety of technologies. If youre not doing data warehousing or big data today,
chances are you dont have all of the skills you need in house to support such an infrastructure. In
order to support a new range of unfamiliar software and hardware you will either need to train or hire
the appropriate talent. Cloud services providers support many of the components of the Modern Data
Architecture as Platform as a Service (PaaS) offerings. In other words, they have things like Hadoop,
Data Warehouse and Business Intelligence applications that are deployed and supported for you. This
alleviates the need for you to have full-time resources dedicated to the support and maintenance of
specialized software and hardware. On the other hand, using managed services may reduce your ability
to customize and configure the solution. If you want to add a component or upgrade to a newer version,
you may be dependent on the cloud service provider.

In a survey conducted in early 2015 by Gartner5, CIOs named Business
Intelligence and Analytics as their top investment priority, followed closely
by Infrastructure and Data Center and Cloud. When looking through the
other items on the list, such as Digitization/Digital Marketing and Customer
Relationship/Experience, its clear they also fall under the data management
heading. Without good data management, the marketing organization cant
decide who to market to and the customer service organization doesnt
know what their customers are thinking. Because so much of the business is
driven by data, building a solid foundation for data analysis must be a high
priority for any organization that wants to make informed decisions.

From CIOs name BI and Analytics No. 1 investment priority of 2015:

If you are looking to bring in new approaches, combined

with proven techniques, to support decision making at all
levels of your organization, let us help.

