Introduction to Big Data

■ Introduction to Big Data:

■ What is Big Data, overview of Big Data Analytics,
traditional database systems vs Big Data systems, 5
V's of Big Data, importance of Big Data and real
world challenges.
■ Architecture of Big Data systems, Big Data
applications, Data Analytics Life Cycle.

Data is created constantly, and at an ever-increasing

Sources of Big Data:

1. Mobile phones, social media, imaging technologies
-all these and more create new data, and that must be
stored somewhere for some purpose.

2.Devices and sensors automatically generate

diagnostic information that needs to be stored and
processed in real time.
Examples of big data
■ Photos and video footage uploaded to the World Wide
■ Video surveillance, such as the thousands of video
cameras spread across a city .
■ Mobile devices, which provide geospatial location data
of the users, as well as metadata about text messages,
phone calls, and application usage on smart phones
■ Smart devices, which provide sensor-based collection of
information from smart electric grids, smart buildings,
and many other public and industry infrastructures

Statistics of Big Data
■ We constantly generate data. On Google alone, we submit
over 40,000 search queries per second. That amounts to 1.2
trillion searches yearly!

■ Each minute, 300 new hours of video show up on YouTube.

That’s why there’s more than 1 billion gigabytes (1 exabyte)
of data on its servers!
[an exabyte is equal to one quintillion
(1,000,000,000,000,000,000) bytes ]

■ People share more than 100 terabytes of data on Facebook

daily. Every minute, users send 31 million messages and
■ Big Data usage statistics indicate people take
about 80% of photos on their smartphones.
Considering that only this year over 1.4 billion
devices will be shipped worldwide, we can only
expect this percentage to grow.

■ Smart devices (for example, fitness trackers, sensors)

produce 5 quintillion bytes of data daily. In 5 years,
we can expect for the number of these gadgets to be
more than 50 billion!

Merely keeping up with this huge data is difficult, but
substantially more challenging is analyzing vast amounts
of it, especially when it does not conform to traditional
notions of data structure, to identify meaningful patterns
and extract useful information.


■ Big Data is data whose scale, distribution, diversity,

and/or timeliness require the use of new technical
architectures and analytics to enable insights that
unlock new sources of business value.

Overview of Big Data Analytics

■ Big data analytics is the often complex process of

examining large and varied data sets, or big data, to
uncover information -- such as hidden patterns,
unknown correlations, market trends and customer
preferences -- that can help organizations make
informed business decisions.

UNIT I- Introduction to Big Data 12
Healthcare application

Big Data Analytics..

■ Big Data Analytics is a form of advanced analytics,

which involves complex applications with elements
such as
■ predictive models,
■ statistical algorithms and
■ what-if analysis

powered by high-performance analytics systems.

[What-if analysis (contd)]

■ Example of what-if analysis:

Suppose a student plans to score an average of 80 in semester

exam. She scored 82, 70, 83 and 76 in the subjects English,
Mathematics, Computer Science and Mechanics respectively.

Statistics exam is due to happen shortly, we want to calculate

the marks she needs to score in Statistics to achieve an average
of 80 in the semester.

Importance of Big Data

Driven by specialized analytics systems and software,

as well as high-powered computing systems, Big Data
Analytics offers various business benefits, including:

■ New revenue opportunities

■ More effective marketing
■ Better customer service
■ Improved operational efficiency
■ Competitive advantages over rivals

■ Big data can deliver value in almost any area of business
or society:

1. It helps companies to better understand and serve

■ Examples include the recommendations made by

Amazon or Netflix.

2. It allows companies to optimize their processes:

Uber is able to predict demand, dynamically price

journeys and send the closest driver to the customers.

3. It improves healthcare:

Government agencies can now predict flu outbreaks and

track them in real time and pharmaceutical companies
are able to use Big Data Analytics to fast-track drug

4. It helps us to improve security:

Government and law enforcement agencies use Big

Data to foil terrorist attacks and detect cyber crime.

5. It allows sport stars to boost their performance:

Sensors in balls, GPS trackers on their clothes allow

athletes to analyze and improve upon what they do.

Real world challenges

■ 1. Exploiting the opportunities that Big Data presents

requires new data architectures, including analytic
sandboxes, new ways of working, and people with new
skill sets.

■ These drivers are causing organizations to set up

analytic sandboxes and build Data Science teams.

■ 2. Potential pitfalls of big data analytics initiatives
include a lack of internal analytics skills and the high
cost of hiring experienced personnel to fill the gaps.

■ 3. Although some organizations are fortunate to have

data scientists (most may not be), there is a growing
talent gap that makes finding and hiring data scientists
in a timely manner difficult.

Big Data Applications

Traditional database systems
Big Data systems

UNIT I- Introduction to Big Data 27
UNIT I- Introduction to Big Data 28
5 V’s of Big Data

1. Volume:

■ Big data first and foremost has to be “big,” and size in

this case is measured as volume.


■ From clinical data associated with lab tests and physician

visits, to the administrative data surrounding payments,
this well of information is already expanding.

■ When that data is coupled with greater use of precision

medicine, there will be a big data explosion in health
care, especially as genomic and environmental data
become more ubiquitous.

2. Velocity:
■ Velocity in the context of big data refers to two related concepts
familiar to anyone in healthcare: the rapidly increasing speed at which
new data is being created by technological advances, and the
corresponding need for that data to be digested and analyzed in near

■ For example, as more and more medical devices are designed to

monitor patients and collect data, there is great demand to be able to
analyze that data and then to transmit it back to clinicians and others.

■ This “internet of things” of healthcare will only lead to increasing

velocity of big data in healthcare.

3. Variety:
■ With increasing volume and velocity comes increasing variety. This
third “V” describes just what you’d think: the huge diversity of data
types that healthcare organizations see every day.

■ Scenario: Electronic health records and medical devices.

■ Each one might collect a different kind of data, which in turn might
be interpreted differently by different physicians—or made
available to a specialist but not a primary care provider.

■ Challenges:
■ Standardizing and distributing all of that information so that
everyone involved is on the same page.
■ With increasing adoption of population health and big data
analytics, we are seeing greater variety of data by
✓ traditional clinical and administrative data with
✓ unstructured notes,
✓ socioeconomic data and even
✓ social media data.

4. Variability

• The way care is provided to any given patient depends on

all kinds of factors—and the way the care is delivered and
more importantly the way the data is captured may vary
from time to time or place to place.

• Such variability means data can only be meaningfully

interpreted when care setting and delivery process is taken
into context.

■ For example a diagnosis of “CP” may mean chest
pain when entered by a cardiologist but may mean
“cerebral palsy” when entered by a neurologist.

■ Because true interoperability is still somewhat elusive

in health care data, variability remains a constant

5. Value
■ Last but not least, Big Data must have value.

■ Organizations might use the same tools and technologies

for gathering and analyzing the data they have available,
but how they then put that data to work is ultimately up to

“Information is the oil of the 21st
century, and analytics is the
combustion engine”

– Peter Sondergaard

Architecture of Big Data Systems

UNIT I- Introduction to Big Data 40
1. Data Sources

1. For data sources to be loaded into the data warehouse,

data needs to be well understood, organized and
normalized with the appropriate data type definitions.

2. Data typically must go through significant

preprocessing and checkpoints before it can enter this
sort of controlled environment.

Checkpoints refer to a validation point that compares
the current value of specified properties or current
state of an object with the expected value, which can
be inserted at any point of time in the script.

■ Although this kind of centralization enables security,
backup, and fail over of highly critical data, it also
means that data typically must go through significant
preprocessing and checkpoints before it can enter this
sort of controlled environment, which does not lend
itself to data exploration and iterative analytics.

2. Departmental Warehouse

Data is read by additional applications across the

enterprise for BI and reporting purposes.

These are high-priority operational processes getting

critical data feeds from the repositories.

(Enterprise Data Warehouse)
■ EDW are central repositories of integrated data (from
one or more sources).

■ EDW is a system for reporting and data analysis

■ Considered as a core component of Business


■ Store current and historical data in one single place

that are used for creating analytical reports.

■ Example : analysis - visualization / summary

UNIT I- Introduction to Big Data 46

EDW (contd)
As a result of this level of control on the EDW, additional local
systems may emerge in the form of departmental warehouses and
local data marts that business users create to accommodate their
need for flexible analysis.

These local data marts may not have the same constraints for
security and structure as the main EDW and allow users to do
some level of more in-depth analysis.

However, these one-off systems reside in isolation, often are not

synchronized or integrated with other data stores, and may not be
backed up.

A data mart is a structure / access pattern specific to data
warehouse environments, used to retrieve client-facing

The data mart is a subset of the data warehouse and is

usually oriented to a specific business line or team.

Whereas data warehouses have an enterprise-wide

depth, the information in data marts pertains to a single
■ What is a Data Lake?
A data lake is a centralized storage repository that
holds a massive amount of structured and
unstructured data.

■ What is Data Warehouse?

Data warehousing is about the collection of data from
varied sources for meaningful business insights. An
electronic storage of a massive amount of information,
it is a blend of technologies that enable the strategic
use of data!
■ At the end of this workflow, analysts get data provisioned
for their downstream analytics.

■ Because users generally are not allowed to run custom or

intensive analytics on production databases, analysts
create data extracts from the EDW to analyze data offline
in R or other local analytical tools.

■ Many times these tools are limited to in-memory

analytics on desktops analyzing samples of data, rather
than the entire population of a dataset.
■ Because these analyses are based on data extracts,
they reside in a separate location, and the results of
the analysis-and any insights on the quality of the data
or anomalies-rarely are fed back into the main data

Data Analytics Life Cycle

■ The Data Analytics Lifecycle defines analytics process

best practices spanning discovery to project completion.

Main phases of Data Analytics

UNIT I- Introduction to Big Data 54
Phase 1- Discovery
■ In Phase 1, the team learns the business domain,
including relevant history such as whether the
organization or business unit has attempted similar
projects in the past from which they can learn.

■ The team assesses the resources available to support the

project in terms of
■ people,
■ technology,
■ time and
■ data.

Phase 1 (contd)

■ Important activities in this phase include framing the

business problem as an analytics challenge that can be
addressed in subsequent phases and formulating
Initial Hypotheses (IHs) to test and begin learning the

Phase 1 (contd)
[Hypothesis:A proposed explanation for a phenomenon]

■If children sleep for 8 hours daily, then it results in
good health.
■If students study consistently for 3 hours daily, then
their exam scores improve by 5% over others.

Result of hypothesis : true / falsification (false/nullify)

Phase 2- Data preparation
■ Phase 2 requires the presence of an analytic sandbox, in
which the team can work with data and perform analytics
for the duration of the project.

■ The team needs to execute Extract, Load, and Transform

(ELT) or Extract, Transform and Load (ETL) to get data

■ The ELT and ETL are sometimes abbreviated as ETLT.

Extract, Load, Transform (ELT) is a data integration process for

transferring raw data from a source server to a data system (such as
a data warehouse or data lake) on a target server and then preparing
the information for downstream uses.

ELT is comprised of a data pipeline with three different operations

being performed on data:

■ The biggest determinant is how, when and where the

data transformations are performed.

■ With ETL: the raw data is not available in the data

warehouse because it is transformed before it is
■ With ELT: the raw data is loaded into the data
warehouse and transformations occur on the stored

ELT vs ETL (contd.)
■ ETL is Extract, Transform and Load while ELT is Extract, Load,
and Transform of data.
■ In ETL data moves from the data source, to staging, into the data
■ ELT leverages the data warehouse to do basic transformations. No
data staging is needed.
■ ETL can help with data privacy and compliance, cleansing
sensitive & secure data even before loading into the data
■ ETL can perform sophisticated data transformations and can be
more cost effective than ELT.

ELT (contd.)

The first step is to Extract the data. Extracting data is the

process of identifying and reading data from one or more
source systems, which may be databases, files, archives,
ERP, CRM or any other viable source of useful data.

The second step is to Load the extract data. Loading is the

process of adding the extracted data to the target database.

The third step is to Transform the data. Data

transformation is the process of converting data from its
source format to the format required for analysis.

ELT (contd.)

Transformation is typically based on rules that define

how the data should be converted for usage and analysis
in the target data store.

Examples of transformations include:

•Replacing codes with values
•Aggregating numerical sums
•Applying mathematical functions
•Converting data types
•Modifying text strings
•Combining data from different tables and databases

■ ETL is valuable when it comes to data quality, data
security, and data compliance.
■ However, ETL is slow when ingesting unstructured

■ ELT is fast when ingesting large amounts of raw,

unstructured data.
■ However, ELT sacrifices data quality, security, and
compliance in many cases.

ETLT framework(extract, transform, load,
transformation) –
A new approach to refreshing the Enterprise Data

The idea is combine ETL and ELT to exploit the strengths

of each approach to achieve optimum performance and

Phase 3-Model planning

■ Phase 3 is model planning, where the team determines:

■ methods,
■ techniques, and
■ workflow it intends to follow for the subsequent model
building phase.

■ The team explores the data to learn about the

relationships between variables and subsequently selects
key variables and the most suitable models

■ Method is defined as a habitual, logical, or prescribed
practice of achieving certain end results with
accuracy and efficiency, usually in a preordained
sequence of steps.

■ Technique means a systematic procedure, formula,

or a routine by which a task is accomplished.

A Workflow is a sequence of tasks that processes a
set of data.
Workflows occur across every kind of business and

[relationships between variables ]

UNIT I- Introduction to Big Data 70

UNIT I- Introduction to Big Data 71
■ A machine learning model is a program that has been
trained to recognize certain types of patterns.
■ You train a model over a set of data, providing it an
algorithm that it can use to reason over and learn from
those data.

Phase 4-Model building

■ In Phase 4, the team develops data sets for

■ testing,
■ training and
■ production purposes.

UNIT I- Introduction to Big Data 74
UNIT I- Introduction to Big Data 75
■ Also, the team builds and executes models based on
the work done in the model planning phase.

■ The team also considers whether its existing tools will

suffice for running the models, or if it will need a
more robust environment for executing models and
work flows
(for example, fast hardware and parallel processing, if

Phase 5-Communicate results

■ In Phase 5, the team, in collaboration with major

stakeholders, determines if the results of the project are a
success or a failure based on the criteria developed in
Phase 1.

■ The team should:

■ identify key findings,
■ quantify the business value,
■ develop a narrative to summarize
■ convey findings to stakeholders.

Communicate findings
through dashboard

Phase 6 - Operationalize

■ In Phase 6, the team delivers

■ final reports,
■ briefings,
■ code and
■ technical documents.

■ In addition, the team may run a pilot project to

implement the models in a production environment.

[pilot project]

[pilot project]

■ A pilot study, pilot project, pilot test, or pilot

experiment is a small-scale preliminary study
conducted to evaluate feasibility, duration, cost,
adverse events, and improve upon the study design
prior to performance of a full-scale research project.

UNIT 1 ends…
