Download as pdf or txt
Download as pdf or txt
You are on page 1of 86

BIG DATA ANALYTICS

UNIT I

Introduction to Big Data

UNIT I- Introduction to Big Data 1


Syllabus

■ Introduction to Big Data:


■ What is Big Data, overview of Big Data Analytics,
traditional database systems vs Big Data systems, 5
V's of Big Data, importance of Big Data and real
world challenges.
■ Architecture of Big Data systems, Big Data
applications, Data Analytics Life Cycle.

UNIT I- Introduction to Big Data 2


INTRODUCTION TO
BIG DATA

UNIT I- Introduction to Big Data 3


UNIT I- Introduction to Big Data 4
Data is created constantly, and at an ever-increasing
rate:

Sources of Big Data:


1. Mobile phones, social media, imaging technologies
-all these and more create new data, and that must be
stored somewhere for some purpose.

2.Devices and sensors automatically generate


diagnostic information that needs to be stored and
processed in real time.
5
Examples of big data
■ Photos and video footage uploaded to the World Wide
Web.
■ Video surveillance, such as the thousands of video
cameras spread across a city .
■ Mobile devices, which provide geospatial location data
of the users, as well as metadata about text messages,
phone calls, and application usage on smart phones
■ Smart devices, which provide sensor-based collection of
information from smart electric grids, smart buildings,
and many other public and industry infrastructures

UNIT I- Introduction to Big Data 6


Statistics of Big Data
■ We constantly generate data. On Google alone, we submit
over 40,000 search queries per second. That amounts to 1.2
trillion searches yearly!

■ Each minute, 300 new hours of video show up on YouTube.


That’s why there’s more than 1 billion gigabytes (1 exabyte)
of data on its servers!
[an exabyte is equal to one quintillion
(1,000,000,000,000,000,000) bytes ]

■ People share more than 100 terabytes of data on Facebook


daily. Every minute, users send 31 million messages and
UNIT I- Introduction to Big Data 7
■ Big Data usage statistics indicate people take
about 80% of photos on their smartphones.
Considering that only this year over 1.4 billion
devices will be shipped worldwide, we can only
expect this percentage to grow.

■ Smart devices (for example, fitness trackers, sensors)


produce 5 quintillion bytes of data daily. In 5 years,
we can expect for the number of these gadgets to be
more than 50 billion!

UNIT I- Introduction to Big Data 8


Merely keeping up with this huge data is difficult, but
substantially more challenging is analyzing vast amounts
of it, especially when it does not conform to traditional
notions of data structure, to identify meaningful patterns
and extract useful information.

9
DEFINITION OF BIG DATA

■ Big Data is data whose scale, distribution, diversity,


and/or timeliness require the use of new technical
architectures and analytics to enable insights that
unlock new sources of business value.

UNIT I- Introduction to Big Data 10


Overview of Big Data Analytics

■ Big data analytics is the often complex process of


examining large and varied data sets, or big data, to
uncover information -- such as hidden patterns,
unknown correlations, market trends and customer
preferences -- that can help organizations make
informed business decisions.

UNIT I- Introduction to Big Data 11


UNIT I- Introduction to Big Data 12
Example:
Healthcare application

UNIT I- Introduction to Big Data 13


Big Data Analytics..

■ Big Data Analytics is a form of advanced analytics,


which involves complex applications with elements
such as
■ predictive models,
■ statistical algorithms and
■ what-if analysis

powered by high-performance analytics systems.

UNIT I- Introduction to Big Data 14


[What-if analysis (contd)]

■ Example of what-if analysis:

Suppose a student plans to score an average of 80 in semester


exam. She scored 82, 70, 83 and 76 in the subjects English,
Mathematics, Computer Science and Mechanics respectively.

Statistics exam is due to happen shortly, we want to calculate


the marks she needs to score in Statistics to achieve an average
of 80 in the semester.

UNIT I- Introduction to Big Data 15


Importance of Big Data

Driven by specialized analytics systems and software,


as well as high-powered computing systems, Big Data
Analytics offers various business benefits, including:

■ New revenue opportunities


■ More effective marketing
■ Better customer service
■ Improved operational efficiency
■ Competitive advantages over rivals

UNIT I- Introduction to Big Data 16


■ Big data can deliver value in almost any area of business
or society:

UNIT I- Introduction to Big Data 17


1. It helps companies to better understand and serve
customers:

■ Examples include the recommendations made by


Amazon or Netflix.

UNIT I- Introduction to Big Data 18


2. It allows companies to optimize their processes:

Uber is able to predict demand, dynamically price


journeys and send the closest driver to the customers.

UNIT I- Introduction to Big Data 19


3. It improves healthcare:

Government agencies can now predict flu outbreaks and


track them in real time and pharmaceutical companies
are able to use Big Data Analytics to fast-track drug
development.

UNIT I- Introduction to Big Data 20


4. It helps us to improve security:

Government and law enforcement agencies use Big


Data to foil terrorist attacks and detect cyber crime.

UNIT I- Introduction to Big Data 21


5. It allows sport stars to boost their performance:

Sensors in balls, GPS trackers on their clothes allow


athletes to analyze and improve upon what they do.

UNIT I- Introduction to Big Data 22


Real world challenges

■ 1. Exploiting the opportunities that Big Data presents


requires new data architectures, including analytic
sandboxes, new ways of working, and people with new
skill sets.

■ These drivers are causing organizations to set up


analytic sandboxes and build Data Science teams.

UNIT I- Introduction to Big Data 23


■ 2. Potential pitfalls of big data analytics initiatives
include a lack of internal analytics skills and the high
cost of hiring experienced personnel to fill the gaps.

■ 3. Although some organizations are fortunate to have


data scientists (most may not be), there is a growing
talent gap that makes finding and hiring data scientists
in a timely manner difficult.

UNIT I- Introduction to Big Data 24


Big Data Applications

UNIT I- Introduction to Big Data 25


Traditional database systems
vs
Big Data systems

UNIT I- Introduction to Big Data 26


UNIT I- Introduction to Big Data 27
UNIT I- Introduction to Big Data 28
5 V’s of Big Data

UNIT I- Introduction to Big Data 29


1. Volume:

■ Big data first and foremost has to be “big,” and size in


this case is measured as volume.

30
Example:

■ From clinical data associated with lab tests and physician


visits, to the administrative data surrounding payments,
this well of information is already expanding.

■ When that data is coupled with greater use of precision


medicine, there will be a big data explosion in health
care, especially as genomic and environmental data
become more ubiquitous.

UNIT I- Introduction to Big Data 31


2. Velocity:
■ Velocity in the context of big data refers to two related concepts
familiar to anyone in healthcare: the rapidly increasing speed at which
new data is being created by technological advances, and the
corresponding need for that data to be digested and analyzed in near
real-time.

■ For example, as more and more medical devices are designed to


monitor patients and collect data, there is great demand to be able to
analyze that data and then to transmit it back to clinicians and others.

■ This “internet of things” of healthcare will only lead to increasing


velocity of big data in healthcare.

UNIT I- Introduction to Big Data 32


3. Variety:
■ With increasing volume and velocity comes increasing variety. This
third “V” describes just what you’d think: the huge diversity of data
types that healthcare organizations see every day.

■ Scenario: Electronic health records and medical devices.

■ Each one might collect a different kind of data, which in turn might
be interpreted differently by different physicians—or made
available to a specialist but not a primary care provider.

■ Challenges:
■ Standardizing and distributing all of that information so that
everyone involved is on the same page.
UNIT I- Introduction to Big Data 33
■ With increasing adoption of population health and big data
analytics, we are seeing greater variety of data by
combining
✓ traditional clinical and administrative data with
✓ unstructured notes,
✓ socioeconomic data and even
✓ social media data.

UNIT I- Introduction to Big Data 34


4. Variability

• The way care is provided to any given patient depends on


all kinds of factors—and the way the care is delivered and
more importantly the way the data is captured may vary
from time to time or place to place.

• Such variability means data can only be meaningfully


interpreted when care setting and delivery process is taken
into context.

UNIT I- Introduction to Big Data 35


■ For example a diagnosis of “CP” may mean chest
pain when entered by a cardiologist but may mean
“cerebral palsy” when entered by a neurologist.

■ Because true interoperability is still somewhat elusive


in health care data, variability remains a constant
challenge.

UNIT I- Introduction to Big Data 36


5. Value
■ Last but not least, Big Data must have value.

■ Organizations might use the same tools and technologies


for gathering and analyzing the data they have available,
but how they then put that data to work is ultimately up to
them.

UNIT I- Introduction to Big Data 37


“Information is the oil of the 21st
century, and analytics is the
combustion engine”

– Peter Sondergaard

UNIT I- Introduction to Big Data 38


Architecture of Big Data Systems

UNIT I- Introduction to Big Data 39


UNIT I- Introduction to Big Data 40
1. Data Sources

1. For data sources to be loaded into the data warehouse,


data needs to be well understood, organized and
normalized with the appropriate data type definitions.

2. Data typically must go through significant


preprocessing and checkpoints before it can enter this
sort of controlled environment.

UNIT I- Introduction to Big Data 41


[
Checkpoints refer to a validation point that compares
the current value of specified properties or current
state of an object with the expected value, which can
be inserted at any point of time in the script.
]

UNIT I- Introduction to Big Data 42


■ Although this kind of centralization enables security,
backup, and fail over of highly critical data, it also
means that data typically must go through significant
preprocessing and checkpoints before it can enter this
sort of controlled environment, which does not lend
itself to data exploration and iterative analytics.

UNIT I- Introduction to Big Data 43


2. Departmental Warehouse

Data is read by additional applications across the


enterprise for BI and reporting purposes.

These are high-priority operational processes getting


critical data feeds from the repositories.

UNIT I- Introduction to Big Data 44


EDW
(Enterprise Data Warehouse)
■ EDW are central repositories of integrated data (from
one or more sources).

■ EDW is a system for reporting and data analysis

■ Considered as a core component of Business


Intelligence

■ Store current and historical data in one single place


that are used for creating analytical reports.

UNIT I- Introduction to Big Data 45


■ Example : analysis - visualization / summary

UNIT I- Introduction to Big Data 46


EDW (contd)
As a result of this level of control on the EDW, additional local
systems may emerge in the form of departmental warehouses and
local data marts that business users create to accommodate their
need for flexible analysis.

These local data marts may not have the same constraints for
security and structure as the main EDW and allow users to do
some level of more in-depth analysis.

However, these one-off systems reside in isolation, often are not


synchronized or integrated with other data stores, and may not be
backed up.

UNIT I- Introduction to Big Data 47


[
A data mart is a structure / access pattern specific to data
warehouse environments, used to retrieve client-facing
data.

The data mart is a subset of the data warehouse and is


usually oriented to a specific business line or team.

Whereas data warehouses have an enterprise-wide


depth, the information in data marts pertains to a single
department.
] UNIT I- Introduction to Big Data 48
■ What is a Data Lake?
A data lake is a centralized storage repository that
holds a massive amount of structured and
unstructured data.

■ What is Data Warehouse?


Data warehousing is about the collection of data from
varied sources for meaningful business insights. An
electronic storage of a massive amount of information,
it is a blend of technologies that enable the strategic
use of data!
UNIT I- Introduction to Big Data 49
■ At the end of this workflow, analysts get data provisioned
for their downstream analytics.

■ Because users generally are not allowed to run custom or


intensive analytics on production databases, analysts
create data extracts from the EDW to analyze data offline
in R or other local analytical tools.

■ Many times these tools are limited to in-memory


analytics on desktops analyzing samples of data, rather
than the entire population of a dataset.
UNIT I- Introduction to Big Data 50
■ Because these analyses are based on data extracts,
they reside in a separate location, and the results of
the analysis-and any insights on the quality of the data
or anomalies-rarely are fed back into the main data
repository.

UNIT I- Introduction to Big Data 51


Data Analytics Life Cycle

■ The Data Analytics Lifecycle defines analytics process


best practices spanning discovery to project completion.

UNIT I- Introduction to Big Data 52


Main phases of Data Analytics
Lifecycle

UNIT I- Introduction to Big Data 53


UNIT I- Introduction to Big Data 54
Phase 1- Discovery
■ In Phase 1, the team learns the business domain,
including relevant history such as whether the
organization or business unit has attempted similar
projects in the past from which they can learn.

■ The team assesses the resources available to support the


project in terms of
■ people,
■ technology,
■ time and
■ data.

UNIT I- Introduction to Big Data 55


Phase 1 (contd)

■ Important activities in this phase include framing the


business problem as an analytics challenge that can be
addressed in subsequent phases and formulating
Initial Hypotheses (IHs) to test and begin learning the
data.

UNIT I- Introduction to Big Data 56


Phase 1 (contd)
[Hypothesis:A proposed explanation for a phenomenon]

Eg:
■If children sleep for 8 hours daily, then it results in
good health.
■If students study consistently for 3 hours daily, then
their exam scores improve by 5% over others.

Result of hypothesis : true / falsification (false/nullify)

UNIT I- Introduction to Big Data 57


Phase 2- Data preparation
■ Phase 2 requires the presence of an analytic sandbox, in
which the team can work with data and perform analytics
for the duration of the project.

■ The team needs to execute Extract, Load, and Transform


(ELT) or Extract, Transform and Load (ETL) to get data

■ The ELT and ETL are sometimes abbreviated as ETLT.

UNIT I- Introduction to Big Data 58


ELT

Extract, Load, Transform (ELT) is a data integration process for


transferring raw data from a source server to a data system (such as
a data warehouse or data lake) on a target server and then preparing
the information for downstream uses.

ELT is comprised of a data pipeline with three different operations


being performed on data:

UNIT I- Introduction to Big Data 59


ELT vs ETL

■ The biggest determinant is how, when and where the


data transformations are performed.

■ With ETL: the raw data is not available in the data


warehouse because it is transformed before it is
loaded.
■ With ELT: the raw data is loaded into the data
warehouse and transformations occur on the stored
data.

UNIT I- Introduction to Big Data 60


ELT vs ETL (contd.)
■ ETL is Extract, Transform and Load while ELT is Extract, Load,
and Transform of data.
■ In ETL data moves from the data source, to staging, into the data
warehouse.
■ ELT leverages the data warehouse to do basic transformations. No
data staging is needed.
■ ETL can help with data privacy and compliance, cleansing
sensitive & secure data even before loading into the data
warehouse.
■ ETL can perform sophisticated data transformations and can be
more cost effective than ELT.

UNIT I- Introduction to Big Data 61


ELT (contd.)

The first step is to Extract the data. Extracting data is the


process of identifying and reading data from one or more
source systems, which may be databases, files, archives,
ERP, CRM or any other viable source of useful data.

The second step is to Load the extract data. Loading is the


process of adding the extracted data to the target database.

The third step is to Transform the data. Data


transformation is the process of converting data from its
source format to the format required for analysis.

UNIT I- Introduction to Big Data 62


ELT

UNIT I- Introduction to Big Data 63


ELT (contd.)

Transformation is typically based on rules that define


how the data should be converted for usage and analysis
in the target data store.

Examples of transformations include:


•Replacing codes with values
•Aggregating numerical sums
•Applying mathematical functions
•Converting data types
•Modifying text strings
•Combining data from different tables and databases

UNIT I- Introduction to Big Data 64


■ ETL is valuable when it comes to data quality, data
security, and data compliance.
■ However, ETL is slow when ingesting unstructured
data.

■ ELT is fast when ingesting large amounts of raw,


unstructured data.
■ However, ELT sacrifices data quality, security, and
compliance in many cases.

UNIT I- Introduction to Big Data 65


ETLT framework(extract, transform, load,
transformation) –
A new approach to refreshing the Enterprise Data
Warehouse.

The idea is combine ETL and ELT to exploit the strengths


of each approach to achieve optimum performance and
scalability

UNIT I- Introduction to Big Data 66


Phase 3-Model planning

■ Phase 3 is model planning, where the team determines:


■ methods,
■ techniques, and
■ workflow it intends to follow for the subsequent model
building phase.

■ The team explores the data to learn about the


relationships between variables and subsequently selects
key variables and the most suitable models

UNIT I- Introduction to Big Data 67


[
■ Method is defined as a habitual, logical, or prescribed
practice of achieving certain end results with
accuracy and efficiency, usually in a preordained
sequence of steps.

■ Technique means a systematic procedure, formula,


or a routine by which a task is accomplished.
]

UNIT I- Introduction to Big Data 68


[
A Workflow is a sequence of tasks that processes a
set of data.
Workflows occur across every kind of business and
industry.
]

UNIT I- Introduction to Big Data 69


[relationships between variables ]

UNIT I- Introduction to Big Data 70


UNIT I- Introduction to Big Data 71
[model]
■ A machine learning model is a program that has been
trained to recognize certain types of patterns.
■ You train a model over a set of data, providing it an
algorithm that it can use to reason over and learn from
those data.

UNIT I- Introduction to Big Data 72


Phase 4-Model building

■ In Phase 4, the team develops data sets for


■ testing,
■ training and
■ production purposes.

UNIT I- Introduction to Big Data 73


UNIT I- Introduction to Big Data 74
UNIT I- Introduction to Big Data 75
■ Also, the team builds and executes models based on
the work done in the model planning phase.

■ The team also considers whether its existing tools will


suffice for running the models, or if it will need a
more robust environment for executing models and
work flows
(for example, fast hardware and parallel processing, if
applicable).

UNIT I- Introduction to Big Data 76


Phase 5-Communicate results

■ In Phase 5, the team, in collaboration with major


stakeholders, determines if the results of the project are a
success or a failure based on the criteria developed in
Phase 1.

■ The team should:


■ identify key findings,
■ quantify the business value,
■ develop a narrative to summarize
■ convey findings to stakeholders.

UNIT I- Introduction to Big Data 77


[stakeholders]

UNIT I- Introduction to Big Data 78


Communicate findings
through dashboard

UNIT I- Introduction to Big Data 79


Phase 6 - Operationalize

■ In Phase 6, the team delivers


■ final reports,
■ briefings,
■ code and
■ technical documents.

■ In addition, the team may run a pilot project to


implement the models in a production environment.

UNIT I- Introduction to Big Data 80


[pilot project]

UNIT I- Introduction to Big Data 81


[pilot project]

■ A pilot study, pilot project, pilot test, or pilot


experiment is a small-scale preliminary study
conducted to evaluate feasibility, duration, cost,
adverse events, and improve upon the study design
prior to performance of a full-scale research project.

UNIT I- Introduction to Big Data 82


References
■ David Dietrich, Barry Hiller. Data Science and Big Data Analytics, 6th
edition, EMC education services, Wiley publications, 2015, ISBN0-07-
120413-X
■ G. Sudha Sadhasivam, Thirumahal Rajkumar. Big Data Analytics. Oxford
University Press
■ Kevin Roebuck. Storing and Managing Big Data - NoSQL, HADOOP and
More, Emereopty Limited, ISBN: 1743045743, 9781743045749
■ https://www.researchgate.net/figure/The-five-Vs-of-Big-Data-Adapted-
from-IBM-big-data-platform-Bringing-big-data-to-the_fig1_281404634
[image]
■ https://informationcatalyst.com [image]
■ https://www.slideshare.net/hktripathy/lecture2-big-data-life-cycle[image]
■ https://best-excel-tutorial.com/55-advanced/217-whatif-analysis[image]

UNIT I- Introduction to Big Data 83


■ https://olap.com/learn-bi-olap/olap-bi-
definitions/business-intelligence/
■ https://www.blue-granite.com/blog/advantages-of-
the-analytics-sandbox-for-data-lakes
■ https://searchdatamanagement.techtarget.com/defini
tion/Extract-Load-Transform-ELT
■ https://www.linkedin.com/pulse/data-analytics-big-
leap-future-mitwpu-scet/

UNIT I- Introduction to Big Data 84


UNIT 1 ends…
Unit taught by:
Dr. Bhavana Tiple ([email protected]) for A
and B batches ,
Prof. Suja Panickar ([email protected]) for C
and D batches

UNIT I- Introduction to Big Data 85


UNIT I- Introduction to Big Data 86

You might also like