Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

UNIT

INTRODUCTION TO BIG DATA

The term Big Data refers to all the data that is being
1.1 INTRoDUCTION generated across the globe at an unprecedented rate.
Now We are living in Big Data Era.
This data could be either structured or unstructured.
Few years ago, Systems or Organizations or
Today's business enterprises owe a huge part of their
Applications were using all Structured Data only success to an economy that is firmly knowledge-
Structured Data means In the form of Rows and
oriented.
Columns). It was very easy to use Relational Data Bases
Data drives the modern organizations of the world and
(RDBMS) and old Tools to store, manage, process and
hence making sense of this data and unraveling the
report this Data.
various patterns and revealing unseen connections
However recently, Nature of Data is changed. And within the vast sea of data becomes critical and a
Systems or Organizations or Applications are endeavor indeed.
hugely rewarding
generating huge amount of Data in variety of formats
There is a need to convert Big Data into Business
at very fast rate.
Intelligence that enterprises can readily deploy. Better
That means Data is not simple Structured Data(Not in data leads to better decision making and an improved
the form of simple Rows and Columns). It does not
way to strategize for organizations regardless of their
have any proper format, just Raw Data without any size, geography, market share, customer seg mentation
format. It is "very difficult or not possible" to use Old and such other categorizations. Hadoop is the platform
Technologies, Traditional Relational Databases and of choice for working with extremely large volumes of
Tools to store, manage, process and report this Data.
data.
Traditional Databases cannot Store, Process and
Big Data is a blanket term for the non-traditional
Analysis this kind of Data.
strategies and technologies needed to gather,
Then how to solve this problem? Here Big Data
organize, process, and gather insights from large
Solutions come into picture. datasets. While the problem of working with data that
Big Data Solutions solve all these problems very easily. exceeds the computing power or storage of a single
Let us start with understanding What is Big Data and computer is not new, the pervasiveness, scale, and
How important it is in our life. value of this type of computing has greatly expanded
We don't have a straightforward definition to Big Data. in recent years.

However, we will try to answer this question in Why Data is Important


different ways. We are living in Data Era or Information Era. Data is
Data is technique to solve data most important factor for all Organizations for the
In Simple Words, Big a

solvable using Traditional following reasons or benefits:


problems that are not
Databases and Tools. Data is useful in Decision Making.
In other way, Big Data means not just huge amount of know Customer Preferences so that
To
Data. Big Data means huge amount of data generating
Organizations can improve their Business.
at very fast rate in different formats.
Getting the Right Information for Business.
Big Data is a Technique to "Store, Process, Manage,
B y analyzing Data, We can optimize our systems.
Analysis and Report" a huge amount of variety data, at More Results, More
More Data, More Analysis,
the required speed, and within the required time to
allow Real-time Analysis and Reaction. Profits.
(1.1)
INTRODUCTIONTO BIG
BIG DATA ANALYTICS (COMP, DBATU)
(1.2)
Cost Savings:
Some tools of Big Datalit
ike Hadoop
DATA
Business Value he
Data is effective in improving and
Cloud-Based Analytics can

Data Analysis provides


Customer Likes and Dislikes
advantages to
business when large amountsCOstof
ge am
stored and these tools also
information data are to be help in
efficient ways of doing busin
1.2 WHY BIG DATA ISSO
IMPORTANT identifying more siness
for Time Reductions The high speed of ools
Now-a-Days, Big data is very important
Medium-Size to and in-memory analytics
like Can
Organizations or Companies form Hadoop
sourcesof data which
easily
because it enables them
to gather, store, identify new
helps
Large-Size, of data immediately and
Amounts analyzing
manage, and
manipulate "Extremely Large businesses

Extremely based on learning.


the
Data, Extremely High Velocity of Data and quick decisions
the Market Conditions
Wide Variety of Data":
Understand

A t the right speed analyzing big data you can get a bet
understanding of current market conditions Eo
At the right time

Business Value
example, by analyzing customers purchasinn
To get the required behaviors, a company can find out the producte
this Big Data 4Vs Paradigm, we will get lot
By following that are sold the most and produce products
of benefits as shown below: By this, it can get aheadnf
according to this trend.

Who its competitors.


What
Volume Velocity Control Online Reputation: Big data tools cando
How sentiment analysis. Therefore, you can get
feedback about who is saying what about your
Variety Veracity
When Where company. If you
want to monitor and inmprovethe
online presence of your business then, big data
Business Value tools can help in all this.
Fig. 1.1: Big Data 4Vs
-

By using those Big Paradigm, Organizations


Data 4Vs Using Big Data Analytics to Boost Customer Acquisition
can get many befits by understanding "What,
Who,
and Retention
When, Where, How" kind of questions: The customer is the most important asset any business
>What business decisions need to be made? depends on. There is no single business that can claim
What insight can we derive from the information? success without first having to establish a solid
is that data in predicting business Customer base. However, even with a customer base, a
How accurate
value? business cannot afford to disregard the high
Who could benefit from the information that we Competition it faces. If a business is slow to learn wnat
are capturing? customers are looking for, then it is very easy to begl
When do they need to know in order to make a offering poor quality products. In the end, loss o
more informed decision? clientele will result, and this creates an adverse overd
How to improve our business value? effect on business success.
How to improve our profits? The use of big data allows businesses to obsei
Where do we have more Profits? various customer related patterns and trends
gger
Observing customer behavior is important to trigg
The importance of data does not revolve around
big
how much data a company has but how a company loyalty.
utilizes the collected data. Every company uses data in Using Big Data Analytics to Solve Advertisers Probie
its own way; the more efficiently a and Offer Marketing
company uses its Insights
data, the more potential it has to grow. The Big data analytics can help change all
usiness
company
can take data from any and to match custon
source analyze it to find operations. This includes the ability
answers which will enable: ne
and
expectation, changing company's product line
BIGDATA ANALYTICS (COMP., DBATU) (1.3) INTRODUCTION TO BIG DATA
course ensuring that the marketing campaigns are Gaining Customer Insight
powerful.
Determining customer experience and making
Big Data Analytics As a Driver of Innovations and customers the center of a company's attraction is of
Product Development prime importance to organizations.
Another huge advantage of big data is the ability to Fraud Detection
help companies innovate and redevelop their products.
Insurance frauds are a common incidence, Big data use
Best Examples of Big Data case for reducing fraud is highly effective.
The best examples of big data can be found both in Threat Mapping
the public and private sector. From targeted When an insurance agency sells an insurance, they
advertising., education, and already mentioned massive want to be aware of all thepossibilities of things going
industries (healthcare, insurance, manufacturing or
unfavorably with their customer, making them file a
banking), to real-life scenarios, in guest service or
claim.
entertainment. by the year 2020, 1.7 megabytes of
Big Data in Government Industry
data will be generated every second for every person
Along with many other areas, big data in government
on the planet, the potential for data-driven
can have an enormous impact local, national and
organizational growth in the hospitality sector is
global. With so many complex issues on the table
enormous.
today, governments have their work cut out trying to
Big data can serve to deliver benefits in some
make sense of all the information they receive and
surprising areas.
make vital decisions that affect millions of people.
Big Data in Education Industry
Governments, be it of any country, come face to face
Following are some of the fields in education industry
with a very huge amount of data on almost daily basis.
that have been transformed by big data motivated
Reason being, they have to keep track of various
changes records and databases regarding the citizens. The
>Customized and Dynamic Learning Programs:
proper study and analysis of this data helps the
Reframing course material: Governments in endless ways. Few of them are:
Grading Systems: Welfare Schemes:
Career prediction: cyber Security:
Big Data in Insurance Industry
PDF Big Data Applications in the Government Sector:
A
-

The insurance industry holds importance not only for


individuals but also business companies. The reason
Comparative Analysis among Leading Countries
Big Data in Banking Sector
insurance holds a significant place is because it
supports people during times of adversities and The amount of data in banking sectors is skyrocketing
uncertainties. The data collected from these sources every second. According to GDC prognosis, this data is
are of varying formats and change at tremendous estimated to grow 700% by 2020.

speeds Study and Analysis of Big Data can Help Detect:


Collecting Information The misuse of credit cards
As big data refers to gathering data from disparate Misuse of debit cards
sources, this feature creates a crucial use case for the Venture credit hazard treatment
insurance industry to pounce on. Business clarity
Example: When a customer intends to buy a car Customer statistics alteration
insurance Kenya, the companies can obtain
Money laundering
information from which they can calculate the safety
levels for driving in the buyer's vicinity and his past
Risk Mitigation
Real-Time Big Data Analytics Tools
driving records. On basis of this they can effectivelly
More and more tools offer the possibility of real-time
calculate cost of car insurance as well. processing of Big Data.
BIGDATA ANALYTICS (cOMP, DBATU) (1.4)
INTRODUCTION TO BIG DATA
Storm
For Example:
Storm, which is now owned by Twitter, is a real-time Log Files
distributed computation system. In Log Files, Columns
are separated
sing by usin

(Which are characters used s


Cloudera "Whitespace" characters
horizontally or vertically, E
Cloudera offers the Cloudera Enterprise RTQ tools that align things either or
or Tab space, next line etc).
instance, space
offers real-time, interactive analytical queries of the JBoss Server Log File:
Observe the Following
data stored in HBase or HDFS.
09:20:01,054 INFO [orgjboss.modules)] (main)
GridGrain JBoss Modules version
1.3

GridGrain is 09:20:01,652 INFO [orgjboss.as.process.Host


an enterprise open source grid computing
Controller.status] (main) JBAS012017: Starting
made for Java. It is compatible with Hadoop DFS and it
process 'Host Controller'
offers a substitute to Hadoop's MapReduce.
09:20:05,079 INFO [org.jboss.as.process.Server:
SpaceCurve
myserver.status] (ProcessController-threads -

10)
The technology that SpaceCurve is developing can JBASO12017: Starting process 'Server: myserver
discover underlying patterns in multidimensional
17:01:58,833 INFO [orgjboss.as.process]
geodata.
(Shutdown thread) JBASO12016: Shutting down
Big Data: Data Formats process controller

In Big Data 3V Paradigm, one V refers to Variety. It 17:02:03,408 INFO [org.jboss.as.process.Host


means generating or getting data in different formats. Controller.status] (Shutdown thread) JBASO12018:
In Data Era, We, Systems, Devices or Organizations are Stopping process 'Host Controller'

generating or getting the following types of Data 17:02:15,246 INFO [org.jboss.as.process.Server


myserver.status] (ProcessController-threads - 9)
Formats.
JBASO12018: Stopping process 'Server:. myserver
Structured Data
17:03:02,990 INFO
Structured Data means Data that is in the form of Rows
in
[org jboss.as.process.Server:.myserver.status
and Columns. So it is very easy to store even
(reaper for Server: myserver) JBAS012010: Process
Relational Databases. 'Server: myserver finished with an exit status of 0
which possible to store in
In Simple words, Anything 17:03:13,170 INFO [org.jboss.as.process.Host
the form of Rows and Columns that is Structured Data. Controller.status] (reaper for Host Controller)
For Example:- Relational DBs Data(Online Subscription, JBASO12010: Process 'Host Controller finished with
Transactional Data etc). an exit status of 0

Sal Deptno 17:03:13,195 INFO [orgjboss.as.process)


Empno Empname
(Shutdown thread) JBASO012015: All processes
1001 Abc 24500 10
finished; exiting
45000 20 If we observe above
1002 Xyz log file, first column (contains
10000 30 "timestamp') is separated by some Whitespaces with
1003 Pqr
2nd column (Contains Logging level). It is
Semi-Structured Data
sem
formatted, not fully formatted text.
Semi-Structured Data means Data that is formatted in XML Documents
some way. But it is not formatted
in the form of Rows
Observe the following XML Document. It is also
possible to store in Relational
se
and Columns. It is formatted with XML start and end
and provide tags.
Databases, but bit complex to manage Pxml version="1.0" encoding="UTF-8"?>
very less performance
<employees
INTRODUCTION TO BIG DATA
BIGDATAANALYTICS (CcOMP, DBATU) (1.5)
<emp ids "'1001" Google Cloud Big Data Solutions.
empname>Abc«/empname> Microsoft Big Data Solutions.
<salb24500:/sals Cloud Era Big Data Solutions.
<deptno>10«/deptno> 1BM Big Data Solutions.

</emp Oracle Big Data Solutions.


8 /employees> Big Data Use Cases
Un-Structured Data Most of the Organizations are using or moving to Big
Un-Structured Data means Data that is not formatted Data. So it is not possible to list out all those Big Data
in any way. It is not possible to store data in Relational Organizations or Customers here.
Databases. We will provide only some popular Organizations who
For Example:- Audio files, Videos, Call Centre Executive are using and benefiting from Big Data Solutions.
Typed Text, Photos, Sensor Data, Web Data, Mobile Facebook
Data, GPS Data, Social Media Data etc are Un- Facebook is one of the popular Social Networking
Structured Data. Website. World-wide, Around 1000 million users are
If we open any image file (for instance, jpeg file) in any using Facebook Application. It is collecting around
text editor, we can see all binary data, which is not at 500TB (Tera Bytes) per Day from Users Subscription,
all formatted any form. User Likes, Posts, Relations Information, Audios,
Now-a-Days, People, Machines, Devices, Organizations Videos, Pictures etc.
and Internet are generating Multi-Structured Data that Google
means Ccombination of Structured Data, Semi-
Google is also using their Big Data Cloud Platform to
Structured Data and Un-Structured Data. It is not at all
mange their applications data like Gmail, Google+,
possible to store and manage this kind of Data using
Google Search Engine, YouTube etc.
Traditional Old Technologies, Databases and Tools
Adhar India
Multi-Structured Data =Structured Data +Semi-
structured Data + Un-Structured Data In India, UIDAI (Unique Identification Authority Of
Here Big Data solutions solve this problem in efficient India) manages all Adhar Card information. It is also
and cost-effective way. using Big Data solutions to manage that huge amount
Big Data Advantages of Data.
. I f we use Big Data solutions to store, manage, process Red Bus
and report our Data, we will get the following benefits: RedBus is India's largest online Bus Ticket and Hotel
>Store Data of all types and sizes at low cost Booking organization. It is also using Big Data
Efficiently Store, Process and Manage our Data. Solutions to manage that huge amount of Data with
Provides Cost-effective way to mange our Data. very high traffic rate.
Provides Better Performance Solutions eBay and Amazon
Provides Highly Scalable Solutions Two World famous online shopping giants eBay and
Produces Right Business Value Amazon are also using Big Data solutions to mange
>Increase Productivity their Customer Data, products information etc.
Increase Profits
Airline Industry
Big Data Solutions
A lot of Airlines (For Example:- British Airways,
The following is the list of Most Popular Big Data
Singapore Airlines etc.) today are using Big Data
Solutions available in the market.
solutions to store and mange their aircraft and
Apache Hadoop Big Data Solution. customers information.
Amazon Web Services (AWS) Big Data Solutions
BIGDATA ANALYTICS (COMP, DBATU) INTRODUCTION TO BIG DATA
(1.6)
Yahoo meet the demands, determines real potential in the
Yahoo is also using their Big
Data Cloud Platform data.
solutions to mange their
applications data like Yahoo Big Data Velocity deals with the speed at which data
Mail, Yahoo Search Engine, Flickr etc. flows in from sources like businesS processee
Safari Books Online and social media sitec
application logs, networks, tes,
Safari Books Online is an online sensors, Mobile devices, etc. The tlow of data is
subscription service for
Individuals and Organizations to access their online massive and continuous.
Books, Tutorials, Videos. VELOCITY = Produce data at very fast rate
New York Stock Exchange 3. Variety:
The New York Stock
Exchange is one the famous Stock Variety means "Different forms of Data". Now-a-days,
Exchanges the World. It generates about 5 TB
in
(Tera Organizations or Human Beings or Systems are
Bytes) of data per day.
generating very huge amount of data at very fast rate
Big Data is Data with has the Following Three in different formats. We will discuss in details about
Characteristics:
different formats of Data soon. Variety refers to
Extremely Large Volumes of Data. heterogeneous sources and the nature of data, both
Extremely High Velocity of Data. structured and unstructured.
Extremely Wide Variety of Data.
During earlier days, spreadsheets and databases were
1.3 BIG DATA CHARACTERISTICS the only sources of data considered by most of the
The following three are known as "Big Data applications. Nowadays, data in the form of emails,
Characteristics". photos, videos, monitoring devices, PDFs, audio, etc.
1 Volume are also being considered in the analysis applications.
2. Velocity This variety of unstructured data poses certain issues
3 Variety for storage, mining and analyzing data.
1. Volume: VARIETY = Produce data in different formats

Volume means "How much Data is generated". Now-a- Big Data refers to 3V (VVV) Paradigm:
days, Organizations or Human Beings or Systems are
generating or getting very vast amount of Data say
TB(Tera Bytes) to PB(Peta Bytes) to Exa Byte(EB) and
more.
Volume Velocity Variety
The name Big Data itself is related to a size which is
enormous. Size of data plays a very crucial role in
determining value out of data. Also, whether a
particular data can actually be considered as a Big Data
or not, is dependent upon the volume of data. Hence,
Fig. 1.2:BigData-3Vs
Volume' is one characteristic which needs to be
Three "Vs" Paradigm (Volume, Velocity, Variety) of Bi9
considered while dealing with Big Data.
Data was defined by "Doug Laney" in 2001.
vOLUME= Very large amount of data
If our Organization's Data is in
this 3Vs Paradigm, that
2. Velocity:
means we are in
Big Data Problems. So should use
Velocity means "How fast produce Data". Now-a-days, we

Organizations or Human Beings or Systems are


some Big Data Solutions to solve our
problems.
generating huge amounts of Data at very fast rate. The These 3Vs Paradigm is not enough to get better value
term 'velocity' refers to the speed of generation of from our Big Data. There is another V (4th V), which Is
data. How fast the data is generated and processed to most important for every Big Data problem.
BIG DATA ANALYTICS (COMP., DBATU) (1.7) INTRODUCTION TO BIG DATA
4th V:Veracity The wide range of
NoSQL tools, developers and the
Veracity means "The Quality or Correctness or status of the market are
creating uncertainty with the
Accuracy of Captured Data". Out of 4Vs, it is most data management.
important V for any Big Data Solutions. Because 2. Talent Gap in Big Data:
without Correct Information or Data, there is no use of
It is difficult to win the
storing large amount of data at fast rate and different respect from media and
analysts in tech without being bombarded with
formats. That data should give correct business value.
content touting the value of the analysis of big data
This refers to the inconsistency which can be shown by and corresponding reliance on a wide range of
the data at times, thus hampering the process of being
disruptive technologies.
able to handle and manage the data effectively. The new tools evolved in this sector can range fromn
VERACITY = The correctness of data traditional relational database tools with some
So this 4th V answers the following questions: alternative data layouts designed to maximize access
1. How accurate is that data in predicting business value? speed while reducing the storage footprints, NosQL
2. Do the results of a big data analysis actually make data management frameworks, in-memory analytics,
sense? and aswell as the broad Hadoop ecosystem.

Big Data 4Vs In Simple Terminology: The reality is that there is a lack of skills available in the
market for big data technologies. The typical expert
VVolume) : The Amount of Data
has also
gained experience through tool
VVariety): The number of Type of Data
implementation and its use as a programming model,
VVelocity) : The Speed of Data Processing
apart from the big data management aspects.
VVeracity): The Correctness of Data 3. Getting Data into Big Data Structure:
1.4 CHALLENGES AND APPLICATIONS OF BIG It might be obvious that the intent of a big data
DATA management involves analyzing and processing a large
1.4.1 amount of data. There are many people who have
Challenges of Big Data
raised expectations considering analyzing huge data
The handling of big data is very complex. Some
sets for a big data platform. They also may not be
challenges faced during its integration include
aware of the complexity behind the transmission,
uncertainty of data Management, big data talent gap,
access, and delivery of data and information from a
getting data into a big data structure, syncing across
wide range of resources and then loading these data
data sources, getting useful information out of the big
into a big data platform.
data, volume, skill availability, solution cost etc.
The intricate aspects of data transmission, access and
1. The Uncertainty of Data Management:
loading are only part of the challenge. The requirement
One disruptive facet of big data management is the to navigate transformation and extraction is not limited
use of a wide range of innovative data management to conventional relational data sets.
tools and frameworks whose designs are dedicated to
Syncing Across Data Sources:
and analytical processing. The
supporting operational Once you import data into big data platforms you may
NoSQL (not only SQL) frameworks are used that
also realize that data copies migrated from a wide
differentiate it from traditional relational database
range of sources on different rates and schedules can
management systems and are also largely designed to
fulfill performance demands of big data applications rapidly get of the synchronization with the
out

such as manag ing a large amount of data and quick originating system. This implies that the data coming

response times. from one source is not out of date as compared to the
data coming from another source. It also means the
There are a variety of NosQL approaches such as
hierarchical object representation (such as JSON, XML commonality of data definitions, concepts, metadata
and BSON) and the concept of a key-value storage. and the Iike.
BIG DATA ANALYTICS (COMP., DBATU) INTRODUCTION TOBIG DATA
(1.8)
The traditional data IDC that the a
estimates
management and data Digital Universe report, amount
warehouses, the world's syste
IT
extraction and migrations
sequence of data transformation, of information stored in the systems is
By 202 the total
all arise the situation about every two years.
in doubling
which there are risks for be enough to
fill a stack of tablete
data to become amount will that
unsynchronized. reaches from the
earth to the moon 6.6 timec
And
or liability for abo
Extracting Information from the Data in
Big Data enterprises have responsibility
Integration: percent of that information.

The most Much of that data


is unstructured, meaning thas
at t
practical use cases for big data involve the
availability of data, augmenting existing storage of doesn't reside in a database. Documents, phot
otos,
other unstructured data can
data as well as
allowing access to end-user employing audio, videos and be
business difficult to search and analyze.
intelligence tools for the purpose of the
that the 1DG report found
discovery of data. This business intelligence must be It's surprise, then,
no

able to connect different big data platforms and also "Managing unstructured
data is growing as a challeno
- rising from 31 percent in 2015 to 45 percent in 2016
provide transparency of the data consumers to
eliminate the requirement of custom coding. In order to deal with
data growth, organizations are
At the same time, if the number of data consumers turning to a number of different technologies. When it
grow, then one can provide a need to support an comes to storage, converged and hyper converged

increasing collection of many simultaneous user infrastructure and software-defined storage can make
accesses. This increment of demand may also spike at it easier for companies to scale their hardware. And
any time in reaction to different aspects of business technologies like compression, deduplication and
process cycles. It also becomes a challenge in big data tiering can reduce the amount of space and the costs
integration to ensure the right-time data availability to associated with big data storage.
the data consumers. On the management and analysis side, enterprises are
6. Miscellaneous Challenges: using tools like NosQL databases, Hadoop, Spark, big
Other challenges may occur while integrating big data. data analytics software, business intelligence
Some of the challenges include integration of data, applications, artificial intelligence and machine learning
skill availability, solution cost, the volume of data, the to help them comb through their big data stores to
rate of transformation of data, veracity and validity of find the insights their companies need.
data. 8. Generating Insights in a Timely Manner
The ability to merge data that is not similar in source Of course, organizations don't just want to store their
or structure and to do so at a reasonable cost and in big data they want to use that big data to achieve
time. It is also a challenge to process a large amount of business goals. According to the New Vantage Partners
data at a reasonable speed so that information is survey, the most common goals associated with big
available for data consumers when they need it. The data projects included the following:
validation of data set is also fulfilled while transferring Decreasing expenses through operational cost
data from one source to another or to consumers as
efficiencies
well. Establishing a data-driven culture
This is all about the big data integration and some
challenges that one can face during the
Creatingnew avenues for innovation and

disruption
implementation. These points must be considered and
should be taken care of if you are going to manage
Accelerating the speed with which new capabilne
and services
any big data platform.
are
deployed
7. Dealing with Data Growth
Launching new product and service offerings
All of those re
big data is goals can help organizations become fmo
The most obvious challenge associated with
competitive but only if they extract insights tro
simply storing and analyzing all that information. In its can
BIG DATA ANALYTICS (COMP., DBATU) (1.9) INTRODUCTION TO BIG DATA
their big data and then act on those insights quickly. In response, many enterprises are turning to neW
PWC's Global Data and Analytics Survey 2016 found technology solutions. In the IDG report, 89 percent of
"Everyone wants decision-making to be faster, those surveyed said that their
companies planned to
especially in banking, insurance, and healthcare" invest in new big data tools in the next 12 to 18
months. When asked which kind of tools
To achieve that speed, some organizations are looking they were
to a new generation of ETL and analytics tools that planning to purchase, integration technology was
second on the list, behind data analytics software.
dramatically reduce the time it takes to generate
11. Validating Data
reports. They are investing in software with real-time
Closely related to the idea of data integration is the
analytics capabilities that allows them to respond to
idea of data validation. Often organizations are
developments in the marketplace immediately. getting
similar pieces of data from different systems, and the
9. Recruiting and Retaining Big Data Talent data in those different systems doesn't always agree.
But in order to develop, manage and run those For example, the ecommerce system may show daily
applications that generate insights, organizations need sales at a certain level while the Enterprise Resource
professionals with big data skills. That has driven up Planning (ERP) system has a slightly different number.
demand for big data experts and big data salaries have Or a hospital's Electronic Health Record
(EHR) system
increased dramatically as a result. may have one address for a patient, while a partner
The 2017 Robert Half Technology Salary Guide pharmacy has a different address on record.
reported that big data engineers were earning The process of getting those records to agree, as well
between $135,000 and $196,000 on average, while as making sure the records are accurate, usable and
data scientist salaries ranged from $116,000 to $163, secure, is called data governance. And in the At Scale
500. Even business intelligence analysts were very well 2016 Big Data Maturity Survey, the fastest-growing
paid, making $118,000 to $138,750 per year area of concern cited by respondents was data
In order to deal with talent shortages, governance.
organizations
have a couple of options. First, many are increasing Solving data governance challenges is very complex
their budgets and their recruitment and retention and is usually requires a combination of policy changes
and technology. Organizations often set up a group of
efforts. Second, they are offering more training people to oversee data governance and write a set of
opportunities to their current staff members in an
policies and procedures. They may also invest in data
attempt to develop the talent they need from within.
management solutions designed to simplify data
Third, many organizations are looking to technology.
governance and help ensure the accuracy of big data
They are buying analytics solutions with self-service stores and the insights derived from them.
and/or machine learning capabilities. Designed to be 12. Securing Big Data
used by professionals without a data science degree, Security is also a big concern for organizations with big
these tools may help organizations achieve their big data stores. After all, some big data stores can be
data goals even if they do not have a lot of big data attractive targets for hackers or Advanced Persistent
experts on staff. Threats (APTs).
10. Integrating Disparate Data Sources However, most organizations seem to believe that their
The variety associated with big data leads to existing data security methods are sufficient for their
big data needs as well. In the IDG survey, less than half
challenges in data integration. Big data comes from a
of those surveyed (39 percent) said that they were
lot of different places enterprise applications, social
using additional security measure for their big data
media streams, email systems, employee-created
repositories or analyses. Among those who do use
documents, etc. Combining all that data and
additional measures, the most popular include identity
reconciling it so that it can be used to create reports
and access control (59 percent), data encryption (52
can be incredibly difficult. Vendors offer a variety of
percent) and data segregation (42 percent.
ETL and data integration tools designed to make the
13. Organizational Resistance
process easier, but many enterprises say that they have
It is not only the technological aspects of big data that
not solved the data integration problem yet.
can be challenging people can be an issue too.
BG
DATA

BIGDATA ANALYTICS (COMP, DBATU) (1.10)


INTRODUCTION TO BIG DATA Face
Wind
In the New Vantage Partners survey, 85.5 percent applications in various field. fields
of Big data has found many rom
those surveyed said that their firms were committed to where big data is being used
today. The major fields
buye
creating a data-driven culture, but only 37.1 are as follows: wOrk
percent
said they had been successful with those efforts.
When Government
asked about the impediments to that culture to be very useful in the Fraud
De
shift Big data analytics has proven
respondents pointed to three big obstacles within their government sector. Big data analysis played a arge F o rb

organizations: Obama's successful


2012 re-electi
ction Cain
role in Barack
6) Insufficient organizational alignment (4.6 percent) recently, Big analysis was
data of
the
campaign. Also most
for the BJP and its allies to win a
(i) Lack of middle management adoption and majorly responsible
HISto

successful Indian General


Election 2014. Tho
The
understanding (41.0 percent) highly elusi

Indian Government utilizes numerous techniques to


(Gii) Business resistance or lack of understanding (41.0 after
ascertain how the Indian electorate is respondingto
percent)
d o n e

well as ideas for polic


government action, as
adjus
In order for organizations to capitalize on the
opportunities offered by big data, they are going to augmentation.
Big

have to do some things differently. And that sort of Social Media Analytics trans

has led to an outburst of


change can be tremendously difficult for large The advent of social media patte
have been built in order to
organizations. big data. Various solutions ano

socialmedia activity like IBM's Coanos


1.4.2 Applications of Big Data analyze chan
Consumer Insights,point solution running on 1BM's
a
CallCent
The primary goal of big data analytics is to help can make sense of the
Biginsights Big Data platform,
companies make more informed business decisions by valuable real-time
chatter. Social media can provide
N o w

enabling data scientists, predictive modelers, and other to products


insights into how the market is responding app
analytics professionals to analyze large volumes of parda
and campaigns. help of these insights, the
With the
transactional data, as well as other forms of data that call
companies can adjust their pricing, promotion, and
may be untapped by more conventional Business Before utilizing the mark
campaign placements accordingly.
Intelligence(B) programs. data there needs to be some preprocessing to be muc
big
That could include web server logs and Internet click- be o
done on the big data in order to derive some
stream data, social media content and social network
intelligent and valuable results. Thus to know the Big
activity reports, text from customer emails and survey consumer mindset the application of intelligent or cu
responses, mobile phone call detail records and decisions derived from big data is necessary.
machine data captured by sensors and connected to
only
Technology metr
the Internet of Things.
The technological applications of big data comprise of Conte
Big Data Applications:
the following companies which deal with huge
Banking and Securities Banking
amounts of data every day and put them to use for
The
business decisions as well.
Media and ISsues
Entertainment Insurance For example, eBay.com uses two data warehouses at
Seemin
7.5 peta bytes and 40PB as well as a 40PB Hadoop
COuld
Big data cluster for search, consumer recommendations, and
Health care Applications Transportation Resear
merchandising Inside eBay's 90PB data warehouse
their u
Amazon.com handles millions of backend operations
Education Qutsou
Energy and Utilities every day as well as queries from more than half a
million third-party sellers. The core technology that Custom
of richer
keeps Amazon running is Linux-based and as of 2005,
Manufacturing
they had the world's three largest Linux databases.
Fig. 1.3
with capacities of 7.8 TB, 18.5 TB, and 24.7 TB. nsurencd
ncdets
BIG DATA ANALYTICS (COMP., DBATU)
(1.11)
Facebook handles 50 billion photos from its user base.
INTRODUCTION TO BIG DATA
discourage customers from sharing personal
Windermere Real Estate uses anonymous GPS signals information in exchange for customized
from nearly 100 million drivers to help new home offers
Agriculture
buyers determine their typical drive times to and from
A
work throughout various times of the day. biotechnology firm uses sensor data to optimize
Fraud Detection
crop efficiency. It plants test crops and runs
simulations to measure how
plants react to various
. For businesses whose operations
involve any type of
changesin condition. Its data environment
constantly
claims or transaction processing, fraud detection is one adjusts to changes in the attributes of various data it
of the most compelling Big Data application examples. collects, including temperature, water levels, soil
Historically, fraud detection on the fly has proven an composition, growth, output, and gene sequencing of
elusive goal. In most cases, fraud is discovered long each in the test bed. These simulations allow it to
plant
after the fact, at which point the damage has been discover the optimal environmental conditions for
done and all that's left is to minimize the harm and
specific gene types.
adjust policies to prevent it from happening again.
Marketing
Big Data platforms that can analyze claims and
Marketers have begun to use facial recognition
transactions in real time, identifying large-scale software to learn how well their advertising succeeds
patterns across many transactions or detecting
or fails at stimulating interest in their products. A
anomalous behavior from an individual user, can
recent study published in the Harvard Business Review
change the fraud detection game.
looked at what kinds of advertisements compelled
Call Center Analytics viewers to continue watching and what turned viewers
Now we turn to the customer-facing Big Data off. Among their tools was "a system that analyses
application examples, of which call center analytics are es facial expressions to reveal what viewers are feeling."
particularly powerful. What's going on in a customer's T h e research was designed to discover what kinds of
call center is often a great barometer and influencer of promotions induced watchers to share the ads with
market sentiment, but ithout a Big Data solution, their social network, helping marketers create ads most
much of the insight that a call center can provide will likely to "go viral" and improve sales.
be overlooked or discovered too late. Smart Phones
Big Data solutions can help identify recurring problems Perhaps more impressive, people now carry facial
or customer and staff behavior patterns on the fly not recognition technology in their pockets. Users of I
only by making sense of time/quality resolution Phone and Android smart phones have applications at
metrics but also by capturing and processing call their fingertips that facial
use recognition technology
content itself. for various tasks. For example, Android users with the
Banking remember app can snap a photo of someone, then
The use of customer data invariably raises privacy bring up stored information about that person based
issues. By uncovering hidden connections between on their image when their own memory lets them
seemingly unrelated pieces of data, big data analytics down a potential boon for salespeople.
could potentially reveal sensitive personal information. Telecom
Research indicates that 62% of bankers are cautious in Now a day's big data is used in different fields. In
their use of big data due to privacy issues. Further, telecom also it plays a very good role. Operators face
outsourcing of data analysis activities or distribution of an uphill challenge when they need to deliver new,
Customer data across departments for the generation revenue-generating services without
compelling
of richer insights also amplifies security risks. overloading their networks and keeping their running
Such as customer's earnings, savings, mortgages, and costs under control.
insurance policies ended up in the wrong hands. Such The market demands new set of data management
incidents reinforce concerns about data privacy and andanalysis capabilities that can helpservice providers
BIGDATA ANALYTICS (COMP., DBATU)) INTRODUCTIONTO BIG
(1.12) DATA
make accurate decisions by taking into account of success. Organizations cs
customer, network context and other critical results vary in terms still
struggle to forge what would be consider a ata
their businesses. Most of these aspects of
decisions must be executives who report startin.
made in real time, placing additional driven" culture. Of the ting
pressure on the
such a only 40.2% report having success,
project,
RBig
operators.
transformations take time, and while the vast maioris
Real-time predictive analytics can help leverage the
ority
of firms aspire to being "data-driven", a much smallaer
data that resides in their multitude make it
systems, realized this ambition. Culttr
immediately accessible and help correlate that data to percentage have ural
generate insight that can help them drive their transformations seldom occur overnight.

business forward. At this point in the evolution of big data, the

Healthcare most companies are not related to


challenges for
Traditionally, the healthcare industry has lagged impediments to adoption
technology. The biggest
behind other industries in the use of big data, part of relate to cultural challenges: organizational alignment.
the problem stems from resistance to change resistance or lack of understanding, and change
providers are accustomed to making treatment
decisions independently, using their own clinical management.
that enable Big Data
judgment, rather than relying on protocols based on Here are some key technologies
big data. Other obstacles are more structural in nature. for Businesses:
This is one of the best place to set an example for Big
Data Application. Even within a single hospital, payor, Key echnologies that Enable
Big Data Analytics for Business
or pharmaceutical company, important information
often remains siloed within one group or department
because organizations lack procedures for integrating
data and communicating findings.
Health care stakeholders now have access to promising
new threads of knowledge. This information is a form Predictive Analytics NoSQL Databases

called not only for its sheer volume


of"big data," so
but for its complexity, diversity, and timelines.
Pharmaceutical industry experts, payers, and providers

are now beginning to analyze big data to obtain Knowledge discovery Stream analytics
advances in the industry tools
insights. Recent technologic
their ability to work with such data,
have improved
files and often have
even though the are enormous

structures and technical


differentdatabase In-memory data fabric Distributed storage

characteristics.
1.5 ENABLING TECHNOLOGIES FOR BIG
DATA
is a combination of Data virtualization
The big data analytics technology Data integration
methods. What
several techniques and processing
makes them effective

enterprises to obtain
management and implementation.
is their
relevant
collective

results
use by
for strategic 88
Data Preprocessing Data quality
enthusiasm, and ambition to
In spite of the investment
data to transform the enterprise, Fig. 1.4
leverage the power of
BIGDATA ANALYTICS (COMP., DBATU)
(1.13) INTRODUCTION TO BIG DATA
1. Predictive Analytics
7. Data Virtualization
One of theprime tools for businesses to avoid risks in I t enables
decision making, predictive analytics can
help
applications to retrieve data without
businesses. Predictive analytics hardware and software implementing technical restrictions such as data
formats, the physical location of
solutions can be utilized for discovery, evaluation and data, etc. Used by
Apache Hadoop and other distributed data stores for
deployment of predictive scenarios by
processing big real-time or near real-time access to
data. Such data can help companies to be data stored on
prepared for various platforms, data virtualization
is one of the most
what is to come and help solve
problems by analyzing used big data technologies.
and understanding them.
Data Integration
2. NoSQL Databases
A key operational challenge for most
These databases utilized for reliable and efficient
are organizations
handling big data is to process terabytes (or peta
data management across a scalable number of
storage bytes) of data in a way that can be useful for customer
odes. NoSQL databases store data as relational deliverables. Data integration tools allow businesses
to
database tables, JSON docs or key-value streamline data across a number of
pairings big data solutions
3. Knowledge Discovery Tools such as Amazon EMR,
Apache Hive, Apache Pig,
These are tools that allow businesses to mine Apache Spark, Hadoop, MapReduce, Mongo DB and
big data Couchbase.
(structured and unstructured) which is stored on
. Data Preprocessing
multiple sources. These sources can be different file
These software solutions are used for
systems, APIs, DBMS or similar platforms. With search manipulation of
data into a format that is consistent and can be used
and knowledge discovery tools, businesses can isolate for further analysis. The data
and utilize the information to their benefit. preparation tools
accelerate the data sharing process by formatting and
4. Stream Analytics
cleansing unstructured data sets. A limitation of data
Sometimes the data anorganization needs to process preprocessing is that all its tasks cannot be automated
can be stored on multiple platforms and in multiple and require
human oversight, which can be tedious
formats. Stream analytics software is highly useful for and time-consuming.
filtering, aggregation, and analysis of such big data. 10. Data Quality

Stream analytics also allows connection to external An important parameter for big data processing is the
data sources and their integration into the data quality. The data quality software can conduct
application
flow. cleansing and enrichment of large data sets by utilizing
5. In-Memory Data Fabric
parallel processing. This software is widely used for
getting consistent and reliable outputs from big data
This technology helps in distribution of large quantities
processing.
of data across system resources such as Dynamic RAM,
In conclusion, Big Data is already being used to
Flash Storage or Solid State Storage Drives which in
improve operational efficiency, and the ability to make
turn enables low latency access and processing of big informed decisions based on the very latest up-to-the-
data on the connected nodes. momentinformation is rapidly becoming the
6. Distributed Storage mainstream norm.
Away to counter independent node failures and loss1.6BIG DATA STACK
or corruption of big data sources, distributed file stores The Data Layer
contain replicated data. Sometimes the data is also At the bottom of the stack are technologies that store
replicated for low latency quick access on large masses of raw data, which comes from traditional
computer networks. These are generally non-relational sources like OLTP databases, and newer, less
databases. structured sources like log files, sensors, web analytics,
BIG DATA ANALYTICs (COMP DBATU) INTRODUCTION TO BIGDA
(1,14) DATA
document and media archives. Increasingly, Tools
Big Data Ingestion
happens in the cloud or on virtualized local storage
resources. A few examples
Organizations are moving away from legacy storage, ETL (Extract, Transforn
form, Load
towards commoditized hardware, and Stitch:A lightweight
managed services like Amazon $3.
more recently to tool which pulls data from multiple pre-integrato
rated data
cleans it as necessan
transforms and

tegrates with Stitch


sOurces,
Data Storage Systems and
A few examples:
easy to setup,
data sources.
seamless

However, it does not supDos


dozens of
port data
Hadoop HDFS The classic big data file
system. It transformations
became popular due to its robustness and Blendo: A cloud data integration tool that lete
limitless you
scale connect data sources with a few clicks, and
and pine
pipe them
on commodity
hardware. However, it requires a
specialized skill set and complex integration of a to Amazon Redshift, PostgreSQL, MS SQL Server
Panoply's automated data warehouse. Blendo prou
and
myriad open source components ovides
schemas and optimization for email market
Amazon S3 Create buckets and load data
using a
eCommerce and other big data use cases. eting,
variety of integrations, with 99.999999999%
Apache Kafka : An open source streaming messaain
guaranteed durability. $3 is simple, secure, and
bus that can creates a feed from your data souce
provides a quick and cheap solution for storing ces,
limitless amounts of big data. partitions the data, and streams it to a passive listener
Apache Kafka is a mature and powerful solution Used
MongoDB: A mature open source document-based in production at huge scale. However it is complex tn
database, built to handle data at scale with
proven implement, and uses a messaging paradigm most data
performance. However, some have criticized its use as
engineers are not familiar with.
a first-class data storage system, due to its limited
analytical capabilities and The Data Processing Layer
no
support for transactional
data. Thanks to theplumbing, data arrives at its destination
The Data Ingestion and You now need a technology that can crunch
Integration Layer the
To create numbers to facilitate analysis. Analysts and data
a big data store, you'll need to import data scientists want to run SQL queries against
from its original sources into the data layer. In your big
many data, some of which will require enormous
cases, to enable analysis, you'll need to ingest data into computing
power to execute. The data processing layer should
specialized tools, such as data warehouses. This won't
happen without a data pipeline. You can leverage a optimize the data to facilitate more efficient analysis,
and provide a
rich ecosystem of big data integration tools, compute engine to run the queries. Data
including warehouse tools are optimal for
powerful open source integration tools, to pull data processing data at
scale, while a data lake is
from sources, transform it, and load it to a target
more
appropriate for storage
system of your choice.
requiring other technologies to assist when data needs
to be
processed and analyzed.
Analytics and BlLayer Data Processing Tools
Analysis A few examples:

Processing Layer Apache Spark: Like the old


Crunching 100X faster. Runs Map/Reduce but over
parallelized queries on unstructured
distributed data in
Integration and Ingestion Layer
elsewhere. Spark also
Hadoop, Mesos, Kubernetes and
Plumbing not natively a
provides a SQL interface, but
SQL engine.
Data Layer PostgreSQL: Many organizations pipe their to
Stroage good old Postgres to facilitate data
be scaled queries. PostgresQL can
by sharding or
Fig. 1.5 reliable. partitioning and is vey
BIG DATA ANALYTICs (COMP, DBATU)
(1.1.5) INTRODUCTIONTO BIG DATA
Amazon Redshift: Darling of the data industry, a resulted in the
cloud-based peta byte-scale data warehouse emergence of many commerCial
offering distributions. These generally come
blazing query speeds and can be used as a relational packaged with
database. support or additional features designed to streamline
its deployment or allow users to build additional
Analytics and BI Layer
analytics, security or data handling into their
You've bought the groceries, whipped up a cake and framework.
baked it now you get to eat it! The data layer collected
Competition in this market is fierce and the landscape
the raw materials for your analysis, the is constantly
integration shifting. For example all the top
layer mixed them all together, the data
processing distributions now include the Apache
layer optimized, organized the data and executed the Spark parallel
processing framework, whereas a few years ago this
queries. The analytics & BI is the real thing using the was not the case. The
data to enable data-driven decisions.
growing prominence of Spark
has resulted in many vendors
increasing the resources
Using thetechnology in this layer, you can run queries dedicated to Spark deployment and support
to answer questions the business is
asking, slice and One important factor to consider in
choosing a
dice the data, build dashboards and create beautiful
Hadoop distribution is whether you want an on-
visualizations, using one of many advanced BI tools. premises or cloud-based solution. If there is no room
Your objective? Answer business questions and to compromise when it comes to maintaining
provide actionable data which can help the business. complete control and ownership of your data, an on-
BI/Analytics Tools site solution still
theoretically offers the highest level of
A few examples: security. In recent years, though, cloud solutions have
Tableau: Powerful BI and data visualization tool, become less expensive, more flexible and easier to
which connects to your data and allows scale.
you to drill
down, perform complex analysis, and build charts and Most of the vendor products here can be installed on a
dashboards. cloud or on-premises. However, some cannot be run

Chartio: Cloud BI service allowing you to connect on-site. These are generally products from web service
data providers, such as Amazon or Microsoft, running either
sources, explore data, build SQL queries and transform
the data as needed, and create live Hadoop distributions from other, platform-focused
auto-refreshing vendors such
dashboards. as Hortonworks or MapR, or their own
distributions.
Looker: Cloud-based BI platform that lets you query
Beyond that, all of the top distributions have subtle
and analyze large data sets via SQL define metrics once
differences which could make them more or less
set up visualizations that tell a story with your data.
suitable for your business. Here's a non-exhaustive
17BIG DATADISTRIBUTION PACKAGE guide to some of the most popular on the market
Hadoop is the open source software framework at the today.
heart of much of the Big Data and analytics revolution. Cloudera
It provides solutions for enterprise data storage and Cloudera was the first vendor to offer Hadoop as a
analytics with almost unlimited scalability. Since its package and continues to be a leader in the industry.
release in 2011 it has rapidly grown in popularity and a Its Cloudera CDH distribution, which contains all the
strong ecosystem of distributors, vendors and
open source components, is the most popular Hadoop
consultants has emerged to support its use across distribution. Cloudera is known for acting quickly to
industry. innovate with additions to the core framework - it was

At its core, Hadoop is an Open Source system, which, the first to offer SQL-for-Hadoop with its Impala query
among other considerations, means it is essentially free engine. Other additions include user interface, security
for anyone to use. However, the requirement for it to and interfaces for integration with third party
be aligned to the needs of individual organizations has applications. It offers support for the whole of the
BIG DATA ANALYTICS (COMP., DBATUu)
distribution through (1.16) INTRODUCTION TO BIG DATA
its
subscription service. Cloudera Enterprise for storage and AWS loT to collect data from Internes
net
Hortonworks of Things-enabled devices.

Hortonworks' platform Microsoft


is
entirely open source in fact
the company is known for Microsoft's Azure HDInsight platform is a cloud-only
making acquisitions of other service which offers managed installations of several
companies with useful code and releasing it into the
open source community. What some open source Hadoop distributions including
have seen as a Hortonworks, Cloudera and MapR. It integrates them
start of a trend towards
consolidation in the market with its own Azure Data Lake platform to offer a
has prompted a
growth in popularity of
product. Recently Pivotal
Hortonworks complete solution for cloud-based storage and
stopped development of its analytics. As well as the core Hadoop framework
own distribution and both Amazon and IBM are now HDInsights provides Spark, Hive, Kafka and Storm
offering Hortonworks as options on their own cloud services, and its own cloud security framework.
platforms, alongside their own Hadoop distributions. Altiscale
Hortonworks' platform is also at the core of the
Data Platform Initiative a
Open Acquired recently by SAP for $125 million, Altiscale is
group looking to simplify and another company offering cloud-based, managed
standardize specifications in the Big Data
ecosphere. In Hadoop-as-a-service. It continues to offer its Altiscale
the long run this is likely to mean it will become
even Data Cloud product, which includes additional
more widely supported.
operational services like automation, security, scaling
MapR and performance-tuning alongside the core Hadoop
Like Hortonworks and Cloudera, MapR is a
platform- framework. Data Cloud also provides managed Spark,
focused provider, rather than a managed service Hive and Pig services like most of the other products
provider, like Amazon or Microsoft (see below). MapR here but unlike the other as-a-service offerings, uses
integrates its own database system MapR-DB which it its own Hadoop distribution rather than that of one of
claims is between four and seven times faster than the the platform-focused vendors such as Hortonworks or
stock Hadoop database HBase running on competing MapR.
distributions. Due to its power and speed MapR is
often seen as a good choice for the biggest of Big Data
EXERCISE
1. Define the term Big Data.
projects
2. Why Big Data is so important?
Amazon Elastic Map Reduce
3. State and explain various sources of big data.
Amazon offers a cloud-only Hadoop-as-a-service
4. Give the examples of Big Data formats.
platform through its Amazon Web Services arm. A key
advantage of the pay-as-you-go model offered by 5. What are the advantages of Big Data?

is the scalability offered, 6. Mention any 5 use cases of Big Data.


cloud-only service providers
with storage and data processing able to be ramped 7. Describe the characteristics of
Big Data with suitable
Amazon has examples.
up or wound down as demands change.
can now use the 8. Mention any 6
recently announced that customers challenges of Big Data.
Apache Flink stream processing framework for real- 9 Describe any 5 applications of Big Data.
with other
time data analytics on the platform, along 10. Explain in brief any 3 key technologies that enable
and Presto. It also
popular tools such as Kafka Big Data for businesses.
with
seamlessly connects (as you would expect) 11. Draw and explain the Big Data Stack.
infrastructure such as
Amazon's other cloud services 12. Mention various Big Data Distribution package
cloud Amazon
processing, S3 and Dynamo DB
EC2 for

You might also like