Unit 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

UNIT 1

Introduction:
Introduction to big data
Data which are very large in size is called Big Data. Normally we work on data of size
MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte
size is called Big Data. It is stated that almost 90% of today's data has been generated in the
past 3 years.
Sources of Big Data
These data come from many sources like
Social networking sites: Facebook, Google, LinkedIn all these sites generates huge amount of
data on a day to day basis as they have billions of users worldwide.
E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from
which users buying trends can be traced.
Weather Station: All the weather station and satellite gives very huge data which are stored
and manipulated to forecast weather.
Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
Share Market: Stock exchange across the world generates huge amount of data through its
daily transaction.
3V's of Big Data
Velocity: The data is increasing at a very fast rate. It is estimated that the volume of data will
double in every 2 years.
Variety: Now a days data are not stored in rows and column. Data is structured as well as
unstructured. Log file, CCTV footage is unstructured data. Data which can be saved in tables
are structured data like the transaction data of the bank.
Volume: The amount of data which we deal with is of very large size of Peta bytes.

Use case
An e-commerce site XYZ (having 100 million users) wants to offer a gift voucher of 100$ to
its top 10 customers who have spent the most in the previous year.Moreover, they want to find
the buying trend of these customers so that company can suggest more items related to them.
Issues
Huge amount of unstructured data which needs to be stored, processed and analyzed.
Solution
Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File System)
which uses commodity hardware to form clusters and store data in a distributed fashion. It
works on Write once, read many times principle.
Processing: Map Reduce paradigm is applied to data distributed over network to find the
required output.
Analyze: Pig, Hive can be used to analyze the data.
Cost: Hadoop is open source so the cost is no more an issue.
Solution
Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File System)
which uses commodity hardware to form clusters and store data in a distributed fashion. It
works on Write once, read many times principle.
Processing: Map Reduce paradigm is applied to data distributed over network to find the
required output.
Analyze: Pig, Hive can be used to analyze the data.
Cost: Hadoop is open source so the cost is no more an issue.
Introduction to Big Data Platform
The constant stream of information from various sources is becoming more intense[4],
especially with the advance in technology. And this is where big data platforms come in to
store and analyze the ever-increasing mass of information.
A big data platform is an integrated computing solution that combines numerous software
systems, tools, and hardware for big data management. It is a one-stop architecture that solves
all the data needs of a business regardless of the volume and size of the data at hand. Due to
their efficiency in data management, enterprises are increasingly adopting big data platforms
to gather tons of data and convert them into structured, actionable business insights[5].
Currently, the marketplace is flooded with numerous Open source and commercially available
big data platforms. They boast different features and capabilities for use in a big data
environment.
Challenges of Conventional Systems
Uncertainty of Data Management Landscape: Because big data is continuously expanding,
there are new companies and technologies that are being developed everyday. A big challenge
for companies is to find out which technology works bests for them without the introduction
of new risks and problems.The Big Data Talent Gap: While Big Data is a growing field, there
are very few experts available in this field. This is because Big data is a complex field and
people who understand the complexity and intricate nature of this field are far few and between.
Another major challenge in the field is the talent gap that exists in the industryGetting data into
the big data platform: Data is increasing every single day. This means that companies have to
tackle limitless amount of data on a regular basis. The scale and variety of data that is available
today can overwhelm any data practitioner and that is why it is important to make data
accessibility simple and convenient for brand mangers and owners.Need for synchronisation
across data sources: As data sets become more diverse, there is a need to incorporate them into
an analytical platform. If this is ignored, it can create gaps and lead to wrong insights and
messages.Getting important insights through the use of Big data analytics: It is important that
companies gain proper insights from big data analytics and it is important that the correct
department has access to this information. A major challenge in the big data analytics is
bridging this gap in an effective fashion.
Intelligent data analysis
• ou will find the best dissertation research areas / topics for future researchers enrolled
in Computer Science & Information.
• In order to identify the future research topics, we have reviewed the computer science
(recent peer-reviewed studies) on Data Analysis.
• Process of finding and identifying the meaning of data.
• Main advantage of visual representations is to discover, make sense of data and
communicating data.
Data
Data is nothing but things known or anything that is assumed; facts from which conclusions
can be gathered.
PhD Assistance works on Intelligent Data Analysis and Visualization. Hiring our experts,
you are assured with quality and on-time delivery.
Data Analysis
• Breaking up of any data into parts i.e., the examination of these parts to know about
their nature, proportion, function, interrelationship, etc.
• A process in which the analyst moves laterally and recursively between three modes:
describing data (profiling, correlation, summarizing), assembling data (scrubbing,
translating, synthesizing, filtering) and creating data (deriving, formulating,
simulating).
• It is a sense of making data. The process of finding and identifying the meaning of data.
Hire PhD Assistance experts to develop intelligent data analysis and visualization for
your Engineering & Technology.
Data Visualization
• It is a process of revealing already existing data and/or its features (origin, metadata,
allocation), which includes everything from the table to charts and multidimensional
animation (Min Yao, 2014) .
• To form an intellectual image of something not there to the sight.
• Visual data analysis is another form of data analysis, in which some or all forms of data
visualization may be used to give feedback sign to the analyst. Our product uses visual
signs such as charts, interactive browsing, and workflow process cues to help the
analyst in moving through the modes of data analysis.
• The main advantage of visual representations is to discover, make sense of data and
communicating data. Data visualization is a central part and an essential means to carry
out data analysis and then, once the importance have been identified and understood, it
is easy to communicate those meanings to others.
PhD Assistance experts has experience in handling dissertation and
assignment in Computer Science and Technology with assured 2:1 distinction. Talk to
Experts Now
Importance of IDA:
Intelligent Data Analysis (IDA) is one of the major issues in artificial intelligence and
information. Intelligent data analysis discloses hidden facts that are not known previously and
provides potentially important information or facts from large quantities of data (White, 2008).
It also helps in making a decision. Based on machine learning, artificial intelligence,
recognition of pattern, and records and visualization technology mainly, IDA helps to obtain
useful information, necessary data and interesting models from a lot of data available online in
order to make the right choices.
Intelligent data analysis helps to solve a problem that is already solved as a matter of routine.
If the data is collected for the past cases together with the result that was finally achieved, such
data can be used to revise and optimize the presently used strategy to arrive at a conclusion.
In certain cases, if some questions arise for the first time, and have only a little knowledge
about it, data from the related situations helps us to solve the new problem or any unknown
relationships can be discovered from the data to gain knowledge in an unfamiliar area.
Steps Involved In IDA:
IDA, in general, includes three stages: (1) Preparation of data; (2) data mining; (3) data
validation and explanation (Keim & Ward, 2007). The preparation of data involves opting for
the required data from the related data source and incorporating it into a data set that can be
used for data mining.
The main goal of intelligent data analysis is to obtain knowledge. Data analysis is the process
of a combination of extracting data from data set, analyzing, classification of data, organizing,
reasoning, and so on. It is challenging to choose suitable methods to resolve the complexity of
the process.
Regarding the term visualization, we have moved away from visualization to use the
term charting. The term analysis is used for the method of incorporating, influencing, filtering
and scrubbing the data, which certainly contains, but is not limited to interrelating with their
data through charts.
The Goal of Data Analysis:
Data analysis need not essentially involve arithmetic or statistics. While it is true that analysis
often involves one or both, and that many analytical pursuits cannot be handled without them,
much of the data analysis that people perform in the course of their work involves at most
mathematics no more complicated than the calculation of the mean of a set of values. The
essential activity of analysis is a comparison (of values, patterns, etc.), which can often be done
by simply using our eyes.
The aim of the analysis is not to find out appealing information in the data. Rather, this is only
a vital part of the process (Berthold & Hand, 2003). The aim is to make sense of data (i.e., to
understand what it means) and then to make decisions based on the understanding that is
achieved. Information in and of itself is not useful. Even understanding information in and of
it is not useful. The aim of data analysis is to make better decisions.
The process of data analysis starts with the collection of data that can add to the solution of
any given problem, and with the organization of that data in some regular form. It involves
identifying and applying a statistical or deterministic schema or model of the data that can be
manipulated for explanatory or predictive purposes. It then involves an interactive or automated
solution that explores the structured data in order to extract information – a solution to the
business problem – from the data.
PhD Assistance has vast experience in developing dissertation for data analysis
topics for student’s pursuing the UK dissertation in Engineering & Technology. Order
Now
The Goal of Visualization
The basic idea of visual data mining is to present the data in some visual form, allowing the
user to gain insight into the data, draw conclusions, and directly interact with the data. Visual
data analysis techniques have proven to be of high value in exploratory data analysis. Visual
data mining is mainly helpful when the only little fact is known about the data and the
exploration goals are indistinct.
The main uses of visual data examination over data analysis methods are:
• Visual data examination can simply deal with highly non-homogeneous and noisy data.
• Visual data exploration is spontaneous and requires no knowledge of complex
mathematical or arithmetical algorithms or parameters.
• Visualization can present a qualitative outline of the data, letting data phenomenon to
be secluded for further quantitative analysis. Accordingly, visual data examination
usually allows a quicker data investigation and often provides fascinating results,
especially in cases where automatic algorithms fail.
• Visual data examination techniques provide a much higher degree of assurance in the
findings of the exploration.
Conclusion
The examination of large data sets is a significant but complicated problem. Information
visualization techniques can be helpful in solving this problem. Visual data investigation is
helpful for many purposes such as fraud detection system and data mining can make use of
data visualization technology for improved data analysis.
Nature of Data
Big Data Characteristics
Big Data contains a large amount of data that is not being processed by traditional data storage
or the processing unit. It is used by many multinational companies to process the data and
business of many organizations. The data flow would exceed 150 exabytes per day before
replication.
There are five v's of Big Data that explains the characteristics.
5 V's of Big Data
o Volume
o Veracity
o Variety
o Value
o Velocity

Volume
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data
generated from many sources daily, such as business processes, machines, social media
platforms, networks, human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the "Like"
button is recorded, and more than 350 million new posts are uploaded each day. Big data
technologies can handle large amounts of data.
Variety
Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these
days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos,
videos, etc.

The data is categorized as below:


a. Structured data: In Structured schema, along with all the required columns. It is in a
tabular form. Structured Data is stored in the relational database management system.
b. Semi-structured: In Semi-structured, the schema is not appropriately defined, e.g., JSON,
XML, CSV, TSV, and email. OLTP (Online Transaction Processing) systems are built to
work with semi-structured data. It is stored in relations, i.e., tables.
c. Unstructured Data: All the unstructured files, log files, audio files, and image files are
included in the unstructured data. Some organizations have much data available, but they did
not know how to derive the value of data since the data is raw.
d. Quasi-structured Data:The data format contains textual data with inconsistent data formats
that are formatted with effort and time with some tools.
Example: Web server logs, i.e., the log file is created and maintained by some server that
contains a list of activities.
Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.
For example, Facebook posts with hashtags.
Value
Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.

Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the
data is created in real-time. It contains the linking of incoming data sets speeds, rate of
change, and activity bursts. The primary aspect of Big Data is to provide demanding data
rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.

Analytic Processes and Tool


There are hundreds of data analytics tools out there in the market today but the selection of
the right tool will depend upon your business NEED, GOALS, and VARIETY to get
business in the right direction. Now, let’s check out the top 10 analytics tools in big data.
1. APACHE Hadoop
It’s a Java-based open-source platform that is being used to store and process big data. It is
built on a cluster system that allows the system to process data efficiently and let the data run
parallel. It can process both structured and unstructured data from one server to multiple
computers. Hadoop also offers cross-platform support for its users. Today, it is the best big
data analytic tool and is popularly used by many tech giants such as Amazon, Microsoft,
IBM, etc.
Features of Apache Hadoop:
• Free to use and offers an efficient storage solution for businesses.
• Offers quick access via HDFS (Hadoop Distributed File System).
• Highly flexible and can be easily implemented with MySQL, and JSON.
• Highly scalable as it can distribute a large amount of data in small segments.
• It works on small commodity hardware like JBOD or a bunch of disks.
2. Cassandra
APACHE Cassandra is an open-source NoSQL distributed database that is used to fetch large
amounts of data. It’s one of the most popular tools for data analytics and has been praised
by many tech companies due to its high scalability and availability without compromising
speed and performance. It is capable of delivering thousands of operations every
second and can handle petabytes of resources with almost zero downtime. It was created by
Facebook back in 2008 and was published publicly.
Features of APACHE Cassandra:
• Data Storage Flexibility: It supports all forms of data i.e. structured, unstructured, semi-
structured, and allows users to change as per their needs.
• Data Distribution System: Easy to distribute data with the help of replicating data on
multiple data centers.
• Fast Processing: Cassandra has been designed to run on efficient commodity hardware and
also offers fast storage and data processing.
• Fault-tolerance: The moment, if any node fails, it will be replaced without any delay.
3. Qubole
It’s an open-source big data tool that helps in fetching data in a value of chain using ad-hoc
analysis in machine learning. Qubole is a data lake platform that offers end-to-end service
with reduced time and effort which are required in moving data pipelines. It is capable of
configuring multi-cloud services such as AWS, Azure, and Google Cloud. Besides, it also
helps in lowering the cost of cloud computing by 50%.
Features of Qubole:
• Supports ETL process: It allows companies to migrate data from multiple sources in one
place.
• Real-time Insight: It monitors user’s systems and allows them to view real-time insights
• Predictive Analysis: Qubole offers predictive analysis so that companies can take actions
accordingly for targeting more acquisitions.
• Advanced Security System: To protect users’ data in the cloud, Qubole uses an advanced
security system and also ensures to protect any future breaches. Besides, it also allows
encrypting cloud data from any potential threat.
4. Xplenty
It is a data analytic tool for building a data pipeline by using minimal codes in it. It offers a
wide range of solutions for sales, marketing, and support. With the help of its interactive
graphical interface, it provides solutions for ETL, ELT, etc. The best part of using Xplenty is
its low investment in hardware & software and its offers support via email, chat, telephonic
and virtual meetings. Xplenty is a platform to process data for analytics over the cloud and
segregates all the data together.
Features of Xplenty:
• Rest API: A user can possibly do anything by implementing Rest API
• Flexibility: Data can be sent, and pulled to databases, warehouses, and salesforce.
• Data Security: It offers SSL/TSL encryption and the platform is capable of verifying
algorithms and certificates regularly.
• Deployment: It offers integration apps for both cloud & in-house and supports deployment
to integrate apps over the cloud.
5. Spark
APACHE Spark is another framework that is used to process data and perform numerous
tasks on a large scale. It is also used to process data via multiple computers with the help of
distributing tools. It is widely used among data analysts as it offers easy-to-use APIs that
provide easy data pulling methods and it is capable of handling multi-petabytes of data as
well. Recently, Spark made a record of processing 100 terabytes of data in just 23
minutes which broke the previous world record of Hadoop (71 minutes). This is the reason
why big tech giants are moving towards spark now and is highly suitable for ML and AI
today.
Features of APACHE Spark:
• Ease of use: It allows users to run in their preferred language. (JAVA, Python, etc.)
• Real-time Processing: Spark can handle real-time streaming via Spark Streaming
• Flexible: It can run on, Mesos, Kubernetes, or the cloud.
6. Mongo DB
Came in limelight in 2010, is a free, open-source platform and a document-oriented
(NoSQL) database that is used to store a high volume of data. It uses collections and
documents for storage and its document consists of key-value pairs which are considered a
basic unit of Mongo DB. It is so popular among developers due to its availability for multi-
programming languages such as Python, Jscript, and Ruby.
Features of Mongo DB:
• Written in C++: It’s a schema-less DB and can hold varieties of documents inside.
• Simplifies Stack: With the help of mongo, a user can easily store files without any
disturbance in the stack.
• Master-Slave Replication: It can write/read data from the master and can be called back for
backup.
7. Apache Storm
A storm is a robust, user-friendly tool used for data analytics, especially in small companies.
The best part about the storm is that it has no language barrier (programming) in it and can
support any of them. It was designed to handle a pool of large data in fault-tolerance and
horizontally scalable methods. When we talk about real-time data processing, Storm leads
the chart because of its distributed real-time big data processing system, due to which today
many tech giants are using APACHE Storm in their system. Some of the most notable names
are Twitter, Zendesk, NaviSite, etc.
Features of Storm:
• Data Processing: Storm process the data even if the node gets disconnected
• Highly Scalable: It keeps the momentum of performance even if the load increases
• Fast: The speed of APACHE Storm is impeccable and can process up to 1 million
messages of 100 bytes on a single node.
8. SAS
Today it is one of the best tools for creating statistical modeling used by data analysts. By
using SAS, a data scientist can mine, manage, extract or update data in different variants from
different sources. Statistical Analytical System or SAS allows a user to access the data in any
format (SAS tables or Excel worksheets). Besides that it also offers a cloud platform for
business analytics called SAS Viya and also to get a strong grip on AI & ML, they have
introduced new tools and products.
Features of SAS:
• Flexible Programming Language: It offers easy-to-learn syntax and has also vast libraries
which make it suitable for non-programmers
• Vast Data Format: It provides support for many programming languages which also include
SQL and carries the ability to read data from any format.
• Encryption: It provides end-to-end security with a feature called SAS/SECURE.
9. Data Pine
Datapine is an analytical used for BI and was founded back in 2012 (Berlin, Germany). In a
short period of time, it has gained much popularity in a number of countries and it’s mainly
used for data extraction (for small-medium companies fetching data for close monitoring).
With the help of its enhanced UI design, anyone can visit and check the data as per their
requirement and offer in 4 different price brackets, starting from $249 per month. They do
offer dashboards by functions, industry, and platform.
Features of Datapine:
• Automation: To cut down the manual chase, datapine offers a wide array of AI assistant and
BI tools.
• Predictive Tool: datapine provides forecasting/predictive analytics by using historical and
current data, it derives the future outcome.
• Add on: It also offers intuitive widgets, visual analytics & discovery, ad hoc reporting, etc.
10. Rapid Miner
It’s a fully automated visual workflow design tool used for data analytics. It’s a no-code
platform and users aren’t required to code for segregating data. Today, it is being heavily
used in many industries such as ed-tech, training, research, etc. Though it’s an open-source
platform but has a limitation of adding 10000 data rows and a single logical processor.
With the help of Rapid Miner, one can easily deploy their ML models to the web or mobile
(only when the user interface is ready to collect real-time figures).
Features of Rapid Miner:
• Accessibility: It allows users to access 40+ types of files (SAS, ARFF, etc.) via URL
• Storage: Users can access cloud storage facilities such as AWS and dropbox
• Data validation: Rapid miner enables the visual display of multiple results in history for
better evaluation.
Analysis vs Reporting

Analytics Reporting

Analytics is the method of


Reporting is an action that includes all
examining and analyzing
the needed information and data and is
summarized data to make business
put together in an organized way.
decisions.

Questioning the data, Identifying business events, gathering


understanding it, investigating it, the required information, organizing,
and presenting it to the end users summarizing, and presenting existing
are all part of analytics. data are all part of reporting.

The purpose of analytics is to The purpose of reporting is to organize


draw conclusions based on data. the data into meaningful information.

Analytics is used by data analysts, Reporting is provided to the appropriate


scientists, and business people to business leaders to perform effectively
make effective decisions. and efficiently within a firm.

You might also like