Unit 1
Unit 1
Unit 1
Introduction:
Introduction to big data
Data which are very large in size is called Big Data. Normally we work on data of size
MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte
size is called Big Data. It is stated that almost 90% of today's data has been generated in the
past 3 years.
Sources of Big Data
These data come from many sources like
Social networking sites: Facebook, Google, LinkedIn all these sites generates huge amount of
data on a day to day basis as they have billions of users worldwide.
E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from
which users buying trends can be traced.
Weather Station: All the weather station and satellite gives very huge data which are stored
and manipulated to forecast weather.
Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
Share Market: Stock exchange across the world generates huge amount of data through its
daily transaction.
3V's of Big Data
Velocity: The data is increasing at a very fast rate. It is estimated that the volume of data will
double in every 2 years.
Variety: Now a days data are not stored in rows and column. Data is structured as well as
unstructured. Log file, CCTV footage is unstructured data. Data which can be saved in tables
are structured data like the transaction data of the bank.
Volume: The amount of data which we deal with is of very large size of Peta bytes.
Use case
An e-commerce site XYZ (having 100 million users) wants to offer a gift voucher of 100$ to
its top 10 customers who have spent the most in the previous year.Moreover, they want to find
the buying trend of these customers so that company can suggest more items related to them.
Issues
Huge amount of unstructured data which needs to be stored, processed and analyzed.
Solution
Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File System)
which uses commodity hardware to form clusters and store data in a distributed fashion. It
works on Write once, read many times principle.
Processing: Map Reduce paradigm is applied to data distributed over network to find the
required output.
Analyze: Pig, Hive can be used to analyze the data.
Cost: Hadoop is open source so the cost is no more an issue.
Solution
Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File System)
which uses commodity hardware to form clusters and store data in a distributed fashion. It
works on Write once, read many times principle.
Processing: Map Reduce paradigm is applied to data distributed over network to find the
required output.
Analyze: Pig, Hive can be used to analyze the data.
Cost: Hadoop is open source so the cost is no more an issue.
Introduction to Big Data Platform
The constant stream of information from various sources is becoming more intense[4],
especially with the advance in technology. And this is where big data platforms come in to
store and analyze the ever-increasing mass of information.
A big data platform is an integrated computing solution that combines numerous software
systems, tools, and hardware for big data management. It is a one-stop architecture that solves
all the data needs of a business regardless of the volume and size of the data at hand. Due to
their efficiency in data management, enterprises are increasingly adopting big data platforms
to gather tons of data and convert them into structured, actionable business insights[5].
Currently, the marketplace is flooded with numerous Open source and commercially available
big data platforms. They boast different features and capabilities for use in a big data
environment.
Challenges of Conventional Systems
Uncertainty of Data Management Landscape: Because big data is continuously expanding,
there are new companies and technologies that are being developed everyday. A big challenge
for companies is to find out which technology works bests for them without the introduction
of new risks and problems.The Big Data Talent Gap: While Big Data is a growing field, there
are very few experts available in this field. This is because Big data is a complex field and
people who understand the complexity and intricate nature of this field are far few and between.
Another major challenge in the field is the talent gap that exists in the industryGetting data into
the big data platform: Data is increasing every single day. This means that companies have to
tackle limitless amount of data on a regular basis. The scale and variety of data that is available
today can overwhelm any data practitioner and that is why it is important to make data
accessibility simple and convenient for brand mangers and owners.Need for synchronisation
across data sources: As data sets become more diverse, there is a need to incorporate them into
an analytical platform. If this is ignored, it can create gaps and lead to wrong insights and
messages.Getting important insights through the use of Big data analytics: It is important that
companies gain proper insights from big data analytics and it is important that the correct
department has access to this information. A major challenge in the big data analytics is
bridging this gap in an effective fashion.
Intelligent data analysis
• ou will find the best dissertation research areas / topics for future researchers enrolled
in Computer Science & Information.
• In order to identify the future research topics, we have reviewed the computer science
(recent peer-reviewed studies) on Data Analysis.
• Process of finding and identifying the meaning of data.
• Main advantage of visual representations is to discover, make sense of data and
communicating data.
Data
Data is nothing but things known or anything that is assumed; facts from which conclusions
can be gathered.
PhD Assistance works on Intelligent Data Analysis and Visualization. Hiring our experts,
you are assured with quality and on-time delivery.
Data Analysis
• Breaking up of any data into parts i.e., the examination of these parts to know about
their nature, proportion, function, interrelationship, etc.
• A process in which the analyst moves laterally and recursively between three modes:
describing data (profiling, correlation, summarizing), assembling data (scrubbing,
translating, synthesizing, filtering) and creating data (deriving, formulating,
simulating).
• It is a sense of making data. The process of finding and identifying the meaning of data.
Hire PhD Assistance experts to develop intelligent data analysis and visualization for
your Engineering & Technology.
Data Visualization
• It is a process of revealing already existing data and/or its features (origin, metadata,
allocation), which includes everything from the table to charts and multidimensional
animation (Min Yao, 2014) .
• To form an intellectual image of something not there to the sight.
• Visual data analysis is another form of data analysis, in which some or all forms of data
visualization may be used to give feedback sign to the analyst. Our product uses visual
signs such as charts, interactive browsing, and workflow process cues to help the
analyst in moving through the modes of data analysis.
• The main advantage of visual representations is to discover, make sense of data and
communicating data. Data visualization is a central part and an essential means to carry
out data analysis and then, once the importance have been identified and understood, it
is easy to communicate those meanings to others.
PhD Assistance experts has experience in handling dissertation and
assignment in Computer Science and Technology with assured 2:1 distinction. Talk to
Experts Now
Importance of IDA:
Intelligent Data Analysis (IDA) is one of the major issues in artificial intelligence and
information. Intelligent data analysis discloses hidden facts that are not known previously and
provides potentially important information or facts from large quantities of data (White, 2008).
It also helps in making a decision. Based on machine learning, artificial intelligence,
recognition of pattern, and records and visualization technology mainly, IDA helps to obtain
useful information, necessary data and interesting models from a lot of data available online in
order to make the right choices.
Intelligent data analysis helps to solve a problem that is already solved as a matter of routine.
If the data is collected for the past cases together with the result that was finally achieved, such
data can be used to revise and optimize the presently used strategy to arrive at a conclusion.
In certain cases, if some questions arise for the first time, and have only a little knowledge
about it, data from the related situations helps us to solve the new problem or any unknown
relationships can be discovered from the data to gain knowledge in an unfamiliar area.
Steps Involved In IDA:
IDA, in general, includes three stages: (1) Preparation of data; (2) data mining; (3) data
validation and explanation (Keim & Ward, 2007). The preparation of data involves opting for
the required data from the related data source and incorporating it into a data set that can be
used for data mining.
The main goal of intelligent data analysis is to obtain knowledge. Data analysis is the process
of a combination of extracting data from data set, analyzing, classification of data, organizing,
reasoning, and so on. It is challenging to choose suitable methods to resolve the complexity of
the process.
Regarding the term visualization, we have moved away from visualization to use the
term charting. The term analysis is used for the method of incorporating, influencing, filtering
and scrubbing the data, which certainly contains, but is not limited to interrelating with their
data through charts.
The Goal of Data Analysis:
Data analysis need not essentially involve arithmetic or statistics. While it is true that analysis
often involves one or both, and that many analytical pursuits cannot be handled without them,
much of the data analysis that people perform in the course of their work involves at most
mathematics no more complicated than the calculation of the mean of a set of values. The
essential activity of analysis is a comparison (of values, patterns, etc.), which can often be done
by simply using our eyes.
The aim of the analysis is not to find out appealing information in the data. Rather, this is only
a vital part of the process (Berthold & Hand, 2003). The aim is to make sense of data (i.e., to
understand what it means) and then to make decisions based on the understanding that is
achieved. Information in and of itself is not useful. Even understanding information in and of
it is not useful. The aim of data analysis is to make better decisions.
The process of data analysis starts with the collection of data that can add to the solution of
any given problem, and with the organization of that data in some regular form. It involves
identifying and applying a statistical or deterministic schema or model of the data that can be
manipulated for explanatory or predictive purposes. It then involves an interactive or automated
solution that explores the structured data in order to extract information – a solution to the
business problem – from the data.
PhD Assistance has vast experience in developing dissertation for data analysis
topics for student’s pursuing the UK dissertation in Engineering & Technology. Order
Now
The Goal of Visualization
The basic idea of visual data mining is to present the data in some visual form, allowing the
user to gain insight into the data, draw conclusions, and directly interact with the data. Visual
data analysis techniques have proven to be of high value in exploratory data analysis. Visual
data mining is mainly helpful when the only little fact is known about the data and the
exploration goals are indistinct.
The main uses of visual data examination over data analysis methods are:
• Visual data examination can simply deal with highly non-homogeneous and noisy data.
• Visual data exploration is spontaneous and requires no knowledge of complex
mathematical or arithmetical algorithms or parameters.
• Visualization can present a qualitative outline of the data, letting data phenomenon to
be secluded for further quantitative analysis. Accordingly, visual data examination
usually allows a quicker data investigation and often provides fascinating results,
especially in cases where automatic algorithms fail.
• Visual data examination techniques provide a much higher degree of assurance in the
findings of the exploration.
Conclusion
The examination of large data sets is a significant but complicated problem. Information
visualization techniques can be helpful in solving this problem. Visual data investigation is
helpful for many purposes such as fraud detection system and data mining can make use of
data visualization technology for improved data analysis.
Nature of Data
Big Data Characteristics
Big Data contains a large amount of data that is not being processed by traditional data storage
or the processing unit. It is used by many multinational companies to process the data and
business of many organizations. The data flow would exceed 150 exabytes per day before
replication.
There are five v's of Big Data that explains the characteristics.
5 V's of Big Data
o Volume
o Veracity
o Variety
o Value
o Velocity
Volume
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data
generated from many sources daily, such as business processes, machines, social media
platforms, networks, human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the "Like"
button is recorded, and more than 350 million new posts are uploaded each day. Big data
technologies can handle large amounts of data.
Variety
Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these
days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos,
videos, etc.
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the
data is created in real-time. It contains the linking of incoming data sets speeds, rate of
change, and activity bursts. The primary aspect of Big Data is to provide demanding data
rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.
Analytics Reporting