Harnessing Big Data
Harnessing Big Data
4
Big Data Tech Stack
Redundant Physical Infrastructure
• Redundant physical infrastructure is fundamental to the operation
and scalability of a big data architecture.
• To support an unanticipated or unpredictable volume of data, a
physical infrastructure for big data has to be different than that for
traditional data.
• The physical infrastructure is based on a distributed computing
model. This means that data may be physically stored in many
different locations and can be linked together through networks, the
use of a distributed file system, and various big data analytic tools
and applications.
• Redundancy is important because we are dealing with so much data
from so many different sources. Redundancy comes in many forms.
Redundancy may be internal cloud for load balancing.
• Redundancy may be in external cloud services to augment its
internal resources. In some cases, this redundancy may come in the
form of a Software as a Service (SaaS).
Security Infrastructure
The overall process of extracting insights from big data can be broken down into five stages.
These five stages form the two main sub-processes: data management and analytics. Data
management involves processes and supporting technologies to acquire and store data and
to prepare and retrieve it for analysis. Analytics, on the other hand, refers to techniques
used to analyze and acquire intelligence from big data. Thus, big data analytics can be
viewed as a sub-process in the overall process of ‘insight extraction’ from big data.
Big data management
MapReduce model
Distributed grep
Split data grep matches
Split data grep matches
Very All
big Split data grep matches cat
matches
data
Split data grep matches
R
M E
Very Partitioning
A D Result
big Function
P U
data
C
E
Map: Reduce :
– Accepts input key/value pair – Accepts intermediate key/value*
– Emits intermediate key/value pair pair
– Emits output key/value pair
Directed Acyclic Graph model
MapReduce model simply states that distributed computation on a large dataset can
be boiled down to two kinds of computation steps - a map step and a reduce step.
One pair of map and reduce does one level of aggregation over the data. Complex
computations typically require multiple such steps. When you have multiple such
steps, it essentially forms a DAG of operations. So a DAG execution model is
essentially a generalization of the MapReduce model.
Computations expressed in MapReduce boil down to multiple iterations of
• Read data from HDFS
• Apply map and reduce,
• Write back to HDFS.
Each map-reduce round is completely independent of each other, and Hadoop does
not have any global knowledge of what MR steps are going to come after each MR.
For many iterative algorithms this is inefficient as the data between each map-reduce
pair gets written and read from filesystem. Newer systems like Spark and Tez
improves performance over Hadoop by considering the whole DAG of map-reduce
steps and optimizing it globally (e.g., pipelining consecutive map steps into one, not
write intermediate data to HDFS). This prevents writing data back and forth after
every reduce.
Directed Acyclic Graph model
One method for representing graphs uses the W3C’s Resource Description Framework, called RDF. In RDF, you specify
triples that follow a subject-predicate-object format, such as “Larry Brown is-employed-by Millipede Electronics.” In this
example, “Larry Brown” is the subject, “Millipede Electronics” is the object, and “is-employed-by” is the predicate that
relates the subject to the object.
In effect, the RDF triple defines a directed edge between the subject and the object. A collection of RDF triples can
capture the format of a graph. Within the graph database, various data structures represent the graph. Some are
straightforward, such as Java objects linked with pointers, and others are optimized using different types of data
structures, such as a sparse matrix data structure.
Graphs are queried using a graph query language such as SPARQL -- a recursive acronym for SPARQL Protocol and
RDF Query Language. SPARQL is a semantic query language, allowing queries based on the attributes of the vertices,
the attributes of the edges, and the structure of the connections. For example, you can query the graph for “all individuals
who have three outgoing edges that connect to companies that have more than 500 employees.”
Search is just one analytics algorithm that you can apply to graphs. Others include:
-- Partitioning isolates portions of the graph into smaller pieces with particular properties. Example uses include
partitioning a telecommunications network based on serving particular geographic regions or organizing sales
territories by the location of the sales staff.
-- Shortest path analysis seeks the most efficient path between two nodes. A good example is examining all the delivery
points for a parcel delivery company each day to determine routes that require the least amount of fuel.
-- Analysis can locate connected components, subgraphs in which all the vertices can be reached from any other
member of the connected component. Connected components often represent distinct clusters of entities, and you
can use them for segmentation.
-- Page rank characterizes the importance of a vertex within the graph. The algorithm, named after Google founder Larry
Page, is part of how search engines rank websites based on their content and connections to other websites.
-- Centrality is another algorithm used to identify the most important or most influential entities within the graph.
BSP Model
The Bulk Synchronous Parallel (BSP) abstract computer is a model for designing parallel
algorithms. An important part of analysing a BSP algorithm rests on quantifying the
synchronization and communication needed.
A BSP computer consists of
• Components capable of processing and/or local memory transactions (i.e., processors),
• A network that routes messages between pairs of such components, and
• A hardware facility that allows for the synchronization of all or a subset of components.
This is commonly interpreted as a set of processors which may follow different threads of
computation, with each processor equipped with fast local memory and interconnected by a
communication network. A BSP algorithm relies heavily on the third feature; a computation
proceeds in a series of global supersteps, which consists of three components:
Concurrent computation: Every participating processor may perform local computations,
i.e., each process can only make use of values stored in the local fast memory of the
processor. The computations occur asynchronously of all the others but may overlap with
communication.
Communication: The processes exchange data between themselves to facilitate remote
data storage capabilities.
Barrier synchronization: When a process reaches this point (the barrier), it waits until all
other processes have reached the same barrier.
BSP Model
The computation and communication actions do not have to be ordered in time.
Communication typically takes the form of the one-sided put and get Direct
Remote Memory Access (DRMA) calls, rather than paired two-sided send and
receive message passing calls. The barrier synchronization concludes the
superstep. It ensures that all one-sided communications are properly
concluded. Systems based on two-sided communication include this
synchronization cost implicitly for every message sent. The method for barrier
synchronization relies on the hardware facility of the BSP computer. This facility
periodically checks if the end of the current superstep is reached globally.
The BSP model is also well-suited to enable automatic memory management
for distributed-memory computing through overdecomposition of the problem
and oversubscription of the processors. The computation is divided into more
logical processes than there are physical processors, and processes are
randomly assigned to processors. This strategy leads to almost perfect load
balancing, both of work and communication.
Big Data Technology
Thank you
The End