Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

CS8091 BIG DATA ANALYTICS

COUNTING ONES IN A WINDOW


Suppose we have a window of length N on a binary stream. We want at all times to be able to
answer queries of the form “how many 1’s are there in the last k bits?” for any k ≤ N.

The Cost of Exact Counts


To begin, suppose we want to be able to count exactly the number of 1’s in the last k bits for any
k ≤ N. Then it is necessary to store all ‘N’ bits of the window, as any representation that used
fewer than ‘N’ bits could not work.

In proof, suppose we have a representation that uses fewer than ‘N’ bits to represent the ‘N’ bits
in the window. Since there are 2N sequences of N bits, but fewer than 2N representations, there
must be two different bit strings ‘w’ and ‘x’ that have the same representation. Since w≠x, they
must differ in at least one bit. Let the last k−1 bits of ‘w’ and ‘x’ agree, but let them differ on the
k-th bit from the right end.

Example

If w = 0101 and x = 1010, then k = 1, since scanning from the right, they first disagree at position
1. If w = 1001 and x = 0101, then k = 3, because they first disagree at the third position from the
right.

The Datar-Gionis-Indyk-Motwani Algorithm


The DGIM algorithm uses O(log2 N) bits to represent a window of N bits, and allows us to
estimate the number of 1’s in the window with an error of no more than 50%. Later,
improvement of the method that limits the error to any fraction € > 0, and still uses only O(log2
N) bits.

To begin, each bit of the stream has a timestamp, the position in which it arrives. The first bit has
timestamp 1, the second has timestamp 2, and so on. Since we only need to distinguish positions
within the window of length N, we shall represent timestamps modulo N, so they can be
represented by log2 N bits.

If also store the total number of bits ever seen in the stream (i.e., the most recent timestamp)
modulo N, then we can determine from a timestamp modulo N where in the current window the
bit with that timestamp is. The idea is to divide the window into buckets consisting of:

• The timestamp of its right (most recent) end.


• The number of 1’s in the bucket. This number must be a power of 2, and we refer to the
number of 1’s as the size of the bucket.

To represent a bucket, we need log2 N bits to represent the timestamp (modulo N) of its right
end.
To represent the number of 1’s we only need log2 log2 N bits. The reason is that we know this
number i is a power of 2, say 2j, so we can represent ‘i’ by coding ‘j’ in binary.
1
CS8091 BIG DATA ANALYTICS

Since ‘j’ is at most log2 N, it requires log2 log2 N bits. Thus, O(logN) bits suffice to represent a
bucket. There are six rules that must be followed when representing a stream by buckets.

• The right end of a bucket is always a position with a 1.


• Every position with a 1 is in some bucket.
• No position is in more than one bucket.
• There are one or two buckets of any given size, up to some maximum size.
• All sizes must be a power of 2.
• Buckets cannot decrease in size as we move to the left (back in time).

A bit-stream divided into buckets following the DGIM rules

Storage Requirements for the DGIM Algorithm


There are O(logN) buckets. Since each bucket can be represented in O(logN) bits, the total space
required for all the buckets representing a window of size N is O(log2 N).

Query Answering in the DGIM Algorithm


Suppose we are asked how many 1’s there are in the last k bits of the window, for some 1 ≤ k ≤
N. Find the bucket ‘b’ with the earliest timestamp that includes at least some of the ‘k’ most
recent bits. Estimate the number of 1’s to be the sum of the sizes of all the buckets to the right
(more recent) than bucket ‘b’, plus half the size of ‘b’ itself.

Example

2
CS8091 BIG DATA ANALYTICS

Suppose the stream is that of the figure, and k = 10. Then the query asks for the number of 1’s in
the ten rightmost bits, which happen to be 0110010110. Let the current timestamp (time of the
rightmost bit) be ‘t’. Then the two buckets with one 1, having timestamps ‘t−1’ and ‘t−2’ are
completely included in the answer. The bucket of size 2, with timestamp ‘t–4’, is also completely
included. However, the rightmost bucket of size 4, with timestamp ‘t−8’ is only partly included.
We know it is the last bucket to contribute to the answer, because the next bucket to its left has
timestamp less than ‘t−9’ and thus is completely out of the window.

On the other hand, we know the buckets to its right are completely inside the range of the query
because of the existence of a bucket to their left with timestamp ‘t−9’ or greater. Our estimate of
the number of 1’s in the last ten positions is thus 6. This number is the two buckets of size 1, the
bucket of size 2, and half the bucket of size 4 that is partially within range. Of course the correct
answer is 5.

Maintaining the DGIM Conditions


Suppose we have a window of length ‘N’ properly represented by buckets that satisfy the DGIM
conditions. When a new bit comes in, we may need to modify the buckets, so they continue to
represent the window and continue to satisfy the DGIM conditions.
First, whenever a new bit enters:
• Check the leftmost (earliest) bucket. If its timestamp has now reached the current
timestamp minus N, then this bucket no longer has any of its 1’s in the window.
Therefore, drop it from the list of buckets.

Now, we must consider whether the new bit is 0 or 1. If it is 0, then no further change to the
buckets is needed. If the new bit is a 1, however, we may need to make several changes.
First:
• Create a new bucket with the current timestamp and size 1. If there was only one bucket
of size 1, then nothing more needs to be done. However, if there are now three buckets of
size 1, that is one too many. We fix this problem by combining the leftmost (earliest) two
buckets of size 1.
• To combine any two adjacent buckets of the same size, replace them by one bucket of
twice the size. The timestamp of the new bucket is the timestamp of the rightmost (later
in time) of the two buckets.

3
CS8091 BIG DATA ANALYTICS

Combining two buckets of size 1 may create a third bucket of size 2. If so, we combine the
leftmost two buckets of size 2 into a bucket of size 4. That, in turn, may create a third bucket of
size 4, and if so combine the leftmost two into a bucket of size 8. This process may ripple
through the bucket sizes, but there are at most log2 N different sizes, and the combination of two
adjacent buckets of the same size only requires constant time. As a result, any new bit can be
processed in O(logN) time.
Reducing the Error
Instead of allowing either one or two of each size bucket, suppose we allow either r−1 or r of
each of the exponentially growing sizes 1,2,4,..., for some integer r > 2. By picking r sufficiently
large, we can limit the error to any desired level.

DECAYING WINDOWS
Exponentially decaying windows are quite useful in finding the most common “recent”
elements.
The Problem of Most-Common Elements
Suppose we have a stream whose elements are the movie tickets purchased all over the world,
with the name of the movie as part of the element. To keep a summary of the stream that is the
most popular movies “currently.” While the notion of “currently” is imprecise, intuitively, we
want to discount the popularity of a movie like Star Wars–Episode 4, which sold many tickets,
but most of these were sold decades ago. On the other hand, a movie that sold n tickets in each
of the last 10 weeks is probably more popular than a movie that sold 2n tickets last week but
nothing in previous weeks.

One solution would be to imagine a bit stream for each movie. The ith bit has value 1 if the ith
ticket is for that movie, and 0 otherwise. Pick a window size N, which is the number of most
recent tickets that would be considered in evaluating popularity. Then, use the method of
counting oneness in a window to estimate the number of tickets for each movie, and rank movies
by their estimated counts. This technique might work for movies, because there are only
thousands of movies, but it would fail if we were instead recording the popularity of items sold at
Amazon, or the rate at which different Twitter-users tweet, because there are too many Amazon
products and too many tweeters. Further, it only offers approximate answers.

Definition of the Decaying Window


An alternative approach to overcome the approximate answers is to compute a smooth
aggregation of all the 1’s ever seen in the stream, with decaying weights, so the further back in
the stream, the less weight is given.

Formally, let a stream currently consist of the elements a1,a2,...,at, where a1 is the first element to
arrive and at is the current element. Let c be a small constant, such as 10−6 or 10−9. Define the
exponentially decaying window for this stream to be the sum

4
CS8091 BIG DATA ANALYTICS

The effect of this definition is to spread out the weights of the stream elements as far back in time
as the stream goes. In contrast, a fixed window with the same sum of the weights, 1/c, would put
equal weight 1 on each of the most recent 1/c elements to arrive and weight 0 on all previous
elements.
A decaying window and a fixed-length window of equal weight

It is much easier to adjust the sum in an exponentially decaying window than in a sliding
window of fixed length. In the sliding window, we have to worry about the element that falls out
of the window each time a new element arrives. That forces us to keep the exact elements along
with the sum, or to use an approximation scheme such as DGIM. However, when a new element
at+1 arrives at the stream input, all we need to do is:

1. Multiply the current sum by 1−c.


2. Add at+1.
The reason this method works is that each of the previous elements has now moved one position
further from the current element, so its weight is multiplied by 1−c. Further, the weight on the
current element is (1−c)0 = 1, so adding at+1 is the correct way to include the new element’s
contribution.

Finding the Most Popular Elements


Consider the problem of finding the most popular movies in a stream of ticket sales.
Use an exponentially decaying window with a constant c, which you might think of as 10−9. That
is approximate a sliding window holding the last one billion ticket sales. For each movie, we
imagine a separate stream with a 1 each time a ticket for that movie appears in the stream, and a
0 each time a ticket for some other movie arrives.

The decaying sum of the 1’s measures the current popularity of the movie. We imagine that the
number of possible movies in the stream is huge, so we do not want to record values for the
unpopular movies. We establish a threshold, say 1/2, so that if the popularity score for a movie
goes below this number, its score is dropped from the counting. For reasons that will become
obvious, the threshold must be less than 1, although it can be any number less than 1.

When a new ticket arrives on the stream, do the following:


• For each movie whose score we are currently maintaining, multiply its score by (1−c).
• Suppose the new ticket is for movie M. If there is currently a score for M, add 1 to that
score. If there is no score for M, create one and initialize it to 1.
• If any score is below the threshold 1/2, drop that score.

It may not be obvious that the number of movies whose scores are maintained at any time is
limited. However, note that the sum of all scores is 1/c. There cannot be more than 2/c movies

5
CS8091 BIG DATA ANALYTICS

with score of 1/2 or more, or else the sum of the scores would exceed 1/c. Thus, 2/c is a limit on
the number of movies being counted at any time.

USING GRAPH ANALYTICS FOR BIG DATA


Graph analytics, is an analytics alternative that uses an abstraction called a graph model.

Characteristics

• The simplicity of this model allows for rapidly absorbing and connecting large volumes
of data from many sources in ways that tact limitations of the source structures

• Graph analytics is an alternative to the traditional data warehouse model as a framework


for absorbing both structured and unstructured data from various sources to enable
analysts to probe the data in an undirected manner.

• The graph model allows us to tightly couple the meaning of entity relationships as part of
the representation of the relationship. This effectively embeds the semantics of
relationships among different entities within the structure, providing an ability to both
invoke traditional-style queries (to answer typical “search” queries modeled after known
patterns) and enable more sophisticated undirected analyses.

what is graph analytics?


To have a thorough knowledge about graph model we must be familiar with the following points
• What constitutes graph analytics?
• Types of problems that are suited to graph analytics.
• Types of questions that are addressed using graph analytics.
• Types of graphs that are commonly encountered.
• The degree of prevalence within big data analytics problems.

The simplicity of the graph model


Graph analytics is based on a model of representing individual entities and numerous kinds of
relationships that connect those entities. It employs the graph abstraction for representing
connectivity, consisting of a collection of vertices (which are also referred to as nodes or points)
that represent the modeled entities, connected by edges (which are also referred to as links,
connections, or relationships) that capture the way that two entities are related.

Examples

6
CS8091 BIG DATA ANALYTICS

The flexibility of the model is based on its simplicity. A simple unlabeled undirected graph, in
which the edges between vertices neither reflect the nature of the relationship nor indicate their
direction, has limited utility. Adding context by labeling the vertices and edges enhances the
meanings embedded within the graph, and by extension, the entire representation of a network.

The enhancements, that enrich the meaning of the nodes and edges represented in the graph
model are:
• Vertices can be labeled to indicate the types of entities that are related.
• Edges can be labeled with the nature of the relationship.
• Edges can be directed to indicate the “flow” of the relationship.
• Weights can be added to the relationships represented by the edges.
• Additional properties can be attributed to both edges and vertices.
• Multiple edges can reflect multiple relationships between pairs of vertices.

Representation as triples
A directed graph that can be represented using a triple format consisting of a subject (the source
point of the relationship), an object (the target), and a predicate (that models the type of the
relationship).
Triples Derived from the Example

A collection of these triples is called a semantic database, and this kind of database can capture
additional properties of each triple relationship as attributes of the triple.

7
CS8091 BIG DATA ANALYTICS

Graphs and network organization


One of the benefits of the graph model is the ability to detect patterns or organization that are
inherent within the represented network, such as:

• Embedded micro networks: Looking for small collections of entities that form
embedded “micro communities.” Some examples include determining the originating
sources for a hot new purchasing trend, identifying a terrorist cell based on patterns of
communication across a broad swath of call detail records, or sets of individuals within a
particular tiny geographic region with similar political views.
• Communication models: Modeling communication across a community triggered by a
specific event, such as monitoring the “buzz” across a social media channel associated
with the rumored release of a new product, evaluating best methods for communicating
news releases, or correlation between travel delays and increased mobile telephony
activity.
• Collaborative communities: Isolating groups of individuals that share similar interests,
such as groups of health care professionals working in the same area of specialty,
purchasers with similar product tastes, or individuals with a precise set of employment
skills.
• Influence modeling: Looking for entities holding influential positions within a network
for intermittent periods of time, such as computer nodes that have been hijacked and put
to work as proxies for distributed denial of service attacks or for emerging cybersecurity
threats, or individuals that are recognized as authorities within a particular area
• Distance modeling: Analyzing the distances between sets of entities, such as looking for
strong correlations between occurrences of sets of statistically improbable phrases among
large sets of search engines queries, or the amount of effort necessary to propagate a
message among a set of different communities.

Choosing graph analytics


Characteristics and factors of business problems responsible for choosing graph analytics
solutions:
• Connectivity: The solution to the business problem requires the analysis of relationships
and connectivity between a variety of different types of entities.
• Undirected discovery: Solving the business problem involves iterative undirected
analysis to seek out as-of-yet unidentified patterns.
• Absence of structure: Multiple datasets to be subjected to the analysis are provided
without any inherent imposed structure.
• Flexible semantics: The business problem exhibits dependence on contextual semantics
that can be attributed to the connections and corresponding relationships.
• Extensibility: Because additional data can add to the knowledge embedded within the
graph, there is a need for the ability to quickly add in new data sources or streaming data
as needed for further interactive analysis
• Knowledge is embedded in the network: Solving the business problem involves the
ability to exploit critical features of the embedded relationships that can be inferred from
the provided data.
• Ad hoc nature of the analysis: There is a need to run ad hoc queries to follow lines of
reasoning.
8
CS8091 BIG DATA ANALYTICS

• Predictable interactive performance: The ad hoc nature of the analysis creates a need
for high performance because discovery in big data is a collaborative man/machine
undertaking, and predictability is critical when the results are used for operational
decision making.

Graph analytics use cases


• Health care quality analytics in which patient health encounter histories, diagnoses,
procedures, treatment approaches, and results of clinical trials contribute to the ability to
analyze the comparative effectiveness of different health care options. The process
requires the absorption of many different collections of medical histories, clinical
records, laboratory results, and prescription records from many different sources and
systems. In turn, the application must guide health care providers in a timely and efficient
manner by enabling the rapid assessment of therapies used for other patients with similar
characteristics (such as age, clinical history, and associated risk factors) that have the
most positive outcomes. This analysis creates a graph representing the corresponding
relationships and may also combine health histories with additional data sources that can
be easily integrated into the existing graph model.

• Concept-based correlations that seek to organize large bodies of knowledge. Some


examples include looking for contextual relationships between scientific health care
research and particular types of pharmaceuticals, investigative journalists seeking
connections among individuals referred to in a variety of news sources, fraud analysts
evaluating financial irregularities across multiple-related organizations, or even
correlations between corporate behavior and increased health risks. Each of these
examples involve the absorption of many content artifacts from multiple varied data
sources, the need to combine isolated pieces of information extracted from among the
corpus of documents, and a discovery activity looking for correlations that are currently
hidden or unknown.

• Cybersecurity: The numbers of cyber-attacks are increasing, and their sophistication is


expanding way beyond distributed denial of service (DDoS), which is more frequently
being used as a premise and distraction while more insidious attacks seek to traverse
corporate firewalls, extract critical business information, or incrementally drain
individual financial accounts while operating completely under the radar. Monitoring for
cybersecurity events is a process in which a wide variety of massive streaming datasets
(such as network logs, NetFlow, DNS, and IDS data) need to be rapidly captured and
integrated into a model that allows for both identifications of known attack patterns as
well as the discovery of new patterns that emerge as the attacks become more elaborate.
A graph analytics approach can address the challenge by capturing and loading the entire
volume and breadth of the available datasets coupled with an evolving model that can
quickly log connections and relationships that is used to identify new patterns of attack.
This graph-based approach allows analysts to rapidly get results from many ad hoc
queries requested from a model managing massive amounts of data representing
thousands of interconnected entities. This approach enables the analysts to quickly
identify potential cyber-threats in minutes, or even seconds so that defensive actions can
be taken quickly.
9
CS8091 BIG DATA ANALYTICS

Graph analytics algorithms and solution approaches


Some of the types of analytics algorithmic approaches include:
• Community and network analysis, in which the graph structures are traversed in search
of groups of entities connected in particularly “close” ways. One example is a collection
of entities that are completely connected (i.e., each member of the set is connected to all
other members of the set).
• Path analysis, which analyze the shapes and distances of the different paths that connect
entities within the graph.
• Clustering, which examines the properties of the vertices and edges to identify
characteristics of entities that can be used to group them together.
• Pattern detection and pattern analysis, or methods for identifying anomalous or
unexpected patterns requiring further investigation.
• Probabilistic graphical models such as Bayesian networks or Markov networks for
various application such as medical diagnosis, protein structure prediction, speech
recognition, or assessment of default risk for credit applications.
• Graph metrics that are applied to measurements associated with the network itself,
including the degree of the vertices (i.e., the number of edges in and out of the vertex), or
centrality and distance (including the degree to which particular vertices are “centrally
located” in the graph, or how close vertices are to each other based on the length of the
paths between them).

Technical complexity of analyzing graphs


The factors that might introduce performance penalties that are not easily addressed on standard
hardware architectures include:
• Unpredictability of graph memory accesses: The types of discovery analyses in graph
analytics often require the simultaneous traversal of multiple paths within a network to
find the interesting patterns for further review or optimal solutions to a problem. The
graph model is represented using data structures that incorporate the links between the
represented entities, and this differs substantially from a traditional database model. In a
parallel environment, many traversals can be triggered at the same time, but each graph
traversal is inherently dependent on the ability to follow the links from the subject to the
target in the representation. Unlike queries to structured databases, the memory access
patterns are not predictable, limiting the ability to reduce data access latency by
efficiently streaming prefetched data through the different levels of the memory hierarchy
• Graph growth models: As more information is introduced into an environment, real-
world networks grow in an interesting way. The larger the graph becomes, the more it
exhibits what is called “preferential connectivity”—newly introduced entities are more
likely to connect to already existing ones. Existing nodes with a high degree of
connectivity are more likely to continue to be “popular” in that they will continue to
attract new connections. This means that graphs continue to grow, but apparently that
growth does not scale equally across the data structure.

10
CS8091 BIG DATA ANALYTICS

• Dynamic interactions with graphs: As with any big data application, graphs to be
analyzed are populated from a wide variety of data sources of large or massive data
volumes, streamed at varying rates. But while the graph must constantly absorb many
rapidly changing data streams and process the incorporation of connections and
relationships into the persistent representation of the graph, the environment must also
satisfy the need to rapidly respond to many simultaneous discovery analyses. A high-
performance graph analytics solution must accommodate the dynamic nature without
allowing for any substantial degradation in analytics performance.

• Complexity of graph partitioning: Graphs tend to cluster among centers of high


connectivity, as many networks naturally have “hubs” consisting of a small group of
nodes with many connections. Benefit of having hubs is that in general it shortens the
distances among collections of nodes in the graph. The hubs often showcase entities that
may be of particular interest because of their perceived centrality. But while one of the
benefits of big data platforms is the expectation of distribution of data and computation,
the existence of hubs makes it difficult to partition the graph among different processing
units because of the multiplicity of connections. Arbitrarily distributing the data
structures in ways that span or cross multiple partitions will lead to increased cross-
partition network traffic, which effectively eliminates the perceived performance benefit
of data distribution.

Features of a graph analytics platform


Ease of development features of graph analytics platform is enabled through the adoption of
industry standards as well as supporting general requirements for big data applications, such as:

• Seamless data intake: Providing a seamless capability to easily absorb and fuse data
from a variety of different sources.
• Data integration: A semantics-based approach is necessary for graph analytics to
integrate different sets of data that do not have predetermined structure. Similar to other
NoSQL approaches the schema-less approach of the semantic model must provide the
flexibility not provided by relational database models.
• Inferencing: The application platform should provide methods for inferencing and
deduction of new information and insights derived from the embedded relationships and
connectivity.
• Standards-based representation: Any graph analytics platform must employ the
resource description framework standard to use triples for representing the graph.
Representing the graph using RDF allows for the use of the SPARQL query language
standard for executing queries against a triple-based data management environment.

Considerations for the operational aspects of the graph analytics platform:


• Workflow integration: Providing a graph analytics platform that is segregated from the
existing reporting and analytics environments will have limited value when there are gaps
in incorporating results from across the different environments. Make sure that the graph
analytics platform is aligned with the analysis and decision-making workflows for the
associated business processes.

11
CS8091 BIG DATA ANALYTICS

• Visualization: Presenting discoveries using visualization tools is critical to highlight their


value. Look for platforms that have tight integration with visualization services.
• “Complementariness”: A graph analytics platform augments an organization’s analytics
capability and is not intended to be a replacement. Any graph analytics capabilities must
complement existing data warehouses, data marts, OLAP engines, and Hadoop analytic
environments.

Important criteria for graph analytics due to the nature of the representation and the types of
discovery analyses performed are as follows:
• High-speed I/O: Meeting this expectation for real-time integration requires a scalable
infrastructure, particularly with respect to high speed I/O channels, which will speed the
intake of multiple data streams and thereby support graphs that are rapidly changing as
new data is absorbed.
• High-bandwidth network: As a significant burden of data access may cross node
barriers, employing a high-speed/high-bandwidth network interconnect will also help
reduce data latency delays, especially for chasing pointers across the graph.
• Multithreading: Fine-grained multithreading allows the simultaneous exploration of
different paths and efficiently creating, managing, and allocating threads to available
nodes on a massively parallel processing architecture can alleviate the challenge of
predictability.
• Large memory: A very large memory shared across multiple processors reduces the
need to partition the graph across independent environments. This helps to reduce the
performance impacts associated with graph partitioning. Large memories are useful in
keeping a large part, if not all of the graph resident in-memory. By migrating allocated
tasks to where the data is resident in memory reduces the impact of data access latency
delays.

Dedicated appliances for graph analytics


There are different emerging methods of incorporating graph analytics into the enterprise.

One class is purely a software approach, providing an ability to create, populate, and query
graphs. This approach enables the necessary functionality and may provide the ease-of
implementation and deployment. Most, if not all, software implementations will use industry
standards, such as RDF and SPARQL, and may even leverage complementary tools for
inferencing and deduction. Performance of a software-only implementation is limited by its use
of the available hardware, and even using commodity servers cannot necessarily enable it to
natively exploit performance and optimization.

Another class is the use of a dedicated appliance for graph analytics. From an algorithmic
perspective, this approach is equally capable as one that solely relies on software. From a
performance perspective, there is no doubt that a dedicated platform will take advantage of high-
performance I/O, high-bandwidth networking, in-memory computation, and native
multithreading to provide the optimal performance for creating, growing, and analyzing graphs
that are built from multiple high-volume data streams.

12
CS8091 BIG DATA ANALYTICS

Software approaches may be satisfactory for smaller graph analytics problems, but as data
volumes and network complexity grow, the most effective means for return on investment may
necessitate the transition to a dedicated graph analytics appliance.

13

You might also like