Professional Documents
Culture Documents
Counting Ones in A Window: The Cost of Exact Counts
Counting Ones in A Window: The Cost of Exact Counts
In proof, suppose we have a representation that uses fewer than ‘N’ bits to represent the ‘N’ bits
in the window. Since there are 2N sequences of N bits, but fewer than 2N representations, there
must be two different bit strings ‘w’ and ‘x’ that have the same representation. Since w≠x, they
must differ in at least one bit. Let the last k−1 bits of ‘w’ and ‘x’ agree, but let them differ on the
k-th bit from the right end.
Example
If w = 0101 and x = 1010, then k = 1, since scanning from the right, they first disagree at position
1. If w = 1001 and x = 0101, then k = 3, because they first disagree at the third position from the
right.
To begin, each bit of the stream has a timestamp, the position in which it arrives. The first bit has
timestamp 1, the second has timestamp 2, and so on. Since we only need to distinguish positions
within the window of length N, we shall represent timestamps modulo N, so they can be
represented by log2 N bits.
If also store the total number of bits ever seen in the stream (i.e., the most recent timestamp)
modulo N, then we can determine from a timestamp modulo N where in the current window the
bit with that timestamp is. The idea is to divide the window into buckets consisting of:
To represent a bucket, we need log2 N bits to represent the timestamp (modulo N) of its right
end.
To represent the number of 1’s we only need log2 log2 N bits. The reason is that we know this
number i is a power of 2, say 2j, so we can represent ‘i’ by coding ‘j’ in binary.
1
CS8091 BIG DATA ANALYTICS
Since ‘j’ is at most log2 N, it requires log2 log2 N bits. Thus, O(logN) bits suffice to represent a
bucket. There are six rules that must be followed when representing a stream by buckets.
Example
2
CS8091 BIG DATA ANALYTICS
Suppose the stream is that of the figure, and k = 10. Then the query asks for the number of 1’s in
the ten rightmost bits, which happen to be 0110010110. Let the current timestamp (time of the
rightmost bit) be ‘t’. Then the two buckets with one 1, having timestamps ‘t−1’ and ‘t−2’ are
completely included in the answer. The bucket of size 2, with timestamp ‘t–4’, is also completely
included. However, the rightmost bucket of size 4, with timestamp ‘t−8’ is only partly included.
We know it is the last bucket to contribute to the answer, because the next bucket to its left has
timestamp less than ‘t−9’ and thus is completely out of the window.
On the other hand, we know the buckets to its right are completely inside the range of the query
because of the existence of a bucket to their left with timestamp ‘t−9’ or greater. Our estimate of
the number of 1’s in the last ten positions is thus 6. This number is the two buckets of size 1, the
bucket of size 2, and half the bucket of size 4 that is partially within range. Of course the correct
answer is 5.
Now, we must consider whether the new bit is 0 or 1. If it is 0, then no further change to the
buckets is needed. If the new bit is a 1, however, we may need to make several changes.
First:
• Create a new bucket with the current timestamp and size 1. If there was only one bucket
of size 1, then nothing more needs to be done. However, if there are now three buckets of
size 1, that is one too many. We fix this problem by combining the leftmost (earliest) two
buckets of size 1.
• To combine any two adjacent buckets of the same size, replace them by one bucket of
twice the size. The timestamp of the new bucket is the timestamp of the rightmost (later
in time) of the two buckets.
3
CS8091 BIG DATA ANALYTICS
Combining two buckets of size 1 may create a third bucket of size 2. If so, we combine the
leftmost two buckets of size 2 into a bucket of size 4. That, in turn, may create a third bucket of
size 4, and if so combine the leftmost two into a bucket of size 8. This process may ripple
through the bucket sizes, but there are at most log2 N different sizes, and the combination of two
adjacent buckets of the same size only requires constant time. As a result, any new bit can be
processed in O(logN) time.
Reducing the Error
Instead of allowing either one or two of each size bucket, suppose we allow either r−1 or r of
each of the exponentially growing sizes 1,2,4,..., for some integer r > 2. By picking r sufficiently
large, we can limit the error to any desired level.
DECAYING WINDOWS
Exponentially decaying windows are quite useful in finding the most common “recent”
elements.
The Problem of Most-Common Elements
Suppose we have a stream whose elements are the movie tickets purchased all over the world,
with the name of the movie as part of the element. To keep a summary of the stream that is the
most popular movies “currently.” While the notion of “currently” is imprecise, intuitively, we
want to discount the popularity of a movie like Star Wars–Episode 4, which sold many tickets,
but most of these were sold decades ago. On the other hand, a movie that sold n tickets in each
of the last 10 weeks is probably more popular than a movie that sold 2n tickets last week but
nothing in previous weeks.
One solution would be to imagine a bit stream for each movie. The ith bit has value 1 if the ith
ticket is for that movie, and 0 otherwise. Pick a window size N, which is the number of most
recent tickets that would be considered in evaluating popularity. Then, use the method of
counting oneness in a window to estimate the number of tickets for each movie, and rank movies
by their estimated counts. This technique might work for movies, because there are only
thousands of movies, but it would fail if we were instead recording the popularity of items sold at
Amazon, or the rate at which different Twitter-users tweet, because there are too many Amazon
products and too many tweeters. Further, it only offers approximate answers.
Formally, let a stream currently consist of the elements a1,a2,...,at, where a1 is the first element to
arrive and at is the current element. Let c be a small constant, such as 10−6 or 10−9. Define the
exponentially decaying window for this stream to be the sum
4
CS8091 BIG DATA ANALYTICS
The effect of this definition is to spread out the weights of the stream elements as far back in time
as the stream goes. In contrast, a fixed window with the same sum of the weights, 1/c, would put
equal weight 1 on each of the most recent 1/c elements to arrive and weight 0 on all previous
elements.
A decaying window and a fixed-length window of equal weight
It is much easier to adjust the sum in an exponentially decaying window than in a sliding
window of fixed length. In the sliding window, we have to worry about the element that falls out
of the window each time a new element arrives. That forces us to keep the exact elements along
with the sum, or to use an approximation scheme such as DGIM. However, when a new element
at+1 arrives at the stream input, all we need to do is:
The decaying sum of the 1’s measures the current popularity of the movie. We imagine that the
number of possible movies in the stream is huge, so we do not want to record values for the
unpopular movies. We establish a threshold, say 1/2, so that if the popularity score for a movie
goes below this number, its score is dropped from the counting. For reasons that will become
obvious, the threshold must be less than 1, although it can be any number less than 1.
It may not be obvious that the number of movies whose scores are maintained at any time is
limited. However, note that the sum of all scores is 1/c. There cannot be more than 2/c movies
5
CS8091 BIG DATA ANALYTICS
with score of 1/2 or more, or else the sum of the scores would exceed 1/c. Thus, 2/c is a limit on
the number of movies being counted at any time.
Characteristics
• The simplicity of this model allows for rapidly absorbing and connecting large volumes
of data from many sources in ways that tact limitations of the source structures
• The graph model allows us to tightly couple the meaning of entity relationships as part of
the representation of the relationship. This effectively embeds the semantics of
relationships among different entities within the structure, providing an ability to both
invoke traditional-style queries (to answer typical “search” queries modeled after known
patterns) and enable more sophisticated undirected analyses.
Examples
6
CS8091 BIG DATA ANALYTICS
The flexibility of the model is based on its simplicity. A simple unlabeled undirected graph, in
which the edges between vertices neither reflect the nature of the relationship nor indicate their
direction, has limited utility. Adding context by labeling the vertices and edges enhances the
meanings embedded within the graph, and by extension, the entire representation of a network.
The enhancements, that enrich the meaning of the nodes and edges represented in the graph
model are:
• Vertices can be labeled to indicate the types of entities that are related.
• Edges can be labeled with the nature of the relationship.
• Edges can be directed to indicate the “flow” of the relationship.
• Weights can be added to the relationships represented by the edges.
• Additional properties can be attributed to both edges and vertices.
• Multiple edges can reflect multiple relationships between pairs of vertices.
Representation as triples
A directed graph that can be represented using a triple format consisting of a subject (the source
point of the relationship), an object (the target), and a predicate (that models the type of the
relationship).
Triples Derived from the Example
A collection of these triples is called a semantic database, and this kind of database can capture
additional properties of each triple relationship as attributes of the triple.
7
CS8091 BIG DATA ANALYTICS
• Embedded micro networks: Looking for small collections of entities that form
embedded “micro communities.” Some examples include determining the originating
sources for a hot new purchasing trend, identifying a terrorist cell based on patterns of
communication across a broad swath of call detail records, or sets of individuals within a
particular tiny geographic region with similar political views.
• Communication models: Modeling communication across a community triggered by a
specific event, such as monitoring the “buzz” across a social media channel associated
with the rumored release of a new product, evaluating best methods for communicating
news releases, or correlation between travel delays and increased mobile telephony
activity.
• Collaborative communities: Isolating groups of individuals that share similar interests,
such as groups of health care professionals working in the same area of specialty,
purchasers with similar product tastes, or individuals with a precise set of employment
skills.
• Influence modeling: Looking for entities holding influential positions within a network
for intermittent periods of time, such as computer nodes that have been hijacked and put
to work as proxies for distributed denial of service attacks or for emerging cybersecurity
threats, or individuals that are recognized as authorities within a particular area
• Distance modeling: Analyzing the distances between sets of entities, such as looking for
strong correlations between occurrences of sets of statistically improbable phrases among
large sets of search engines queries, or the amount of effort necessary to propagate a
message among a set of different communities.
• Predictable interactive performance: The ad hoc nature of the analysis creates a need
for high performance because discovery in big data is a collaborative man/machine
undertaking, and predictability is critical when the results are used for operational
decision making.
10
CS8091 BIG DATA ANALYTICS
• Dynamic interactions with graphs: As with any big data application, graphs to be
analyzed are populated from a wide variety of data sources of large or massive data
volumes, streamed at varying rates. But while the graph must constantly absorb many
rapidly changing data streams and process the incorporation of connections and
relationships into the persistent representation of the graph, the environment must also
satisfy the need to rapidly respond to many simultaneous discovery analyses. A high-
performance graph analytics solution must accommodate the dynamic nature without
allowing for any substantial degradation in analytics performance.
• Seamless data intake: Providing a seamless capability to easily absorb and fuse data
from a variety of different sources.
• Data integration: A semantics-based approach is necessary for graph analytics to
integrate different sets of data that do not have predetermined structure. Similar to other
NoSQL approaches the schema-less approach of the semantic model must provide the
flexibility not provided by relational database models.
• Inferencing: The application platform should provide methods for inferencing and
deduction of new information and insights derived from the embedded relationships and
connectivity.
• Standards-based representation: Any graph analytics platform must employ the
resource description framework standard to use triples for representing the graph.
Representing the graph using RDF allows for the use of the SPARQL query language
standard for executing queries against a triple-based data management environment.
11
CS8091 BIG DATA ANALYTICS
Important criteria for graph analytics due to the nature of the representation and the types of
discovery analyses performed are as follows:
• High-speed I/O: Meeting this expectation for real-time integration requires a scalable
infrastructure, particularly with respect to high speed I/O channels, which will speed the
intake of multiple data streams and thereby support graphs that are rapidly changing as
new data is absorbed.
• High-bandwidth network: As a significant burden of data access may cross node
barriers, employing a high-speed/high-bandwidth network interconnect will also help
reduce data latency delays, especially for chasing pointers across the graph.
• Multithreading: Fine-grained multithreading allows the simultaneous exploration of
different paths and efficiently creating, managing, and allocating threads to available
nodes on a massively parallel processing architecture can alleviate the challenge of
predictability.
• Large memory: A very large memory shared across multiple processors reduces the
need to partition the graph across independent environments. This helps to reduce the
performance impacts associated with graph partitioning. Large memories are useful in
keeping a large part, if not all of the graph resident in-memory. By migrating allocated
tasks to where the data is resident in memory reduces the impact of data access latency
delays.
One class is purely a software approach, providing an ability to create, populate, and query
graphs. This approach enables the necessary functionality and may provide the ease-of
implementation and deployment. Most, if not all, software implementations will use industry
standards, such as RDF and SPARQL, and may even leverage complementary tools for
inferencing and deduction. Performance of a software-only implementation is limited by its use
of the available hardware, and even using commodity servers cannot necessarily enable it to
natively exploit performance and optimization.
Another class is the use of a dedicated appliance for graph analytics. From an algorithmic
perspective, this approach is equally capable as one that solely relies on software. From a
performance perspective, there is no doubt that a dedicated platform will take advantage of high-
performance I/O, high-bandwidth networking, in-memory computation, and native
multithreading to provide the optimal performance for creating, growing, and analyzing graphs
that are built from multiple high-volume data streams.
12
CS8091 BIG DATA ANALYTICS
Software approaches may be satisfactory for smaller graph analytics problems, but as data
volumes and network complexity grow, the most effective means for return on investment may
necessitate the transition to a dedicated graph analytics appliance.
13