08 CoreText M6 KTP1-Introduction To Big Data
08 CoreText M6 KTP1-Introduction To Big Data
Module Overview:
Driven by the increasing power and availability of analytic technology, the scale and scope of
data available for audit work as oversight organizations continues to increase at a rapid pace.
Obtaining access to data, analyzing data, and developing insights will continue to be an essential
part of the work that Supreme Audit Institutions (SAIs) do, both now and in the future.
This module provides an overview of big data: what’re the characteristics of big data, what kind
of data sources can be distinguished and how all these data can be categorized, as well as how
data should be processed and stored. We will also discuss how data analytics help auditors. A
common model of performing data analytics will be introduced and some examples of audit
practice using respective data analysis technologies will be presented. We end with a discussion
on the opportunities and challenges of data analytics we may face within the SAI in the era of
big data.
At the end of the module, participants will be able to describe the big data and data analytics to
the extent that it covers data source, data types and the process of data analytics, as evaluated
by the mentors.
6.1.1 Overview
Big data is a popular term used to describe the exponential growth and availability of data
created by people, applications, and organizations. Wikipedia also defines this term as a
collection of data sets so large and complex that it becomes difficult to process using on-hand
Module
6 Introduction to Big Data and Data Analytics 1
database management tools or traditional data processing applications. The proliferation of
structured and unstructured data, combined with technical advances in storage, processing
power, and analytic tools, has enabled big data to become a competitive advantage for
governments that use it to gain insights into national governance and assist in rational decision-
making.
Given the fact that big data is closely related to national governance and sustainable
development in many countries, and that the mission of SAIs is to enhance accountability,
promote good governance and monitor the fulfillment of sustainable development, embracing
big data and enhancing big data assisted audit are the development trend and our realistic choice.
However, it also raises the level of knowledge and skills required for auditors to work with big
data effectively.
This session provides an overview of big data: what’re the characteristics of big data, what kind
of data sources can be distinguished and how all these data can be categorized, as well as how
data should be processed and stored.
Volume refers to the quantity of data, as big data is frequently defined in terms of massive data
sets with measures such as petabytes and zettabytes commonly referenced. And these vast
amounts of data are generated every second. The more data we have acquired, the more insights
we can extract from it. Thus, volume of data plays very crucial role in determining value out of
data.
Variety refers to the increasingly diversified sources and types of data requiring management
and analysis. We used to store data from sources like spreadsheets and databases. Now data
comes in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. So we need
to integrate these complex and multiple data types - structured and unstructured - from an array
of systems and sources both internal and external. The more varied data we have acquired, the
more multifaceted view we could develop. However the variety of unstructured data creates
problems for storage, mining and analyzing the data.
Velocity in big data deals with the accelerating speed at which data flows in from sources like
business transactions, machines, networks and human interaction with things like social media
applications, mobile devices, etc. The flow of data is massive and continuous. A rapid data
ingestion and rapid analysis capability provides us with the timely and correct insights.
Module
6 Introduction to Big Data and Data Analytics 2
Veracity in this dimension refers to the biases, noise and abnormality in data being generated.
Is the data that is being stored and mined meaningfully to the problem being analyzed? Given
the increasing volume of data being generated at an unprecedented rate, and in ever more
diverse forms, there is a clear need for us to manage the uncertainty associated with particular
types of data.
However, as systems become more efficient and the need to process data faster continues to
increase, the original data management dimensions have expanded to include other
characteristics unique to big data. The additional characteristics are validity, variability,
visualization, volatility and value.
Validity, like big data veracity, means the correct and accurate data for the intended use. The
validity of big data sources and subsequent analysis must be accurate, if we are to use the results
for decision making.
Variability is different from variety. It refers to data whose meaning is constantly changing. This
is particularly the case when gathering data relies on language processing.
Visualization. Analytics results from big data are often hard to interpret. Therefore, translating
vast amounts of data into readily presentable graphics and charts that are easy to understand is
critical to end-user satisfaction and may highlight additional insights.
Volatility refers to how long the data is valid and how long it should be stored. In this world of
real-time data, we need to determine at what point the data is no longer relevant to the current
analysis.
Value. Last, but arguably the most important of all, is value. The other characteristics of big data
are meaningless if we don't derive value from the data. Organizations, societies and
governments can all benefits from big data. Value is generated when new insights are translated
into actions that create positive outcomes.
Since the data collected by the SAI comes from different sources (inside and outside the
government) with multiple data types (structured and unstructured) and huge quantity, the
audit data naturally has the characteristics of big data.
Module
6 Introduction to Big Data and Data Analytics 3
combined with modern big data tooling, like Hadoop, etc. This combination of technology
enables dealing with the challenges posed by the existence of different types of data sources,
and specific challenges in data volume and variety. Within these data sources, we distinguish
between internal versus external data sources and structured versus unstructured data sources.
The different combinations of internal versus external and structured versus unstructured data
are shown in Figure 6.1.
Figure 6.1: Two Dimension of Data: Data Source and Data Type
Module
6 Introduction to Big Data and Data Analytics 4
information may, for example, involve the relationship among people. Credit card firms such as
Mastercard, and Visa have a huge population of customers and good knowledge of whether a
consumer is related to another, by analyzing the use and repayment of their credit cards. This
could be quite useful for auditors to detect fraud and corruption.
Another type of fast-growing external data are data from social media. These data come from
parties like Facebook, Twitter, LinkedIn, and Wechat, all of which have huge user bases, so that
often a substantial proportion of an audited entity participate in these social media. Linking
these data to the audit data environment is quite a challenge and not yet a common practice.
This is mainly due to the highly unstructured way these data are being created. Some companies
are collecting these data by using platforms like Radian6 for social media monitoring. It could
also be used to perceive social trends and assist strategic planning of the SAI.
Internal data sources are already present within audited entities and may include financial
statement data, transaction data, invoice data, contact data, and usage data. These internal data
sources are gathered and stored in the audited entity’s information system. Internal data could
be the base for auditors to perform their work, since all the audited entity’s behavior should be
recorded, described and understood by extracting insights from the internal data.
Historically, the majority of data stored within audited entities has been structured and
maintained within relational - or even legacy hierarchical or flat-file - databases. It allows for
repeatable queries, as much of the data is maintained in relational tables. It is often easier to
control than unstructured data, due to defined ownership and vendor-supported database
solutions. However, the use of unstructured data is growing and becoming more common within
audited entities. This type of data is not confined to traditional data structures or constraints. It
Module
6 Introduction to Big Data and Data Analytics 5
is typically more difficult to manage, due to its evolving and unpredictable nature, and it is
usually sourced from large, disparate, and often external data sources. Consequently, new
solutions have been developed to manage and analyze this type of data. See Figure 6.2 for a
diagram that shows the difference between structured and unstructured data.
Module
6 Introduction to Big Data and Data Analytics 6
Figure 6.3: Data Warehouse
Data lakes (see Figure 6.4) are becoming an increasingly popular solution to support big data
storage and data discovery. Data lakes are similar to data warehouses in that they store large
amounts of data from various sources, but they can store data in its natural format, that
facilitates the collocation of data in various schemata and structural forms. The idea of Data Lake
is to have a single store of all data in the enterprise ranging from raw data (which implies exact
copy of source system data) to transformed data which is used for various tasks including
reporting, visualization, analytics and machine learning. The data lake includes structured data
from relational databases and unstructured data from different sources, thus creating a
centralized data store accommodating all forms of data. This technology gives companies the
ability to store and use big data, allowing them to embrace non-traditional data types. Hadoop,
an open source big data framework that has grown exponentially over the last decade, aligns
well with the data lake based on the pure amount of data.
Module
6 Introduction to Big Data and Data Analytics 7
Figure 6.4: Data Warehouse
The table 6.1 below highlights a few of the key differences between a data warehouse and a data
lake.
Data. A data warehouse only stores data that has been modeled/structured, while a data
lake is no respecter of data. It stores all - structured and unstructured.
Processing. Before we can load data into a data warehouse, we first need to give it some
shape and structure - i.e., we need to model it. That’s called schema-on- write. With a data
lake, you just load in the raw data, and then when you’re ready to use the data, that’s the
time you give it shape and structure. That’s called schema-on-read.
Module
6 Introduction to Big Data and Data Analytics 8
Storage. One of the primary features of big data technologies like Hadoop is that the cost
of data storage is relatively low as compared to the data warehouse. There are two key
reasons for this: First, Hadoop is open source software, so the licensing and community
support is free. And second, Hadoop is designed to be installed on low-cost commodity
hardware.
Agility. A data warehouse is a highly-structured repository, by definition. It’s not technically
hard to change the structure, but it can be very time-consuming given all the processes that
are tied to it. A data lake, on the other hand, lacks the structure of a data warehouse, which
gives developers the ability to easily configure and reconfigure their models, queries, and
apps.
Security. Data warehouse technologies have been around for decades, while big data
technologies are relatively new. Thus, the ability to secure data in a data warehouse is much
more mature than securing data in a data lake. It should be noted, however, that there’s a
significant effort being placed on security right now in the big data industry. It’s not a
question of if, but when.
An important element of good databases is that the data are arranged in such a way that they
can easily be retrieved by users. The database structure may, however, depend on the user.
Director of specific division in the hospital may interested in the performance of subordinate
doctors, while inventory manager likely wants to know which kind of medicine has been mostly
used in the prescription. An auditor focusing on the procurement is most likely interested in
suppliers, such as quantity and price. To overcome this hurdle, relational database structures are
currently standard. Relational databases use different key-variables which link several databases
to each other. For example, in a management context the patient id and the doctor id are usually
key variables. In an inventory context the doctor id can again be a key variable, while the
medicine id should be too. Based on the type of information required, a primary key variable, a
secondary key variable, etc. can be distinguished. In a patient database, the patient id would be
the primary key variable, while in a inventory database the medicine id is the primary key
variable.
Module
6 Introduction to Big Data and Data Analytics 9
Table 6.2 shows an example of the structure of a patient database. In this example, the patient
id is the primary key variable. The patient name provides further information on that patient.
The doctor id is a secondary key variable, and the disease id is another. One can easily add more
key variables, such as prescription and medicine, etc. From this database, one can in principle
derive a database with the other non-primary key variables as key variables. For example, if one
wanted to take the disease id as a key variable, a database could be created through aggregation
and transformation procedures which shows how many patients suffer from the specific disease
and which doctor is most experienced in treating such disease(see Table 6.3).
Table 6.2 Examples of simple data table with patient as the central element
Table 6.3 Examples of disease data table derived from the patient database
Module
6 Introduction to Big Data and Data Analytics 10
Or if integration of databases is not done frequently, a recent product purchase or a recent
defection might not have been included yet, leading to unreliable information. Mistakes can
potentially have strong negative consequences for the audited entity. For example, social
security department may continue paying pension to someone who has recently passed away.
Moreover, audited entity’s data being not up to date may also cause wrong audit findings. There
are several options available to solve problems with data quality. One option is to use reference
databases for cleaning historical data quality issues in the databases; they can also be used as a
reference table during data entry. Another option is to use a software approach for data cleaning,
for example, by using tools to recognize duplicates within a dataset, or to recognize different
options for registering a specific address.
Data will never be perfect. One of the important problems auditors may face is that there are
missing values of the variable in a field of the dataset. For example, there is no information on
contact number and address for some audited entity’s suppliers. Missing value is a common
occurrence, which reduces the representativeness of the dataset and can distort inferences and
conclusions drawn from the data. One easy way to deal with these missing data is to throw away
observations with missing values. This may, however, cause sample problems, especially when
these missing values occur very frequently in a non-random fashion. For example, mainly for
suppliers relating to specific project. Data may be missing because suppliers were not asked to
provide these data. But it may also have some other meaning. For instance, suppliers may on
purpose not provide data, because the quality of the products they provided may not so good
or some fraud may hide behind the purchase. In the latter case missing values can potentially be
predictors of behavior, e.g. auditors might assume that suppliers with missing values are more
likely to commit fraud. Thus in general, auditors should carefully analyze the reason for multiple
missing values, and consider whether these missing values occur in a random or non-random
fashion. Understanding the reasons for and the nature of missing values is important to
appropriately handle the remaining data. If it is a random event, one could probably delete
observations with missing values or replace the missing values with, for example, the mean,
median or mode of the available values.
6.1.6 Summary
In this session, we started with the characteristics of Big Data. There are four common Vs and five
additional Vs to define big data. And we also make the distinction between different data sources,
group them in different ways: structured versus unstructured and internal versus external. We
stressed that big data are not just about the internal structured sources, but also have to deal
with the other sources. The description of database structures and possible issues with data
quality and missing values should help in tackling practical issues around working with big data.
Module
6 Introduction to Big Data and Data Analytics 11