Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

Business Understanding

Business understanding is the essential and mandatory first phase in any data
mining or data analytics project.

It involves identifying and describing the fundamental aims of the project from a business
perspective. This may involve solving a key business problem or exploring a particular business
opportunity. Such problems might be:

 Establishing whether the business has been performing or under-performing and in which
areas
 Monitoring and controlling performance against targets or budgets
 Identifying areas where efficiency and effectiveness in business processes can be improved
 Understanding customer behaviour to identify trends, patterns and relationships
 Predicting sales volumes at given prices
 Detecting and preventing fraud more easily
 Using scarce resources most profitably
 Optimising sales or profits
Having identified the aims of the project to address the business problem or opportunity, the next
step is to establish a set of project objectives and requirements. These are then used to inform the
development of a project plan. The plan will detail the steps to be performed over the course of the
rest of the project and should cover the following:

 Deciding which data needs to be selected from internal or external sources;


 Acquiring suitable data;
 Determining criteria to determine whether or not the project will have been a success;
 Developing an understanding of the acquired data;
 Cleaning and preparing the data for modelling;
 Selecting suitable tools and techniques for modelling;
 Creating appropriate models from the data;
 Evaluating the created models;
 Visualising the information obtained from the data;
 Implementing a solution or proposal that achieves the original business objective.
 Evaluating whether the project has been a success using the predetermined criteria.

Data Understanding
The second phase of the CRISP-DM process involves obtaining and exploring the data identified as
part of the previous phase and has three separate steps, each resulting in the production of a report.
Select each item to reveal more information.

Data acquisition
This step involves retrieving the data from their respective sources and the production of a data
acquisition report that lists the sources of data, along with their provenance, the tools or
techniques used to acquire them. It should also document any issues which arose during the
acquisition along with the relevant solutions. This report will facilitate the replication of the data
acquisition process if the project is repeated in the future.

Data description
The next step requires loading the data and performing a rudimentary examination of the data to aid
in the production of a data quality report. This report should describe the data that has been
acquired.

It should detail the number of attributes and the type of data they contain. For quantitative data, this
should include descriptive statistics such as minimum and maximum values as well as their mean
and median and other statistical measures. For qualitative data, the summary data should include
the number of distinct values, known as the cardinality of data, and how many instances of each
value exists. The first step is to describe the raw data. For instance, if analysing a purchases ledger,
you would at this stage produce counts of the number of transactions for each department and cost
centre, the minimum, mean and maximum for amounts, etc. Relationships between variables are
examined in the data exploration phase (eg. by calculating correlation). For both types of data, the
report should also detail the number of missing or invalid values in each of the attributes.

If there are multiple sources of data, the report should state on which common attributes these
sources will be joined. Finally, the report should include a statement as to whether the data acquired
is complete and satisfies the requirements outlined during the business understanding phase.

Data exploration
This step builds on the data description and involves using statistical and visualisation techniques to
develop a deeper understanding of the data and their suitability for the analysis.

These may include:

 Performing basic aggregations;


 Studying the distribution of data; either through producing descriptive statistics such as
means, medians and standard deviations or by plotting histograms;
 Examining the relationships between pairs of attributes; eg. by correlation for numeric data
using regression analysis or chi-square testing; and
 Exploring the distribution and relationships in significant subsets of the data
These exploratory data analysis techniques can help provide an indication on the likely outcome of
the analysis and may uncover patterns in the data that may be worth subjecting to further
examination.

The results of the exploratory data analysis should be presented as part of a data exploration
report that should also detail any initial findings.
Data Preparation
As with the data exploration phase, the data preparation phase is composed of multiple steps and is
about ensuring that the correct data is used, in the correct form in order for the data analytics model
to work effectively:

Data selection

The first step in data preparation is to determine the data that will be used in the analysis. This decision
will be informed by the reports produced in the data understanding phase but may also be based on the
relevance of particular datasets or attributes to the objectives of the data mining project, as well as the
capabilities of the tools and systems used to build analytical models. There are two distinct types of data
selection, both of which may be used as part of this step.

Feature selection is the process of eliminating features or variables which exhibit little predictive value
or those that are highly correlated with others and retaining those that are the most relevant to the
process of building analytical models such as:

Multiple linear regression, where the correlation between multiple independent variables and the
dependent variable is used to model the relationship between them;

Decision trees, simulating human approaches to solving problems by dividing the set of predictors into
smaller and smaller subsets and associating an outcome with each one.

Neural networks, a naïve simulation of multiple interconnected brain cells that can be configured to
learn and recognise patterns.

Sampling may be needed if the amount of data exceeds the capabilities of the tools or systems used to
build the model. This normally involves retaining a random selection of rows as a predetermined
percentage of the total number of rows. Often surprisingly small samples can give reasonably reliable
information about the wider population of data, such as obtained from voter exit polls in local and
national elections.

Any decisions taken during this step should be documented, along with a description of the reasons for
eliminating non-significant variables or selecting samples of data from a wider population of such data.

Data cleaning
Data cleaning is the process of ensuring the data can be used effectively in the analytical model. The
next step is to process missing and erroneous data identified during the data understanding or collection
phase. Erroneous data, values outside of reasonably expected ranges, are generally set as missing.

Missing values in each feature are then replaced either using simple rules of thumb, such as setting
them to be equal to the mean or median of data in the feature or by building models that represent the
patterns of missing data and using those models to 'predict' the missing values.

Other data cleaning tasks include transforming dates into a common format and removing non-
alphanumeric characters from text. The activities undertaken, and decisions made during this step
should be documented in a data cleaning report.

Data integration

Data mining algorithms expect a single source of data to be organised into rows and columns. If multiple
sources of data are to be used in the analysis, it is necessary to combine them. This involves using
common features in each dataset to join the datasets together. For example, a dataset of customer
details may be combined with records of their purchases. The resulting joined dataset will have one row
for each purchase containing attributes of the purchase combined with attributes related to the
customer.

Feature engineering

This optional step involves the creation or inclusion of new variables or derived attributes into the
existing variables or features originally included to improve the model’s capability. This step is
frequently performed when the data analyst feels that the derived attribute or new feature or variable is
likely to make a positive contribution to the modelling process and where it involves a complex
relationship that the model is unlikely to infer by itself.

An example of a derived feature might be adding such attributes such as the amount a customer spends
on different products in a given time period, how soon they pay and how often they return goods to
more reliably assess the profitability of that customer, rather than just measure the gross profit
generated by the customer based on sales values.
Modelling
This key part of the data mining process involves creating generalised, concise representations of
the data. These are frequently mathematical in nature and are used later to generate predictions
from new, previously unseen data.

Determine the modelling techniques to be used


The first step in creating models is to choose the modelling techniques which are the most
appropriate, given both the nature of the analysis and of the data used. Many modelling methods
make assumptions about the nature of data. For examples, some methods can perform well in the
presence of missing data whereas others will fail to produce a valid model.

Design a testing strategy


Before proceeding to build a data analytics model, you will need to determine how you are going to
assess the quality of predictive ability of the model. This is done using data specially held aside for
this purpose, in other words, how well the model will perform on data it hasn't yet seen. This involves
using a subset of data kept aside for this purpose and using it to evaluate how far off the model's
predictions of the dependent variable are from the actual values in the data.

Evaluation
At this stage in the project, you need to verify and document that the results you have obtained from
modelling have the veracity (are reliable enough) for you to prove or reject your hypotheses in the
business understanding stage. For example, if you have performed a multiple regression analysis on
predicting sales based on weather patterns, are you sure that the results you have obtained are
statistically significant enough for you to implement the solution, or have you checked that there are
no other intermediate variables linked to the X, Y variables in your relationship which are a more
direct causal link?

Before proceeding to final deployment of the model, it is important to thoroughly evaluate it and
review the steps executed to create it; to be certain the model properly achieves the business
objectives. A key objective is to determine if there is some important business issue that has not
been sufficiently considered. At the end of this phase, a decision on the use of the data mining
results should be reached.

At this stage, you will determine if it is feasible to move on to the final phase deployment, or whether
it is preferable to return to and refine some of the earlier steps. The outcome of this phase should be
a document providing an overview of the evaluation and details of the final decision together with a
supporting rationale for proceeding.
Deployment
During this final phase, the outcome of the evaluation will be used to establish a timetable and
strategy for the deployment of the data mining models, detailing the required steps and how they
should be implemented.

Data mining projects are rarely 'set it and forget it' in nature. At this time, you will need to develop a
comprehensive plan for the monitoring of the deployed models as well as their future maintenance.

This should take the form of a detailed document. Once the project has been completed there
should be a final written report, re-stating and re-affirming the project objectives, identifying the
deliverables, providing a summary of the results and identifying any problems encountered and how
they were dealt with.

Depending on the requirements, the deployment phase can be as simple as generating a report and
presenting it to the sponsors or as complex as implementing a repeatable data mining process
across the enterprise.

In many cases, it is the customer, not the data analyst, who carries out the deployment steps.
However, even if the analyst does carry out the deployment, it is important for the customer to
clearly understand which actions need to be carried out in order to actually make use of the created
models.

This is where data visualisation is most important as the data analyst hands over the findings from
the modelling to the sponsor or the end user and these should be presented and communicated in a
form which is easily understood.

The 3 V’s of Big Data


The main focus in big data and the digital revolution is not so much about the quantity of data,
although this is a big advantage, but it is more about the speed and currency of the data and the
variety in which it is made available. Sophisticated data analytics is about accessing data that is
useful for decision making and the three things that Big Data brings to improve the quality of
decision making are:

Volume
In data analytics, the amount of data can make a difference. With big data, you will often have to
process large amounts of data, most of it unstructured and with low information density.

The term volume is used to refer to these massive quantities of data. Most of this may have no value but
you will not know until you somehow try and structure it and use it. This data can come from a wide
range of sources as we will see later, but could include social media data, hits on a website, or results
from surveys or approval ratings given by consumers.

The main benefit from the volume of big data is the additional reliability it gives the data analyst. As any
statistician knows, the more data you have, the more reliable your analysis becomes and the more
confident you are about using the results you obtain to inform decision-making.

For some organisations the quantity of this data will be enormous, and will be difficult to collect, store
and manage without the correct infrastructure, including adequate storage and processing capacity.

Velocity

Velocity refers to the rate at which data is received, stored and used. In today’s world transactions are
conducted and recorded in real time. Every time you scan your goods at a supermarket, the store knows
instantly how much inventory it still has available and so it knows as soon as possible when it needs to
re-order each item.

Similarly, as people shop with debit and credit cards using their phone apps, these transactions are
updated immediately. The bank knows immediately that funds have gone out of your account. The
business also knows that funds have been transferred into their account – all in real time.

Variety

In the past the data we collected electronically came in the very familiar rows and columns form
encountered in any database or spreadsheet. With the advent of the Internet and more recently the
Worldwide Web, the forms data comes in have significantly broadened. Variety refers to the multitude
of types of data that are available for analysis as a result of these changes. Thanks to rapid
developments in communications technology, the data we store increasingly comes in different forms
which possess far less structure and which take on a variety of forms. Examples of these include
numerical data, plain text, audio, pictures and videos.
With the increasingly prevalent use of mobile internet access, sensor data also counts as a new data
type. We still also have the more traditional data types; those that are highly structured such as data
held in relational databases. Corporate information systems such as Enterprise Resource Planning,
Customer Relationship Management and financial accounting functions employ such database systems
and the data these systems contain are a valuable resource for data analysts to work with.

Unstructured data require significant additional processing as in the data preparation stage in the CRISP
framework to transform them into meaningful and useful data which can be used to support decision-
making, but being able to access them and use them provides richer information which can make the
information obtained from the data analysed more relevant and significant than larger amounts of data
from more structured sources.

The value and lessons to be learned from Big


Data
Big data has become an important form of organisational capital. For some of the world’s biggest
tech companies, such as Facebook, a large part of the value they offer comes from their data, which
they’re constantly analysing to produce more efficiency and develop new revenue streams.

However, the impact of big data and data reliance doesn't stop with the tech giants. Data is
increasingly considered by many enterprises to be a key business asset with significant potential
value.

Data which is not used or analysed has no real value. However, value can be added to data as it is
cleaned, processed, transformed and analysed. Data collected can be considered to be the raw
material, as in a manufacturing process and is frequently referred to as 'raw data'.

Some of this raw material is unrefined, such as unstructured data, and some refined, as is the case with
structured data. Such data needs to be stored in a virtual warehouse, such as a cloud storage provider or
an on-premise storage solution.

The cleaning and transformation of the data into a suitable form for analysis is really where the value is
being added, so that the data can become the finished product - the useful information which needs to
be delivered or communicated to the user. Reliable, timely and relevant information is what the
customer wants.
What about the veracity of your data?
Deriving value from big data isn’t only about analysing it. It is a discovery process that requires
insightful analysts, business users and managers who ask the right questions, recognise patterns,
make informed assumptions, and predict behaviour.

If the original assumptions are wrong, the interpretation of the original business question or issue is
incorrect, or the integrity of the data used in the analysis is suspect, the data analysis may yield
unreliable or irrelevant information. A data analyst must be sceptical of the information that comes
out of the data analytics process and properly challenge or verify what it is saying.

Recent technological breakthroughs have exponentially reduced the cost of data storage and
computing, making it easier and less expensive to store and process more data than ever before. As
the costs of handling big data are becoming cheaper and more accessible, it is possible to make
more accurate and informed business decisions as long as the big data is stored, processed and
interpreted appropriately.

Platforms for Big Data storage and processin


HDFS

The Hadoop Distributed File System allows the storage of extremely large files in a highly redundant
manner, using a cluster of computers, in this case built using ‘off-the-shelf’ commodity hardware.

MapReduce

This is a divide and conquer approach to big data processing, allowing processing of data to be
distributed across multiple computers in a Hadoop cluster.

Hive

Data Query Language is a query tool used to analyse large sets of data stored on HDFS. It uses a SQL-like
language. It is a declarative language – in other words, you specify what you want, not how to retrieve
it.

Pig

Another high-level programming language used to query large data sets stored on HDFS. It is a data-flow
language that specifies the flows of data from one task to another

HBase

A NoSQL database that runs on Hadoop clusters. NoSQL stands for Not Only SQL and is a pattern of data
access that is more suited to larger data stores. It differs from relational databases in a number of ways,
not least in that it stores each column in the data as a separate physical file.
Drill

A data processing environment for large-scale data projects where data is spread across thousands of
nodes in a cluster and the volume of data is in the petabytes.

Descriptive Analytics
Descriptive analytics takes raw data and summarises or describes it in order to provide useful
information about the past. In essence, this type of analytics attempts to answer the question 'What
has happened?'.

Descriptive analysis does exactly what the name implies because they “Describe” raw data and
allow the user to see and analyse data which has been classified and presented in some logical
way. They are analytics that describe the past. The past refers to any point of time that an event has
occurred, whenever that was, from a second ago to a year ago.

Descriptive analytics are useful because they allow analysts to learn from past behaviours, and
understand how they might influence future outcomes. Spreadsheet tools such as filtering and pivot
tables are an excellent way to view and analyse historic data in a variety of ways.

Descriptive statistics can be used to show things many different types of business data such as total
sales by volume or value, cost breakdowns, average amounts spent per customer and profitability
per product.

Descriptive Analytics
An example of this kind of descriptive analytics can be illustrated where retail data on the sales, cost
of sales (COS) and gross profit margin (GP) in six retail outlets of a range of five products within
each store are tracked over time to establish trends and or to detect potential fraud or loss.

By looking at the overall figures for the company as a whole, or even by individual product across
the company, or for a store as a whole, the business leader may not notice any unusual trends or
departures from the expected levels from a chart or graph of these measures.

Only by analysing and charting these trends more closely by product, in each individual store (such as by
using pivot tables) could the business leader detect if and where there is any specific fraud or loss and
such discrepancies would become more apparent if this type of micro level descriptive analysis is
undertaken

This database shows a data table with three products which are being sold by a business and how
much each product sells for, how much it costs to make or purchase, and what cost is associated
with the return of each unit of the product from customers. The database itself shows, for each of the
30 customers, how much of each product they have purchased, how many they have returned and
an analysis of how much sales and gross profit each customer has generated is included. The
database also shows the totals, the means and the medians of each measure.

Using data analytics, this type of descriptive analysis could help the business understand more
about the following business questions:

 Which products are customers buying?


 Which customers bring in the most revenue?
 Which customers are the most profitable?
 Which customers return the most or fewest goods?
 Which products are being returned the most or the least?
 This kind of descriptive analytics can help the business manager understand their customers
and their buying behaviour so that they can improve their marketing and promotion with
these customers and target their communications to them more effectively. The business can
also gain a greater understanding of its costs, such as the costs of returns associated with
different products and customers and try and find out why some products or some customers
are cost more due to returns and address these issues.
 The spreadsheet above uses filters at the top of each column so that the analyst can sort the
data in any way they choose. For example, they might wish to see a list of customers listed
in order of sales, profitability or by the number of returns they process.
 A powerful tool to use in descriptive analytics is pivot tables in Excel. Pivot tables allow the
original data table in a spreadsheet to be presented in a number of different ways, where the
rows and columns can be interchanged or where only certain fields or data are displayed.
 By examining this, it seems that the total number of returns of products 1 and 3 are very
similar for all customers, but what does this really tell us? To get more insights into this it
would be necessary to compare these returns figures with the actual quantities of each
product sold to each customer to identify what percentage of each product was returned
overall. More relevant still would be an analysis of the percentage of returns by product
returned by individual customers to establish which customers sent back the greatest
proportion of returns under each product type, requiring an even more targeted analysis.
 An increasingly popular area to apply descriptive data analytics is in finance by using
externally available information from the stock markets to help inform and support investment
decisions. Many analysts source their data from a range of external sources such as Yahoo,
Google finance or other easily accessible and free to use databases. This now means that
historical data of share prices and stock market indices are readily and widely available to
use by anyone.
 As an example, finance analysts often need to calculate the riskiness of stocks in order to
estimate the equity cost of capital and to inform their investment plans.
 An analyst wants to estimate the beta of Amazon shares against the Standard and Poor
(S+P) 100 stock index. The beta measures how volatile the periodic returns in this share
have been, against the S+P index as a whole.
 To do this, the analyst would access the financial data from an external website and
download it to their spreadsheet.
Predictive Analytics
Predictive analytics builds statistical models from processed raw data with the aim of being able to
forecast future outcomes. It attempts to answer the question 'What will happen?'

This type of analytics is about understanding the future. Predictive analytics provides businesses
with valuable insights based on data which allow analysts to extrapolate from the past to assume
behaviours and outcomes in the future. It is important to remember that data analytics cannot be
relied upon to exactly “predict” the future with complete certainty. Business managers should
therefore be sceptical and recognise the limitations of such analytics, and that the prediction can
only be based only on reasonable probabilities and assumptions.

These analytics use historical (descriptive data) and statistical techniques to estimate future
outcomes based on observed relationships between attributes or variables. They identify patterns in
the data and apply statistical models and algorithms to capture relationships between various data
sets. Predictive analytics can be used throughout the organization, from forecasting customer
behaviour and purchasing patterns to identifying trends in manufacturing processes and the
predicted impact on quality control.

Regression
Regression analysis is a popular method of predicting a continuous numeric value. A simple
example in a business context would be using past data on sales volumes and advertising spend to
build a regression model that allows manager to predict future sales volumes on the basis of the
projected or planned advertising spend. Using a single predictor or independent variable to forecast
the value of a target or dependent variable is known as simple regression. The inclusion of multiple
independent variables is more typical of real-world applications and is known as multiple-regression.

The simplest regression models, such as those produced by Microsoft Excel, assume that the
relationship between the independent variables and the dependent variable is strictly linear. It is
possible to accommodate a limited range of alternative possible relationships by transforming the
variables using logarithms or by raising them to a power. More sophisticated algorithms can model
curved or even arbitrarily-shaped relationships between the variables.

The performance of a regression model is determined by how far the predictions are away from the
actual values. If the magnitude of errors is a consideration, the squared differences are used
otherwise the absolute differences are used. In Excel, this information is given by a regression
output table which indicates the predictive values of the independent variable(s) and the dependent
variable. The key statistics are R2 which ranges from 0 being a completely random association to 1
which is perfect correlation. The statistical significance of the relationships can also be confirmed by
looking at the P-values and the Significance F, which should be sufficiently small to allow greater
confidence.

One common application, most people are familiar with, is the use of predictive analytics to estimate
sales of product based on different factors such as the weather.

The following spreadsheet includes data on monthly barbecue sales and how these are potentially
influenced by:
 Wet days in the month
 Average monthly temperature
 Monthly hours of sunshine

To enable data analysis tools such as Regression and Solver in Excel on your device, you
need to go into:
File > Options > Add- Ins >
Then at the bottom click Go and then check all the available boxes.

You might also like