Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 39

PREDICTIVE ANALYTICS

Class : III B.Sc. CS./IT/CT/BCA


Introduction to Predictive Analytics
• Analytics is the process of using computational methods to discover and report influential patterns in data.
• The goal of analytics is to gain insight and often to affect decisions.
• Data is necessarily a measure of historic information so, by definition, analytics examines historic data
• The ideas behind analytics are not new at all but have been represented by different terms throughout the
decades, including cybernetics, data analysis, neural networks, pattern recognition, statistics, knowledge
discovery, data mining, and now even data science.
What Is Predictive Analytics?
• Predictive analytics is the process of discovering interesting and meaningful patterns in data.
• It draws from several related disciplines, some of which have been used to discover patterns in data for
more than 100 years, including pattern recognition, statistics, machine learning, artificial intelligence, and
data mining
What differentiates predictive analytics from other types of analytics?
First, predictive analytics is data-driven, meaning that algorithms derive key characteristic of the
models from, data-driven algorithms induce models from the data.
Second, predictive analytics algorithms automate the process of finding the patterns from the data.
Powerful induction algorithms not only discover coefficients or weights for the models, but also the
very form of the models.
Decision trees algorithms, for example, learn which of the candidate inputs best predict a target
variable in addition to identifying which values of the variables to use in building predictions.
Other algorithms can be modified to perform searches, using exhaustive or greedy searches to find
the best set of inputs and model parameters.
If the variable helps reduce model error, the variable is included in the model. Otherwise, if the
variable does not help to reduce model error, it is eliminated.
• Predictive analytics doesn’t do anything that any analyst couldn’t accomplish with pencil and paper or
a spreadsheet if given enough time; the algorithms, while powerful, have no common sense. Consider
a supervised learning data set with 50 inputs and a single binary target variable with values 0 and
• One way to try to identify which of the inputs is most related to the target variable is to plot each
variable, one at a time, in a histogram. The target variable can be superimposed on the histogram, as
shown in Figure With 50 inputs, you need to look at 50 histograms. This is not uncommon for
predictive modelers to do.
• If the patterns require examining two variables at a time, you can do so with a scatter plot. For 50
variables, there are 1,225 possible scatter plots to examine.
• A dedicated predictive modeler might actually do this, although it will take some time. However, if the
patterns require that you examine three variables
• simultaneously, you would need to examine 19,600 3D scatter plots in order to examine all the
possible three-way combinations. Even the most dedicated modelers will be hard-pressed to spend
the time needed to examine so many plots.
Supervised Vs Unsupervised
Supervised learning is a machine learning approach that’s defined by its use of labeled datasets.
These datasets are designed to train or “supervise” algorithms into classifying data or predicting outcomes
accurately.
Using labeled inputs and outputs, the model can measure its accuracy and learn over time.
Supervised learning can be separated into two types of problems when data mining: classification and
regression:
• Classification problems use an algorithm to accurately assign test data into specific categories, such as
separating apples from oranges. Or, in the real world, supervised learning algorithms can be used to classify
spam in a separate folder from your inbox.
• Linear classifiers, support vector machines, decision trees and random forest are all common types of
classification algorithms.
• Regression is another type of supervised learning method that uses an algorithm to understand the
relationship between dependent and independent variables.
• Regression models are helpful for predicting numerical values based on different data points, such as sales
revenue projections for a given business.
• Some popular regression algorithms are linear regression, logistic regression and polynomial regression.
Unsupervised learning uses machine learning algorithms to analyze and cluster unlabeled data sets. These
algorithms discover hidden patterns in data without the need for human intervention (hence, they are
“unsupervised”).
Unsupervised learning models are used for three main tasks: clustering, association and dimensionality
reduction:
• Clustering is a data mining technique for grouping unlabeled data based on their similarities or
differences.
• For example, K-means clustering algorithms assign similar data points into groups, where the K value
represents the size of the grouping and granularity.
• This technique is helpful for market segmentation, image compression, etc.
• Association is another type of unsupervised learning method that uses different rules to find
relationships between variables in a given dataset.
• These methods are frequently used for market basket analysis and recommendation engines, along the
lines of “Customers Who Bought This Item Also Bought” recommendations.
• Dimensionality reduction is a learning technique used when the number of features  (or dimensions) in
a given dataset is too high.
• It reduces the number of data inputs to a manageable size while also preserving the data integrity.
• Often, this technique is used in the preprocessing data stage, such as when autoencoders remove noise
from visual data to improve picture quality.
Parametric and Non-Parametric Methods
• Parametric Methods: There is a set of fixed parameters that uses to determine a probability model that is
used in Machine Learning.
• Parametric methods are those methods for which we priory knows that the population is normal, or if not then
we can easily approximate it using a normal distribution which is possible by invoking the Central Limit
Theorem. Parameters for using the normal distribution is as follows: 
• Mean
• Standard Deviation
• The classification of a method to be parametric is completely depends on the presumptions that are made
about a population.
• There are many parametric methods available some of them are:
• Confidence interval used for – population mean along with known standard deviation.
• The confidence interval is used for – population means along with the unknown standard deviation.
• The confidence interval for population variance.
• The confidence interval for the difference of two means, with unknown standard deviation
Nonparametric Methods:
• Non-parametric methods are gaining popularity and an impact of
influence some reasons behind this fame is:

• The main reason is that there is no need to be mannered while using


parametric methods.
• The second important reason is that we do not need to make more and
more assumptions about the population given (or taken) on which we
are working on.
• Most of the nonparametric methods available are very easy to apply
and to understand also i.e. the complexity is very low.
Parametric Methods Non-Parametric Methods
Non-Parametric Methods use the flexible number of parameters to build the
Parametric Methods uses a fixed number of parameters to build the model. model.

Parametric analysis is to test group means. A non-parametric analysis is to test medians.

It is applicable only for variables. It is applicable for both – Variable and Attribute.

It always considers strong assumptions about data. It generally fewer assumptions about data.

Parametric Methods require lesser data than Non-Parametric Methods. Non-Parametric Methods requires much more data than Parametric Methods.

Parametric methods assumed to be a normal distribution. There is no assumed distribution in non-parametric methods.

Parametric data handles – Intervals data or ratio data. But non-parametric methods handle original data.

Here when we use parametric methods then the result or outputs generated can be When we use non-parametric methods then the result or outputs generated
easily affected by outliers. cannot be seriously affected by outliers.

Parametric Methods can perform well in many situations but its performance is at Similarly, Non-Parametric Methods can perform well in many situations but its
peak (top) when the spread of each group is different. performance is at peak (top) when the spread of each group is the same.

Parametric methods have more statistical power than Non-Parametric methods. Non-parametric methods have less statistical power than Parametric methods.

As far as the computation is considered these methods are computationally faster As far as the computation is considered these methods are computationally slower
than the Non-Parametric methods. than the Parametric methods.

Examples: Logistic Regression, Naïve Bayes Model, etc. Examples: KNN, Decision Tree Model, etc
Business Intelligence
• Business intelligence (BI) leverages software and services to transform
data into actionable insights that inform an organization’s strategic
and tactical business decisions.
• BI tools access and analyze data sets and present analytical findings
in reports, summaries, dashboards, graphs, charts and maps to
provide users with detailed intelligence about the state of the
business.
• The term business intelligence often also refers to a range of tools
that provide quick, easy-to-digest access to insights about an
organization’s current state, based on available data.
Business Intelligence Business Analytics

Examines past and display to drive current business needs. Analyses past information to drive current business

To run current trade operations. To alter trade operations and move forward efficiency

For current commerce operations. For future commerce operations

Tools are SAP Trade Objects, QlikSense, TIBCO, PowerBI etc. Tools are Word handling, Google docs, MS Visio, MS Office Instruments etc.,

Apply to all large-scale companies to run current commerce operations. Applies to companies where future development and efficiency as its objective

Comes beneath Business Analytics. Contains Data warehouse, data administration etc.

Key skills for business intelligence are Data collection and Management, Data Key skills for business Analytics Get it your objectives, Good verbal
Stockroom concepts, Understanding of diverse data sources and exchange communication skills, The capacity to run partner meetings, Be a great listener,
applications, Domain and business information. Hone your introduction aptitudes.
Predictive Analytics vs. Statistics
Statistics and Analytics
• Statistics and analytics are two branches of data science that share
many of their early heroes, so the occasional beer is still dedicated to
lively debate about where to draw the boundary between them.
Practically, however, modern training programs bearing those names
emphasize completely different pursuits.
Predictive Analytics vs. Data Mining
Predictive Analytics Data Mining
Predictive analytics refers to the use of both data mining refers to the computational
new and historical data, statistical technique of discovering patterns in huge
algorithms, and machine learning techniques data sets involving methods at the
to forecast future activity, patterns, and intersection of AI.
trends

It helps to make predictions based on future It helps to understand the gathered


events. information better.
Business analysts and other SMEs perform Statisticians and engineers perform it.
it.
It applies business knowledge to find It applies algorithms such as classification
patterns to get valid business predictions. and regression on gathered information to
find hidden patterns.
•  predictive analytics and data mining. Predictive analytics refers to the
use of both new and historical data, statistical algorithms, and machine
learning techniques to forecast future activity, patterns, and trends.
• The primary objective is to go beyond knowing what has happened to
assess better what will happen in the future. On the other hand, data
mining refers to the computational technique of discovering patterns
in huge data sets involving methods at the intersection of AI.
• The main objective of data mining is to extract useful information
from the data warehouse and transform it into a piece of useful
information.
Who Uses Predictive Analytics?
• Detecting fraud. Combining multiple analytics methods can improve pattern detection and prevent
criminal behavior. As cybersecurity becomes a growing concern, high-performance behavioral
analytics examines all actions on a network in real time to spot abnormalities that may indicate fraud,
zero-day vulnerabilities and advanced persistent threats.
• Optimizing marketing campaigns. Predictive analytics are used to determine customer responses or
purchases, as well as promote cross-sell opportunities. Predictive models help businesses attract, retain
and grow their most profitable customers. 
• Improving operations. Many companies use predictive models to forecast inventory and manage
resources. Airlines use predictive analytics to set ticket prices. Hotels try to predict the number of
guests for any given night to maximize occupancy and increase revenue. Predictive analytics enables
organizations to function more efficiently.
• Reducing risk. Credit scores are used to assess a buyer’s likelihood of default for purchases and are a
well-known example of predictive analytics. A credit score is a number generated by a predictive
model that incorporates all data relevant to a person’s creditworthiness. Other risk-related uses include
insurance claims and collections.
Challenges in Using Predictive Analytics

• Obstacles in Management
• Obstacles with Data
• Obstacles with Modeling
• Obstacles in Deployment
• Data prepration
• Data cleansing.
• Identifying important columns.
• Recognizing correlations.
• Understanding how different algorithms (math) work.
• Choosing the right algorithm for the right problem.
• Deciding the right properties for the algorithm.
• Ensuring the data format is correct.
Setting Up the Problem
Predictive Analytics Processing Steps: CRISP-DM
• CRISP-DM, which stands for Cross-Industry Standard Process for
Data Mining, is an industry-proven way to guide your data mining
efforts.
• As a methodology, it includes descriptions of the typical phases of a
project, the tasks involved with each phase, and an explanation of the
relationships between these tasks.
• As a process model, CRISP-DM provides an overview of the data
mining life cycle.
• The life cycle model consists of six phases with arrows indicating the most important
and frequent dependencies between phases.
• The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process
model that serves as the base for a data science process.
• It has six sequential phases:
1.Business understanding – What does the business need?
2.Data understanding – What data do we have / need? Is it clean?
3.Data preparation – How do we organize the data for modeling?
4.Modeling – What modeling techniques should we apply?
5.Evaluation – Which model best meets the business objectives?
6.Deployment – How do stakeholders access the results?
Business Understanding

• The Business Understanding phase focuses on understanding the objectives and requirements of the project.
Aside from the third task, the three other tasks in this phase are foundational project management activities
that are universal to most projects:

1. Determine business objectives: You should first “thoroughly understand, from a business perspective, what
the customer really wants to accomplish.” (CRISP-DM Guide) and then define business success criteria.

2. Assess situation: Determine resources availability, project requirements, assess risks and contingencies,
and conduct a cost-benefit analysis.

3. Determine data mining goals: In addition to defining the business objectives, you should also define what
success looks like from a technical data mining perspective.

4. Produce project plan: Select technologies and tools and define detailed plans for each project phase.

5. Three Stool Legs

– Domain Expert – Frame Problem

Database Expert - Need to identify what is available for predictive modeling

Predictive Modeling Expert – Data access and normalized , build the model
Data Understanding

• Next is the Data Understanding phase. Adding to the foundation of Business Understanding, it drives the
focus to identify, collect, and analyze the data sets that can help you accomplish the project goals. This phase
also has four tasks:

1. Collect initial data: Acquire the necessary data and (if necessary) load it into your analysis tool.

2. Describe data: Examine the data and document its surface properties like data format, number of records,
or field identities.

3. Explore data: Dig deeper into the data. Query it, visualize it, and identify relationships among the data.

• Verify data quality: How clean/dirty is the data? Document any quality issues
Data Preparation

• A common rule of thumb is that 80% of the project is data preparation.

• This phase, which is often referred to as “data munging”, prepares the final data set(s) for modeling. It has five tasks:

1. Select data: Determine which data sets will be used and document reasons for inclusion/exclusion.

2. Clean data: Often this is the lengthiest task. Without it, you’ll likely fall victim to garbage-in, garbage-out. A
common practice during this task is to correct, impute, or remove erroneous values.

3. Construct data: Derive new attributes that will be helpful. For example, derive someone’s body mass index from
height and weight fields.

4. Integrate data: Create new data sets by combining data from multiple sources.

5. Format data: Re-format data as necessary. For example, you might convert string values that store numbers to
numeric values so that you can perform mathematical operations.
Modeling

1. Select modeling techniques: Determine which algorithms to try (e.g. regression, neural net).

2. Generate test design: Pending your modeling approach, you might need to split the data into training, test,
and validation sets.

3. Build model: As glamorous as this might sound, this might just be executing a few lines of code like “reg =
LinearRegression().fit(X, y)”.

4. Assess model: Generally, multiple models are competing against each other, and the data scientist needs to
interpret the model results based on domain knowledge, the pre-defined success criteria, and the test design.
Evaluation
• Whereas the Assess Model task of the Modeling phase focuses on technical
model assessment, the Evaluation phase looks more broadly at which model
best meets the business and what to do next. This phase has three tasks:

• Evaluate results: Do the models meet the business success criteria? Which
one(s) should we approve for the business?
• Review process: Review the work accomplished. Was anything overlooked?
Were all steps properly executed? Summarize findings and correct anything if
needed.
• Determine next steps: Based on the previous three tasks, determine whether
to proceed to deployment, iterate further, or initiate new projects.
Deployment
• A model is not particularly useful unless the customer can access its results. The
complexity of this phase varies widely. This final phase has four tasks:

• Plan deployment: Develop and document a plan for deploying the model.
• Plan monitoring and maintenance: Develop a thorough monitoring and
maintenance plan to avoid issues during the operational phase (or post-project
phase) of a model.
• Produce final report: The project team documents a summary of the project
which might include a final presentation of data mining results.
• Review project: Conduct a project retrospective about what went well, what
could have been better, and how to improve in the future.
Defining Data for Predictive Modeling
• Consist of 2D data (rows & columns)
• Row – Unit of analysis
• Data Format – Delimited by files, Fixed width flat files, Customized flat files, Binary files, etc.,
• Defining column as Measures – Attributes, Descriptors , variables, fields, features, or
columns

• Defining Unit of Analysis – Independent observation


Defining Target Values -
• For continuous-valued estimation problems, metrics often used for
assessing models are R^2, average error, Mean Squared Error (MSE),
median error, average absolute error, and median absolute error.
Defining Measures of Success for
Predictive Models
• Success Criteria for Classification
• Success Criteria for Estimation
• Other Customized Success Criteria
Success Criteria for Classification

• For classification problems, the most frequent metrics to assess model accuracy is Percent Correct
Classification (PCC).
• PCC measures overall accuracy without regard to what kind of errors are made; every error has the same
weight. Confusion matrices are also commonly used for classification and provide a summary of the
different kinds of errors, called Type I and Type II errors, precision and recall, false alarms and false
dismissals, or specificity and sensitivity; these are merely different ways of describing different ways the
classifier makes errors.
• PCC and the confusion matrix metrics are good when an entire population must be scored and acted on. For
example, if customers who visit a web site are to be served customized content on the site based on their
browsing behavior, every visitor will need a model score and a treatment based on that score.
• If one will treat a subset of the population, a selected population, sorting the population by model score
and acting on only a portion of those entities in the selected group can be accomplished through metrics
such as Lift, Gain, ROC, and Area Under the Curve (AUC).
• These are popular in customer analytics where the models selects a sub-population to contact with a
marketing message, or in fraud analytics when the model identifies transactions that are good candidates
for further investigation.
Success Criteria for Estimation

• For continuous-valued estimation problems, metrics often used for assessing models are R^2, average
error, Mean Squared Error (MSE), median error, average absolute error, and median absolute error.
• In each of these metrics, one first computes the error of an estimate—the actual value minus the
predicted estimate—and the computes the appropriate statistic based on those errors. The values are
then summed over all the records in the data.
• Average errors can be useful in determining whether the models are biased toward positive or negative
errors.
• Average absolute errors are useful in estimating the magnitude of the errors (whether positive or
negative).
• Analysts most often examine not only the overall value of the success criterion, but also examine the
entire range of predicted values by considering scatter plots of actual versus predicted values or actual
versus residuals (errors).
• In principal, one can also include rank-ordered metrics such as AUC and Gain as candidates to estimate
the success criteria, though they often are not included in predictive analytics software for estimation
problems. In these instances, one needs to create a customized success criterion.
Other Customized Success Criteria

• Sometimes none of the typical success criteria are sufficient to evaluate predictive
models because they do not match the business objective. Consider the invoice
fraud example described earlier.
• Let’s assume that the purpose of the model is to identify 100 invoices per month
to investigate from the hundreds of thousands of invoices submitted.
• If one builds a classification model and selects a model that maximizes PCC, we
can be fooled into thinking that the best model as assessed by PCC is good, even
though none of the top 100 invoices are good candidates for investigation. How is
this possible?
• If there are 100,000 invoices submitted in a month, we are selecting only 0.1
percent of them for investigation. The model could be perfect for 99.9 percent of
the population and miss what we care about the most, the top 100.
• In situations where there are specific needs of the organization that lead to building models, it may
be best to consider customized cost functions.
• In the fraud example, we want to identify a population of 100 invoices that is maximally productive
for the investigators.
• If the worst scenario for the investigators is to pursue a false alarm, a case that turns out to not be
fraudulent at all, the model should reflect this cost in the ranking.
• What modeling metric does this? No metric addresses this directly, though ROC curves are close to
the idea; one could therefore select models that maximize the area under the ROC curve at the
depth of 100 invoices.
• However this considers true alerts and false alarms as equally positive or negative.
• One solution is to consider the cost of false alarms greater than the benefit of a true alert; one may
penalize false alarms ten times as much as a true alert.
• The actual cost values are domain specific, derived either empirically or defined by domain experts.
• Another candidate for customized scoring of models includes Return On Investment (ROI) or profit,
where there is a fixed or variable cost associated with the treatment of a customer or transaction (a
record in the data), and a fixed or variable return or benefit if the customer responds favorably.
• For example, if one is building a customer acquisition model, the costis typically a fixed cost
associated with mailing or calling the individual; the return is the estimated value of acquiring a new
customer.
• For fraud detection, there is a cost associated with investigating the invoice or claim, and a gain
associated with the successful recovery of the fraudulent dollar amount.

• Note that for many customized success criteria, the actual predicted values are not nearly as
important as the rank order of the predicted values.
• If one computes the cumulative net revenue as a customized cost function associated with a model,
the predicted probability may never enter into the final report, except as a means to threshold the
population into the “select” group (that is to be treated) and the “nonselect” group.

You might also like