Unit I Predictive Analytics
Unit I Predictive Analytics
It is applicable only for variables. It is applicable for both – Variable and Attribute.
It always considers strong assumptions about data. It generally fewer assumptions about data.
Parametric Methods require lesser data than Non-Parametric Methods. Non-Parametric Methods requires much more data than Parametric Methods.
Parametric methods assumed to be a normal distribution. There is no assumed distribution in non-parametric methods.
Parametric data handles – Intervals data or ratio data. But non-parametric methods handle original data.
Here when we use parametric methods then the result or outputs generated can be When we use non-parametric methods then the result or outputs generated
easily affected by outliers. cannot be seriously affected by outliers.
Parametric Methods can perform well in many situations but its performance is at Similarly, Non-Parametric Methods can perform well in many situations but its
peak (top) when the spread of each group is different. performance is at peak (top) when the spread of each group is the same.
Parametric methods have more statistical power than Non-Parametric methods. Non-parametric methods have less statistical power than Parametric methods.
As far as the computation is considered these methods are computationally faster As far as the computation is considered these methods are computationally slower
than the Non-Parametric methods. than the Parametric methods.
Examples: Logistic Regression, Naïve Bayes Model, etc. Examples: KNN, Decision Tree Model, etc
Business Intelligence
• Business intelligence (BI) leverages software and services to transform
data into actionable insights that inform an organization’s strategic
and tactical business decisions.
• BI tools access and analyze data sets and present analytical findings
in reports, summaries, dashboards, graphs, charts and maps to
provide users with detailed intelligence about the state of the
business.
• The term business intelligence often also refers to a range of tools
that provide quick, easy-to-digest access to insights about an
organization’s current state, based on available data.
Business Intelligence Business Analytics
Examines past and display to drive current business needs. Analyses past information to drive current business
To run current trade operations. To alter trade operations and move forward efficiency
Tools are SAP Trade Objects, QlikSense, TIBCO, PowerBI etc. Tools are Word handling, Google docs, MS Visio, MS Office Instruments etc.,
Apply to all large-scale companies to run current commerce operations. Applies to companies where future development and efficiency as its objective
Comes beneath Business Analytics. Contains Data warehouse, data administration etc.
Key skills for business intelligence are Data collection and Management, Data Key skills for business Analytics Get it your objectives, Good verbal
Stockroom concepts, Understanding of diverse data sources and exchange communication skills, The capacity to run partner meetings, Be a great listener,
applications, Domain and business information. Hone your introduction aptitudes.
Predictive Analytics vs. Statistics
Statistics and Analytics
• Statistics and analytics are two branches of data science that share
many of their early heroes, so the occasional beer is still dedicated to
lively debate about where to draw the boundary between them.
Practically, however, modern training programs bearing those names
emphasize completely different pursuits.
Predictive Analytics vs. Data Mining
Predictive Analytics Data Mining
Predictive analytics refers to the use of both data mining refers to the computational
new and historical data, statistical technique of discovering patterns in huge
algorithms, and machine learning techniques data sets involving methods at the
to forecast future activity, patterns, and intersection of AI.
trends
• Obstacles in Management
• Obstacles with Data
• Obstacles with Modeling
• Obstacles in Deployment
• Data prepration
• Data cleansing.
• Identifying important columns.
• Recognizing correlations.
• Understanding how different algorithms (math) work.
• Choosing the right algorithm for the right problem.
• Deciding the right properties for the algorithm.
• Ensuring the data format is correct.
Setting Up the Problem
Predictive Analytics Processing Steps: CRISP-DM
• CRISP-DM, which stands for Cross-Industry Standard Process for
Data Mining, is an industry-proven way to guide your data mining
efforts.
• As a methodology, it includes descriptions of the typical phases of a
project, the tasks involved with each phase, and an explanation of the
relationships between these tasks.
• As a process model, CRISP-DM provides an overview of the data
mining life cycle.
• The life cycle model consists of six phases with arrows indicating the most important
and frequent dependencies between phases.
• The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process
model that serves as the base for a data science process.
• It has six sequential phases:
1.Business understanding – What does the business need?
2.Data understanding – What data do we have / need? Is it clean?
3.Data preparation – How do we organize the data for modeling?
4.Modeling – What modeling techniques should we apply?
5.Evaluation – Which model best meets the business objectives?
6.Deployment – How do stakeholders access the results?
Business Understanding
• The Business Understanding phase focuses on understanding the objectives and requirements of the project.
Aside from the third task, the three other tasks in this phase are foundational project management activities
that are universal to most projects:
1. Determine business objectives: You should first “thoroughly understand, from a business perspective, what
the customer really wants to accomplish.” (CRISP-DM Guide) and then define business success criteria.
2. Assess situation: Determine resources availability, project requirements, assess risks and contingencies,
and conduct a cost-benefit analysis.
3. Determine data mining goals: In addition to defining the business objectives, you should also define what
success looks like from a technical data mining perspective.
4. Produce project plan: Select technologies and tools and define detailed plans for each project phase.
Predictive Modeling Expert – Data access and normalized , build the model
Data Understanding
• Next is the Data Understanding phase. Adding to the foundation of Business Understanding, it drives the
focus to identify, collect, and analyze the data sets that can help you accomplish the project goals. This phase
also has four tasks:
1. Collect initial data: Acquire the necessary data and (if necessary) load it into your analysis tool.
2. Describe data: Examine the data and document its surface properties like data format, number of records,
or field identities.
3. Explore data: Dig deeper into the data. Query it, visualize it, and identify relationships among the data.
• Verify data quality: How clean/dirty is the data? Document any quality issues
Data Preparation
• This phase, which is often referred to as “data munging”, prepares the final data set(s) for modeling. It has five tasks:
1. Select data: Determine which data sets will be used and document reasons for inclusion/exclusion.
2. Clean data: Often this is the lengthiest task. Without it, you’ll likely fall victim to garbage-in, garbage-out. A
common practice during this task is to correct, impute, or remove erroneous values.
3. Construct data: Derive new attributes that will be helpful. For example, derive someone’s body mass index from
height and weight fields.
4. Integrate data: Create new data sets by combining data from multiple sources.
5. Format data: Re-format data as necessary. For example, you might convert string values that store numbers to
numeric values so that you can perform mathematical operations.
Modeling
1. Select modeling techniques: Determine which algorithms to try (e.g. regression, neural net).
2. Generate test design: Pending your modeling approach, you might need to split the data into training, test,
and validation sets.
3. Build model: As glamorous as this might sound, this might just be executing a few lines of code like “reg =
LinearRegression().fit(X, y)”.
4. Assess model: Generally, multiple models are competing against each other, and the data scientist needs to
interpret the model results based on domain knowledge, the pre-defined success criteria, and the test design.
Evaluation
• Whereas the Assess Model task of the Modeling phase focuses on technical
model assessment, the Evaluation phase looks more broadly at which model
best meets the business and what to do next. This phase has three tasks:
• Evaluate results: Do the models meet the business success criteria? Which
one(s) should we approve for the business?
• Review process: Review the work accomplished. Was anything overlooked?
Were all steps properly executed? Summarize findings and correct anything if
needed.
• Determine next steps: Based on the previous three tasks, determine whether
to proceed to deployment, iterate further, or initiate new projects.
Deployment
• A model is not particularly useful unless the customer can access its results. The
complexity of this phase varies widely. This final phase has four tasks:
• Plan deployment: Develop and document a plan for deploying the model.
• Plan monitoring and maintenance: Develop a thorough monitoring and
maintenance plan to avoid issues during the operational phase (or post-project
phase) of a model.
• Produce final report: The project team documents a summary of the project
which might include a final presentation of data mining results.
• Review project: Conduct a project retrospective about what went well, what
could have been better, and how to improve in the future.
Defining Data for Predictive Modeling
• Consist of 2D data (rows & columns)
• Row – Unit of analysis
• Data Format – Delimited by files, Fixed width flat files, Customized flat files, Binary files, etc.,
• Defining column as Measures – Attributes, Descriptors , variables, fields, features, or
columns
• For classification problems, the most frequent metrics to assess model accuracy is Percent Correct
Classification (PCC).
• PCC measures overall accuracy without regard to what kind of errors are made; every error has the same
weight. Confusion matrices are also commonly used for classification and provide a summary of the
different kinds of errors, called Type I and Type II errors, precision and recall, false alarms and false
dismissals, or specificity and sensitivity; these are merely different ways of describing different ways the
classifier makes errors.
• PCC and the confusion matrix metrics are good when an entire population must be scored and acted on. For
example, if customers who visit a web site are to be served customized content on the site based on their
browsing behavior, every visitor will need a model score and a treatment based on that score.
• If one will treat a subset of the population, a selected population, sorting the population by model score
and acting on only a portion of those entities in the selected group can be accomplished through metrics
such as Lift, Gain, ROC, and Area Under the Curve (AUC).
• These are popular in customer analytics where the models selects a sub-population to contact with a
marketing message, or in fraud analytics when the model identifies transactions that are good candidates
for further investigation.
Success Criteria for Estimation
• For continuous-valued estimation problems, metrics often used for assessing models are R^2, average
error, Mean Squared Error (MSE), median error, average absolute error, and median absolute error.
• In each of these metrics, one first computes the error of an estimate—the actual value minus the
predicted estimate—and the computes the appropriate statistic based on those errors. The values are
then summed over all the records in the data.
• Average errors can be useful in determining whether the models are biased toward positive or negative
errors.
• Average absolute errors are useful in estimating the magnitude of the errors (whether positive or
negative).
• Analysts most often examine not only the overall value of the success criterion, but also examine the
entire range of predicted values by considering scatter plots of actual versus predicted values or actual
versus residuals (errors).
• In principal, one can also include rank-ordered metrics such as AUC and Gain as candidates to estimate
the success criteria, though they often are not included in predictive analytics software for estimation
problems. In these instances, one needs to create a customized success criterion.
Other Customized Success Criteria
• Sometimes none of the typical success criteria are sufficient to evaluate predictive
models because they do not match the business objective. Consider the invoice
fraud example described earlier.
• Let’s assume that the purpose of the model is to identify 100 invoices per month
to investigate from the hundreds of thousands of invoices submitted.
• If one builds a classification model and selects a model that maximizes PCC, we
can be fooled into thinking that the best model as assessed by PCC is good, even
though none of the top 100 invoices are good candidates for investigation. How is
this possible?
• If there are 100,000 invoices submitted in a month, we are selecting only 0.1
percent of them for investigation. The model could be perfect for 99.9 percent of
the population and miss what we care about the most, the top 100.
• In situations where there are specific needs of the organization that lead to building models, it may
be best to consider customized cost functions.
• In the fraud example, we want to identify a population of 100 invoices that is maximally productive
for the investigators.
• If the worst scenario for the investigators is to pursue a false alarm, a case that turns out to not be
fraudulent at all, the model should reflect this cost in the ranking.
• What modeling metric does this? No metric addresses this directly, though ROC curves are close to
the idea; one could therefore select models that maximize the area under the ROC curve at the
depth of 100 invoices.
• However this considers true alerts and false alarms as equally positive or negative.
• One solution is to consider the cost of false alarms greater than the benefit of a true alert; one may
penalize false alarms ten times as much as a true alert.
• The actual cost values are domain specific, derived either empirically or defined by domain experts.
• Another candidate for customized scoring of models includes Return On Investment (ROI) or profit,
where there is a fixed or variable cost associated with the treatment of a customer or transaction (a
record in the data), and a fixed or variable return or benefit if the customer responds favorably.
• For example, if one is building a customer acquisition model, the costis typically a fixed cost
associated with mailing or calling the individual; the return is the estimated value of acquiring a new
customer.
• For fraud detection, there is a cost associated with investigating the invoice or claim, and a gain
associated with the successful recovery of the fraudulent dollar amount.
• Note that for many customized success criteria, the actual predicted values are not nearly as
important as the rank order of the predicted values.
• If one computes the cumulative net revenue as a customized cost function associated with a model,
the predicted probability may never enter into the final report, except as a means to threshold the
population into the “select” group (that is to be treated) and the “nonselect” group.