Data Mining Methods Basics - Resp

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 33

Data Mining Methods Basics

Welcome to Data Mining!


In this course, you will learn about the concepts of data mining, its applications,
and knowledge discovery process. You will also be introduced to various data
mining techniques.

So, What is Data Mining?


Data Mining is the process of extracting valid, useful, unknown and
comprehensible information from data and using it to make proactive
knowledge-driven business decisions. Data mining uses statistical procedures
to find unexpected patterns in data and identifies associations between
variables.

Concept of Data Mining


Here is an expert talk on demystifying the concept of data mining and why it
would continue to grow in popularity.

The concept of data mining is growing in popularity, in the realm of commerce


and business activities in general. But it's kind of a misconceived or
misunderstood topic, and I want to give you an idea of what Data Mining is all
about. Basically we're in the information economy and what you have is more
and more data being generated in every aspect you can think of. Every time you
swipe your grocery card, when you try to get a discount for buying whatever
products, that data is being downloaded to a database, on most transactions
you do, there is some sort of data download. Organizations are storing,
processing and analyzing data more than any time in history and that trend is
going to continue to grow.

So what is Data Mining?

Data mining is the incorporation of quantitative methods. Will call them


mathematical methods, that may include mathematical equations, algorithms,
some of the prominent methodologies like traditional logistic regression, neural
networks, segmentation, classification, clustering. Those are all methods, that
utilize mathematics. Data mining is applicable across industry sectors.
Generally, wherever you have processes wherever data you have, it is the
application of these powerful mathematical techniques incorporation with some
statistical type of inference testing, that will extract trends and patterns.
I teach a course in Data Mining for managers and over the first half of the
course, I give students a very good understanding of what data mining is.
Because, to be honest, many people don't quite understand it. It takes a full half
of course to provide that understanding. What are these mathematical
techniques? But, just as important, in the second half of the course, I say, now
you understand these techniques. Now let’s use them in the business world.
Let’s apply it to advertising and marketing effectiveness. Let's apply it to the e-
commerce initiatives. Let's apply it to even health care processes supply chain
processes. There’s just a number of businesses that can be mined with these
techniques. Simply put, any organization that has data and processes, can be
analyzed with data mining. And the results are extracting information, actionable
information from these data resources, that organizations can fine-tune their
processes, increase productivity and increase efficiency.

So the data mining topic, this whole idea or the concept is going to grow in
popularity. Why? Because data continues to grow. Think about social
networking - LinkedIn. Twitter. Facebook. What is it? Its more data and its data
to describe people. What they do, what they like, who they are, when you're out
buying or doing whatever. As far as using services, just conducting your daily
lives, more and more there's data gathering and data capturing. And in the
information economy, the way to extract strategic information from that data is
what is Data Mining.

Data Mining Tasks


Let's now move on to the common classes of Data Mining tasks - Anomaly
Detection, Associate Learning, Cluster Detection, Classification and Regression

Anomaly Detection
Anomaly Detection refers to identifying items, events or observations that do not
adhere to the expected pattern or the other items in the dataset.

Anomaly Detection Example


A good example is how the tax department models typical tax returns and then
identifies returns that differ from this model using anomaly detection. This is
used for audits and reviews.

Association Learning
Association learning is the ability to learn and remember the relationship
between unrelated items or stimuli or behavior.

Association Learning Example


Association learning is the type of data mining that drives the recommendation
engines in major sites like Amazon and Netflix. This would let you know that
customers who bought a particular item also bought another item.

Cluster Detection
Cluster Detection is a type of pattern recognition particularly useful in
recognizing distinct clusters or sub-categories within the data.

Cluster Detection Example


The purchasing habits of hobbyists like gardeners, artists and model builders
would look quite different. By analyzing the purchasing behavior using
clustering algorithms, one can detect the various subgroups within the dataset.

Classification
Classification - If an existing structure is already known, you can use data
mining to classify new cases into these pre-determined categories.

Classification Example
The algorithms can be trained to detect systematic differences between items in
each group by learning from a large set of pre-classified examples. The
algorithm can then apply these rules to the new classification problems. For
instance, a classifier can predict borrowers who cheat on loan payments.

Regression/Prediction
Regression/Prediction uses the historical relationship between a dependent and
one or more independent variables to predict values of the dependent variable.

Regression Example
It is a common practice for businesses to use regression to predict stock prices,
currency exchange rates, sales, productivity gains and so on. For example, a
company might use regression to get insights on how the expenses in past
advertising have impacted the sales. Here, the dependent variable is sales and
the independent variable is advertising expenditures, number of sales reps and
the commission paid.

Data Mining Tasks - Summary


Keep in mind that not all patterns inferred by the data mining algorithms are
necessarily valid. The patterns detected during the data mining process are
often tested against a test set of data, and then the accuracy is validated. Once
it achieves the desired standard, these algorithms are used to predict
outcomes. Data mining, in this way, can grant immense inferential power.

Data Mining Process


In this video, we will learn about data mining process starting from raw data to
the point of knowledge discovery.

Hello! My name is Thales Sehn Körting and I will present very briefly how data
mining works. Most of the times when people search about Big data mining,
what they are interested in is the whole process, in which data mining is just
one step. This video could be called "How knowledge discovery in databases
works”. This is the real title of this video.

To show how knowledge discovery in databases works, we present the steps


that starts from raw data, until we get acknowledgement about the data we
have, and when we do these using our tools - computational and algorithm
tools, we start from the raw data and obtain what we can call knowledge.

The first step is, conversion from raw data to target data, and this is what we
call as "Selection of Data”. Suppose we have lots of information about a certain
phenomenon and we want to derive some knowledge about this, sometimes we
have data that is not useful, data that is not ready to be used, and data in a
different format. In these cases, very basic processing that we have to do is
called selection, where we get data to the target data.

With the target data, we can do "pre-processing". One of the important


operations that we do here is, to detect for example ‘Outliers’. Suppose we have
two variables here in this data distribution, we can see or we can use algorithms
to detect that red point is an outlier. In some cases, some algorithms may not
work properly, if we have data which is very different from the entire distribution.
So this is called an outlier. We can try to remove these and get this pre-
processed data without outliers.

Another thing that we can do here is to ‘Detect Missing Values’. Suppose we


have this data distribution here, we can use some algorithms to estimate what
could be those two holes we have here. Suppose we have this estimation using
that green line, we could try to interpolate the other data in order to see what
could be the data in that holes. These are two of the most well-known pre-
processing steps that are performed on data independent of the application we
are doing.

After pre-processing, we have to apply the "Transformation of the data”. One


thing that we do in this case is to normalize the date. Because, sometimes we
have data that ranges from zero to one, another data that is a textual data,
other data we have from 0 to infinite. Most of the algorithms are created to use
data in a similar way. One of the steps is to "Normalize". Another step is to find
correlated variables. Suppose we have two variables that have a high
correlation, which means that using these two variables is useless.

So what we can do with these? We can apply some transformation on the data
to make these variables uncorrelated and we can extract the most information
that we have in the next step. Suppose we have the transformed data, we want
to apply this main topic of these whole steps called the "Data mining" from the
transformed data, we can get the patterns.

How we do these? We can apply several classification algorithms. You will be


able to see in the description of this video, several data mining algorithms. But
in this case, we can apply several algorithms such as the k-nearest neighbor or
even a decision trees application or vector machines also. So these are
possible data mining algorithms or classification algorithms that we apply to the
data to obtain patterns. So the data will start to be divided into patterns in the
last step. This process is the interpretation of these patterns. This not an
automatic procedure. The user looks at the patterns and applies interpretation
in order to obtain the knowledge given by that patterns. So the user can look at
the discovered patterns and try to see if there is some redundant or irrelevant
patterns and with this thing in mind, we obtain knowledge from the data.

It is important to say that we have all these green arrows which means that we
can return to any / often the previous steps that we applied here in order to
improve our notion of patterns and also our notion of knowledge. That's why we
have such an interconnected procedure here. Is important to say that, this
explanation about knowledge discovery in databases is based on these main
reference from Fayyad, Shapiro and smith in 1996.Thanks for your invitation
and this is how knowledge discovery in database works.

Knowledge Discovery Process


Now that you have an idea of how data is processed to create knowledge, let's
learn about various stages of Knowledge Discovery Process: Problem Definition
> Data Preparation > Data Mining > Data Analysis > Knowledge Assimilation

Knowledge Discovery Process


Now that you have an idea of how data is processed to create knowledge, let's
learn about various stages of Knowledge Discovery Process: Problem Definition
> Data Preparation > Data Mining > Data Analysis > Knowledge Assimilation

Problem Definition Stage

The problem definition stage is the initial phase of a data mining project, and it
focuses on understanding the project objectives, requirements and defining the
data mining problem. Based on this, you can identify the data requirement and
models

Data Preparation Stage


This stage involves three key activities and requires more than 70% of the total
data mining effort.

1. 'Data Selection': We identify the sources of information select a subset of data


required for analysis.
2. 'Data Pre-processing': We join data from various tables and resolve issues such
as data conflicts, outliers, and missing data.
3. 'Data Transformation': We use conversions and combinations to generate new
data fields like ratios and discretized continuous values.

Data Mining Stage


In this stage data mining technique, we identify the algorithm and tools to be
used. Then, we apply the algorithm on the sample data set (also known as
training data) and tune the control parameters of the algorithm till we get a
satisfying result. Later, we validate the model by running the algorithm against
the actual data (also known as test data).

Data Analysis Stage


In this stage, we evaluate the mined patterns with respect to the defined goals.
We interpret the Data Mining output – in the form of rules or patterns to find new
and potentially useful knowledge. This is the Holy Grail of the Knowledge
Discovery!

Knowledge Assimilation Stage


In this stage, we implement the business insights derived from the Data Mining
process in the organization’s system for further action. The knowledge becomes
active, which means that we can make changes to the system, and measure
the impact of the changes. The success of this step determines how effective
the Knowledge Discovery process is.

Knowledge Discovery Process - Summary


The final deployment would involve building computerized systems to capture
relevant data and to make real time recommendations to business. Also, Data
Mining Models need to be continuously monitored and refined as several
economic factors, business changes and competitor initiatives could impact the
performance of the model.

Data Mining team


Let's understand the typical team composition required for Data Mining projects.
These projects require people with not just great minds but those who have a
great eye for data. A Data Mining team typically involves :

 Domain Expert
 Database Administrator
 Statistician
 Mining Specialist

Domain Experts
Domain Experts are usually people in higher business management functions
who know the business environment, processes, customers, and competition.

Database Administrator
Database Administrators come with a good understanding of company data,
where it is stored, how it is stored, how to access it and how to relate it to other
data sources.

Statisticians
Statisticians validate and analyze datasets. Their key tasks include analysis,
interpretation, and presentation of statistical outputs.

Data Miner
Data Miners apply data mining techniques and technically interpret the results.
They usually have a background in data analysis and statistics.

Roles in Data Mining Summary


In a Data Mining projects, Data Miners play a central role in establishing
relationships with Domain Experts for business guidance on their results, with
DBAs for access to the data required for their activities and with Statisticians for
validating analysis and interpreting statistical outputs.

Supervised Learning
Supervised learning is the most common technique for training neural networks and
decision trees. Let us watch this video to learn about Supervised Learning.

This class is divided into three subclasses or three parts. They are, supervised
learning, unsupervised learning and reinforcement learning.

P1. So, what do you think supervised learning is? P2. I think of supervised
learning as being the problem of taking labelled data sets, gleaning information
from it so that you can label new data sets. P1. That's fair. I call that function
approximation. Here's an example of supervised learning. I'm going to give you
an input and an output, and I'm going to give them to you as pairs, and I want
you to guess what the function is. Okay? P2. Okay. P1. 1 -> 1 2 -> 4 P2. Wait,
hang on, is 1 the input and 1 the output? P1. Yes. P2. And 2 the input, and 4
the output? P1. Correct. All right. P2. I think I am on to you. P1. 3 -> 9 4 -> 16 5
-> 25 6 -> 36 7 -> 49 P2. Nice. This is a very hip data set. P1. It is. What's the
function? P2. It's hip to be squared. P1. Exactly. Maybe. So, if you believe that's
true, then tell me if the input is 10, what's the output? P2. 100. P1. And that's
right, if it turns out, in fact, that the function is x squared. But the truth is, we
have no idea whether this function is x squared. P2. Not really. I have a pretty
good idea. P1. You do? Well, where's that idea come from? P2. It comes from
having spoken with you over a long period of time. And plus, you know, math.
You can't say I'm wrong. P1. You're wrong. P2. You just said I was wrong. P1.
No, you've talked to me for a long time, and plus math. I agree with that. P2.
Okay. P1. But, I'm going to claim that you're making a leap of faith, Despite
being a scientist, by deciding that the input is 10 and the output is 100. P2.
Sure. I would agree with that. P1. What's that leap of faith? P2. Well, I mean,
from what you told me, it's still consistent with lots of other mappings from input
to output like, 10 gets mapped to 11. P1. Right or everything is x squared
except 10. P2. Sure. P1. Or everything is x, x squared up to 10. P2. Right, that
would be mean. P1. That would be mean, P2. But it's not logically impossible.
P1. What would be the median? P2. A-ha. P1. Thank you very much. I was
saving that one up.

Unsupervised Learning
Unsupervised Learning is a very powerful technique for learning from unlabeled
data. Let us watch this video to learn about Unsupervised Learning.

P1. What about unsupervised learning? P2. Right, so unsupervised learning we


don't get those examples. We have just essentially something like input, and we
have to derive some structure from them, just by looking at the relationship
between the inputs themselves. P1. Right . So, give me an example of that. P2.
So, when you're studying different kinds of animals, say, even as a kid. You
might start to say, oh, there's these animals that all look kind of the same.
They're all four-legged. I'm going to call all of them dogs. Even if they happen to
be horses, or cows, or whatever. But I have developed, without anyone telling
me ,this sort of notion that all these belong in the same class. And it's different
from things like trees. P1. Which don't have four legs. P2. Well some do, but I
mean, they have, they both bark, is all I'm saying. P1. Did I really set you up for
that? Not on purpose. I'm sorry, I want to apologize to each and every one of
you for that. But that was pretty good. Michael is very good at word play. Which
I guess is often unsupervised as well. P2. No, I get a lot of that. P1. You
certainly get a lot of feedback. P2. Yeah, that's right. So I say, please stop doing
that. P1. So, if supervised learning is about function approximation, then
unsupervised learning is about description. It's about taking a set of data and
figuring out how you might divide it up in one way or the other. P2. Or maybe
even summarization, it's not just the description but it's a shorter description. It's
usually concise, compressed, compact description. P1. So, I might take a bunch
of pixels like I have here it might say, male. P2. Wait, wait, wait, wait, I’m pixels
now? P1. As far as we can tell. P2. That's fine. P1. I however, am not pixels. I
know I'm not pixels. I'm pretty sure the rest of you are pixels. That's right. So I
have a bunch of pixels, and I might say male, or I might say female, or I might
say dog, or I might say tree. But the point is, I don't have a bunch of labels that
say dog, tree, male, or female. I just decide that pixels like this belong with
pixels like this. As opposed to, pixels like something else that I'm pointing to
behind me. P2. Yeah we're living in a world right now that is devoid of any other
objects. Oh, chairs! P1. Chairs! Right. So these pixels are very different than
those pixels because of where they are relative to the other pixels. Say, right?
P2. I'm not sure that's helping me understand unsupervised learning. P1. Go
out and look at a crowd of people and try to decide how you might divide them
up. Maybe you'll divide them up by ethnicity, maybe you'll divide them up by
whether they have purposefully shaven their hair in order to mock the bald or
whether they have curly hair. Maybe you'll divide them up by whether they have
goatees, or whether they have grey hair, there's lots of things that you might do
in order. P2. Did you just point at me and say grey hair? P1. I was pointing and
your head happened to be there. P2. All right. P1. Okay. So, imagine you're
dividing the world up that way, you could divide it up male, female. You could
divide it up short, tall, wears hats, doesn't wear hats, all kinds of ways you can
divide it up. And no one's telling you the right way to divide it up. At least not
directly. That's unsupervised learning. That's description, because now-rather
than having to send pixels of everyone, or having to do a complete description
of this crowd, you can say, there were 57 males and 23 females, or there are
mostly people with beards, Or whatever. P1. I like summarization for that. Yeah.
It's a nice concise description. P2. That's unsupervised learning. P1. Good.
Very good. P2. And that's different from supervised learning in a couple of
ways. One way that it's different is, all of those ways that we could have just
divided up the world, In some sense are all equal either. So, I could divide up by
sex, or I could divide up by height, or I could divide up by clothing, or whatever.
And they're all equally good, absent some other signal later telling you, how you
should be dividing up the world. But supervised learning directly tells you,
there's a signal, this is what it ought to be, and that's how you train. P2. Now,
but I could see ways that unsupervised learning could be helpful in the
supervised setting, right? So, if I do get a nice description, and it's the right kind
of description, it may help me do the function approximation better. P1. Right,
so instead of taking pixels as input, and labels like, male or female. I could just
simply take a summarization of you like how much hair do you have, your
relative height, the weight, and various things like that might help me do it.
That's right. And by the way, in practice this turns out to be things like density
estimation. We do end up turning it into statistics at the end of the day. Often.
P2. But it's statistics from the beginning. But when you say density estimation,
are you saying I'm stupid? P1. No. P2. All right so what is density estimation?
P1. Well they'll have to take the class to find out. P2. I see. Okay.

Introduction to Data Mining Techniques


We will now learn about some of the widely used Data Mining Techniques.

Classification
The classification technique is based on machine learning. Here, we classify
each item in a dataset into one of the predefined sets of classes or groups. The
classification methods incorporate mathematical techniques such as statistics,
neural networks, decision trees and linear programming.

Classification
This video will help us learn details about Classification.

So, what kind of predictors are we going to look at, or what kind of tasks or what
kind of prediction tasks will be covered in this course? We will basically cover,
three - we will cover classification, we will cover regression and we will cover
clustering tasks. So I'm gonna show them kind of pictorially, just to get everyone
warmed up.

So in classification, what do you have? - You have a collection of individuals. So


these are the people that we've seen before and they are represented in some
way. And in a classification task, in addition to this collection of individuals,
somebody comes along and gives us some labels for some of these individuals.
So he says that, maybe this one is an F and this one is an M and now as a
learning algorithm, you actually have no idea what those labels mean and you
really have no idea what these things are. But somebody comes along and
attaches labels to some data points. So, what does a classification algorithm try
to do in a situation like that?

What it tries to do is, it builds that predictor. And in classification, the predictor
takes on a particularly simple reduced form. So the whole predictor takes the
form of something that we call a Decision Boundary. The decision boundary is
an imaginary line that goes through our space and cuts the space into two
parts. One part is going to be the part where our algorithm thinks the M's live
and the other part is where the F's live. It tries to draw a boundary in the space
of data points such that all that F's are on one side and all the M's are on
another side. So that’s what it tries to do and that's what the decision
boundaries is and it's a fundamental concept in classification. We will dwell
deep on what decision boundaries are and what do they look like a
geometrically. That’ll come in a few lectures down the line. So for now, you just
have some sort of boundary, where M is on one side and F is on other side. So
this red line is the function and there's nothing more to it and that is your
predictor. And how does it predicts stuff?

Well, if you fall on one side of the boundary, it'll predict an M for you and if you
fall on another side, it'll predict an F. So maybe an M is the market going up and
F is the market going down or maybe M is the individual is male and F as an
individual is female. So, you just build a predictor to detect gender based on
however you represent that individual. So, that's Classification.

Now the important thing to keep in mind is, this classification or decision
boundary only looks that way because of the labels we put on the data. So, we
could take exactly the same dataset, exactly the same set of individuals, put
some different labels, may be couple of yes and no, and what you hope for is
that you're learning algorithm will produce a different prediction, a different
decision. So, maybe this decision boundary reflects whether you are going to
loan money to that particular individual or not. So, you had some examples of
people who paid up and example of a person who didn't pay up. So, that's how
you decide to draw a boundary. And again the function, the predictor is just the
decision boundary. So the prediction is which side of the boundary you are
falling on. So, that's Classification!

Regression Analysis
Regression is a predictive modeling technique. It explores the relationship
between a dependent variable - which is referred to as a target and an
independent variable(s) - which is referred to as the predictor. We use
regression technique for forecasting, time series modeling and finding the
causal effect relationship between the variables.

Regression Analysis
In this video we will learn more about Regression Analysis with an example.

This tutorial is an introduction to Regression. There's an X variable and a Y


variable and in this case, the independent variable is on the x-axis and the
dependent variable is on they-axis. And we try to form a relationship between
these two variables and draw a line, in this case a straight line. Over the next
series of videos, I’ll explain what all this means.
What we try to understand is, as the independent variable is moving or
changing, what happens to the dependent variable? Does it go up? Does it go
down? How does it change? If they move in the same direction, if the
independent variable increases and the dependent variable increases as well
like, we say there is a positive relationship. If on the other hand, as the
independent variable increases and the dependent variable decreases like this,
we say there is a negative relationship. The line would look like this, go
downward. In linear regression, to make a line, the key is on line right there - a
straight line. You can also do curved lines. But for this topic, is all straight lines
to actually conduct regression.

I take observations and always plot some more observations in your random
house to come in here like that. And, I try to find a line that will fit a straight line
that fits to all these different points. This is called my regression line and it's
based upon the least squares method. In the end, I want to minimize the
difference between the estimated value and the actual value. I want to minimize
my errors. This line will have a lot of errors, if I compare the actual to the
estimated value. Again the point is to minimize these errors or make them as
small as possible. Now, let's imagine I put study time on the x-axis or make that
my independent variable, and the dependent variable becomes grades or GPA.
As study time increases, grades should go up. There is a positive relationship.
In regression, we develop these equations like this.

In this case, y hat is estimated grades and it's based upon or its equal to B0
plus B1 times X. Where X is study time, B0 we derive mathematically and it is
the y-intercept. B1 is also derived mathematically and I’ll do it later video and it's
the slope of the line. In this case, the slope is positive. In the next video, I'll
discuss how you develop these equations. Now if I change the x-axis to time on
Facebook, we see a negative relationship. More time on Facebook, grades will
suffer and go down - a negative relationship. What we're estimating is still
grades. Estimated grades is equal to B0 - B1 times X. Where X is time on
Facebook. B0 is still the y intercept and it is a calculated value. The slope of the
line is B1 because its downward sloping - negative relationship and as I said,
before I will show you how to calculate this equation in the next video. The X is
the independent variable and the Y is the dependent variable. X is what we
control, what we manipulate, what we change and the dependent variable is the
outcome. So, study time is the independent variable, that’s what we control and
your grades are dependent upon how much you study. Now this looks really
ugly and it's what I’ll talk about in the next video. But I’ll step you step by step
through it and I hope we make it simple for you.

Decision Trees
Decision Tree is one of the widely used and easy-to-understand techniques.
The root of the decision tree is a condition or a question which has several
answers. Each answer points to a set of questions or conditions that help in
determining the data that can help make the final decision.

Decision Trees
This video gives us some more details on Decision Trees.

In today's session we will talk about a very popular data mining technique called
Decision Trees. This technique is liked by Data Miners and Analysts world over,
because of its intuitive nature and user-friendly results. Let us take some time to
understand how this technique works.

We will take the example of a credit card company that has a set of customers.
Some of them are profitable. Some of them are not. Customers who do not use
their credit cards frequently or those who use the card, but diligently pay their
bills on time are examples of customers who are not profitable for credit card
companies. Customers who carry balances on the cards, i.e. customers who do
not make their card payments in full or on time are examples of customers who
are profitable for the company.

On our slide, we will denote profitable customers with red dots and unprofitable
customers with blue crosses. In our simplified example, let us assume the
company has five profitable and five unprofitable customers. This box here
represents the company's customer base. It has five red dots, i.e. five profitable
customers and five blue crosses, i.e. five unprofitable customers. These are the
company's existing customers. Outside this box is a large population of
potential customers. Potential customers are people who are not the customers
of this company, but the company can market to these customers so they have
the potential to be its customers. These customers are denoted by green
squares. The company doesn't yet know if these customers will be profitable
once they become customers. Now the credit card company has a fixed
marketing budget that allows it to market its products to a limited set of people
out of this large population of potential customers. The company wants to utilize
its marketing budget in such a way that it attracts the maximum number of
profitable customers. In essence, the company is saying, I have 10 customers,
5 of whom are profitable and 5 are unprofitable. I want to add 10 more
customers to my customer base. But I want all or most of them to be profitable.
So in effect, the company wants to focus its marketing budget only on those
people who are likely to be profitable, if they become the company's customers.
This is an interesting problem.

How can the credit card company predict if a person will be a profitable
customer or not, before the person even becomes a customer? This is where
analytics and the power of historical data come in. The company has certain
information available about its potential customers. For example age, gender,
marital status and the number of credit cards they already own. It wants to see if
any of these variables can help predict the profitability of a potential customer.
How will the company find this out? For this, let us examine the company's
existing customer base. The same information is available to the company
about its current customer base also. It knows their age, gender, marital status
and the number of cards already owned.

Please examine this table in some detail. In the existing customer base, 5 of the
customers are profitable and 5 are unprofitable. Hence the profitability rate of
the total customer base is 50%. Now, let us partition the data into two segments
based on the age variable. Let us put those what 35 and above in the left
segment and those below 35 in the right segment. Examine the profitability rate
of the two segments.

The left segment has 4 profitable customers and 2 unprofitable customers. That
is, a profitability rate of 66%. In other words, two-thirds of the customers who
are above 35 are profitable customers. Compare this with the overall population
profitability rate of 50%. And we have an important insight. People who are 35
and above tend to be more profitable customers for the credit card company
than the average population. This means if the company markets its products
only two people who are above 35 it will end up with a more profitable customer
base.

Now, let's see if we can further segment this population into smaller segments.
Some of which have an even higher profitability. We will segment this
population of people over 35 by the marital status variable. That is whether a
person is single or married. The population is segmented into two separate
groups. One comprising of married people and the other made up of single
people. Notice the left-hand box now. It has four customers all of whom are
profitable. This segment of the population is, people who are over 35 and
married has a profitability rate of 100%. We have now identified a small
segment of population that is highly profitable for the credit card company. We
have also learned that the credit card company needs to focus its marketing
efforts on people who are above 35 and married, as these people are likely to
be profitable customers. This is an example of a business using historical data
of its existing customers to predict the behavior of potential future customers in
order to build a more profitable customer base.

In particular, we have seen how the decision tree technique is used in predictive
modelling. This is a simplified example with four variables and 10 records. In
this example, we first segment the data on the age variable. Further we used
age greater than or equal to 35 as the splitting criteria. How do we know which
variable to use for the split at what time and how do we know what level to split
the variable at? In a real business situation, you will be dealing with hundreds of
variables and thousands or millions of records. How do you make these
decisions in such a scenario? This is where, Decision Trees come in. There are
various Decision Tree algorithms that allow the analyst to choose the right
variable from thousands of available variables and split the variable at the most
optimal value. In the following slides we will learn more about the decision tree
technique and the various algorithms underlying this technique.

Neural Networks
Neural Network is well suited for identifying patterns and forecasting. It is a set
of connected input/output units where each connection has a weight associated.
In the learning phase, for the network to predict the class of the input tuples, it
learns by adjusting weights.

Neural Networks
This video gives us some more details about Neural Networks.

We all use computers every day. But sometimes computers fail us and this
upsets a lot of people. What we'd like is for our computers to be smarter and
more user friendly. So, some people think we should try to make them more
human. This would involve making computers think more people. But to do this,
we first need to understand, how humans think and how our brains work.
First though, let's look at computers as they are today. Despite everything they
can do, they're pretty simple. They take some inputs, perform calculations and
produce outputs. The human brain however is extremely complicated and a lot
of very smart sciences are still struggling to understand how the whole thing
works. One thing we have known for a while is that the tiniest components of
your brain that makes it think and do smart things are special cells called
neurons. Your brain has billions of these neurons and they talk to each other
using electrical impulses to create what are called synapses. This massive
synapses is what is responsible for making your brain think and have a
consciousness.

Some computer scientists had the idea that we can make a computer that is
modelled after this system of neuron connections. They called their idea Neural
Networks. The idea behind neural networks or neurons for short is that we have
nodes that has some connection between them. This is similar to the neurons in
your brain and the synapses a form to get a neuron to do something, we trigger
a node with some input and that node in turn triggers the nodes it is connected
to. But this alone is not very useful. So we usually organize the neurons the way
that makes it easy to produce good results. Since we're used to the computer
model of computation, we'd like to have well-defined input and output nodes.
We also like to have directed connections so that we know which way
information is going. Not only that, we want our connections to have different
values. That is some connection should be more important than others. Here
the connection values called weights are represented by the thickness of their
arrows. The purpose of having different connection weights is that it allows our
nodes to behave more like real neurons. When a node stimulated by two
different nodes, It can decide which of the two is more important to it by their
connection weights.

Here, nodes A and B have been given the values green and orange. And they
try to pass on these values to node C since they are connected to it. Since the
connection weights between B and C is much larger than the connection weight
between A and C, node C decides B is more important to it and takes its value.
More often though, we design nodes to take the sum or an average of the
nodes true unit. Here, node C takes a sort of yellow color. But notice the shade
is much closer to orange than it is too green since node B's connection weight
is large compared to node A's.
In some cases, we'd like for our nodes to be able to decide whether they want
to accept their triggers at all. So each node gets to think about what it will do. To
decide, each node is given what is called a transfer function to judge its inputs.
Since in the real world, computers treat all data as numbers, the transfer
function is a math equation. It's usually not that complicated. After the node
makes a decision, it set its value and then it can trigger the next set of nodes
with that value. Choosing whether or not to accept the trigger value is most
useful for the output nodes. Since these are the node that produce the result
that we actually want. Usually though, the transfer function will return a value
that is a combination of the nodes current value and the trigger value. So using
the connection with and transfer function, the neural network takes inputs and
produces outputs. This is the same test that a computer would do. But it's done
in a way that is similar to the way neurons work. Since the input and output
nodes are the ones that matter to us, we consider the nodes in the middle as
hidden nodes. They do most of the work, but get the least credit.

Now that we know the basics, we need to ask some questions. One of the most
important questions is, how are the connections determined?

Well, it turns out that the neural networks can learn them. Does this make
neural network smart? Sort of. But it also turns out that the neural networks are
very slow learners, as we will soon see. But the question is, how do they learn?
It's done through a process called back propagation. We start with random
connection weights, then for a given set of inputs, we decide on a set of desired
outputs. Using the random weights, we first let the network calculate some
outputs. Then we compare the output that the neural net calculated to the
desired output that we defined. Since we gave the network random weights, we
obviously cannot expect two outputs to be equal. So we find their difference.
We call this difference the error in the network. This is difficult to illustrate with
colors. But you have to trust me that I did it correctly.

For an easier to follow example, I've given each node a numerical value. You
can see that we will find the error by simply subtracting and then we can have
negative error as well. Now that we have the errors, we need to adjust the
connection to produce small errors. This is where the back propagation comes
in. The output nodes tell the hidden nodes they are connected to, about their
error and together they decide on how to adjust the connection weights
between them. The new weight is calculated using an equation based on the
old weight. The nodes input value, the error and something called the learning
rates. We'll get back to the learning rate later. With the weights adjusted, the
hidden nodes calculate their own error using a similar formula to before. Then
these nodes with their newly calculated errors, push the errors back through the
hidden nodes and adjust the weight behind them. This goes on until all the
weights have been adjusted and all the nodes have been assigned an error.
The idea is to determine which nodes are most to blame for the error in the
output and try to adjust their weights the most.

Now that all the weights have been updated, the network tries out the original
inputs again and tries to calculate some outputs. The calculated output should
be closer to the desired outputs than before. But there will still be some error.
So the whole process is repeated again and again and again. Remember, how I
said that the neural nets are slow learners. Well, the neural network has to do
all that for each different input set and there's usually a lot of those. But the idea
works and eventually the network will produce the desired outputs. To try to
produce the desired outputs more quickly, we can try adjusting the learning rate
and we can change the number of nodes as well. But it will still take millions of
attempts for the neural network to get the desired output for even a simple
problem. So, can your networks make computers think more like humans?
Probably not. But it's a good baby step.

Well that's my presentation on neural nets. I hope you liked it and maybe learnt
something as well. Feel free to email me if you have any questions. Thanks for
watching!

Clustering
Clustering is a data mining technique that identifies a cluster of objects having
similar characteristics. At a simple level, clustering uses one or more attributes
as the basis for identifying a cluster of correlating results.

Clustering
This video gives us some more details about Clustering.

Clustering is the process of breaking down a largest population or a large


dataset into smaller groups. As an analyst, you will often face this question that
you need to organize the data that you’re observing into some kind of a
meaningful structure or pattern. And this is where, clustering comes in handy.
Clustering allows you to break a population into smaller groups where, each
observation within every group is more similar to each other than it is to an
observation in another group. So the idea is to group together similar kind of
observations into smaller groups and thus breakdown that large heterogeneous
population that you're seeing into smaller more homogeneous groups.

Let's take an example to understand how clustering works exactly. Imagine that
you own a chain of ice cream shops. You have a number of ice cream shops
spread across the country. Say you have 8 of them and you sell two flavors of
ice cream. You sell chocolate ice cream and you sell vanilla ice-cream. Now in
this table here you can see the sales of both chocolate and vanilla ice cream
across your 8 stores. The units are not important. The timeframe is not
important for what you're doing. But just imagine that this is the data that you're
looking at.

Now there are many different ways you can make sense of this data. You can
look at summary statistics. You can calculate the mean, median, spread of the
variables and dispersion in order to get a better sense of this data. One very
intuitive way of doing this is to plot this data on a graph. So here we have
plotted the sales of both chocolate and vanilla ice cream for each of these 8
stores. So you can see 8 dots here. Each of these dots represents a store and
on the y-axis you have chocolate sales, on the x axis you have vanilla sales. So
you've mapped these 8 stores by their chocolate and vanilla sales and you've
created a scatterplot. This is a very intuitive way of looking at this data to
understand what this data is saying.

Now when you look at this graph, there is one very clear insight that has come
from this and that is that, you can divide your stores into two distinct groups.
You have one large group of stores, a group of 5 stores here and you have
another group of stores which has 3 stores. So essentially, your 8 stores can be
divided into two different groups that behave slightly differently in terms of their
chocolate and vanilla sales. The difference essentially is in terms of the
magnitude of the sales. In group 1, you can see that sales of both chocolate
and vanilla ice cream are lower than in group 2. So, what we've done is, we've
just looked at sales of 8 stores for these two flavors of ice cream and we have
plotted them on a graph and then we have just divided the stores into 2 groups
based on where they were on the graph and their proximity to each other. So,
this is essentially how clustering works. This is a very simple two dimensional
example of how clustering works. But this accurately explains how really the
algorithm works.

Now to better understand this algorithm, let's just look at one more thing that
we've done here quite intuitively, actually without even realizing. When we were
grouping these stores, when we've created these two groups, what have we
done here, at a very intuitive level, what we've done is between these three
stores. If I look at the cluster on the top, we've taken one imaginative point
somewhere in the center and we've drawn a circle around it. Similarly, for group
2 we've taken an imaginary point and we've drawn a circle around it and all the
observations that fall within that circle are grouped together into one cluster. So
that's essentially how clustering works.

Now this is an example where we have 2 flavors of ice cream and we have 8
stores. Now imagine that you have expanded your chain of ice cream stores
and instead of 2 flavors, you're selling 30 different flavors. You're selling banana
ice cream, dark chocolate ice cream, you're selling Belgian chocolate. You’re
selling all kinds of flavors and you're selling 30 different flavors. So, how will you
plot this information on graph now? You can’t draw a 30-dimensional graph.
There’s no way we can visualize a 30-dimensional graph and imagine if instead
of 8 stores now you've grown to 500 stores. So instead of 8 points, you will
have 500 different points on the graph. Now that’s still easier to visualize, but if
you have something like you've got a million records, then you have a million
different points and if you have thousands of variables, then you have
thousands of different dimensions. So, there is a mathematical dealing with
such complexity and that is what cluster analysis works. Let's understand how
to clustering algorithms works in a little more detail.

Association Rule Mining


Association Rule Mining is one of the best-known data mining techniques. Here,
we discover a pattern based on the relationship between items in the same
transaction. In market basket analysis, we use this association technique to
identify products that customers frequently purchase together.

Association Rule Mining


This video will help us learn more about Association Rule Mining.
The following is a production on the Metro applied research club and the
Department of Computer Information Systems at the Metropolitan State
University of Denver. With special thanks to our corporate sponsor Nebraska
Aluminium Castings. Quality, Custom Aluminium die-casting in Hastings
Nebraska.

Welcome to the lecture Association Rules the basics. The goals of this lecture
are to introduce the student to key concepts of association rules including, What
association rules are, How they can be applied and How they can be
interpreted. To introduce the student to important terms and definitions central
to Association rules and to demonstrate how associations can be created and
assessed using a simple dataset. Some of the key words and terms you will
want to pay special attention to include -Association Rules, Support,
Confidence, If-then, Antecedent, Consequent and Item set.

Today we're going to be explore association rules and to follow along with me,
you are probably going to want to have 2 files. First one is an excel file called
credit risk association 2 workbook final. That’s an excel file should be available
to you. If not, you can go to ww.joehasley.com. The other file which you might
want to be able to find is a lecture from the MIT website. It's called Discovering
association rules in transaction database. To explore association rules, you are
going to be looking at a relatively simple problem. Data in the file
"credit_risk_assn2_workbook_final" basically allows us to determine "Credit
Standing". Credit Standing is going to be our dependent variable and then we'll
be looking at several independent variables and trying to figure out, if we can
predict the outcome of dependent variable based on the independent variables.

So for example, independent variable "Checking Acct" data tells us, "Do they
even have checking account?”. This person has no checking account and this
person has no balance. So they have an account but no balance. These are all
relative. We are not given a dollar amount. Just told relative values. "Credit
history" basically, are they current on their bills? Does the bank have to pay
something or are they behind on it? Those types of things. We're told "Purpose"
of previous loan which was for small appliance or furniture or maybe a new car.
"Savings Acct" tells us as opposed to a checking account, to look at a savings
account and the amount that they have into it. They may have no account. Their
"Employment" - how long have they been at their job, their “Gender", "Marital
Status" of the applicant, "Housing" status of the application, what type of job
they have, do they have a telephone, foreign-born, their current age and
ultimately base issue - we give them loan not. So, "Credit Standing" is the
dependent variable that we're looking at. We're going to try and predict that
based on these independent variables.

In the case of this dataset, I have 425 instances or examples and again our task
is to determine some rules based on this data that will help us make decisions
about who to loan money to or not. So, let's talk a little bit about Association
Rules. Association Rules provide information in formal “if - then” statements.
These rules are computed from the data and unlike the “if - then” rules of logic,
association rules are probabilistic in nature. In addition to the antecedent, that’s
the - if part, and the consequent - that's then part. An association rule has two
numbers that express the degree of uncertainty about the rule. What this means
is, rules are not hard-and-fast. Instead of stating a rule such as, "It is cloudy so
it will definitely rain today", a probabilistic rule would say "It is cloudy so it will
rain today is true about 40% of the time. “Or whatever percent.

In association analysis, the antecedent and the consequent are sets of items
called the item sets that are disjoint. That means, they do not have any items in
common. For example, if we consider marital status, we see that the options are
single, divorced or married. Now in real life, you could be separated or there
could be other options. But in this dataset, we have three values and those
values are both mutually exclusive. You can only be in one category and they
are collectively exhaustive. Meaning those are all possible options for this
dataset.

Let's go ahead and examine a few simple association rules together. For
example, a simple Association rule would be, if a person is single, then they
have good credit. Another Association rule will be, if the person is single, then
they have bad credit. Our task is to somehow examine and say "are these good
rules". When we examine rule, we will judge it by two numbers. The first number
is called the support for the rule. The support is simply the number of
transactions that include all items in antecedent and consequent parts of the
rule. The support is sometimes expressed as a percentage of the total number
of records in the database. For example, I can run a simple excel formula
COUNTIF and I can say out of my 420 records, tell me how many of them have
a marital status that is single. Well it turned out to be 233 times. So that's just
out of the 425 records, how many have a marital status of single.
The next thing I want to calculate is, of the individuals with the marital status
single, how many of them have good credit. Well that's my confidence. Put
simply, confidence tells us, of the 233 people in our dataset with the marital
status single, 130 of them had a marital status of single and also had credit
standing of "Good". So about 55.8% of the 233 single people and up also
having great credit. Also you can see of the 233 single people, 103 of them
ended up having bad credit. So about 55% of the single people had good credit
and 44% of the single people had bad credit. Support tells us, out of the whole
dataset, how often this rule holds true. Whether single people had good credit
or whether single people had bad credit. This says out of the 425 people, about
30% of them were single people with good credit and about 24% of them were
single people with bad credit.

Let's consider a second rule. Let’s consider the rule - if divorced, then credit is
good. The confidence for this rule s roughly 41.66%. What that says is,if you are
divorced, then there is 41.6% chance that your credit standing is good. On the
flip side, if you are divorced, then the probability that your credit standing is bad
is, 0.5833 or about 58 .3% of divorced people had bad credit. Now it's worth
noting, if someone is married, I can't really tell much about them. They have
almost 50-50 chance of being a good or bad credit. Finding out that someone is
married doesn't give me much of a statistical advantage over just flipping a coin.
That is to say, if you had access to this information and someone else didn’t
and that someone offered you a bet off, I'll bet you ten dollars that a married
person as a bad bet. The information is not giving you much of an advantage.
Compare this to the information that a person is divorced. If you had access to
this information and someone else did not, and that someone offered you a bet,
I'll bet you ten dollars that a divorced person is a good bet. You would know that
if you take the bet, so that you are betting in that a divorce person is a bad bet.
Then the odds are significantly in your favor.

This is a clear way to quantify the information utility or information value or the
association. This is a direct application of the value of perfect information that
you learn about in your management science class. This marks the end of part
1. Please join us in part two as we do another example and wrap up lecture.

Common Data Mining Problems


We have almost reached the last mile! This video will help us understand some
of the Common Data Mining Problems that we come across in real life.
Data Mining Methods Basics Summary
In this course you have learnt:
What is Data Mining
Knowledge Discovery Process
Roles involved in Data Mining
Commonly used Data Mining Techniques
Applications of Data Mining

You might also like