Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 20

Machine Learning Part

INTRODUCTION

Domain overview

Machine learning is to predict the future from past data. Machine learning (ML) is a
type of artificial intelligence (AI) that provides computers with the ability to learn without
being explicitly programmed. Machine learning focuses on the development of Computer
Programs that can change when exposed to new data and the basics of Machine Learning,
implementation of a simple machine learning algorithm using python. Process of training and
prediction involves use of specialized algorithms. It feed the training data to an algorithm,
and the algorithm uses this training data to give predictions on a new test data. Machine
learning can be roughly separated in to three categories. There are supervised learning,
unsupervised learning and reinforcement learning. Supervised learning program is both given
the input data and the corresponding labeling to learn data has to be labeled by a human being
beforehand. Unsupervised learning is no labels. It provided to the learning algorithm. This
algorithm has to figure out the clustering of the input data. Finally, Reinforcement learning
dynamically interacts with its environment and it receives positive or negative feedback to
improve its performance.
Data scientists use many different kinds of machine learning algorithms to discover
patterns in python that lead to actionable insights. At a high level, these different algorithms
can be classified into two groups based on the way they “learn” about data to make
predictions: supervised and unsupervised learning. Classification is the process of predicting
the class of given data points. Classes are sometimes called as targets/ labels or categories.
Classification predictive modeling is the task of approximating a mapping function from input
variables(X) to discrete output variables(y). In machine learning and statistics, classification is
a supervised learning approach in which the computer program learns from the data input
given to it and then uses this learning to classify new observation. This data set may simply be
bi-class (like identifying whether the person is male or female or that the mail is spam or non-
spam) or it may be multi-class too. Some examples of classification problems are: speech
recognition, handwriting recognition, bio metric identification, document classification etc.

Analyses Predicts
Past Dataset Trains Machine Learning Result

Fig: Process of Machine learning

Supervised Machine Learning is the majority of practical machine learning uses


supervised learning. Supervised learning is where have input variables (X) and an output
variable (y) and use an algorithm to learn the mapping function from the input to the output is
y = f(X). The goal is to approximate the mapping function so well that when you have new
input data (X) that you can predict the output variables (y) for that data. Techniques of
Supervised Machine Learning algorithms include logistic regression, multi-class
classification, Decision Trees and support vector machines etc. Supervised learning requires
that the data used to train the algorithm is already labeled with correct answers. Supervised
learning problems can be further grouped into Classification problems. This problem has as
goal the construction of a succinct model that can predict the value of the dependent attribute
from the attribute variables. The difference between the two tasks is the fact that the
dependent attribute is numerical for categorical for classification. A classification model
attempts to draw some conclusion from observed values. Given one or more inputs a
classification model will try to predict the value of one or more outcomes. A classification
problem is when the output variable is a category, such as “red” or “blue”.

OUTLINE OF THE PROJECT

Overview of the system

. It have to find Accuracy of the training dataset, Accuracy of the testing dataset,
Specification, False Positive rate, precision and recall by comparing algorithm using python
code. The following Involvement steps are,

 Define a problem
 Preparing data
 Evaluating algorithms
 Improving results
 Predicting results

Exploratory Data Analysis

This analysis is not meant to be providing a final conclusion on the reasons leading to
railway sector as it doesn't involve using any inferential statistics techniques/machine
learning algorithms. Machine learning supervised classification algorithms will be used to
give the travel class dataset and extract patterns, which would help in predicting the likely
patient affected or not, thereby helping the hospitals for making better decisions in the future.
Multiple datasets from different sources would be combined to form a generalized dataset,
and then different machine learning algorithms would be applied to extract patterns and to
obtain results with maximum accuracy.

At the dawn of artificial intelligence it was discovered that problems which could be
formally described by a list of mathematical rules could be easily solved by the machines.
This enabled computers to solve logical problems that were difficult for humans. The first
successes in artificial intelligence mostly took place in a formal environment where the
program did not need to have much knowledge about the rest of the world. This showed that
artificial intelligence excelled in problems that could be described mathematically and
although, artificial intelligence would instead struggle with less formal problems. Railway
Passenger Travel Choice Prediction by delay faults based on machine learning Algorithm.

Data Wrangling
In this section of the report will load in the data, check for cleanliness, and then trim
and clean given dataset for analysis. Make sure that the document steps carefully and justify
for cleaning decisions.

Data collection
The data set collected for predicting passengers is split into Training set and Test set.
Generally, 7:3 ratios are applied to split the Training set and Test set. The Data Model which
was created using Random Forest, logistic, Decision tree algorithms, K-Nearest Neighbor
(KNN) and Support vector classifier (SVC) are applied on the Training set and based on the
test result accuracy, Test set prediction is done.

Preprocessing
The data which was collected might contain missing values that may lead to
inconsistency. To gain better results data need to be preprocessed so as to improve the
efficiency of the algorithm. The outliers have to be removed and also variable conversion
need to be done. Based on the correlation among attributes it was observed that attributes that
are significant individually include property area, education, loan amount, and lastly credit
history, which is the strongest among all. Some variables such as applicant income and co-
applicant income are not significant alone, which is strange since by intuition it is considered
as important. The correlation among attributes can be identified using plot diagram in data
visualization process. Data preprocessing is the most time consuming phase of a data mining
process. Data cleaning of loan data removed several attributes that has no significance about
the behavior of a customer. Data integration, data reduction and data transformation are also
to be applicable for loan data. For easy analysis, the data is reduced to some minimum
amount of records. Initially the Attributes which are critical to make a loan credibility
prediction is identified with information gain as the attribute-evaluator and Ranker as the
search-method.

Construction of a Predictive Model


Machine learning needs data gathering have lot of past data’s. Data gathering have
sufficient historical data and raw data. Before data pre-processing, raw data can’t be used
directly. It’s used to preprocess then, what kind of algorithm with model. Training and testing
this model working and predicting correctly with minimum errors. Tuned model involved by
tuned time to time with improving the accuracy.
The steps involved in Building the data model is depicted below.

Data collection (Splitting Training set & Test set)

Pre Processing

Building classification Model

Prediction

Fig: data flow diagram for Machine learning model

Project Goals
 Exploration data analysis of variable identification
 Loading the given dataset
 Import required libraries packages
 Analyze the general properties
 Find duplicate and missing values
 Checking unique and count values

 Uni-variate data analysis


 Rename, add data and drop the data
 To specify data type

 Exploration data analysis of bi-variate and multi-variate


 Plot diagram of pairplot, heatmap, bar chart and Histogram

 Method of Outlier detection with feature engineering


 Pre-processing the given dataset
 Splitting the test and training dataset
 Comparing the Decision tree and Logistic regression model and random forest etc.

 Comparing algorithm to predict the result


 Based on the best accuracy

Training the Dataset


 The first line imports iris data set which is already predefined in sklearn module and
raw data set is basically a table which contains information about various varieties.
 For example, to import any algorithm and train_test_split class from sklearn and
numpy module for use in this program.
 To encapsulate load_data() method in data_dataset variable. Further divide the dataset
into training data and test data using train_test_split method. The X prefix in variable
denotes the feature values and y prefix denotes target values.
 This method divides dataset into training and test data randomly in ratio of 67:33 /
70:30. Then we encapsulate any algorithm.
 In the next line, we fit our training data into this algorithm so that computer can get
trained using this data. Now the training part is complete.

Testing the Dataset


 Now, the dimensions of new features in a numpy array called ‘n’ and we want to
predict the species of this features and to do using the predict method which takes this
array as input and spits out predicted target value as output.
 So, the predicted target value comes out to be 0. Finally to find the test score which is
the ratio of no. of predictions found correct and total predictions made and finding
accuracy score method which basically compares the actual values of the test set with
the predicted values.

Input Details

Data Processing Test


dataset

Classification ML
Training Model
Algorithm
dataset

Fig: Architecture of Proposed model


Data Gathering

Data Pre-Processing

Choose model

Train model

Test model

Tune model

Prediction

Fig: Process of dataflow diagram

General Properties
Create cells freely to explore the given data and it should not perform too many
operations in each cell. One option that can take with this is to do a lot of explorations in an
initial notebook. These don't have to be organized, but make sure you use enough comments
to understand the purpose of each code cell. Then, after done with your analysis, create a
duplicate notebook where it will trim the excess and organize steps so that have a flowing,
cohesive report and make sure that informed on the steps that are taking in your investigation.
Follow every code cell, or every set of related code cells, with a markdown cell to describe to
the reader what was found in the preceding cell. Try to make it so that the reader can then
understand what they will be seeing in the following cell.
Project Requirements

General:

Requirements are the basic constrains that are required to develop a system.
Requirements are collected while designing the system. The following are the requirements
that are to be discussed.

1. Functional requirements

2. Non-Functional requirements

3. Environment requirements

A. Hardware requirements

B. software requirements

Functional requirements:

The software requirements specification is a technical specification of requirements for


the software product. It is the first step in the requirements analysis process. It lists
requirements of a particular software system. The following details to follow the special
libraries like sk-learn, pandas, numpy, matplotlib and seaborn.

Non-Functional Requirements:

Process of functional steps,

1. Problem define
2. Preparing data
3. Evaluating algorithms
4. Improving results
5. Prediction the result

Environmental Requirements:

1. Software Requirements:

Operating System : Windows

Tool : Anaconda with Jupyter Notebook

2. Hardware requirements:

Processor : Pentium IV/III

Hard disk : minimum 80 GB


RAM : minimum 2 GB

Library Packages:

 Pandas
 Numpy
 Matplotlib
 Seaborn
 Sk-learn
 Tkinter

Frontend: Anaconda with jupyter notebook using python language

Backend: GUI Application using python language

Software Description

Anaconda is a free and open-source distribution of the Python and R programming


languages for scientific computing (data science, machine learning applications, large-scale
data processing, predictive analytics, etc.), that aims to simplify package management and
deployment. Package versions are managed by the package management
system “Conda”. The Anaconda distribution is used by over 12 million users and includes
more than 1400 popular data-science packages suitable for Windows, Linux, and MacOS. So,
Anaconda distribution comes with more than 1,400 packages as well as the Conda package
and virtual environment manager called Anaconda Navigator and it eliminates the need to
learn to install each library independently. The open source packages can be individually
installed from the Anaconda repository with the  conda install  command or using the  pip
install  command that is installed with Anaconda. Pip packages provide many of the features
of conda packages and in most cases they can work together. Custom packages can be made
using the  conda build  command, and can be shared with others by uploading them to
Anaconda Cloud, PyPI or other repositories. The default installation of Anaconda2 includes
Python 2.7 and Anaconda3 includes Python 3.7. However, you can create new environments
that include any version of Python packaged with conda.

Anaconda Navigator
Anaconda Navigator is a desktop graphical user interface (GUI) included in
Anaconda distribution that allows users to launch applications and manage conda packages,
environments and channels without using command-line commands. Navigator can search for
packages on Anaconda Cloud or in a local Anaconda Repository, install them in an
environment, run the packages and update them. It is available
for Windows, macOS and Linux.
The following applications are available by default in Navigator:
 JupyterLab
 Jupyter Notebook
 QtConsole
 Spyder
 Glueviz
 Orange
 Rstudio
 Visual Studio Code

Conda:
Conda is an open source, cross-platform, language-agnostic package manager and
environment management system that installs, runs and updates packages and their
dependencies. It was created for Python programs, but it can package and distribute software
for any language (e.g., R), including multi-languages. The Conda package and environment
manager is included in all versions of Anaconda, Miniconda, and Anaconda Repository.

The Jupyter Notebook

The Jupyter Notebook is an open-source web application that allows you to create and
share documents that contain live code, equations, visualizations and narrative text. Uses
include: data cleaning and transformation, numerical simulation, statistical modeling, data
visualization, machine learning, and much more.

Notebook document:
Notebook documents (or “notebooks”, all lower case) are documents produced by
the Jupyter Notebook App, which contain both computer code (e.g. python) and rich text
elements (paragraph, equations, figures, links, etc…). Notebook documents are both human-
readable documents containing the analysis description and the results (figures, tables, etc.)
as well as executable documents which can be run to perform data analysis.

Jupyter Notebook App:


The Jupyter Notebook App is a server-client application that allows editing and
running notebook documents via a web browser. The Jupyter Notebook App can be executed
on a local desktop requiring no internet access (as described in this document) or can be
installed on a remote server and accessed through the internet. In addition to
displaying/editing/running notebook documents, the Jupyter Notebook App has a
“Dashboard” (Notebook Dashboard), a “control panel” showing local files and allowing to
open notebook documents or shutting down their kernels.
kernel:
A notebook kernel is a “computational engine” that executes the code contained in
a Notebook document. The ipython kernel, referenced in this guide, executes python code.
Kernels for many other languages exist (official kernels). When you open a Notebook
document, the associated kernel is automatically launched. When the notebook
is executed (either cell-by-cell or with menu Cell -> Run All), the kernel performs the
computation and produces the results. Depending on the type of computations,
the kernel may consume significant CPU and RAM. Note that the RAM is not released until
the kernel is shut-down.

Notebook Dashboard:
The Notebook Dashboard is the component which is shown first when you
launch Jupyter Notebook App. The Notebook Dashboard is mainly used to open notebook
documents, and to manage the running kernels (visualize and shutdown). The Notebook
Dashboard has other features similar to a file manager, namely navigating folders and
renaming/deleting files.

Working Process:

 Download and install anaconda and get the most useful package for machine learning
in Python.
 Load a dataset and understand its structure using statistical summaries and data
visualization.
 machine learning models, pick the best and build confidence that the accuracy is
reliable.
Python is a popular and powerful interpreted language. Unlike R, Python is a complete
language and platform that you can use for both research and development and developing
production systems. There are also a lot of modules and libraries to choose from, providing
multiple ways to do each task. It can feel overwhelming.

The best way to get started using Python for machine learning is to complete a project.

 It will force you to install and start the Python interpreter (at the very least).
 It will give you a bird’s eye view of how to step through a small project.
 It will give you confidence, maybe to go on to your own small projects.

When you are applying machine learning to your own datasets, you are working on a
project. A machine learning project may not be linear, but it has a number of well-known
steps:

 Define Problem.
 Prepare Data.
 Evaluate Algorithms.
 Improve Results.
 Present Results.

The best way to really come to terms with a new platform or tool is to work through a
machine learning project end-to-end and cover the key steps. Namely, from loading data,
summarizing data, evaluating algorithms and making some predictions.

Here is an overview of what we are going to cover:

1. Installing the Python anaconda platform.


2. Loading the dataset.
3. Summarizing the dataset.
4. Visualizing the dataset.
5. Evaluating some algorithms.
6. Making some predictions.

Work flow diagram

Source Data

Data Processing and Cleaning

Training Testing
Dataset Dataset

Classification ML Algorithms Best Model by Accuracy


Finding prediction results

Fig: Workflow Diagram


List of Modules: (Only Machine Learning Part)

 Data validation Process


 Data pre-processing process
 Exploration data analysis of visualization
 Outlier detection process
 Comparing Algorithm with prediction in the form of best accuracy result
 GUI Application

Variable Identification Process/ Data validation process

Validation techniques in machine learning are used to get the error rate of the
Machine Learning (ML) model, which can be considered as close to the true error rate of the
dataset. If the data volume is large enough to be representative of the population, you may not
need the validation techniques. However, in real-world scenarios, to work with samples of
data that may not be a true representative of the population of given dataset. To finding the
missing value, duplicate value and description of data type whether it is float variable or
integer. The sample of data used to provide an unbiased evaluation of a model fit on the
training dataset while tuning model hyper parameters.

The evaluation becomes more biased as skill on the validation dataset is incorporated
into the model configuration. The validation set is used to evaluate a given model, but this is
for frequent evaluation. It as machine learning engineers use this data to fine-tune the model
hyper parameters. Data collection, data analysis, and the process of addressing data content,
quality, and structure can add up to a time-consuming to-do list. During the process of data
identification, it helps to understand your data and its properties; this knowledge will help
you choose which algorithm to use to build your model. For example, time series data can be
analyzed by regression algorithms; classification algorithms can be used to analyze discrete
data. For example to show the data type format of given dataset. A number of different data
cleaning tasks using Python’s Pandas library and specifically, it focus on probably the biggest
data cleaning task, missing values and it able to more quickly clean data. It wants to spend less
time cleaning data, and more time exploring and modeling. Some of these sources are just
simple random mistakes. Other times, there can be a deeper reason why data is missing. It’s
important to understand these different types of missing data from a statistics point of view.
The type of missing data will influence how to deal with filling in the missing values and to
detect missing values, and do some basic imputation and detailed statistical approach
for dealing with missing data. Before, joint into code, it’s important to understand the sources
of missing data. Here are some typical reasons why data is missing:

 User forgot to fill in a field.


 Data was lost while transferring manually from a legacy database.

 There was a programming error.

 Users chose not to fill out a field tied to their beliefs about how the results would be
used or interpreted.

Variable identification with Uni-variate, Bi-variate and Multi-variate analysis:

 import libraries for access and functional purpose and read the given dataset
 General Properties of Analyzing the given dataset
 Display the given dataset in the form of data frame
 show columns
 shape of the data frame
 To describe the data frame
 Checking data type and information about dataset
 Checking for duplicate data
 Checking Missing values of data frame
 Checking unique values of data frame
 Checking count values of data frame
 Rename and drop the given data frame
 To specify the type of values
 To create extra columns

Data Validation/ Cleaning/Preparing Process


Importing the library packages with loading given dataset. To analyzing the variable
identification by data shape, data type and evaluating the missing values, duplicate values.
A validation dataset is a sample of data held back from training your model that is used to
give an estimate of model skill while tuning model's and procedures that you can use to make
the best use of validation and test datasets when evaluating your models. Data cleaning /
preparing by rename the given dataset and drop the column etc. to analyze the uni-variate, bi-
variate and multi-variate process. The steps and techniques for data cleaning will vary from
dataset to dataset. The primary goal of data cleaning is to detect and remove errors and
anomalies to increase the value of data in analytics and decision making.

Exploration data analysis of visualization

Data visualization is an important skill in applied statistics and machine learning.


Statistics does indeed focus on quantitative descriptions and estimations of data. Data
visualization provides an important suite of tools for gaining a qualitative understanding. This
can be helpful when exploring and getting to know a dataset and can help with identifying
patterns, corrupt data, outliers, and much more. With a little domain knowledge, data
visualizations can be used to express and demonstrate key relationships in plots and charts
that are more visceral and stakeholders than measures of association or significance. Data
visualization and exploratory data analysis are whole fields themselves and it will
recommend a deeper dive into some the books mentioned at the end.

Sometimes data does not make sense until it can look at in a visual form, such as with
charts and plots. Being able to quickly visualize of data samples and others is an important
skill both in applied statistics and in applied machine learning. It will discover the many types
of plots that you will need to know when visualizing data in Python and how to use them to
better understand your own data.

 How to chart time series data with line plots and categorical quantities with bar charts.
 How to summarize data distributions with histograms and box plots.
 How to summarize the relationship between variables with scatter plots.
There are many excellent plotting libraries in Python and it recommend exploring
them in order to create presentable graphics. For quick and dirty plots intended for your own
use, it recommends using the matplotlib library. It is the foundation for many other plotting
libraries and plotting support in higher-level libraries such as Pandas. The matplotlib provides
a context, one in which one or more plots can be drawn before the image is shown or saved to
file and context can be accessed via functions on pyplot.

Outlier detection process

Many machine learning algorithms are sensitive to the range and distribution of
attribute values in the input data. Outliers in input data can skew and mislead the training
process of machine learning algorithms resulting in longer training times, less accurate
models and ultimately poorer results.
Even before predictive models are prepared on training data, outliers can result in
misleading representations and in turn misleading interpretations of collected data. Outliers
can skew the summary distribution of attribute values in descriptive statistics like mean and
standard deviation and in plots such as histograms and scatterplots, compressing the body of
the data. Finally, outliers can represent examples of data instances that are relevant to the
problem such as anomalies in the case of fraud detection and computer security.

False Positives (FP): A person who will pay predicted as defaulter. When actual class is no
and predicted class is yes. E.g. if actual class says this passenger did not survive but predicted
class tells you that this passenger will survive.

False Negatives (FN): A person who default predicted as payer. When actual class is yes but
predicted class in no. E.g. if actual class value indicates that this passenger survived and
predicted class tells you that passenger will die.

True Positives (TP): A person who will not pay predicted as defaulter. These are the
correctly predicted positive values which means that the value of actual class is yes and the
value of predicted class is also yes. E.g. if actual class value indicates that this passenger
survived and predicted class tells you the same thing.

True Negatives (TN): A person who default predicted as payer. These are the correctly
predicted negative values which means that the value of actual class is no and value of
predicted class is also no. E.g. if actual class says this passenger did not survive and predicted
class tells you the same thing.

Comparing Algorithm with prediction in the form of best accuracy result

It is important to compare the performance of multiple different machine learning


algorithms consistently and it will discover to create a test harness to compare multiple
different machine learning algorithms in Python with scikit-learn. It can use this test harness
as a template on your own machine learning problems and add more and different algorithms
to compare. Each model will have different performance characteristics. Using resampling
methods like cross validation, you can get an estimate for how accurate each model may be
on unseen data. It needs to be able to use these estimates to choose one or two best models
from the suite of models that you have created. When have a new dataset, it is a good idea to
visualize the data using different techniques in order to look at the data from different
perspectives. The same idea applies to model selection. You should use a number of different
ways of looking at the estimated accuracy of your machine learning algorithms in order to
choose the one or two to finalize. A way to do this is to use different visualization methods to
show the average accuracy, variance and other properties of the distribution of model
accuracies.

In the example below 5 different algorithms are compared:

 Logistic Regression
 Random Forest
 K-Nearest Neighbors
 Decision tree
 Support Vector Machines

The K-fold cross validation procedure is used to evaluate each algorithm, importantly
configured with the same random seed to ensure that the same splits to the training data are
performed and that each algorithm is evaluated in precisely the same way. Before that
comparing algorithm, Building a Machine Learning Model using install Scikit-Learn
libraries. In this library package have to done preprocessing, linear model with logistic
regression method, cross validating by KFold method, ensemble with random forest method
and tree with decision tree classifier. Additionally, splitting the train set and test set. To
predicting the result by comparing accuracy.

Prediction result by accuracy:


Logistic regression algorithm also uses a linear equation with independent predictors to
predict a value. The predicted value can be anywhere between negative infinity to positive
infinity. We need the output of the algorithm to be classified variable data. Higher accuracy
predicting result is logistic regression model by comparing the best accuracy.

Over-fitting is a common problem in machine learning which can occur in most


models. K-fold cross-validation can be conducted to verify that the model is not over-fitted. In
this method, the data-set is randomly partitioned into kmutually exclusive subsets, each
approximately equal size and one is kept for testing while others are used for training. This
process is iterated throughout the whole k folds.

True Positive Rate(TPR) = TP / (TP + FN)

False Positive rate(FPR) = FP / (FP + TN)

Accuracy: The Proportion of the total number of predictions that is correct otherwise overall
how often the model predicts correctly defaulters and non-defaulters.

Accuracy calculation:

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Accuracy is the most intuitive performance measure and it is simply a ratio of


correctly predicted observation to the total observations. One may think that, if we have high
accuracy then our model is best. Yes, accuracy is a great measure but only when you have
symmetric datasets where values of false positive and false negatives are almost same. 

Precision: The proportion of positive predictions that are actually correct. (When the model
predicts default: how often is correct?)

Precision = TP / (TP + FP)

Precision is the ratio of correctly predicted positive observations to the total predicted
positive observations. The question that this metric answer is of all passengers that labeled as
survived, how many actually survived? High precision relates to the low false positive rate.
We have got 0.788 precision which is pretty good.

Recall: The proportion of positive observed values correctly predicted. (The proportion of
actual defaulters that the model will correctly predict)

Recall = TP / (TP + FN)

Recall(Sensitivity) - Recall is the ratio of correctly predicted positive observations to


the all observations in actual class - yes.

F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both
false positives and false negatives into account. Intuitively it is not as easy to understand as
accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class
distribution. Accuracy works best if false positives and false negatives have similar cost. If
the cost of false positives and false negatives are very different, it’s better to look at both
Precision and Recall.

General Formula:

F- Measure = 2TP / (2TP + FP + FN)

F1-Score Formula:

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

ALGORITHM AND TECHNIQUES

Algorithm Explanation

In machine learning and statistics, classification is a supervised learning approach in


which the computer program learns from the data input given to it and then uses this learning
to classify new observation. This data set may simply be bi-class (like identifying whether the
person is male or female or that the mail is spam or non-spam) or it may be multi-class too.
Some examples of classification problems are: speech recognition, handwriting recognition,
bio metric identification, document classification etc. In Supervised Learning, algorithms learn
from labeled data. After understanding the data, the algorithm determines which label should
be given to new data based on pattern and associating the patterns to the unlabeled new data.

Used Python Packages:

sklearn:
 In python, sklearn is a machine learning package which include a lot of ML
algorithms.
 Here, we are using some of its modules like train_test_split,
DecisionTreeClassifier or Logistic Regression and accuracy_score.
NumPy:
 It is a numeric python module which provides fast maths functions for
calculations.
 It is used to read data in numpy arrays and for manipulation purpose.
Pandas:
 Used to read and write different files.
 Data manipulation can be done easily with data frames.
Matplotlib:
 Data visualization is a useful way to help with identify the patterns from given
dataset.
 Data manipulation can be done easily with data frames.
Logistic Regression 

It is a statistical method for analysing a data set in which there are one or more
independent variables that determine an outcome. The outcome is measured with a
dichotomous variable (in which there are only two possible outcomes). The goal of logistic
regression is to find the best fitting model to describe the relationship between the
dichotomous characteristic of interest (dependent variable = response or outcome variable)
and a set of independent (predictor or explanatory) variables. Logistic regression is a Machine
Learning classification algorithm that is used to predict the probability of a categorical
dependent variable. In logistic regression, the dependent variable is a binary variable that
contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.).

In other words, the logistic regression model predicts P(Y=1) as a function of X.


Logistic regression Assumptions:

 Binary logistic regression requires the dependent variable to be binary.

 For a binary regression, the factor level 1 of the dependent variable should represent
the desired outcome.

 Only the meaningful variables should be included.

 The independent variables should be independent of each other. That is, the model
should have little.

 The independent variables are linearly related to the log odds.

 Logistic regression requires quite large sample sizes.

Decision Tree 

It is one of the most powerful and popular algorithm. Decision-tree algorithm falls under
the category of supervised learning algorithms. It works for both continuous as well as
categorical output variables. Assumptions of Decision tree:

 At the beginning, we consider the whole training set as the root.


 Attributes are assumed to be categorical for information gain, attributes are assumed
to be continuous.
 On the basis of attribute values records are distributed recursively.
 We use statistical methods for ordering attributes as root or internal node.
Decision tree builds classification or regression models in the form of a tree structure.
It breaks down a data set into smaller and smaller subsets while at the same time an associated
decision tree is incrementally developed. A decision node has two or more branches and a leaf
node represents a classification or decision. The topmost decision node in a tree which
corresponds to the best predictor called root node. Decision trees can handle both categorical
and numerical data. Decision tree builds classification or regression models in the form of a
tree structure. It utilizes an if-then rule set which is mutually exclusive and exhaustive for
classification. The rules are learned sequentially using the training data one at a time. Each
time a rule is learned, the tuples covered by the rules are removed.
This process is continued on the training set until meeting a termination condition. It is
constructed in a top-down recursive divide-and-conquer manner. All the attributes should be
categorical. Otherwise, they should be discretized in advance. Attributes in the top of the tree
have more impact towards in the classification and they are identified using the information
gain concept. A decision tree can be easily over-fitted generating too many branches and may
reflect anomalies due to noise or outliers.

K-Nearest Neighbor (KNN/KNC)


K-Nearest Neighbor is a supervised machine learning algorithm which stores all
instances correspond to training data points in n-dimensional space. When an unknown
discrete data is received, it analyzes the closest k number of instances saved (nearest
neighbors) and returns the most common class as the prediction and for real-valued data it
returns the mean of k nearest neighbors. In the distance-weighted nearest neighbor algorithm,
it weights the contribution of each of the k neighbors according to their distance using the
following query giving greater weight to the closest neighbors.
Usually KNN is robust to noisy data since it is averaging the k-nearest neighbors. The k-
nearest-neighbors algorithm is a classification algorithm, and it is supervised: it takes a bunch
of labeled points and uses them to learn how to label other points. To label a new point, it
looks at the labeled points closest to that new point (those are its nearest neighbors), and has
those neighbors vote, so whichever label the most of the neighbors have is the label for the
new point (the “k” is the number of neighbors it checks). Makes predictions about the
validation set using the entire training set. KNN makes a prediction about a new instance by
searching through the entire set to find the k “closest” instances. “Closeness” is determined
using a proximity measurement (Euclidean) across all features.

Random Forest
Random forests or random decision forests are an ensemble learning method for
classification, regression and other tasks, that operate by constructing a multitude of decision
trees at training time and outputting the class that is the mode of the classes (classification) or
mean prediction (regression) of the individual trees. Random decision forests correct for
decision trees’ habit of over fitting to their training set. Random forest is a type of supervised
machine learning algorithm based on ensemble learning. Ensemble learning is a type of
learning where you join different types of algorithms or same algorithm multiple times to
form a more powerful prediction model. The random forest algorithm combines multiple
algorithm of the same type i.e. multiple decision trees, resulting in a  forest of trees, hence the
name "Random Forest". The random forest algorithm can be used for both regression and
classification tasks.
The following are the basic steps involved in performing the random forest algorithm:
 Pick N random records from the dataset.
 Build a decision tree based on these N records.
 Choose the number of trees you want in your algorithm and repeat steps 1 and 2.
 In case of a regression problem, for a new record, each tree in the forest predicts a
value for Y (output). The final value can be calculated by taking the average of all the
values predicted by all the trees in forest. Or, in case of a classification problem, each
tree in the forest predicts the category to which the new record belongs. Finally, the
new record is assigned to the category that wins the majority vote.

Support Vector Machines

A classifier that categorizes the data set by setting an optimal hyper plane between
data. I chose this classifier as it is incredibly versatile in the number of different kernelling
functions that can be applied and this model can yield a high predictability rate. Support
Vector Machines are perhaps one of the most popular and talked about machine learning
algorithms. They were extremely popular around the time they were developed in the 1990s
and continue to be the go-to method for a high-performing algorithm with little tuning.

 How to disentangle the many names used to refer to support vector machines.
 The representation used by SVM when the model is actually stored on disk.
 How a learned SVM model representation can be used to make predictions for new
data.
 How to learn an SVM model from training data.
 How to best prepare your data for the SVM algorithm.
 Where you might look to get more information on SVM.

You might also like