Internship
Internship
CHAPTER 1
INTRODUCTION
1.1 COURSE OBJECTIVES
The objective of Machine Learning is to discover patterns in the user data and then
make predictions based on these and intricate patterns for answering business
questions and solving business problems.
Learning is one thing, but taking those skills into the workforce and applying them is a
great way to explore different career paths and specializations that suit individual
interests.
2. Gain experience and increase marketability.
Having an internship gives you experience in the career field you want to pursue. Not
only does this give individuals an edge over other candidates when applying for jobs,
but it also prepares them for what to expect in their field and increases confidence in
their work.
3. Networking.
Having an internship benefits you in the working environment, and it also builds your
professional network. Internships provide a great environment to meet professionals in
the career field you want to pursue, as well as other interns who have similar interests.
4. Professionalism.
Internships can provide students with the soft skills needed in the workplace and in
leadership positions. Skills, such as communication, leadership, problem-solving, and
teamwork can all be learned through an internship and utilized beyond that experience.
5. Learn how a professional workplace operates.
Internships help students learn all about workplace culture, employee relations, and
leadership structure, which should help them on board in their first professional job with
more ease than if they haven’t had professional experience.
c. Excellent opportunity to see how the theoretical aspects learned in classes are
integrated into the practical world. On-floor experience provides much more
professional experience which is often worth more than classroom teaching.
d. Helps them decide if the industry and the profession is the best career option to
pursue.
k. Creating network and social circle and developing relationships with industry
people.
Livewire is a division of CADD Centre Training services head quartered in Chennai, India.
CADD Centre as a training services company was formed in 1988 and has now established
several brands across various domains focusing on technical skill development of students
and professionals. Livewire was established in the year 2013, under CADD Centre training
Services, to bring all specialization on Electronics and IT domains under one radar.
Livewire delivers NSDC approved trainings and is also authorized by MSME to deliver
trainings and internships to students. As for the training delivery methodology, CADD
Centre and its brands are ISO 9001:2015/29990:2010 certified for the quality of training
delivery methods and standards.
CHAPTER 2
The process of learning begins with observations or data, such as examples, direct
experience, or instruction, in order to look for patterns in data and make better decisions in
the future based on the examples that we provide. The primary aim is to allow the computers
learn automatically without human intervention of assistance and adjust actions accordingly.
The types of machine learning algorithms differ in their approach, the type of data they
input and output, and the type of task or problem that they are intended to solve. Broadly
Machine Learning can be categorized into four categories.
1. Supervised Learning
Supervised Learning is a type of learning in which we are given a data set and we already
know as what correct outputs are should look like, having the idea that there is a
relationship between the input and output. Basically, it is learning task of learning a
function that maps an input to an output based on example input-output pairs. It infers a
function from labeled training data consisting of a set of training examples. Supervised
learning problems are categorized. In Supervised learning, an AI system is presented with
data which is labelled, which means that each data tagged with the correct label. The goal
is to approximate the mapping function so well that when you have new input data (x) that
you can predict the output variables (Y) for that data.
2. Unsupervised Learning
3. Semi-Supervised Learning
4. Reinforcement Learning
performing correctly and penalties for performing incorrectly. The agent learns without
intervention from a human by maximizing its reward and minimizing its penalty. It is a
type of dynamic programming that trains algorithms using a system of reward and
punishment.
2.1.2 OBJECTIVES
➢ To understand the method of data analysis, algorithms and mathematical models to train
the sample data in Machine Learning.
➢ To discover patterns in the user data and then make predictions based on these and
intricate patterns for answering business questions and solving business problems.
➢ To come up with computer programs that have the capability to improve themselves
based on new data without requiring any explicit programming for the same.
➢ To build websites and software, automate tasks, and conduct data analysis and develop
prototypes.
2.1.3 APPLICATIONS
1. Virtual Personal Assistants: Names like Siri and Alexa bring to mind the capabilities of
virtual assistants. We can ask Siri to make a call for you or play music. You can request
Alexa for today’s weather forecast. You can even set an alarm or send an SMS. What makes
this easier on you is that you only need to speak to it and it will listen to your command. This
comes in handy for those differently abled. Such assistants take note of how you interact with
them and use that to make your next experience with them better.
2. Online Customer Support: Websites like educators and shopping platforms will often
pop a live chat up to help you with your questions. A visitor with a head full of questions is
more likely to leave than stay and possibly make a purchase. Some websites use a chat-bot
instead to pull information to the website and try to address the customer’s queries.
3. Online Fraud Detection: If you’re familiar with PayPal, you realize your trust with it. It
uses machine learning to stand in defense against illegal acts like money laundering. By
comparing millions of transactions, it can find out which ones are illegitimate.
4. Product Recommendations: Shopping platforms like Amazon and Jabong notice what
products you look at and suggest similar products to you. If this gets a favorite product across
to you and results in a purchase you make with them, it’s a win for them. For this, it also uses
your wish-list and cart contents.
5. Automatic Translation: Machine Learning lets us translate text into another language.
The ML algorithm for this figure how words fit together and then uses this information to
improve the quality of a translation
NumPy
NumPy, short for Numerical Python, is the foundational package for scientific computing
in Python. The majority of this book will be based on NumPy and libraries built on top
of NumPy It provides, among other things
Pandas
Pandas provide rich data structures and functions designed to make working with structured
data fast, easy, and expressive. It is, as you will see, one of the critical ingredients enabling
Python to be a powerful and productive data analysis environment. The primary object in
pandas that will be used is the Data Frame, a two dimensional tabular, column-oriented data
structure with both row and column label pandas combine the high-performance array
computing features of NumPy with the flexible data manipulation capabilities of
spreadsheets and relational databases (such as SQL). It provides sophisticated indexing
functionality to make it easy to reshape, slice and dice, perform aggregations, and select
subsets of data pandas is the primary tool. For financial users, pandas feature rich, high-
performance time series functionality and tools well-suited for working with financial data.
The pandas name itself is derived from panel data, an econometrics term for
multidimensional structured datasets, and Python data analysis itself.
SkLearn
SkLearn is a library in Python that provides many unsupervised and supervised learning
algorithms. It's built upon some of the technology you might already be familiar with, like
NumPy, pandas, and Matplotlib!
Matplotlib
Matplotlib is a python library used to create 2D graphs and plots by using python scripts. It
has a module named pyplot which makes things easy for plotting by providing feature to
control line styles, font properties, formatting axes etc. It supports a very wide variety of
graphs and plots namely histogram, bar charts, power spectra, error charts etc. It is used along
with NumPy to provide an environment that is an effective open-source alternative for
Matlab. It can also be used with graphics toolkits like PyQt and wxPython.
Anaconda
Anaconda distribution is a free and open-source platform for Python programming languages. It
can be easily installed on any OS such as Windows, Linux, and MAC OS. It provides more than
1500 Python or data science packages which are suitable for developing machine learning and
deep learning models. Anaconda distribution provides installation of Python with various IDE's
such as Jupyter Notebook, Spyder, Anaconda prompt, etc. Hence it is a very convenient packaged
solution which you can easily download and install in your computer. It will automatically install
Python and some basic IDEs and libraries with it.
Jupyter Notebook
The Jupyter Notebook is the original web application for creating and sharing computational
documents. It offers a simple, streamlined, document-centric experience. The Jupyter Notebook is
an open-source web application that you can use to create and share documents that contain live
code, equations, visualizations, and text. Jupyter Notebook is maintained by the people at Project
Jupyter.
Jupyter Interface
Now you’re in the Jupyter Notebook interface, and you can see all the files in your current
directory. All Jupyter Notebooks are identifiable by the notebook icon next to their name. If you
already have a Jupyter Notebook in your current directory that you want to view, find it in your
files list and click it to open.
FLASK
A web framework is an architecture containing tools, libraries, and functionalities suitable to build
and maintain massive web projects using a fast and efficient approach. They are designed to
streamline programs and promote code reuse. To create the server-side of the web application, you
need to use a server-side language. Python is home to numerous such frameworks, famous among
which are Django and Flask. Python Flask Framework is a lightweight micro-framework based on
Werkzeug, Jinja2. It is called a micro framework because it aims to keep its core functionality
small yet typically extensible to cover an array of small and large applications. Flask Framework
depends on two external libraries: The Jinja2 template, Werkzeug WSGI toolkit. Even though we
have a plethora of web apps at our disposal, Flask tends to be better suited due to -
Dept. of CSE, SVCE 2021-2022 Page | 13
MACHINE LEARNING WITH FLASK
CHAPTER 3
WORK CARRIED OUT
3.1 PROBLEMS/CHALLENGES
The work carried during the internship is completely shown in the below table
3.2METHODOLOGY
The credit risk classification dataset is collected from the Kaggle online repository. This
dataset contains Customer Transaction and Demographic related data. It holds Risky and
Not Risky customer for specific banking products
• Features of dataset
payment_data.csv:
payment data.csv: customer’s card payment history.
id: customer id
OVDt1: number of times overdue type 1
OVDt2: number of times overdue type 2
OVDt3: number of times overdue type 3
OVDsum: total overdue days
paynormal: number of times normal payment
prodcode: credit product code
prodlimit: credit limit of product
updatedate: account update date
newbalance: current balance of product
highestbalance: highest balance in history
reportdate: date of recent payment
customer_data.csv:
customer’s demographic data and category attributes which have been encoded.
Category features are fea1, fea3, fea5, fea6, fea7, fea9.
label is 1, the customer is in high credit risk
label is 0, the customer is in low credit risk
The data collection process involves the selection of quality data for analysis. Here we used
credit risk classification dataset taken form Kaggle. Here, we have found different ways and
sources for collecting relevant and comprehensive data, interpreting it, and analyzing results
with the help of statistical techniques.
Transformation: This involves changing data format to one form to other that is
making them most understandable by doing normalization, smoothing, and
generalization, aggregation techniques on data.
Integration: Data that we need not process may not be from a single source
sometimes it can be from different sources we do not integrate them it may be a
problem while processing integration is one of important phase in data pre-processing
and different issues considered here to integrate.
A dataset used for machine learning should be partitioned into three subsets — training,
test, and validation sets.
Training set: A data scientist uses a training set to train a model and define its optimal
parameters it has to learn from data.
Test set: A test set is needed for an evaluation of the trained model and its capability for
generalization. The latter means a model’s ability to identify patterns in new unseen data
after having been trained over a training data. It’s crucial to use different subsets for
training and testing to avoid model overfitting, which is the incapacity for generalization
we mentioned above.
After a data scientist has preprocessed the collected data and split it into train and test
can proceed with a model training. This process entails “feeding” the algorithm with
trainingdata. An algorithm will process data and output a model that is able to find a
target value (attribute) in new data an answer you want to get with predictive analysis.
The purpose of model training is to develop a model.
The K Nearest Neighbors (KNN) algorithm measures the distance between a query
scenario and a set of scenarios in the data set. We can compute the distance between
two scenarios using some distance function d(x,y), where x,y are scenarios composed
of N features, such that x={x1,…,xN}, y={y1,…,yN} .
The model for KNN is the entire training dataset. When a prediction is required for a
unseen data instance, the KNN algorithm will search through the training dataset for
the k- most similar instances. The prediction attribute of the most similar instances is
summarized and returned as the prediction for the unseen instance.
The similarity measure is dependent on the type of data. For real-valued data, the
Euclidean distance can be used. Other types of data such as categorical or binary data,
The goal of this step is to develop the simplest model which is able to formulate a target
value fast and well enough. A data scientist can achieve this goal through model tuning.
That’s the optimization of model parameters to achieve an algorithm’s best
performance.
3.3.8 Accuracy
Classification accuracy is what we usually mean, when we use the term accuracy. It is
the ratio of number of correct predictions to the total number of input samples. In this
project the prediction was obtained from a KNN algorithm of customers attribute values.
The accuracy obtained is 80% the features can be tuned for more accuracy.
CHAPTER 4
RESULTS AND DISCUSSIONS
4.1 IMPLEMENTTION
Source Code
➢ Importing libraries
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn. neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import
accuracy_score,classification_report,confusion_matrix
➢ Data pre-processing
customer_df.columns
payment_df.columns
payment_df['id'].nunique()
customer_df['id'].nunique()
customer_df['fea_2'].fillna(customer_df['fea_2'].mean(),inplace=True)
payment_df['highest_balance'].fillna(0,inplace=True)
final_df=pd.merge(customer_df,payment_df,how='inner',on='id')
final_df
➢ Data visualization
import seaborn as sns
sns.countplot(k_df['report_date'],hue=k_df['label'], data=k_df)
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(14,7))
plt.subplot(121)
k_df["label"].value_counts().plot.pie(autopct = "%1.0f%%",colors =
sns.color_palette("prism",7),startangle = 60,labels=["0","1"],
wedgeprops={"linewidth":2,"edgecolor":"k"},explode=[.1,0],shadow =True)
plt.title("Distribution of Target variable")
plt.subplot(122)
ax = k_df["label"].value_counts().plot(kind="barh")
for i,j in enumerate(k_df["label"].value_counts().values):
ax.text(.7,i,j,weight = "bold",fontsize=20)
plt.title("Count of Traget variable")
plt.show()
➢ Splitting of dataset
x=k_df[['fea_2', 'fea_4', 'fea_8', 'fea_10',
'fea_11', 'OVD_t1', 'OVD_t2',
'OVD_sum', 'prod_code','pay_normal','prod_limit' ,
'new_balance', 'highest_balance']]
x.shape
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.4,stratify=y,
random_state = 1234)
print(x_train.shape,x_test.shape,y_train.shape,y_test.shape)
rf=pickle.load(file)
@app1.route("/",methods=['GET'])
def home():
return render_template("index1.html")
@app1.route('/predict',methods=['POST'])
def predict():
a=float(request.form['fea_2'])
b=int(request.form['fea_4'])
c=int(request.form['fea_8'])
d=int(request.form['fea_10'])
e=float(request.form['fea_11'])
f=int(request.form['OVD_t1'])
g=int(request.form['OVD_t2'])
h=int(request.form['OVD_sum'])
i=int(request.form['prod_code'])
j=int(request.form['pay_normal'])
k=float(request.form['prob_limit'])
l=float(request.form['new_balance'])
m=float(request.form['highest_balance'])
y_pred=rf.predict([[a,b,c,d,e,f,g,h,i,j,k,l,m]])
if(y_pred==1):
return render_template("index1.html",prediction_value="The
customer is in low credit risk")
else:
return render_template("index1.html",prediction_value="The
customer is in high credit risk")
print(y_pred)
if __name__=="__main__":
app1.run(debug=True)
CONCLUSION
These five weeks of internship at LIVEWIRE, has helped overall understanding and
providing us an insight on Python and Machine Learning. The works carried out during
internship focuses on writing a Python code for various logics and problem statements. We
understood Machine Learning using Python and developed a mini project on credit risk
prediction by using Machine learning techniques. This internship at Livewire has helped
in overall personality development by interaction with many members. It has helped with
integrating conceptual knowledge with real life applications. It provided the working
experience with real life professionals which will certainly help us in our career ahead.
REFERENCES
➢ Assef, Fernanda; Steiner, Maria Teresinha; Steiner Neto, Pedro Jose; Franco, David
Gabriel de Barros (2019). Classification Algorithms in Financial Application: Credit
Risk Analysis on Legal Entities. IEEE Latin America Transactions, 17(10), 1733–
1740. doi:10.1109/TLA.2019.8986452
➢ E. Khandani, A. J. Kim, and A. W. Lo, “Consumer credit-risk models via machine-
learning algorithms,” Journal of Banking & Finance, vol. 34, no. 11, pp. 2767–2787,
2010.
➢ S. Bhatia, P. Sharma, R. Burman, S. Hazari, and R. Hande, “Credit scoring using
machine learning techniques,” International Journal of Computer Applications, vol.
161, no. 11, pp. 1–4, 2017.
➢ S. Piramuthu, “On preprocessing data for financial credit risk evaluation,” Expert
Systems with Applications, vol. 30, no. 3, pp. 489–497, 2006.
➢ Abedin MZ, Guotai C, Colombage S, Moula FE (2018) Credit default prediction
using a support vector machine and a probabilistic neural network. J Credit Risk
14(2):1–27