[email protected]

Employee Satisfaction and
Retention Analysis
Business Analytics
Division E
Group 3
Name Roll no.

Muskan Bhatia E003
Arjun Chawla E013
Muskan Singhal E023
Dahir Sharma E033
Devashish Sharma E043
Mohit Kumar E063
2
Contents
Business Problem ....................................................................................................................... 3
Dataset Used .............................................................................................................................. 3
Tools Used ................................................................................................................................. 3
Data Cleaning............................................................................................................................. 4
Analysis...................................................................................................................................... 5
Department-wise satisfaction scores: ..................................................................................... 5
Effect of Performance Rating on Employee Retention: ........................................................ 7
Correlation ............................................................................................................................. 8
Clustering Employees into groups ......................................................................................... 8
Building Predictive Models ................................................................................................... 9
Logistic Regression .............................................................................................................. 10
Recommendations .................................................................................................................... 12
References ................................................................................................................................ 13
3
Business Problem
The problem statement that we have chosen is to determine the various factors affecting the
employee satisfaction in an IT organization. Also, we aim to find the factors which influence
the possibility of an employee exiting from the organization. In this process, we would also
attempt to determine the attrition rate of an organization based on the different parameters for
judging an employee’s behaviour and tendencies.
Dataset Used
A data with almost 15,000 records containing details of employees working in an IT
company was taken. The dataset contained employee records with parameters like salary,
working hours, satisfaction level, accidents at workplace, experience, promotion, role,
evaluation duration, etc. A sample of the dataset is provided below:
The first step towards working with such enormous amounts of data is the step for data
cleaning and processing. To make the data ready, noise was removed and other null values
were removed. After the removal of gibberish from the data values, we moved on to work on
identifying and removing any extreme outliers, which could reasonably not exist in the data
set. Other anomalies we normalized according to the other data in place.
Finally, after the data was cleaned and prepared, we moved forward with analysis of the data
to determine results.
Tools Used
We used Python and Excel as the main tools to process the data, visualize it, and generate the
results and inferences. While using Python we made use of several libraries. Some of them
are
• Sklearn
• Numpy
• Pandas
4
• Matplotlib
We made the use of statistical tools from Python to analyse the various parameters separately.
This was followed by an in-depth analysis of the data using predictive and modelling
techniques to determine the best relationships.
Data Cleaning
The first step in analytics is to gather the data and clean it thoroughly. Our main aims while
cleaning the data were to:
• Remove duplicate data
• Remove corrupt data
• Remove noise
• Fix incorrect data
• Extrapolate incomplete data
• Correct the formatting of the dataset
The first part of data cleaning was done manually using Excel and its various inbuilt
functions.
After this was complete, we ran the dataset through Python code to identify misleading or
incorrect data, and then either removed those data points or corrected the data. Checks like
null check, infinity check, data type checks, etc were performed and all anomalies were
rectified.
Once this was complete, it was important to transform the data into a format in which it can
be used for analysis. There were columns in the dataset which contained string type data. To
use inbuilt functions of excel or to feed the data through a machine learning algorithm like
logistic regression, it was important to first convert the string data into numerical data.
Hence, we coded the string values to numbers.
Some of such examples are as follows:
salary_groups = {'low': 0, 'medium': 1, 'high': 2}
department_groups = {'sales': 1,
'marketing': 2,
'product_mng': 3,
'technical': 4,
'IT': 5,
'RandD': 6,
'accounting': 7,
'hr': 8,
'support': 9,
'management': 10
}
5
Analysis
The dataset provided us with information about the following parameters to analyse the
employee satisfaction and attrition:
• Employee satisfaction level
• Last evaluation
• Number of projects
• Average monthly hours
• Employee’s experience at the company
• Whether they have had a work accident
• Whether they have had a promotion in the last 5 years
• Role
• Salary
• Whether the employee has left
First, we tried to analyse the data that is present and looked into the details.
import pandas
hr_data = pandas.read_csv('HR_Data.csv')
print('Number of people employed in the company over the period of time
: ' + str(len(hr_data)))
employees_left = hr_data.loc[hr_data['left'] == 1]
print('Number of people left over this period: ' + str(len(employees_le

ft)))
departments = hr_data['role']
print(set(departments))
Output:
Number of people employed in the company over the period of time: 14999
Number of people left over this period: 3571
{'IT', 'hr', 'marketing', 'sales', 'product_mng', 'accounting', 'technical', 'support', 'management', 'RandD'}
We can see that over the period, 23.8% of the employees have left their jobs. Hence, the
attrition rate for the organization stands at 23.8% annually over the various departments
present in the organization.
Department-wise satisfaction scores:

After this, we tried to delve into the department-wise analysis of the employees. We tried to
check whether there is any particular department exists for which the employees are less
satisfied. For this, we checked the employee satisfaction scores across departments. We
found the mean and median satisfaction scores to determine the satisfaction level of each
6
department. The following table shows the satisfaction levels of each department, where
satisfaction_level_x represents for the mean and satisfaction_level_y represents for the
median.
import matplotlib.pyplot as plt

mean_satisfaction = hr_data.groupby('role', as_index=False)['satisfacti
on_level'].mean()
print("mean\n",mean_satisfaction)
median_satisfaction = hr_data.groupby('role', as_index=False)['satisfac

tion_level'].median()
print("median\n",median_satisfaction)
mean_medium = pandas.merge(mean_satisfaction, median_satisfaction, how=

'outer', on=['role'])
print("mean medium\n",mean_medium)
The following graph represents the above scores relative to each other:
Satisfaction with each department

0.68
0.66
0.64
0.62
0.6
0.58
0.56
0.54
satisfaction_level_x satisfaction_level_y
From the above graph, it is very evident that the highest satisfaction scores are in IT and
Management roles whereas the roles of Accounting and HR show low satisfaction scores.
This could be related to the type of jobs offered. Since this is an IT organization, it is quite
7
expected that the IT jobs would have highest satisfaction. But the lesser satisfied roles could
be looked into in more details to increase employee happiness.
Effect of Performance Rating on Employee Retention:

After this, we will look into the relationship between the performance rating of an employee
to his/her leaving the organization. This is an important parameter as most of the future
prospects of the employee depends on their ratings, including their salary increments. Also
lower ratings may increase the risk on job security of the employee.
def label_race (row):
if row[1] <= 0.15 and row[1]>0.05: return 1
if row[1] > 0.15 and row[1]<=0.25: return 2
if row[1] > 0.95 and row[1]<=1: return 9
else: return 0
hr_data['eval'] = hr_data.apply(lambda row: label_race(row), axis=1)
good_better_best = pandas.DataFrame({'count': hr_data.groupby(["eval", "leave"]).size()}).rese
t_index()
count_stay = good_better_best[good_better_best.left == 0]
count_leave = good_better_best[good_better_best.left == 1]
count_stay['Key'] = 'stay'
count_leave['Key'] = ‘leave’
stay_leave = pandas.concat([count_stay,count_left],keys=['stay','left'])
print(stay_leave)
stay_leave_group = df_stay_leave.groupby(['eval,'Key'])
stay_leave_group[['count']].sum().unstack('Key').plot(kind='bar')
print(stay_leave_group)
plt.show()
Here, we have encoded the employee satisfaction, which ranges from 0-1, on a scale of 0-9
using a 0.1 interval.
This graph yields some very interesting results. From here, we can spot that at lower levels of
performance ratings, the employees are much more likely to quit their jobs. This rate
decreases as the performance band increases. But interestingly enough, as the band increases
8
towards the maximum levels, the people quitting their jobs again increases. The most stable
employees are found at the central performance bands. Now this can be understood using the
logic that at lower bands, employees switch for better future prospects at other organizations
and a better job reputation and security. On the other hand, at higher performance bands, the
employees are highly skilled and they move on to catch better pay-scales or promotions in
their jobs.
Correlation
The next part of the analysis is to determine the correlation between all the various
parameters provided, to understand the interdependence of each variable on the others. This
will enable us to gain insights and a sense of direction onto which variable are crucial for us
and which ones can be ignored.
From our results, we saw that there was a heavy correlation between the performance band
and the employee satisfaction. This would eventually influence the leaving or staying of the
employees.
Clustering Employees into groups

Upon realising the determining factors that affect the employees, we decided to divide the
employees into 3 clusters. These were:
1. High satisfaction and High Evaluation — Performers
2. Average Satisfaction and Average Evaluation — Satisfied
3. Low Satisfaction and High Evaluation — Frustrated
For this, we utilized KMeans clustering technique to group the employees, and understand
the ratio of performing to satisfied to frustrated employees by seeing the density of the
clusters.
from sklearn.cluster import KMeans
kmeans_df = hr_data[hr_data.left == 1].drop([u'number_project', u'avera
ge_montly_hours', u'exp_in_company', u'Work_accident',u'left', u'promot
ion_last_5years', u'role', u'salary'], axis=1)
kmeans = KMeans(n_clusters=3, random_state=0).fit(kmeans_df)
print(kmeans.cluster_centers_)
people_left = hr_data[hr_data.left == 1]
people_left['label'] = kmeans.labels_
plt.plot(people_left.satisfaction_level[people_left.label==0],people_le
ft.last_evaluation[people_left.label==0],'o', alpha = 0.2, color = 'r')
ft.last_evaluation[people_left.label==1],'x', alpha = 0.2, color = 'g')
ft.last_evaluation[people_left.label==2],'*', alpha = 0.2, color = 'b')
plt.legend(['Performers','Frustrated','Satisfied'], loc = 3, fontsize =
15,frameon=True)
plt.show()
9
From the graph, we can see there are high number of performers (in red). But the number of
satisfied employees is quite thin (in green). On the other hand, the number of frustrated
employees is also quite high (in blue). This is indicative that the administration should take
some action quickly to deal with the frustrated employees and convert them to satisfied ones.
Building Predictive Models

After analysing the variables separately and through different techniques, we will now try to
build a predictive model to be able to use these insights to quantify the risk of an employee
leaving the organization.
For this purpose, we tried with different classifiers and predictive models. To evaluate a
classifier which can be used to do a prediction for us, we used the AUC score. This lets us see
the trade-off between the true positive rate vs. false positive rate. Higher the AUC score,
better is the accuracy of the classifier in predicting the future about the employee leaving an
organization using the given parameters. The different algorithms we tried were:
• RandomForestClassifier
• AdaBoostClassifier
• ExtraTreesClassifier
• KNeighborsClassifier
• DecisionTreeClassifier
• ExtraTreeClassifier
• LogisticRegression
• GaussianNB
• BernoulliNB
10
classifiers = [('RandomForestClassifierG',RandomForestClassifier(n_jobs
=-1, criterion='gini')),
('AdaBoostClassifier', AdaBoostClassifier()),
('ExtraTreesClassifier',ExtraTreesClassifier(n_jobs=-1)),
('KNeighborsClassifier',KNeighborsClassifier(n_jobs=-1)),
('DecisionTreeClassifier', DecisionTreeClassifier()),
('ExtraTreeClassifier', ExtraTreeClassifier()),
('LogisticRegression', LogisticRegression()),
('GaussianNB', GaussianNB()),
('BernoulliNB', BernoulliNB())]
allscores = []
salary_groups = {'low': 0, 'medium': 1, 'high': 2}
department_groups = {'sales': 1, 'marketing': 2, 'product_mng': 3,
'technical': 4, 'IT': 5, 'RandD': 6, 'accounting': 7, 'hr': 8,
'support': 9, 'management': 10}
hr_data.salary = hr_data.salary.map(salary_groups)
hr_data['department'] = hr_data.role.map(department_groups)
hr_data = hr_data.drop('role', axis=1)
x, Y = hr_data.drop('left', axis=1), hr_data['left']

for name, classifier in classifiers:
scores = []
for i in range(3): # three runs
roc = cross_val_score(classifier, x, Y, scoring='roc_auc', cv=20)
scores.extend(list(roc))
scores = numpy.array(scores)
print(name, scores.mean())
new_data = [(name, score) for score in scores]
allscores.extend(new_data)
Output:
RandomForestClassifierG 0.9943171341832512
AdaBoostClassifier 0.9818277892812185
ExtraTreesClassifier 0.993756003995646
KNeighborsClassifier 0.9729448123168295
DecisionTreeClassifier 0.9784380690668585
ExtraTreeClassifier 0.9686484011809469
LogisticRegression 0.7922131258561164
GaussianNB 0.8541697565717896
BernoulliNB 0.6969364866732521
From the above result, we can see that the RandomForestClassifier from sklearn.ensemble
provides the best accuracy. Because of this we trained a model and saved it using pickle to
use it for predicting employee exit probabilities.
Logistic Regression
In addition to the above analysis, we also conducted logistic regression on our data to build a
predictive model using it. Logistic regression is suitable for categorical data and when the
data contains few parameters, which are binary or ordinal. Here, we use confusion matrix for
figuring out the accuracy unlike the liner/multiple regression where we use R-square value
(Mean absolute percentage error – MAPE). In given dataset, parameters like “salary”, “work
accident” etc are categorical. Hence, logistic regression was found to be suitable for the
model.
11
After the data cleaning, we divided the data into two parts- 75% for training and 25% for
testing. We ran logistic regression in a special scenario, where we combined it with backward
elimination. Backward elimination is used to eliminate the parameters which have higher p
value (p>0.05) to determine the significant parameters for my model. Here, we ran the
logistic regression iteratively, calculating and analyse the coefficients, intercepts and the
confusion matrix. Then we determined the variables which seemed to not affect the
regression, i.e., variables with highest p value (P>0.05 with 95% confidence interval). This
process was continued till all the variables had p value less than 0.05. After this, we went
forward develop the regression equation.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size =
0.25, random_state = 0)
# Fitting Logistic Regression to the Training set

from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(X_train, y_train)
predictions = logmodel.predict(X_test)
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test,predictions)
Confusion Matrix:
0 1
(NO) (YES)
0 2656 225
(NO)
1 554 315
(YES)
Accuracy = (2656+315)/ (2656+315+225+554) = 79.2%
Probability (of leaving the job) equation for logistic equation:
P = (exp (-0.639 -4.15*satisfaction level -0.727*last evaluation -0.33*number of projects -0.0046*Avg monthly hours
+0.27*exp in company -1.44*work accident -1.046*promotion in last 5 years -0.655*R&D +0.17*hr - 0.45*management
+1.8*low salary + 1.26*medium salary
)
Both these models are very useful tools in predicting the employee retention or exiting
possibility by the use of other parameters.
For the dataset that we used, we realized that parameters like employee satisfaction, last
evaluation, work accident, promotion and salary were among the most important parameters
affecting the employee retention.
12
Recommendations
From the insights that we gained, the organization should focus highly on the employee
satisfaction. Some departments in the organization had poor satisfaction scores and hence,
thorough introspection for those departments is required to understand the problems of the
employees and solve them.
Also, performance ratings play a gigantic role on the employee satisfaction and their leaving
or staying with the organization. Hence, fair and transparent appraisal systems must be in
place. Also, continuous feedback and appraisal can be enacted, so that the employees do not
have to wait for the entire cycle to receive their feedback at the end of it. Clear cut goals,
continuous feedback and transparency in the process are the keys to achieving satisfaction.
We also found that a major portion of the employees were frustrated with their jobs. Hence it
is highly important for the management to provide to the needs of these employees. Since the
number of dissatisfied employees are so huge, perhaps even an organization-wide cultural
change might be required. Also, there are good performers in the organization. The
organization should also aim to bring them to the satisfied status. Learning and development
program, mentoring sessions and providing a path for growth and development of employees
can turn out to be major steps in bringing employee satisfaction and loyalty.
We found out the various parameters which directly affect an employee, and using these
insights, the management should carefully work to retain the employees. Finally, the
organization should be ever vigilant and not miss any opportunity to retain any talented
employee. Predictive models, similar to the ones build by us could be employed to monitor
and predict employee behaviour and possible actions and corrective measures should be taken
even before the damage can occur. This would save the organization precious human
resources and also financial resources.
13
References
(n.d.). Retrieved from Stackoverflow: stackoverflow.com
(n.d.). Retrieved from Kaggle: https://www.kaggle.com/datasets
Edureka. (n.d.). Youtube. Retrieved from Machine Learning Tutorial:
https://www.youtube.com/watch?v=GwIo3gDZCVQ
Ng, A. (n.d.). AI for Everyone. Retrieved from Coursera: https://www.coursera.org/learn/ai-
for-everyone?utm_source=gg&utm_medium=sem&utm_campaign=08-
AIforEveryone-IN&utm_content=08-AIforEveryone-
IN&campaignid=6499977756&adgroupid=78423061392&device=c&keyword=andre
w%20ng%20ai%20course&matchtype=b&network=g&devic
StandfordOnline. (n.d.). Youtube. Retrieved from Machine Learning:
https://www.youtube.com/watch?v=jGwO_UgTS7I&list=PLoROMvodv4rMiGQp3
WXShtMGgzqpfVfbU

[email protected]

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

[email protected]

Uploaded by

Copyright:

Available Formats

Employee Satisfaction and

Name Roll no.

salary_groups = {'low': 0, 'medium': 1, 'high': 2}

print('Number of people left over this period: ' + str(len(employees_le

Department-wise satisfaction scores:

import matplotlib.pyplot as plt

median_satisfaction = hr_data.groupby('role', as_index=False)['satisfac

mean_medium = pandas.merge(mean_satisfaction, median_satisfaction, how=

Satisfaction with each department

Effect of Performance Rating on Employee Retention:

Clustering Employees into groups

Building Predictive Models

x, Y = hr_data.drop('left', axis=1), hr_data['left']

# Fitting Logistic Regression to the Training set

from sklearn.metrics import confusion_matrix

Accuracy = (2656+315)/ (2656+315+225+554) = 79.2%

Probability (of leaving the job) equation for logistic equation:

You might also like