Facebook Comment Volume Prediction

“FACEBOOK COMMENT VOLUME PREDICTION”
Partial submission (Project Notes – I)
Post Graduate Program in Business Analytics and Business Intelligence
Capstone Project Report
Submitted to
Submitted by:
Yogesh Sharma
Under the guidance of
(Mr. Anirban Dey)
Batch- PGPBABI. Jan’19
Date of submission: 9th Nov, 19

ACKNOWLEDGEMENT
I would like to convey my sincere gratitude to the mentor Mr. Anirban Dey for his able guidance and
mentorship. I expect that his deep understanding of the use case and business intellect shall help me
in charting the right approach and deploying the appropriate models for the analytics problem at
hand.
I would also like to thank the Great Lakes management for giving an opportunity to work on a real
case scenario which will surely help me to apply the learning practically.
NOTE
This is a ‘Work in progress’ document and submitted in partial fulfillment basis the requirements in
‘Project Notes-I’ only. The document will be enriched further as per next phases and mentor
comments.
2|Page
Table of Contents
1. INTRODUCTION: .............................................................................................................................................4
2. PROJECT BACKGROUND: ............................................................................................................................4
3. PROJECT OBJECTIVE: ..................................................................................................................................5
4. APPROACH AND METHODOLOGY: .............................................................................................................5
5. TECHNIQUES, TOOLS, DOMAIN: ..................................................................................................................6
6. EXPLORATORY DATA ANALYSIS: ...............................................................................................................6
7. APPENDIX: ..................................................................................................................................................... 12
3|Page
1. Introduction:
The leading trends towards the Social Networking has drawn high public attention from past ‘two’ decades.
For both small businesses and large corporations, social media is playing a key role in brand building and
customer communication. Facebook is one of the social networking site relevant for firms to make
themselves real for customers. It is estimated that advertising revenues of Facebook in the United States in
2018 stands up to 14.89 billion USD against 18.95 billion USD outside. Other categories like news,
communication, commenting, marketing, banking, Entertainment etc. are also generating huge social
media content every minute.
As per Forbes survey in 2018, there are 2 billion active users on Facebook making it the largest media
platform.
Here are some more intriguing Facebook statistics:
- 1.5 billion people are active on Facebook daily

- Europe has more than 307 million people on Facebook
- There are five new Facebook profiles created every second!
- More than 300 million photos get uploaded per day
- Every minute there are 510,000 comments posted and 293,000 statuses updated
2. Project Background:
In this project, we used the most active social networking service ‘Facebook’ importantly the ‘Facebook
Pages’ for analysis. Our research is oriented towards the estimation of comment volume that a post is
expected to receive in next few hours. Before continuing to the problem of comment volume prediction,
some domain specific concepts are discussed below:
- Public Group/Facebook Page: It is a public profile specifically created for businesses, brands,
celebrities etc.
- Post/Feed: These are basically the individual stories published on page by administrators of page.
- Comment: It is an important activity in social sites, that gives potential to become a discussion
forum and it is only one measure of popularity/interest towards post is to which extent readers are
inspired to leave comments on document/post.
4|Page
3. Project Objective:
Basis the training dataset e.g. ‘Facebook comment volume prediction’ provided, the goal is to predict how
many comments a user generated posts is expected to receive in the given set of hours. We need to model
the user comments pattern over a set of variables which are provided and get to the right number of
comments for each post with minimum error possible.
Here, user comment volume prediction is made based on page category i.e., for a particular category of
page’s post will get certain amount of comments. In order to predict the comment volume for each page
and to find which page category getting the highest comment, I shall use ‘Decision tree’ and ‘regression
techniques’ to make the prediction effective. I shall also model the user comment pattern with respect to
Page Likes and Popularity, Page Category and Time.
As the part of Project Notes – I, we shall focus on following:
• Data Report
o Visual inspection of data (rows, columns, descriptive details)
o Understanding of attributes (variable info, renaming if required)
• Initial Exploratory Data Analysis
o Univariate analysis (distribution and spread for every continuous attribute, distribution of the
data in categories for categorical ones)
o Bivariate analysis (relationship between different variables, correlations)
Dataset and attributes information are attached in the Appendix section.
4. Approach and Methodology:
As the no. of comments (Facebook dataset) is the continuous data hence we shall perform Regression
analysis to determine the relationship between the target value and predictors in it. We shall also look at
the distribution and spread of variables using histogram and box plot.
We shall also look at the analysis basis the Decision Tree, LASSO, RSME, K-Nearest Neighbor (KNN),
and Random Forest to perform the effective prediction.
Our experimental model explains that data set split into training and testing before data modelling and then
change into vector form in order to push it for prediction model and the results will be generated with
respect to minimal error obtained.
The structure of each process is carried out in each phase as per below diagram:
5|Page
Figure - 1
5. Techniques, Tools, Domain:
Techniques EDA, Regression and Decision tree

Tools R
Domain Social Media
6. Exploratory Data Analysis:
6.1 First Level Insights
The data set used is a ‘Facebook comment volume’ record captured over the period containing 32,759
lines and 43 variables.
Out of 43 variables with one as target value for each post and categorized the features based on
relation between Target variable.
1) Page Features: It defines about popularity/Likes of a page, check-in’s, category of a page. Page
Likes: This feature describes about the user specific interest related to page category such as Status,
wall posts, Photos, Profile pic, shares or pages.
2) Essential Features: The pattern of comment from different users on the post at various time interval
with respect to randomly selected base time/date. CC1 to CC5 cover this.
6|Page
Figure - 2
3) Weekday Features: It is for the complete week that is used to pick the post that got published on
selected base time/date and weekday.
4) Other basic Features: The remaining features that help to predict the volume of comment for each
page category and that includes to document about the source of the page and date/time for about next
H hours.
5) Without using Parameter: The prediction comes with expected way when performing without
specifying any parameters in it. The regression gives the result which expected and termed as best
prediction results among the results that with specified
parameters.glm (formula = train_scale$Target.Variable ~ ., data = train_scale)
6.2 Variable Rationalization
In order to study the data better, we performed a preliminary variable reduction in the beginning itself.
At this stage, we reduced the variable on the following criteria:
▪ Redundant Variables
▪ Business relevance
▪ Correlated Variables
▪ Target Variable
Variable name
Type of Variable
Page Popularity/likes Business relevance
Page Check-ins Business relevance
Page talking about Business relevance
Page Category Business relevance
Feature 5 – Feature 29 Redundant Variable
CC1 Correlated Variable


Base time Business relevance
7|Page
Post length Redundant Variable
Post Share Count Business relevance
Post Promotion Status Business relevance
H Local Business relevance

Post published weekday Business relevance
Base Date Time weekday Business relevance
Comments Target Variable
6.3 Data Preparation and Analysis
Let’s perform analysis started using R:
## Set Working Directory
setwd("C:/Users/Yogesh Sharma/Desktop/Capstone")
getwd()
## Read Input data
Comments = read.csv("Facebook.csv", header = TRUE)
## View column names
names(Comments)
## View Structure and Summary of Input data
str(Comments)
summary(Comments)
OBSERVATIONS:
1. Dependent Variable: Target.Variable
2. All independent variables are numeric or integer expect Post.published.weekday and
Base.DateTime.weekday
3. Missing values present in Page.likes, Page.Checkins, Page.talking.about, Page.Category,CC1-
CC5,Derived features
4. Max value for some key variables is high as compared to 3rd Qu - Possibility of outliers?
8|Page
5. Similar outlier possibility found in Page.likes, Page.Checkins,Page.talking.about,CC1-
CC5,Post.Share.Count
## Examine Dependent Variable 'Comments'
attach(Comments)
## Build Histogram for Target.Variable to understand its distribution
hist(Target.Variable)
OBSERVATIONS: Possibly Outlier(s) affecting histogram
boxplot(Target.Variable)
OBSERVATIONS:
Most of the Comments are at the lower end - One outlier very far out
For now, let us examine only low Comments (< 1100)
library(dplyr)
Comments=Target.Variable[Target.Variable<1100]
hist(Target.Variable)
OBSERVATIONS:
Number of obs reduced from 32760 to 32757 - Therefore, there were 3 outliers
Comments resembles Normal Distribution
## Let us now examine the Integer Independent variables using the original dataset
names(Target.Variable)
hist(Page.likes)
hist(Page.talking.about)
hist(Page.Category)
hist(CC1)
hist(CC2)
hist(CC3)
9|Page
hist(CC4)
hist(CC5)
hist(H.local)
boxplot(Page.Category)
## Now let us examine the Categorical Variables
table(Post.published.weekday)
> table(Post.published.weekday)
Post.published.weekday
Friday Monday Saturday Sunday Thursday Tuesday Wednesday
1 4813 4693 4437 4043 4692 4920 5161
>
plot(Post.published.weekday)
table(Base.DateTime.weekday)
plot(Base.DateTime.weekday)
6.4 Visual representation (Univariate and Bi-variate)
10 | P a g e
11 | P a g e
************* This is WIP document for Project Notes-1 only ******************
7. Appendix:
Title Artifact/Location Remarks

Source of Data https://olympus.greatlearning.in/courses/4012/files/459750?module_item_id=265652
List of Variable
Variables description and
Data
Dictionary.docx
rationale
behind
selection
R Code for R File for
Reference Facebook
12 | P a g e

Facebook Comment Volume Prediction

Uploaded by

Copyright:

Available Formats

You might also like

Facebook Comment Volume Prediction

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Facebook Comment Volume Prediction

Uploaded by

Copyright:

Available Formats

“FACEBOOK COMMENT VOLUME PREDICTION”

Partial submission (Project Notes – I)

Post Graduate Program in Business Analytics and Business Intelligence

Capstone Project Report

Under the guidance of

(Mr. Anirban Dey)

Batch- PGPBABI. Jan’19

Date of submission: 9th Nov, 19

2. PROJECT BACKGROUND: ............................................................................................................................4

3. PROJECT OBJECTIVE: ..................................................................................................................................5

4. APPROACH AND METHODOLOGY: .............................................................................................................5

5. TECHNIQUES, TOOLS, DOMAIN: ..................................................................................................................6

6. EXPLORATORY DATA ANALYSIS: ...............................................................................................................6

- 1.5 billion people are active on Facebook daily

As the part of Project Notes – I, we shall focus on following:

Dataset and attributes information are attached in the Appendix section.

4. Approach and Methodology:

5. Techniques, Tools, Domain:

Techniques EDA, Regression and Decision tree

6. Exploratory Data Analysis:

6.1 First Level Insights

parameters.glm (formula = train_scale$Target.Variable ~ ., data = train_scale)

6.2 Variable Rationalization

Page Check-ins Business relevance

Page talking about Business relevance

Page Category Business relevance

Feature 5 – Feature 29 Redundant Variable

CC1 Correlated Variable

CC4 Correlated Variable

CC5 Correlated Variable

Post Share Count Business relevance

Post Promotion Status Business relevance

H Local Business relevance

6.3 Data Preparation and Analysis

Let’s perform analysis started using R:

## Set Working Directory

## Read Input data

Comments = read.csv("Facebook.csv", header = TRUE)

## View column names

## View Structure and Summary of Input data

1. Dependent Variable: Target.Variable

2. All independent variables are numeric or integer expect Post.published.weekday and

3. Missing values present in Page.likes, Page.Checkins, Page.talking.about, Page.Category,CC1-

## Examine Dependent Variable 'Comments'

## Build Histogram for Target.Variable to understand its distribution

OBSERVATIONS: Possibly Outlier(s) affecting histogram

For now, let us examine only low Comments (< 1100)

Comments resembles Normal Distribution

## Now let us examine the Categorical Variables

6.4 Visual representation (Univariate and Bi-variate)

Title Artifact/Location Remarks

You might also like