Fake News Detection Project Report

Fake News Detection
Ritika Nair, Shubham Rastogi, Tridiv Nandi

Northeastern University
Abstract
The most common algorithms used by fake news detection
In our modern era where internet is ubiquitous, systems include machine learning algorithms such as Sup-
everyone relies on various online resources for
port Vector Machines, Random Forests, Decision trees, Sto-
news. Along with the increase in use of social me-
chastic Gradient Descent, Logistic Regression and so on. In
dia platforms like Facebook, Twitter etc. news this project we have attempted to implement two out of
spread rapidly among millions of users within a
these algorithms to train and test our results. We have used a
very short span of time. The spread of fake news
combination of both off the shelf datasets as well as ex-
has far reaching consequences like creation of bi- panded it by crawling content on the web. The main chal-
ased opinions to swaying election outcomes for the
lenge throughout the project has been to build a set of uni-
benefit of certain candidates. Moreover, spammers
form clean data and to tune parameters of our algorithms to
use appealing news headlines to generate revenue attain the maximum accuracy.
using advertisements via click-baits. In this project,
we aim to perform a binary classification of vari-
We observed that the Random Forests algorithm with a
ous news articles available online with the help of simple term frequency-inverse document frequency vector
concepts pertaining to Artificial Intelligence, Natu-
performed the best out of the four algorithms we tested. In
ral Language Processing and Machine Learning.
section 2 we describe the data collection, building of the
dataset and the text preprocessing techniques used. In sec-
1 Introduction tion 3, the three different models and their algorithms are
discussed. Section 4 discusses evaluation of results and the
With the growing popularity of mobile technology and so- scope for future enhancements.
cial media, information is accessible at one’s fingertips.
Mobile applications and social media platforms have over-
thrown traditional print media in the dissemination of news 2 Data Collection
and information. It is only natural that with the convenience
and speed that digital media offers, people express prefer-
The dataset for this project was built with a mix of both real
ence towards using it for their daily information needs. Not
and fake news. Most of the data was manually crawled and
only has it empowered consumers with faster access to di- extracted, whereas some were used off the shelf. The entire
verse data, it has also provided profit seeking parties with a
dataset amounted to 125,600 news articles out of which
strong platform to capture a wider audience.
15,600 were fake news and 1,10,000 were real news.
With the outburst of information, it is seemingly tedious for
To efficiently collect such huge data, we created a multi-
a layman to distinguish whether the news he consumes is
threaded web crawler. We ran the crawler using up to 100
real or fake. Fake news is typically published with an intent threads at a time and download the raw HTML body content
to mislead or create bias to acquire political or financial
of the crawled pages.
gains. Hence it may tend to have luring headlines or inter-
esting content to increase viewership. The sources of real news include Yahoo News, AOL, Reu-
ters, Bloomberg and The Guardian among many. Sources
In the recent elections of United States, there has been much
for fake news include TheOnion. UsaNewsFlash, Truth-Out,
debate regarding the authenticity of various news reports The Controversial Files and so on.
favoring certain candidates and the political motives behind
them. Amidst such growing concerns, the detection of fake
To extract important content from the crawled pages we
news gains utmost importance to prevent its negative im- used two strategies. First was to reduce noise by removing
pacts on individuals and society.
insignificant and irrelevant information like images, tables,
headers, footers, special symbols, navigation bars etc. The
second strategy was to extract HTML div tags from the re- 4. Feature Generation
maining content having the id property as ‘content’ or some
variations of ‘content’. With this we noticed we were able to For generation of features from the given data, we first per-
extract most of the important information across many formed tokenization on the raw text of articles. We then
webpages. Since each website has its own style of layout generated tf-idf feature vectors as described below.
and parameters, a one size fit all strategy would have failed,
and hence we leveraged a generic approach.
4.1 Term Frequency - Inverse Document Fre-
The collected data was processed using various text prepro- quency
cessing measures, as explained later and stored in CSV files.
The real and fake data were then merged and shuffled to get
a CSV file containing a consolidated randomized dataset. The tf-idf is a statistical measure that reflects the importance
of a particular word with respect to a document in a corpus.
From the consolidated randomized dataset we picked 20845 It is often used in information retrieval and text mining as
records at random which contained approximate 50% real one of the components for scoring documents and perform-
news and 50% fake news articles. From these records, 80% ing searches. It is a weighted measure of how often a word
was used for training the detection model and 20% was re- occurs in a document relative to how often it occurs across
served for testing the model. all documents in the corpus. Term frequency is the number
of times a term occurs in a document. Inverse document
2.1 Real News frequency is the inverse function of the number of docu-
ments in which it occurs.
The News Aggregator Dataset from the UCI Machine
Learning Repository was used to extract real news. This
dataset consists of links to the originally published news
articles in their websites. We extracted these URLS and
crawled them to download the news content using Beauti-
fulSoup.
We extracted the body content of the articles by removing

unnecessary information such as headers, footers, images,
advertisements, tables etc. Further, we extracted the text
from div tags having content and performed preprocessing Figure 1. TF-IDF formula for weight of term i in document
steps on them before saving them into CSV files. j. (Source: researchgate.net)
2.2 Fake News Hence a term like “the” that is common across a collection
will have lesser tf-idf values, as its weight is diminished by
For fake news we used Kaggle’s ‘Getting Real about Fake the idf component. Hence the weight computed by tf-idf
News’ dataset. The CSV file with data was available off the represents the importance of a term inside a document.
shelf for use, and we had to perform minimal text pro-
cessing on this data. The tokenized data was used to generate a sparse matrix of
tf-idf features for representation. This represented our fea-
ture vector and was used in subsequent prediction algo-
2.3 Text Preprocessing rithms.
Since most of the data was crawled and extracted manually, 3. Prediction Algorithms
we had to first go through the data to understand organiza-
tion and formatting of text. The data was made uniform and We implemented two different algorithms from scratch for
comparable by converting it into a uniform UTF-8 encod- the prediction model which were: Logistic Regression mod-
ing. There were some cases where we encountered weird el and the Naïve Bayes classifier model. The algorithms and
symbols and letters incompatible with the character set the details of implementation have been explained in the
which had to be removed. We noticed that the data from sections below. In addition to these we also trained and test-
news articles were often organized into paragraphs. So, we ed our dataset on two other models: Random Forests model
performed trimming to get rid of extra spaces and empty and Support Vector Machine model. Given the short time
lines in text. frame of the project, the last two algorithms were prudently
implemented with the help of scikit-learn libraries.
3.1 Logistic Regression
Logistic Regression is a Machine Learning technique used 3.2 Naïve Bayes Classifier
to estimate relationships among variables using statistical
methods. This algorithm is great for binary classification This is a simple yet powerful classification model that
problems as it deals with predicting probabilities of classes, works remarkably well. It uses probabilities of the elements
and hence our decision to choose this algorithm as our base- belonging to each class to form a prediction. The underlying
line run. It relies on fitting the probability of true scenarios assumption in the Naïve Bayes model is that the probabili-
to the proportion of actual true scenarios observed. Also, ties of an attribute belonging to a class is independent of the
this algorithm does not require large sample sizes to start other attributes of that class. Hence the name ‘Naive’.
giving fairly good results.
Figure 4. Naïve Bayes formula (Source: techleer.com)

Figure 2. Logistic Regression Model (Source: towardsdata-
science.com)
Figure 3. Logistic Regression Pseudo-

code(Source:whatbeg.com )
The Logistic Regression algorithm works by assigning ob-

servations to a discrete set of classes and then transforms it Figure 5. Naïve Bayes Pseudocode (Source:
using a sigmoid function to give the probability value which web.stanford.edu)
can be mapped to a discrete class.
3.4 Support Vector Machine
Support Vector Machines are machine learning models that

In this model we multiply the conditional probabilities of perform supervised learning on data for classification and
each attribute given the class value, to get the probability of
regression. When given a labeled training dataset, it com-
the test data belonging to that class. We arrive at the final
putes the optimal hyperplane that categorizes the test data.
prediction by selecting the class that has the highest of the
probabilities for the instance belonging to that class.
The advantages of using Naïve Bayes is that it is simple to

compute, and it works well in categorizing data as we are
using ratios for computation. The formula used for this
model is as follows:
3.3 Random Forest Classifier
Random Forests are a machine learning method of classifi-

cation that work by building several decision trees while
training the model. It is a kind of additive model that makes
predictions from a combination of decisions from base
models. Decision trees have huge depth and tend to overfit
results. Random forest utilizes multiple decision trees to
average out the results.
Figure 7. Support Vector Machine (Source: analyt-
icsvidhya.com)
The Random forest classifier creates a set of decision trees
from a subset of the training data. It aggregates the results Data points are plotted in a multidimensional space, where
from different decision trees and then decides the final clas- the dimension is determined by the number of features at
sification of the test data. The subsets of data used in the our disposal. The value of each feature is mapped to a point
decision trees may overlap. in the coordinate system. The algorithm then performs clas-
sification by finding the hyperplane that differentiates the
two classes well. The hyperplane having the maximum mar-
gin between the two classes in chosen.
The advantages of the SVM model are that it performs very

well for high dimensional spaces and also creates a clear
margin of separation between data points. The disad-
vantages of using SVM were that it takes greater time to
train the model compared to other models, especially when
the dataset is large.
4 Evaluation Metrics
We used the following three metrics for the evaluation of
our results. The use of more than one matrix helped us eval-
uate the performance of the models from different perspec-
tives.
Figure 6. Random Forest Model (Source: globalsup- 4.1 Classification Accuracy

portsoftware.com)
This depicts the number of accurate predictions made out of
the total number of predictions made.
Classification accuracy is calculated by dividing the total pus. This was only possible after stop word removal as this
number of correct result by the total number of test data could have caused an issue as stop words are the most fre-
records and multiplying by 100 to get the percentage. quent words in the document.
5.2 Naïve Bayes

For Naïve Bayes model, smoothing parameter was added
as well as log probabilities were taken for accurate calcula-
4.2 Confusion Matrix tions.
This is a great visual way to depict the predictions as four

categories: 5.3 Logistic Regression
1. False Positive: Predicted as fake news but are actual- In the implementation of Logistic Regression, we collected
ly true news. the loss in each iteration in an array. Then we plotted the
2. False Negative: Predicted as true news but are actu- curve of the loss in each iteration vs the number of itera-
ally fake news. tions. The graph is given below:
3. True Positive: Predicted as fake news and are actual-
ly fake news.
4. True Negative: Predicted as true news and are actu-
ally true news.
4.1 Precision and Recall
Precision which is also known as the positive predictive

value is the ratio of relevant instances to the retrieved in-
stances.
Precision = No. of True Positives / (No. of True Positives +

No. of False Positives)
Recall which is also known as sensitivity is the proportion

of relevant instances retrieved among the total number of
relevant instances.
Recall = No. of True Positives / (No. of True Positives +

Figure 8. Loss vs Iteration Curve
No. of False Negatives)
5.4 Support Vector Machine

5 Optimization
In SVM, we have used GridSearch from sklearn to find the
optimized parameters that gives us the most accurate predic-
5.1 Feature Selection and Extraction tions. GridSearch produces better results but this also slows
down the training process.
Feature selection was the major part of our text-based classi- The parameters used are as follows :
fication problem, we used tf-idf vectorization of the news parameters={'C' :[0.00001, 0.0001, 0.001, 0.01, 0.1,
data we collected. But there was a challenge as the dimen- 1,10],'kernel':['linear'], 'random_state': [1]}
sions of the vector was quite high which caused models like
SVM and Logistic Regression to run for a very long time on
large datasets. To resolve the issue we used some text based 5.5 Random Forest
transformation techniques such as stopping. We passed a list For Random Forest Model no tuning was required.
of stop words generated using NLTK library, as a parameter
to the sklearn TFIDF vectorization class. We also defined
the max_feature parameter to be assigned to 50000, i.e. the
TFIDF class generates a vocabulary and considers only the
top max_features ordered by term frequency across the cor-
6 Observed Results Logistic Regression :
The results observed in terms of evaluation metrics are as

follows:
MODELS
METRICS Naïve Logistic SVM Random
Bayes Regression Forest
Accuracy 57% 98 % 97% 97%
Precision 0.59 0.98 0.98 0.98
Recall 0.70 0.98 0.97 0.96
N.B: The above results are for evaluating the models on

data set of size 20,000. (model trained on 80% and tested on
20% of the data)
The confusion matrix data for each model run on data set
size 20,000 split into 80% Training and 20% Test data are
as follows:
MODELS
METRICS Naïve Logistic SVM Random
Bayes Regression Forest
True Positive 1538 2155 2125 2114
True Negative 756 1750 1754 1756 Figure 10. Scatter plot for Logistic Regression model
False Positive 1048 54 45 48
False Negative 658 41 76 82
Total Test Data 4000 4000 4000 4000
Support Vector Machine:
Scatter Plots for various Models on a small set of data i.e
1000 are shown below. In each figure the first subplot
shows the prediction value plotted in red, whereas the sec-
ond subplot shows the actual values plotted in green.
Naïve Bayes:
Figure 11. Scatter plot for Support Vector Machine
Figure 9. Scatter plot for Naïve Bayes Classifier

Shubham was responsible was extraction of real news da-
ta by creating a multithreaded web crawler and analyzing
Random Forest: and pre-processing the text. He was responsible for collec-
tion of this data into CSV files and merging with the real
news data to form a consolidated dataset and cleaning up the
data into usable formats. He was also responsible for the
implementation of the Naïve Bayes classifier from scratch
and the evaluation and visualization of results for the same.
Tridiv was responsible for analysis and cleaning of the

consolidated dataset, randomizing it and segregating it into
training and testing data. He was also responsible for the
implementation of the Logistic Regression model from
scratch. He trained and tested the data using this model,
evaluated the results using the described evaluation
measures and graphically plotted the results.
All three members were involved in the topic selection,

analysis, design and algorithm selection and documentation
phases.
References
Figure 12. Scatter plot for Random Forest Classifier

https://machinelearningmastery.com/naive-bayes-classifier-
scratch-python/
www.analyticsvidhya.com
7 Conclusion
[K Gunasekaran et al, 2016] Fake News Detection in Social
Media. Vol 1, No. 1 Article 1, January 2016.
Naïve Bayes performed very poorly in Bag of words model. [Victoria L Rubin, Yimin Chen, and Niall J Conroy] Decep-
Logistic Regression with tf-idf vector as a feature improved tion detection for news: three types of fakes. Proceedings
of the Association for Information Science and Technol-
this result substantially.
ogy, 52(1):1–4, 2015.
Naïve Bayes performed very poorly. Logistic Regression
[Niall J Conroy, Victoria L Rubin, and Yimin Chen] Auto-
surprisingly performed very well, as observed from the
matic deception detection: methods for finding fake
above results Logistic regression performed slightly better
news. Proceedings of the Association for Information
than Support vector machine and random forest models.
Science and Technology, 52(1):1–4, 2015.
The result can further be improved if the n-grams are used En.wikipedia.org. Support vector machine.
to generate tf-idf vectors and then used as a feature. https://en.wikipedia.org/wiki/Support_vector_machine
Stat.berkeley.edu. Random forests - classification descrip-
tion.
8 Contributions https://www.stat.berkeley.edu/~breiman/RandomForests/
cc_ home.htm.
The contributions of individual group members were as [Nebel, 2000] Bernhard Nebel. On the compilability and
follows: expressive power of propositional planning formalisms.
Journal of Artificial Intelligence Research, 12:271–315,
Ritika was responsible for extraction of the fake news da- 2000.
taset, preprocessing the text and the collection and randomi- [Daniel Jurafsky, Jones H. Martin] Naïve Bayes and Senti-
zation of the cleaned data into CSV files. She also implement Classification, Speech and Language Processing,
mented the training of data using Random Forest classifier Chapter 6. https://web.stanford.edu/~jurafsky/slp3/6.pdf
and the Support Vector Machine classifier using sklearn
library and tested the results for various input sizes using the http://scikit-learn.org/stable/
evaluation measures described. http://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html
http://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.Ra
ndomForestClassifier.html
http://scikit-learn.org/stable/modules/svm.html
Lecture Notes (Logistic Regression) of CS 6140: Machine

Learning, CCIS, Northeastern University by Prof. Bilal
Ahmed & Prof. Yu Wen
Lecture Notes (Logistic Regression) of CS 5100: Funda-
mentals of Artificial Intelligence, CCIS, Northeastern
University by Prof. David Smith.
Lecture Notes (Naïve Bayes) of CS 5100: Fundamentals of

Artificial Intelligence, CCIS, Northeastern University by
Prof. David Smith.
Datasets:
https://www.kaggle.com/mrisdal/fake-news
[Dua, D. and Karra Taniskidou, E. (2017).] UCI Machine

Learning Repository [http://archive.ics.uci.edu/ml]. Ir-
vine, CA: University of California, School of Infor-
mation and Computer Science.
NLTK :
[Bird, Steven, Edward Loper and Ewan Klein (2009).]
Natural Language Processing with Python. O'Reilly Media
Inc.

Fake News Detection Project Report

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fake News Detection Project Report

Uploaded by

Copyright:

Available Formats

Fake News Detection

Ritika Nair, Shubham Rastogi, Tridiv Nandi

We extracted the body content of the articles by removing

Figure 4. Naïve Bayes formula (Source: techleer.com)

Figure 3. Logistic Regression Pseudo-

The Logistic Regression algorithm works by assigning ob-

Support Vector Machines are machine learning models that

The advantages of using Naïve Bayes is that it is simple to

3.3 Random Forest Classifier

Random Forests are a machine learning method of classifi-

The advantages of the SVM model are that it performs very

Figure 6. Random Forest Model (Source: globalsup- 4.1 Classification Accuracy

5.2 Naïve Bayes

This is a great visual way to depict the predictions as four

4.1 Precision and Recall

Precision which is also known as the positive predictive

Precision = No. of True Positives / (No. of True Positives +

Recall which is also known as sensitivity is the proportion

Recall = No. of True Positives / (No. of True Positives +

5.4 Support Vector Machine

The results observed in terms of evaluation metrics are as

N.B: The above results are for evaluating the models on

Figure 11. Scatter plot for Support Vector Machine

Figure 9. Scatter plot for Naïve Bayes Classifier

Tridiv was responsible for analysis and cleaning of the

All three members were involved in the topic selection,

Figure 12. Scatter plot for Random Forest Classifier

Lecture Notes (Logistic Regression) of CS 6140: Machine

Lecture Notes (Naïve Bayes) of CS 5100: Fundamentals of

[Dua, D. and Karra Taniskidou, E. (2017).] UCI Machine

You might also like