0% found this document useful (0 votes)
102 views49 pages

02 - Bharghav Fake News Detection

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views49 pages

02 - Bharghav Fake News Detection

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 49

A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

CHAPTER 1

INTRODUCTION

1.1 Overview
These days’ fake news is creating different issues from sarcastic articles to a
fabricated news and plan government propaganda in some outlets. Fake news and lack
of trust in the media are growing problems with huge ramifications in our society.
Obviously, a purposely misleading story is “fake news “but lately blathering social
media’s discourse is changing its definition. Some of them now use the term to
dismiss the facts counter to their preferred viewpoints. The importance of
disinformation within American political discourse was the subject of weighty
attention, particularly following the American president election. The term 'fake news'
became common parlance for the issue, particularly to describe factually incorrect and
misleading articles published mostly for the purpose of making money through page
views. In this paper, it is seemed to produce a model that can accurately predict the
likelihood that a given article is fake news. Facebook has been at the epicenter of
much critique following media attention. They have already implemented a feature to
flag fake news on the site when a user sees’s it; they have also said publicly they are
working on to to distinguish these articles in an automated way. Certainly, it is not an
easy task. A given algorithm must be politically unbiased – since fake news exists on
both ends of the spectrum – and also give equal balance to legitimate news sources on
either end of the spectrum. In addition, the question of legitimacy is a difficult one.
However, in order to solve this problem, it is necessary to have an understanding on
what Fake News is. Later, it is needed to look into how the techniques in the fields of
machine learning, natural language processing helps us to detect fake news.

Progression and advancement of the hand-held devices and high-speed Internet have
exponentially increased the number of digital media users. According to digital global
report 2020, the number of users for digital media reached 4.75 billion, and the social

1
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

media users reached 301 million in 2020. This digitalization converts the world into
the global village. Due to this advancement, individuals are just one click away from
the information worldwide. Despite several advantages, this transformation has raised
some challenges. Fake news is one of the challenges faced by the digital community
nowadays.

Fake news is pervasive propaganda that spreads misinformation online, using social
media like Facebook, twitter, and Snapchat to manipulate public perceptions. Social
media can have two sides for news consumption, i.e., can be utilized to update the
community about the latest news and, on the other hand, can be a source of spreading
false news. However, social media is a low cost, quick access, and fast distribution of
news and information and to know what is happening worldwide. Moreover, due to its
simplicity and lack of control on the Internet, it allows “fake news” to be widespread.

Fake news has become a focal point of discussion in the media over the past three
years due to its impact on the 2016 US Presidential election. Reports showed that
human’s capability for detecting deception without special assistance is only 54%.
Therefore, there is a need for an automated way to classify fake and real news
accurately. Some studies have been conducted but still there is a need for further
attention and exploration. The proposed study attempts to eliminate the spread of
rumors and fake news and helps people to identify the news source as trustworthy or
not by automatically classifying the news.

In the Internet age, most people spend the majority of their time on their mobile
phones. While the younger generation consumes news through social media or online
news blogs, the older generation spends their leisure time watching the news on TV or
reading it in the newspaper, which is readily available with a single click. As a result,
there is no longer a need to purchase a newspaper and read it.

However, with such latitude, we have seen an all-time surge in the prevalence of fake
news on the Internet and social media. Anyone on the internet/social media may
publish whatever they want, making traditional fact-checking nearly impossible.

2
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

Along with modern-day journalism, it has resulted in an increase in fake news, which
is easily accessible on the Internet for anybody to read.

Fake News, as defined by various other authors, comprises misleading content that
may deceive readers and fabricated stories that appear to have legitimate sources.
Fake news is mainly propagated via the internet, either through websites or social
media. These websites attempt to appear legitimate by naming themselves after
legitimate websites. And some of these websites will vanish as soon as the intended
results are obtained through fake news promotion.

The rise of new information technologies has produced many new ways of
consumption like online shopping, multimedia entertainment, gaming, advertising or
learning, among others. One of the sectors greatly impacted by this new paradigm is
the information industry. In the last years, there has been an exodus of users from the
more traditional media such as newspapers, radio and television to new formats:
social networks, YouTube, podcasts, online journals, news applications, etc. The main
cause of the information media decay is the growing capability, thanks to the Internet,
of having instant and free access to a wide variety of information sources let alone
numerous services that allow sharing the news with millions of people around the
world.

As a result, the media have started to react to the change. Some, for example, have
begun to prioritize their online presence or have decided to start using new
distribution channels such as videos or podcasts. Their current project is finding a way
of making profitable these new distribution formats that have always been free for the
end user. Most of these media have decided to start monetizing their content through
advertising embedded in their articles, videos, etc. One of the most frequent methods
is publishing articles with flashy headlines and photos aimed to be shared on social
networks (known as clickbait) so that users navigate to their websites thus
maximizing their revenue. However, this kind of approach can lead to dangerous
situations. The large amount of information that people access is usually unverified
and generally assumed as true.

3
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

It is at this point where the term fake news arises. This problem has reached its peak
of notoriety after the 2016 US electoral campaign when it has been determined that
the fake news have been crucial to polarize society and promote the triumph of
Donald Trump.
The big technology companies (Twitter, Facebook, Google) have spotted this danger
and have already begun to work on systems to detect these fake news on their
platforms. All the same, even if this methods evolve fast, this is still a very
complicated problem that needs further investigation.

Fake News has certain standard features or characteristics, making it easier to


determine whether it is fake or real. Those include low facticity, such as misleading or
deceptive material; journalistic styles, such as structural elements like headline, text,
and body; and intention to deceive for personal benefits, such as financial, political, or
to provoke someone are examples of these . Other characteristics include grammatical
and spelling errors or, if a reputable fact checker does not verify the articles, the
content is not ascribed to any original news source.

Even though many websites, such as Politifact and Snopes, frequently check the news
for legitimacy to keep the public informed about which news is fake or real. Many
researchers have also been developing repositories to identify which web pages are
fraudulent or genuine. Any news that comes out of a publishing house contains
information content that reaches out to consumers of daily news. The news is then put
up on the internet by bloggers, online news agencies, and social media platforms.
Sometimes the news is fabricated on social media or anonymous blogs on the internet,
causing upheaval or riots in modern society. As a result, it is vital to keep bogus news
in check. However, the traditional fact-checking approach is labor-intensive, time-
consuming, and inefficient. As a result, we require a more robust method, such as the
usage of advanced Machine Learning algorithms, to assist us in classifying certain
news as false or genuine based on the semantics or attributes of the news.

1.2 Problem Statement


4
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

In the era of digital information proliferation, the spread of fake news poses a
significant threat to societal well-being, public discourse, and democratic processes.
Traditional methods of detecting misinformation often lack efficiency and scalability
in coping with the sheer volume and rapid dissemination of deceptive content across
various online platforms. Hence, there is a pressing need for a deep learning-based
fast fake news detection method that can accurately and swiftly identify misleading
information amidst the vast expanse of online data. This project aims to develop such
a solution by leveraging advanced deep learning techniques to analyze textual, visual,
and contextual cues, thereby empowering users and platforms with a robust tool to
combat the proliferation of fake news in real-time.

1.3 Objectives of the Project

1: Develop a Robust Deep Learning Model The primary objective of this project is to
design and implement a robust deep learning model capable of detecting fake news
with high accuracy and efficiency. This entails exploring various deep learning
architectures such as convolutional neural networks (CNNs), recurrent neural
networks (RNNs), and transformers, and optimizing their configurations for the task
of fake news detection. By leveraging the expressive power of deep learning, the aim
is to create a model that can effectively capture the complex patterns and semantic
cues indicative of fake news across different types of textual and multimedia content.

2: Dataset Curation and Preprocessing Another crucial objective is to curate and


preprocess a comprehensive dataset comprising both real and fake news articles,
social media posts, images, and videos. This dataset will serve as the foundation for
training and evaluating the deep learning model. Special attention will be given to
ensuring the dataset's diversity, balance, and representativeness across various
domains, languages, and socio-political contexts. Furthermore, extensive
preprocessing techniques will be applied to clean and standardize the data, including
text normalization, tokenization, and feature extraction, to facilitate effective model
training.

5
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

3: Feature Engineering and Representation Learning In addition to leveraging raw


textual and multimedia content, this project aims to explore advanced feature
engineering and representation learning techniques to enhance the model's
discriminatory power. This includes extracting linguistic, semantic, and socio-
contextual features from text, images, and videos using techniques such as word
embeddings, visual embeddings, and audio embeddings. By incorporating rich feature
representations, the model can better discern subtle cues and nuances indicative of
fake news, thereby improving its detection performance across diverse media formats
and languages.

4: Real-time and Scalable Implementation To address the practical requirements of


real-time fake news detection in online platforms and social media networks, a key
objective is to develop an efficient and scalable implementation of the deep learning
model. This involves optimizing the model's architecture, inference algorithms, and
deployment infrastructure to minimize latency and resource consumption while
maintaining high accuracy. Furthermore, considerations will be given to scalability,
allowing the model to handle large volumes of data and adapt to evolving news trends
and propagation dynamics in dynamic online environments.

5: Evaluation and Benchmarking Lastly, this project aims to rigorously evaluate the
performance of the developed deep learning-based fake news detection method
through comprehensive benchmarking against existing state-of-the-art approaches and
baselines. Evaluation metrics such as precision, recall, F1-score, and area under the
receiver operating characteristic curve (AUC-ROC) will be employed to assess the
model's effectiveness, robustness, and generalization across different datasets and
evaluation scenarios. Additionally, qualitative analyses will be conducted to gain
insights into the model's strengths, weaknesses, and areas for improvement, guiding
future research directions and practical applications

6
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

CHAPTER 2

LITERATURE SURVEY

The research in the field of fake news detection has been intense in recent years.
However, most of the work in the area is focused on the idea of studying and
detecting the hoaxes on its main spreading channel: social media. Examples of the
above are [5] or [6], where the probability of a given post to be false is studied using
its own characteristics such as likes, followers, shares, etc., through classical machine
learning methods (classification trees, SVM, ...). Applying these kind of
approximations, the best results are obtained in [7], where click-bait news are detected
achieving results of 93% of accuracy.

Other works like [8] use graph-based approximations for studying the relations
between users who share news and the path which the shared content follows in order
to stop it in order to mitigate its potential deception effects. Although the general
trend is to analyze the way hoaxes are spread, other alternatives focused on the
analysis of the content of the news the have begun to appear. In [9], besides of the
user features who shares the news, the text is used to discriminate fake news. On the
other hand, in [10], the statements in an article are studied in order to detect false facts
in the content.

Regarding the use of modern deep learning algorithms, the company Fabula.ai,
recently acquired by Twitter, uses a method which takes into account both the content
of news and the features extracted from the social networks, achieving results of 0.93
of AUC [11]. In [12] the performance of several algorithms (both classic and deep
learning) is compared categorizing news among the categories of true and fake,
achieving results of 95% of accuracy.

7
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

Finally, using only the content of the news [13], a convolutional neural network based
technique is proposed to detect fake news using the titles and the heading image,
obtaining results of 92% of accuracy.

One of the earlier studies on fake news detection and automatic fact-checking with
more than a thousand samples was done by [4] using LIAR dataset. The dataset
contains 12.8 K human-labeled short statements from POLITIFACT.COM. The
statements were labeled in six different categories, such as pants fire, false, barely
true, half true, mostly true, and true. The study used several classifiers such as logistic
regression (LR), support vector machine (SVM), a bidirectional long short-term
memory (Bi-LSTM) networks model), and a convolutional neural network (CNN)
model.

For LR and SVM, the study used the LIBSHORTTEXT toolkit and showed
significant performance on short text classification problems. The study compared
several techniques using text features only and achieved an accuracy of 0.204 and
0.208 on the validation and test sets. Due to overfitting, the Bi-LSTMs did not show
good performance.

However, the CNN outperformed all models, resulting in an accuracy of 0.270 on the
holdout data splitting. Similarly, another study compared three datasets such as LIAR
datasets, fake or real news dataset [5], and the dataset generated by collecting fake
news and real news from Internet [6]. The study made a comparison among various
conventional machine learning models such as SVM, LR, decision tree (DT),
AdaBoost (AB), Naive Bayes (NB), and K nearest neighbor (KNN), respectively,
using lexical, sentiment, unigram, and bigram techniques with term frequency and
inverse document frequency (TF-IDF). Furthermore, several CNN models such as
NN, CNN, LSTM, Bi-LSTM, hierarchical attention network (HAN), convolutional
HAN, and character level C-LSTM were also used with Glove embedding and
character embedding to train the model. They found that the performance of the
LSTM model highly depends upon the size of the dataset. The result showed that NB,

8
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

with n-gram (bigram TF-IDF), features produced the best outcome of approximately
0.94 accuracy with the combined corpus dataset.

Conversely, the study by [4] indicated that the CNN model outperformed the LIAR
dataset. However, the study by [6] showed that the CNN model is the second-best for
all the datasets. The NB model showed the best performance for the LIAR dataset
with 0.60 accuracy and 0.59 F1-score. For the fake or real news, dataset Char-level C-
LSTM showed the best performance with 0.95 accuracy and 0.95 F1-score. LSTM-
based models showed the best outcome on the combined corpus dataset, where both
Bi-LSTM and C-LSTM produced an accuracy of 0.95 and F1-score of 0.95.
Furthermore, another study was performed by Girgis et al. [3] regarding the spread of
fake news and used recurrent neural network (RNN) models (Vanilla RNN, Gated
Recurrent Unit (GRU)) and long short-term memories (LSTMs) on the LIAR dataset
to predict fake news. They compared and analyzed their results with Wangs [4]
findings. Although similar results were achieved, GRU (0.217) outperformed the
other models. Nevertheless, in comparison with the findings of Wang, they found that
CNN is better in terms of speed and outcomes. Similarly, the authors in [7] used the
LSTM model on LIAR dataset. They found that adding the speaker profile enhances
the performance of the algorithm. The model achieved an accuracy of 0.415.

Moreover, the study by [8] proposed a novel approach to overcome the problem of
fake news detection using two metaheuristic algorithms, salp swarm optimization
(SSO) and grey wolf optimization (GWO). The study performed experiments using
three different datasets, which are BuzzFeed Political News, Random Political News,
and LIAR Benchmark. The results showed that the GWO algorithm outperformed as
compared with the SSO and other algorithms. GWO obtained the best accuracy in all
datasets and produced highest precision and F1-score in two out of three datasets.
Moreover, the precision of the SSO within two out of three datasets performed better
than all the algorithms. The results obtained from the two algorithms were very
promising because of the representation structure, and flexible fitness function
handled many different objectives simultaneously and efficiently. The study

9
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

recommended that using different similarity metrics in model construction and testing
improves the performance of their model.

In the converted document vector, binary versions of metaheuristic optimization


techniques can also be used. Similarly, to improve the results of the study, adaptive
and hybrid versions of the SSO and GWO algorithms were proposed. Another study
[9] used self multihead attention-based CNN (SMHACNN). The study implemented
CNN and self multihead attention (SMHA) techniques and evaluated the truthfulness
of news based on its content. The experiments were conducted on a public dataset that
was collected from fakenews.mit.edu. The study conducted two experiments using 5-
fold cross-validation, and their results showed that the model produced effective
outcomes in detecting the fake news with the precision of 0.95 and the recall of 0.95.
Besides, they have compared their results with previous work and have shown that
their proposed technique using the self multihead attention with the CNN made a
remarkable performance.

Additionally, the authors in [10] developed an exploratory analysis model using


Facebook news during the 2016 US Presidential election based on the elaboration
likelihood model as well as numerous cognitive and visual indicators of information,
which most of them have already been shown to impact the quality of online
information. The study investigated how news posts’ cognitive, visual, affective, and
behavioural clues, together with the addressed user communal, can be used by
machine learning models to automatically detect the fake news. The study used a
BuzzFeed dataset of Facebook posts. They trained many machine learning models
appropriate for binary classification. The classifiers were LR, SVM, DT, random
forest (RF), and extreme gradient boosting (XGB) and were trained with the same
features set. The study achieved the highest accuracy of 0.80 and an approximately
0.90 recall.

A study used a hybrid approach by combining deep learning, natural language


processing (NLP), and semantics using LIAR and PolitiFact datasets [11]. The study
compared the performance of some classical machine learning models like

10
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

multinomial Na¨ıve Bayes (MNB), stochastic gradient boosting (SGD), LR, DT, and
SVM. The study compared the performance of some classical machine learning
models like multinomial Na¨ıve Bayes (MNB), stochastic gradient boosting (SGD),
LR, DT, SVM, and DL models like CNN, Basic LSTM, Bi-LSTM GRU, and
CapsNet, respectively. The study found that CapsNet outperformed the other model
with an accuracy of 0.649 using LIAR dataset. The integration of semantic features
such as named entity recognition (NER) sentiments in LIAR dataset enhancedthe
performance of the classification model. Similarly, another study also compared the
performance of machine learning and DL models and found similar performance of
SVM and Bi-LSTM with an accuracy of 0.61 using LIAR dataset [12]. However, the
training time of Bi-LSTM was very huge. Recently, the study used ensemble-based
machine learning approach for the classification of fake news using two datasets
LIAR and ISOT dataset [13]. The ensemble model used DT, RF, and extra tree
classifiers. The study achieved testing accuracy of 44.15%.

11
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

Chapter 3
FEASIBILITY STUDY

Requirements are the basic constrains that are required to develop a system.
Requirements are collected while designing the system.

The following are the requirements that are to be discussed.


1. Functional requirements
2. Non-Functional requirements
3. Technical requirements
A. Hardware requirements
B. Software requirements

3.1 Functional requirements


The software requirements specification is a technical specification of requirements
for the software product. It is the first step in the requirements analysis process. It lists
requirements of a particular software system. The following details to follow the
special libraries like sk-learn, pandas, numpy, matplotlib and seaborn.

3.2 Non-Functional Requirements


Process of functional steps:
I. Problem define
II. Preparing data
III. Evaluating algorithms
IV. Improving results

12
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

V. Prediction the result


3.3 Technical Requirements

Software Requirements:
Operating System : Windows
Tool : Anaconda with Jupyter Notebook

Hardware requirements:
Processor : Pentium IV/III
Hard disk : minimum 80 GB
RAM : minimum 2 GB

3.4 Methods
3.4.1 Linear Regression

In the Deep Learning Based Fast Fake News Detection Method project, linear
regression plays a crucial role in feature engineering and model interpretation.
Initially, linear regression may be employed to identify relevant features from the
dataset that can effectively discriminate between real and fake news. Features such as
the frequency of certain words, sentiment scores, or linguistic patterns can be
quantified and used as input variables for the regression model. Through regression
analysis, the importance of each feature in predicting the authenticity of news can be
assessed, allowing for the selection of the most informative features for subsequent
modeling stages.

Once the relevant features are identified, linear regression can be utilized to build a
baseline model for predicting the likelihood of a news article being fake. This initial
model serves as a reference point for evaluating the performance of more complex
deep learning models later in the project. By fitting a linear regression model to the
training data, the relationships between the selected features and the target variable
(i.e., the authenticity of the news) can be quantified, providing insights into the
13
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

underlying patterns that differentiate real and fake news articles.

Furthermore, linear regression can be used for model interpretation and post hoc
analysis in the Deep Learning Based Fast Fake News Detection Method project. After
training more sophisticated deep learning models, such as recurrent neural networks
or convolutional neural networks, linear regression can help explain the predictions
made by these models. By examining the coefficients of the linear regression model,
researchers can gain insights into which features contribute most significantly to the
model's decisions, providing a better understanding of the factors influencing the
classification of news articles as real or fake. Overall, the integration of linear
regression techniques into the project enables not only feature selection and baseline
modeling but also enhances the interpretability of the final deep learning-based fake
news detection system.

3.4.2 Decision Tree


In the Deep Learning Based Fast Fake News Detection Method project, decision trees
are employed as a crucial component for feature selection and classification. Initially,
decision trees are utilized to identify relevant features from the dataset, which may
include linguistic patterns, sentiment analysis scores, and other textual features
extracted from news articles or social media posts. Through an iterative process,
decision trees help discern which features contribute most significantly to
distinguishing between genuine and fake news, thereby optimizing the feature set for
subsequent classification tasks.
Once the relevant features are identified, decision trees are integrated into the
classification pipeline to efficiently categorize news articles or social media content as
either authentic or fabricated. Decision trees offer a transparent and interpretable
framework for understanding the decision-making process behind classifying each
piece of information. By traversing the tree's branches based on the extracted features,
the model can make informed decisions about the authenticity of the content with
high accuracy and speed, crucial for fast-paced environments such as social media
platforms where misinformation spreads rapidly.
Moreover, decision trees facilitate the creation of ensemble learning techniques, such
14
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

as random forests or gradient boosting, to enhance the overall performance of the fake
news detection system. By combining multiple decision trees, each trained on
different subsets of the data or with different feature sets, ensemble methods can
mitigate overfitting and improve the robustness of the model. This approach enables
the Deep Learning Based Fast Fake News Detection Method project to achieve
superior detection accuracy and generalization capabilities, thereby providing more
reliable tools for combating the proliferation of misinformation in digital ecosystems.

3.4.3 Gradient Boosting classifier

Gradient boosting is a powerful technique that has found extensive application in


various machine learning tasks, including deep learning-based projects like Fast Fake
News Detection Methods. In such projects, gradient boosting algorithms like
XGBoost or LightGBM are often employed to enhance the predictive performance of
the model. One of the primary advantages of gradient boosting lies in its ability to
sequentially train weak learners, such as decision trees, by focusing on the instances
that were misclassified in the preceding iterations. This iterative refinement process
enables the model to progressively improve its predictive accuracy, making it well-
suited for tasks like fake news detection where the data is often noisy and complex.

In the context of deep learning-based fast fake news detection methods, gradient
boosting algorithms are typically utilized in conjunction with neural networks to boost
their performance. For instance, researchers may integrate gradient boosting models
with deep learning architectures such as convolutional neural networks (CNNs) or
recurrent neural networks (RNNs). By combining the strengths of both techniques, the
model can effectively capture complex patterns and relationships present in textual or
multimedia data, thereby improving the overall detection accuracy of fake news.

Moreover, gradient boosting techniques can play a crucial role in feature engineering
and selection, which are essential steps in developing robust fake news detection
systems. These algorithms can automatically identify and prioritize informative
features from a large pool of potential predictors, thereby reducing the risk of
15
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

overfitting and enhancing the model's generalization capabilities. Additionally,


gradient boosting algorithms offer flexibility in handling various types of data
sources, including text, images, and metadata, allowing researchers to create
comprehensive and versatile fake news detection frameworks that can adapt to
evolving information landscapes. By leveraging the strengths of gradient boosting in
conjunction with deep learning techniques, researchers can develop fast and accurate
solutions for combating the proliferation of fake news in online platforms.

3.4.4 Random Forest

In the realm of Deep Learning Based Fast Fake News Detection, the integration of
Random Forest (RF) algorithms plays a pivotal role in enhancing model performance
and robustness. By incorporating RF into the framework, the system can efficiently
handle high-dimensional data and mitigate overfitting issues commonly encountered
in deep learning architectures. RF operates by constructing a multitude of decision
trees during training and outputs the mode of the classes for classification tasks. In the
context of fake news detection, RF aids in capturing intricate patterns and
relationships within the data, thereby enabling more accurate classification of news
articles as either genuine or fabricated.

One notable advantage of employing Random Forest in this project lies in its ability to
handle unbalanced datasets effectively. Fake news detection datasets often exhibit
class imbalance, where genuine news samples significantly outnumber fake ones. RF
tackles this challenge by weighting the classes appropriately during training, ensuring
that the model does not become biased towards the majority class. This capability
enhances the system's ability to identify subtle features indicative of fake news, thus
improving overall detection accuracy. Moreover, RF's inherent parallelism facilitates
expedited training and inference, aligning with the project's objective of achieving
fast fake news detection without compromising performance.

Furthermore, the ensemble nature of Random Forest contributes to the project's


robustness against noise and outliers in the data. By aggregating predictions from
16
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

multiple decision trees, RF diminishes the impact of individual errors, leading to more
reliable classification outcomes. Additionally, RF's interpretability allows analysts to
gain insights into the features driving the classification decisions, thereby facilitating
model refinement and feature selection. Overall, the incorporation of Random Forest
in the Deep Learning Based Fast Fake News Detection Method project not only
enhances detection accuracy and efficiency but also fosters a deeper understanding of
the underlying data dynamics, thus empowering stakeholders in the fight against
misinformation.

3.5 Python
Python is a widely used general-purpose, high level programming language. It was
created by Guido van Rossum in 1991 and further developed by the Python Software
Foundation. It was designed with an emphasis on code readability, and its syntax
allows programmers to express their concepts in fewer lines of code.

Python is a programming language that lets you work quickly and integrate systems
more efficiently.
There are two major Python versions: Python 2 and Python 3. Both are quite different.

3.5.1 Features of Python


Interpreted

● There are no separate compilation and execution steps like C and C++.

● Directly run the program from the source code.

● Internally, Python converts the source code into an intermediate form called

bytecodes which is then translated into native language of specific computer to


run it.

● No need to worry about linking and loading with libraries, etc.

Platform Independent

● Python programs can be developed and executed on multiple operating system

17
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

platforms.

● Python can be used on Linux, Windows, Macintosh, Solaris and many more.

● Free and Open Source; Redistributable

High-level Language

● In Python, no need to take care about low-level details such as managing the

memory used by the program.


Simple

● Closer to English language;Easy to Learn

● More emphasis on the solution to the problem rather than the syntax

Embeddable

● Python can be used within C/C++ program to give scripting capabilities for

the program’s users.


Robust:

● Exceptional handling features

● Memory management techniques in built

Rich Library Support

● The Python Standard Library is very vast.

● Known as the “batteries included” philosophy of Python ;It can help do

various things involving regular expressions, documentation generation, unit


testing, threading, databases, web browsers, CGI, email, XML, HTML, WAV
files, cryptography, GUI and many more.

● Besides the standard library, there are various other high-quality libraries such

as the Python Imaging Library which is an amazingly simple image


manipulation library.

18
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

3.5.2 Python Libraries

3.5.2.1 Pandas
Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.
The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis"
and was created by Wes McKinney in 2008.
Pandas allows us to analyze big data and make conclusions based on statistical
theories.
Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.
Pandas gives you answers about the data. Like:
Is there a correlation between two or more columns?
What is average value?
Max value?
Min value?
Pandas are also able to delete rows that are not relevant, or contains wrong values,
like empty or NULL values. This is called cleaning the data.

3.5.2.2 Numpy
NumPy is a Python library used for working with arrays.
It also has functions for working in domain of linear algebra, fourier transform, and
matrices.
NumPy was created in 2005 by Travis Oliphant. It is an open source project and you
can use it freely.
NumPy stands for Numerical Python.
In Python we have lists that serve the purpose of arrays, but they are slow to process.
NumPy aims to provide an array object that is up to 50x faster than traditional Python
lists.
The array object in NumPy is called ndarray, it provides a lot of supporting functions
that make working with ndarray very easy.
19
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

Arrays are very frequently used in data science, where speed and resources are very
important.
NumPy is a Python library and is written partially in Python, but most of the parts that
require fast computation are written in C or C++.

3.5.2.3 matplotlib
Matplotlib is a low level graph plotting library in python that serves as a visualization
utility.
Matplotlib was created by John D. Hunter.
Matplotlib is open source and we can use it freely.
Matplotlib is mostly written in python, a few segments are written in C, Objective-C
and Javascript for Platform compatibility.

Matplotlib is an amazing visualization library in Python for 2D plots of arrays.


Matplotlib is a multi-platform data visualization library built on NumPy arrays and
designed to work with the broader SciPy stack. It was introduced by John Hunter in
the year 2002. One of the greatest benefits of visualization is that it allows us visual
access to huge amounts of data in easily digestible visuals. Matplotlib consists of
several plots like line, bar, scatter, histogram, etc

3.5.2.4 seaborn
Seaborn is a Python data visualization library based on matplotlib. It provides a high-
level interface for drawing attractive and informative statistical graphics.
seaborn is a library for making statistical graphics in Python. It provides a high-level
interface to matplotlib and integrates closely with pandas data structures. Functions in
the seaborn library expose a declarative, dataset-oriented API that makes it easy to
translate questions about data into graphics that can answer them. When given a
dataset and a specification of the plot to make, seaborn automatically maps the data
values to visual attributes such as color, size, or style, internally computes statistical
transformations, and decorates the plot with informative axis labels and a legend.
Many seaborn functions can generate figures with multiple panels that elicit
comparisons between conditional subsets of data or across different pairings of

20
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

variables in a dataset. seaborn is designed to be useful throughout the lifecycle of a


scientific project. By producing complete graphics from a single function call with
minimal arguments, seaborn facilitates rapid prototyping and exploratory data
analysis. And by offering extensive options for customization, along with exposing
the underlying matplotlib objects, it can be used to create polished, publication-
quality figures.

3.5.2.4 tensorflow
TensorFlow is an open-source library for fast numerical computing.
It was created and is maintained by Google and was released under the Apache 2.0
open source license. The API is nominally for the Python programming language,
although there is access to the underlying C++ API.
Unlike other numerical libraries intended for use in Deep Learning like Theano,
TensorFlow was designed for use both in research and development and in production
systems, not least of which is RankBrain in Google search and the fun DeepDream
project.
It can run on single CPU systems and GPUs, as well as mobile devices and large-scale
distributed systems of hundreds of machines.

3.5.2.4 keras

Keras runs on top of open source machine libraries like TensorFlow, Theano or
Cognitive Toolkit (CNTK). Theano is a python library used for fast numerical
computation tasks. TensorFlow is the most famous symbolic math library used for
creating neural networks and deep learning models. TensorFlow is very flexible and
the primary benefit is distributed computing. CNTK is deep learning framework
developed by Microsoft. It uses libraries such as Python, C#, C++ or standalone
machine learning toolkits. Theano and TensorFlow are very powerful libraries but
difficult to understand for creating neural networks.
Keras is based on minimal structure that provides a clean and easy way to create deep
learning models based on TensorFlow or Theano. Keras is designed to quickly define

21
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

deep learning models. Well, Keras is an optimal choice for deep learning applications.

Features


Keras leverages various optimization techniques to make high level neural network API easier

● Consistent, simple and extensible API.

● Minimal structure - easy to achieve the result without any frills.

● It supports multiple platforms and backends.

● It is user friendly framework which runs on both CPU and GPU.

● Highly scalability of computation.

Benefits


Keras is highly powerful and dynamic framework and comes up with the following advantage

● Larger community support.

● Easy to test.

● Keras neural networks are written in Python which makes things simpler.

● Keras supports both convolution and recurrent networks.

● Deep learning models are discrete components, so that, you can combine into

many ways.

22
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

Chapter 4
METHODOLOGY

4.1. Dataset

Many real and fake news datasets are freely accessible on the internet. News websites such as
the New York Times, Reuters, and the Washington Post, among others, are considered to be
reliable sources of information. So the data we'll be using comes from websites like
PolitiFact, Snopes, and Boom Live, all of which are fact-checking websites that host all kinds
of news, from fake to real, from various sources on the internet, and also check to see if the
articles are fake or real. We used two separate datasets for this article to be more inclusive of
all forms of news. Previously, most papers centered around the PolitiFact website's Politics
Dataset. As a result, we have included news from Entertainment, Politics, Sports, and other
areas in this project.
The UTK Machine Learning Club maintains the Kaggle dataset (D2), which is collected from
Kaggle [26]. This dataset is divided into two files: the training dataset (20386 articles) and the
test dataset (5126 articles), which will be used to test the model's performance. This dataset
includes news from numerous topics such as entertainment, crime, and sports and is not
limited to Politics. The dataset has five columns: id, title, author, text, and label

Table 4.1. Descriptive table of the dataset.

23
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

4.2 Data Preprocessing

The goal of pre-processing is to remove noise. By removing unnecessary features


from our text, we can reduce complexity and increase predictability (i.e. our model is
faster and better). Removing punctuation, special characters, and ‘filler’ words (the, a,
etc.) does not drastically change the meaning of a text.

The goal of pre-processing is to remove noise. By removing unnecessary features


from our text, we can reduce complexity and increase predictability (i.e. our model is
faster and better). Removing punctuation, special characters, and ‘filler’ words (the, a,
etc.) does not drastically change the meaning of a text. In the preprocessing, we will
preprocess the news column in the fake news detection.

Lowercasing: Convert all text to lowercase. This ensures uniformity and helps the
model treat words with the same letters but different cases as the same.

Removing Punctuation: Remove any punctuation marks from the text. Punctuation
may not carry much information for fake news detection and removing it can help
simplify the text.

Tokenization: Tokenize the text, breaking it down into individual words or tokens.
This step is essential for further analysis as it allows the model to understand and
process the text at a granular level.

Removing Stop Words: Remove common stop words (e.g., "and," "the," "is") that do
not contribute much to the meaning of the text. This helps reduce noise in the data and
focuses on more meaningful words.

Stemming or Lemmatization: Reduce words to their root form. This can be done
through stemming or lemmatization. Stemming involves removing prefixes and
suffixes, while lemmatization involves reducing words to their base or root form.
24
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

Choose the method that suits your requirements.

Removing Numbers: If numbers don't contribute significantly to fake news detection,


you may consider removing them to simplify the text.

Handling Special Characters and URLs: Remove or replace special characters and
URLs as they may not provide meaningful information for fake news detection.

Spell Checking and Correction: Perform spell checking and correction to fix any
typos or misspelled words. This helps standardize the text and ensures that the model
doesn't misinterpret words due to spelling errors.

Handling Contractions: Expand contractions (e.g., "don't" to "do not") to ensure


consistency in language and meaning.

Feature Engineering: Create additional features such as word counts, sentence lengths,
or any other features that may help capture relevant information for fake news
detection.

Vectorization: Convert the preprocessed text into numerical vectors using techniques
like TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings.
This step is crucial for inputting the data into machine learning models.

4.3 MACHINE LEARNING ALGORITHMS


The objective of this work is to analyze the possibilities offered by machine learning
algorithms as tools for rain forecasting as an alternative to classical forecasting
methods. For this, the following algorithms were applied: knn, decision tree, random
forest, and neural networks. The algorithms are described below.

4.3.1 . K-NN Algorithm (K-Nearest Neighbors)


It is a supervised algorithm (therefore, it takes labeled data as input) that for each
25
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

unlabeled data sample identifies the K closest samples of the input data and assigns
the class of most of the K closest neighbors to the unlabeled sample. The algorithm
requires the use of a function to calculate the distance between samples.

In theory, to choose the number K of closest samples to consider and avoid overfitting
and underfitting, a bias–variance tradeoff is performed. It is a balance to reduce the
impact of data with high variance and not ignore the trends generated by a small
amount of data generating an offset (if a very high number is chosen for K, then the
model would always predict the most common class, and if is a very small value was
chosen this would generate a very noisy result). In practice, the selection of the value
of K will depend on each case and the amount of data used to train the model.
However, there are several widely used techniques, such as choosing the square root
of the number of samples as a starting point and performing iterations considering
different samples of training data, to be able to conclude on a suitable value of K or
using a high value of K but appling a weight function to give more importance to the
closest neighbors to the sample on which its class must be decided.

4.3.2 Decision Trees


It is an algorithm that uses a tree structure to model the relationships between
variables .The algorithm starts from an element called the root and is divided into
increasingly narrow branches. Each division consists of making a decision and is
represented by decision nodes. The process ends at leaf nodes that represent
sufficiently homogeneous data that cannot be divided further.
One difficulty of this algorithm consists of identifying from which variable the
division of the tree should be performed, for which it is necessary that the data contain
a single class. To identify the best division candidate, two measures are used: entropy
and the Gini index. Entropy quantifies the randomness within a set of class values,
such that sets with high entropy are very diverse and offer little information about
other aspects that belong to the set. In this sense, the decision tree tends to find
divisions that decrease entropy, increasing homogeneity within groups. On the other
hand, the Gini index measures the probability that a variable is misclassified when it
is taken randomly .Its value varies between 0 (all the elements belong to a particular

26
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

class or if there is only one class) and 1 (the elements are randomly distributed in
different classes). Thus, a value of 0.5 would indicate that the elements are equally
distributed in different classes. In this sense, when generating a decision tree, it is
preferred to choose the variable with the smallest possible Gini index as the root
element.

4.3.3 Random Forest


It is an algorithm that uses sets of decision trees that combine the principles of
bagging (it consists of averaging many noisy models in order to reduce variation) with
the random selection of features to add additional diversity to the decision tree models
.Once we have the set of decision trees generated, the model uses a vote to combine
the predictions of the trees.
The main advantages of this algorithm are is the reduction of the possibility of over-
fitting occurring , it allows eliminating variables that are not important, it can work
with noisy or absent data, with continuous categorical or numerical variables, and
with a number of variables or high samples. However, the main disadvantage is the
difficulty of interpretation and visualization of the model .

4.3.4 . Neural Networks


A neural network is an algorithm that models the relationship between a set of input
values and a set of output values using a network of nodes called artificial neurons.In
the network, a weight is applied to each entry that represents its importance in the
relationship. The inputs are then added, and a function called the activation function is
applied to the result obtained, which transforms the combined inputs into an output.
This function represents the way of processing and transmitting information through
the network, different types of existing activation functions. In this sense, to select
which one to use, the type of learning to be carried out, the limits of the functions (in
X and Y), the variations of different functions, and the network architecture must be
taken into account. This last concept refers to the different ways of grouping neurons
of the same type. In this sense, an architecture describes the number of neurons in the
model, the number of layers, and the way they are interconnected. Each grouping is
called a layer, and each architecture makes it possible to represent a set of data for

27
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

which a model is to be obtained. There are three types of layers. input (receive data),
output (provide the network’s response to input), and hidden (do not receive or supply
information and represent a type of internal network processing). On the other hand,
according to the number of layers, neural networks can be classified as either a
Single-layer network (a single input layer related to the output layer) or a multi-layer
network (they have one or more layers between the inputs and the outputs and, in
general, all the nodes of one layer are connected with all the nodes of the next layer).
Generally, the connections are made between neurons of different layers, but there
may be interlayer or lateral connections and feedback connections that follow a
direction opposite to that of input–output. In this sense, according to the direction of
the information in the neural network, they can be classified as either a Feedforward
network (unidirectional information flow) or a recurrent network (information flow in
both directions using loops and delays. Finally, the learning algorithm in a network
specifies how to apply the weights that depend on the learning paradigm (it depends
on the information available to the network, so that it can be supervised if the
expected output is known or unsupervised if the expected output is not known), the
learning rules, and the type of learning algorithm (they can be based on error
minimization, based on random parameters, based on competitive strategies, or based
on Hebb’s law). Although the learning process (setting the weights) is complex, it has
the advantage that, once learned, the network keeps the weights.

4.6 Key Performance Indicators


This section defines the metrics (or KPIs, Key Performance Indicators) used to be
able to evaluate the results of the algorithms used :
Accuracy
Numeric value indicating the performance of the predictive model. It is calculated as
follows:

(1)
where:
TP: True Positive. Result in which the model correctly predicts the positive class.

28
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

FP: False Positive. Result in which the model incorrectly predicts the positive class.
TN: True Negative. Result in which the model correctly predicts the negative class.
FN: False Negative. Result in which the model incorrectly predicts the negative class.

2. Kappa statistic
It measures the agreement between two examiners in their corresponding
classifications of N elements into C mutually exclusive categories. In the case of
machine learning, it refers to the actual class and the class expected by the model
used. It is calculated as follows:

(2)
where:
Pr (a) is the observed relative agreement between observers, and Pr (e) is the
hypothesized probability of agreement by chance, using the observed data to compute
the probabilities that each observer randomly ranks each category. If raters fully
agree, then κ = 1. If there is no agreement between raters other than what would be
expected by chance (as defined by Pr (e)), then κ = 0.

3. Logarithmic Loss
It is the negative average of the log of corrected predicted probabilities for each
instance. It is calculated as follows:

(3)
where:
N is the number of samples.
M is the number of classes.
yij, indicates if the sample i belongs to the class j or not.
pij, indicates the probability that the sample i belongs to the class j.

4. Error

29
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

The error gives an indication of how far the predictions are from the actual output.
There are two formulas: Mean Absolute Error (MAE) and Mean Squared Error
(MSE). It is calculated as follows:

(4)

(5)
where:
N corresponds to the total number of samples.
𝑦𝑘 corresponds to the class indicated by the classification model.
𝑦𝑘̂^ corresponds to the actual class.

5. Sensitivity
The sensitivity of a model (or the ratio of true positives) measures the proportion of
correctly classified positive examples. The total number of positives is the sum of
those that were correctly classified and those that were incorrectly classified. It is
calculated as follows:

(6)
where:
TP: True Positive. Result in which the model correctly predicts the positive class.
TN: True Negative. Result in which the model correctly predicts the negative class.
FN: False Negative. Result in which the model incorrectly predicts the negative class.

6. Specificity
The specificity of a model (or the ratio of true negatives) measures the proportion of
correctly classified negative examples. The total number of negatives is the sum of
those that were correctly classified and those that were incorrectly classified. It is
calculated as follows:

30
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

(7)
where:
TP: True Positive. Result in which the model correctly predicts the positive class.
FP: False Positive. Result in which the model incorrectly predicts the positive class.
TN: True Negative. Result in which the model correctly predicts the negative class.

7. Precision
Precision is defined as the proportion of examples classified as positive that are
actually positive. That is, when a model predicts values as positive. It is calculated as
follows:

(8)
where:
TP: True Positive. Result in which the model correctly predicts the positive class.
TN: True Negative. Result in which the model correctly predicts the negative class.
FN: False Negative. Result in which the model incorrectly predicts the negative class.

8.
Recall
Recall is defined as the number of correctly classified positives over the total number
of positives. This formula is the same as that for sensitivity. It is calculated as follows:

(9)
where:
TP: True Positive. Result in which the model correctly predicts the positive class.
FP: False Positive. Result in which the model incorrectly predicts the positive class.
TN: True Negative. Result in which the model correctly predicts the negative class.

9. F-measure
31
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

F-measure is a measure of model performance that combines precision and recall into
a value called F-measure (also called F1 score or F-score). This measure combines
precision and recall using harmonic averaging, which is used for ratios. This type of
averaging is used instead of arithmetic, since precision and recall are expressed as
proportions between 0 and 1, which can be interpreted as ratios. It is calculated as
follows:

Chapter 5

SYSTEM DESIGN

5.1 Existing System


In the era preceding advanced AI-driven systems, fast fake news detection methods
often relied on rudimentary techniques such as keyword matching, source reputation
analysis, and basic linguistic pattern recognition. These systems typically scanned
news articles for predefined suspicious terms or phrases, cross-referenced against a
database of known fake news sources, and flagged content based on linguistic cues
indicative of misleading or fabricated information. However, their effectiveness was
limited by their reliance on manual input for updating databases and their inability to
adapt to evolving forms of misinformation.

5.2 Proposed System


The proposed system for the Deep Learning Based Fast Fake News Detection Method
project aims to leverage advanced deep learning architectures to efficiently identify
and categorize fake news instances from large volumes of textual data. By utilizing
convolutional neural networks (CNNs) and recurrent neural networks (RNNs) in
conjunction with natural language processing techniques, the system will extract
semantic features and patterns indicative of misinformation, enabling rapid and
32
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

accurate classification. Additionally, attention mechanisms will be incorporated to


prioritize relevant information within texts, enhancing the model's discernment
capabilities. Through this approach, the system seeks to contribute to the ongoing
battle against the proliferation of fake news by providing a fast and reliable method
for its detection

33
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

5.1 UML Diagrams


5.3.1 Data Flow Diagram

The following Data Flow diagram for the proposed system of the whole project.

Fig 5.1 Data Flow Diagram

In the figure 5.1 the data flow diagram for the Deep Learning Based Fast Fake News
Detection Method project illustrates a streamlined process starting with the input of
textual data from various sources such as news articles, social media posts, and online
forums. This input data undergoes preprocessing steps including tokenization,
stemming, and stop-word removal to enhance its suitability for deep learning analysis.
Subsequently, the preprocessed data flows into the deep learning model, which
employs convolutional neural networks (CNNs) and recurrent neural networks
(RNNs) to extract meaningful features and classify the text as either genuine or fake
news. The model's predictions are then outputted, along with confidence scores, for
further analysis or dissemination. Additionally, feedback loops may be integrated to
continually improve the model's performance through iterative learning from labeled
data.

34
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

5.3.1 Use Case diagram

A use case diagram in the Unified Modeling Language (UML) is a type of behavioral
diagram defined by and created from a Use-case analysis. Its purpose is to present a
graphical overview of the functionality provided by a system in terms of actors, their
goals (represented as use cases), and any dependencies between those use cases. The
main purpose of a use case diagram is to show what system functions are performed
for which actor. Roles of the actors in the system can be depicted.

Fig 5.2 Usecase Diagram

35
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

5.3.3 Class diagram

The class diagram for cyber hacking breaches represented to identifying the main
classes and their relationships within a system that manages or monitors cybersecurity
incidents.

Fig 5.3 class diagram

35
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

5.3.4 Sequence diagram


A sequence diagram in Unified Modeling Language (UML) is a kind of interaction
diagram that shows how processes operate with one another and in what order. It is a
construct of a Message Sequence Chart. Sequence diagrams are sometimes called
event diagrams, event scenarios, and timing diagrams.

Figure 5.4 sequence diagram

36
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

5.3.4 Activity Diagram for splitting of data

Activity diagrams are graphical representations of workflows of stepwise activities


and actions with support for choice, iteration and concurrency. In the Unified
Modeling Language, activity diagrams can be used to describe the business and
operational step-bystep workflows of components in a system. An activity diagram
shows the overall flow of control.

Fig 5.5 Activity Diagram for Splitting of data

37
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

Chapter 6
RESULTS AND DISCUSSIONS

6.1 Data gathering


The dataset contains two types of articles fake and real News. This dataset was
collected from realworld sources; the truthful articles were obtained by crawling
articles from Reuters.com (News website). As for the fake news articles, they were
collected from different sources. The fake news articles were collected from
unreliable websites that were flagged by Politifact (a fact-checking organization in the
USA) and Wikipedia. The dataset contains different types of articles on different
topics, however, the majority of articles focus on political and World news topics.

Fake news samples


The dataset consists of two CSV files. The first file named “True.csv” contains more
than 12,600 articles from reuter.com.

Fig 6.1 top 5 rows of fake news

38
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

True news samples

The second file named “Fake.csv” contains more than 12,600 articles from different
fake news outlet resources

Fig 6.2 top 5 rows of true news

Each article contains the following information: article title, text, type and the date
the article was published on. To match the fake news data collected for kaggle.com,
we focused mostly on collecting articles from 2016 to 2017. The data collected were
cleaned and processed, however, the punctuations and mistakes that existed in the
fake news were kept in the text.

39
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

6.2 Data Preprocessing

6.2.1 Inserting a column "class" as target feature


There are 23481 fake news which comes under the class 0 and 21417 true news
Comes under class 1.

Sample fake news contains target class as 0.

Fig 6.3 sample dataset with target class as 0

40
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

Sample true news contains target class as 1.

Fig 6.4 sample dataset with target class as 1

6.2.2 Random Shuffling the data frame

By shuffling the dataset randomly before splitting it into training, validation, and
testing sets, biases or patterns in the data ordering are mitigated, preventing the model
from inadvertently learning spurious correlations. This process aids in achieving
better generalization performance by introducing diversity and randomness, enabling
the model to learn more effectively from various aspects of the data. Additionally,
random shuffling enhances the reliability of performance metrics obtained during
evaluation, providing a more accurate assessment of the model's capability to discern
fake news from genuine ones.

Fig 6.5 dataframe after shuffling samples

41
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

6.2.3 Creating a function to process the texts


In the realm of deep learning-based fast fake news detection, a crucial step involves
processing the text to eliminate wild cards or anomalies that might obscure genuine
patterns. This processing typically entails tokenization, where the text is broken down
into individual words or subwords, followed by the removal of stop words,
punctuation, and special characters.

6.2.4 Convert text to vectors


Subsequently, the text is converted into numerical vectors, often using techniques
such as word embeddings or transformers, which represent each word or subword as a
high-dimensional vector. These vectors capture semantic relationships and contextual
information, enabling deep learning models to effectively discern between authentic
and fake news based on underlying patterns in the text.

42
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

6.3 Result analysis

6.3.1 Logistic Regression

Logistic Regression is a commonly used algorithm for binary classification tasks. It's
simple, interpretable, and can be a good choice for text-based classification problems
like fake news detection.

6.3.2 Decision Tree

Using a Decision Tree algorithm for fake news detection can be a reasonable choice,
especially for its interpretability and simplicity. Decision Trees are effective at
handling categorical and numerical data, making them suitable for text classification
tasks

43
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

6.3.3 Gradient Boosting

Algorithms like XGBoost, LightGBM, and AdaBoost, which use gradient boosting
techniques, are known for their high performance. They can handle complex
relationships in data and are effective in ensemble settings.

6.3.4 Random Forest Classifier

Random Forest is an ensemble learning method that combines multiple decision trees.
It is robust and can handle noisy data well. Random Forest can be effective for fake
news detection, especially when dealing with diverse and complex datasets.

6.4 Model Evaluation


LR determines the likelihood of a piece of news being fake based on a linear
combination of features. DT splits the data into subsets based on feature thresholds,
recursively learning decision rules to classify news. GBC sequentially builds a series
of weak learners to improve classification accuracy. RFC constructs multiple decision
trees and combines their predictions to enhance robustness against overfitting.

44
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

Through iterative training on labeled data, these models optimize their parameters to
effectively discriminate between genuine and fake news, providing valuable tools in
the fight against misinformation.

Fig 6.6 sample result of model evaluation.

45
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

Chapter 7
CONCLUSION AND FUTURE SCOPE

In conclusion, the best neural network model trained in this project work can achieve
up to 100% accuracy with 100% recall that can accurately detect fake news with very
Slow mistakes. Neural network models trained with N-gram vectors are performing
slightly better than models trained with sequence vectors mainly because of N-gram
vectors using TF-IDF which will not rely only on term frequency, but also a weight
score that emphasize on more important terms. Models trained with news title are
suitable to be used in social media applications that users would response fast on any
updates or incoming messages due to its fast computation time and high recall rate
(low mistake rate). With fast computation, any message that sends out fake news can
be stopped. On the other hand, for social media applications are having feeds updated
from time to time, fast computation time is not very crucial and therefore, models
trained with news content that would have higher accuracy and recall will be of better
choice to accurately detect the fake news and stop users from spreading it.

For future improvement, the Keras neural network models can be further improved by
tuning the parameters to achieve even higher accuracy and recall. Recurrent Neural
Network (RNN) with long short-term memory algorithm (LSTM) can also be used to
further enhance fake news detection performance with NLP. Besides, further research
can also be done on the images, videos, texts on images of the news to further
improve the models in future. Besides that, to further implement this solution in
Malaysia, similar approaches or techniques can be used to train the models with news
dataset collected in Malaysia.

46
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

REFERENCES

1. N. H. Hassan, "New Straits Times", New Straits Times, February 2019, [online]
Available: https://www.nst.com.my/opinionlcolumnists/2019102/462486/anti-fake-
news-act-irrelevant.

2. J. Vasandani, "Towards Data Science", Medium, April 2019, [online] Available:


https://towardsdatascience.com/i-built-a-fake-news-detector-using-natural-language-
processing-and-classification-models-da180338860e.

3. M. B. a. M. O. Ms. Tardáguila, "The New Work Times", The New Work Times,
October 2018,

4. "Machine Learning", Google Inc [Online], [online] Available:


https://developers.google.com/machine-learning/guides/text-classification.

5. F. T. Asr and M. Taboada, Big Data and quality data for fake news and
misinformation detection, May 2019, [online] Available:
https://doi.org/10.1177/2053951719843310.

6. Kaggle, December 2017, [online] Available: https://www.kaggle.com/jruvika/fake-


news-detection.

7. Kaggle, November 2019, [online] Available: https://www.kaggle.com/nopdev/real-


and-fake-news-dataset.

8. P. V. F. and D. H. C., "Exploratory data analysis", America Physcology


Association, 2012.

9. W. Badr, TowardsDataScience, February 2019, [online] Available:


https://towardsdatascience.com/having-an-imbalanced-dataset-here-is-how-you-can-
solve-it-1640568947eb.

47
A Deep Learning Based Fast Fake News Detection Method For Cyber Physical Social Services

10. S. Singhal and M. Jena, "A Study on WEKA Tool for Data Preprocessing
Classification and Clustering", International Journal of Innovative Technology and
Exploring Engineering (IJITEE), vol. 2, no. 6, pp. 250-253, 2013.

11. D. J. and J. H. Martin, "Regular Expressions Text Normalization", Edit Distance


Jurafsky Ed 3, 2018.

12. S. Vijayarani, J. Ilamathi and Nithya, "Preprocessing Techniques for Text Mining
- An Overview", International Journal of Computer Science & Communication
Networks, vol. 5, no. 1, pp. 7-16, 2015.

13. "Machine Learning", Google Inc. [Online], [online] Available:


https://developers.google.com/machine-learning/guides/text-classificationlstep-3#n-
gram_vectors_option_a.

14. "Medium", Medium, January 2019, [online] Available:


https://medium.com/@pemagrg/one-hot-encoding-129ccc293cda.

15. F. A. Ozbay and B. Alatas, "Fake news detection within online social media using
supervised artificial intelligence algorithms", Physica A: Statistical Mechanics and its
Applications, vol. 540, no. 12317417, pp. 1-17, 2020.

16. K. Poddar, G. B. A. D and U. K. S, "Comparison of Various Machine Learning


Models for Accurate Detection of Fake News", Innovations in Power and Advanced
Computing Technology (i-PACT) Vellore India, 2019.

17. J. Brownlee, Machine Learning Mastery, June 2016, [online] Available:


https://machinelearningmastery.com/dropout-regularization-deep-learning-models-
keras/.

18. J. Brownlee, "Machine Learning Mastery", December 2018, [online] Available:


https://machinelearningmastery.com/early-stopping-to-avoid-overtraining-neural-
network-models/.

48

You might also like