0% found this document useful (0 votes)
66 views28 pages

Transformer-Based Approach Fake News Detection Based On News Content and Social Contexts

This document summarizes a research paper on detecting fake news using a Transformer-based model that considers both news content and social context information. The proposed model has two parts: an encoder that learns representations from fake news data and features, and a decoder that predicts fake news classification based on past user observations. Experimental results showed the model can detect fake news earlier than baselines. The paper also addresses challenges of early fake news detection and limited labeled data by proposing weak supervision and handling concept drift over time.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views28 pages

Transformer-Based Approach Fake News Detection Based On News Content and Social Contexts

This document summarizes a research paper on detecting fake news using a Transformer-based model that considers both news content and social context information. The proposed model has two parts: an encoder that learns representations from fake news data and features, and a decoder that predicts fake news classification based on past user observations. Experimental results showed the model can detect fake news earlier than baselines. The paper also addresses challenges of early fake news detection and limited labeled data by proposing weak supervision and handling concept drift over time.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

International Journal of Data Science and Analytics (2022) 13:335–362

https://doi.org/10.1007/s41060-021-00302-z

REGULAR PAPER

Fake news detection based on news content and social contexts:


a transformer-based approach
Shaina Raza1 · Chen Ding1

Received: 24 April 2021 / Accepted: 13 December 2021 / Published online: 30 January 2022
© Crown 2021

Abstract
Fake news is a real problem in today’s world, and it has become more extensive and harder to identify. A major challenge
in fake news detection is to detect it in the early phase. Another challenge in fake news detection is the unavailability or the
shortage of labelled data for training the detection models. We propose a novel fake news detection framework that can address
these challenges. Our proposed framework exploits the information from the news articles and the social contexts to detect
fake news. The proposed model is based on a Transformer architecture, which has two parts: the encoder part to learn useful
representations from the fake news data and the decoder part that predicts the future behaviour based on past observations.
We also incorporate many features from the news content and social contexts into our model to help us classify the news
better. In addition, we propose an effective labelling technique to address the label shortage problem. Experimental results
on real-world data show that our model can detect fake news with higher accuracy within a few minutes after it propagates
(early detection) than the baselines.

Keywords Fake news · Social contexts · Concept drift · Weak supervision · Transformer · User credibility · Zero shot learning

1 Introduction dards. Lately, social media has become a significant source of


news for many people. According to a report by Statistica,1
Fake news detection is a subtask of text classification [1] and there are around 3.6 billion social media users (about half
is often defined as the task of classifying news as real or the population) in the world. There are obvious benefits of
fake. The term ‘fake news’ refers to the false or misleading social media sites and networks in news dissemination, such
information that appears as real news. It aims to deceive or as instantaneous access to information, free distribution, no
mislead people. Fake news comes in many forms, such as time limit, and variety. However, these platforms are largely
clickbait (misleading headlines), disinformation (with mali- unregulated. Therefore, it is often difficult to tell whether
cious intention to mislead the public), misinformation (false some news is real or fake.
information regardless of the motive behind), hoax, parody, Recent studies [2–4] show that the speed at which fake
satire, rumour, deceptive news and other forms as discussed news travels is unprecedented, and the outcome is its wide-
in the literature [2]. scale proliferation. A clear example of this is the spread of
Fake news is not a new topic; however, it has become a anti-vaccination misinformation2 and the rumour that incor-
hot topic since the 2016 US election. Traditionally, people rectly compared the number of registered voters in 2018 to the
get news from trusted sources, media outlets and editors, usu- number of votes cast in US Elections 2020.3 The implications
ally following a strict code of practice. In the late twentieth of such news are seen during the anti-vaccine movements
century, the internet has provided a new way to consume, that prevented the global fight against COVID-19 or in post-
publish and share information with little or no editorial stan-

B Shaina Raza
1
[email protected] https://www.statista.com/statistics/278414/number-of-worldwide-
social-network-users/.
Chen Ding 2
[email protected] https://www.wrcbtv.com/story/43076383/first-doses-of-covid19-
vaccines-administered-at-chattanooga-hospital-on-thursday.
1 Ryerson University, Toronto, ON, Canada 3 https://archive.is/OXJ60.

123
336 International Journal of Data Science and Analytics (2022) 13:335–362

election unrest. Therefore, it is critically important to stop detection. We choose to work with the BART model over
the spread of fake news at an early stage. the state-of-the-art BERT model [17], which has demon-
A significant research gap in the current state-of-the-art is strated its abilities in NLP (natural language processing)
that it focuses primarily on fake news detection rather than tasks (e.g. question answering and language inference), as
early fake news detection. The seminal works [4, 5] on early well as the GPT-2 model [18], which has impressive autore-
detection of fake news usually detect the fake news after at gressive (time-series) properties. The main reason is that the
least 12 h of news propagation, which may be too late [6]. BART model combines the unique features (bidirectional and
An effective model should be able to detect fake news early, autoregressive) of both text generation and temporal mod-
which is the motivation of this research. elling, which we require to meet our goals.
Another issue that we want to highlight here is the scarcity Though we take inspiration from BART, our model is dif-
of labelled fake news data (news labelled as real or fake) in ferent from the original BART in the following aspects: (1)
real-world scenarios. Existing state-of-the-art works [4, 7, 8] in comparison with the original BART, which takes a single
generally use fully labelled data to classify fake news. How- sentence/document as input, we incorporate a rich set of fea-
ever, the real-world data is likely to be largely unlabelled tures (from news content and social contexts) into the encoder
[5]. Considering the practical constraints, such as unavail- part; (2) we use a decoder to get predictions not only from
ability of the domain experts for labelling, cost of manual previous text sequences (in this case, news articles) as in the
labelling, and difficulty of choosing a proper label for each original BART but also from previous user behaviour (how
news item, we need to find an effective way to train a large- users respond to those articles) sequences, and we detect fake
scale model. One alternative approach is to leverage noisy, news early by temporally modelling user behaviour; (3) on
limited, or imprecise sources to supervise labelling of large top of the original BART model, we add a single linear layer
amounts of training data. The idea is that the training labels to classify news as fake or real.
may be imprecise and partial but can be used to create a strong Our contributions are summarized as follows:
predictive model. This scheme of training labels is the weak
supervision technique [9]. 1. We propose a novel framework that exploits news con-
Usually, the fake news detection methods are trained on tent and social contexts to learn useful representations
the current data (available during that time), which may not for predicting fake news. Our model is based on a Trans-
generalize to future events. Many of the labelled samples former [19] architecture, which facilitates representation
from the verified fake news get outdated soon with the newly learning from fake news data and helps us detect fake
developed events. For example, a model trained on fake news news early. We also use the side information (metadata)
data before the COVID-19 may not classify fake news prop- from the news content and the social contexts to support
erly during COVID-19. The problem of dealing with a target our model to classify the truth better.
concept (e.g. news as ‘real’ or ‘fake’) when the underly- 2. We present a systematic approach to investigate the rela-
ing relationship between the input data and target variable tionship between the user profile and news veracity. We
changes over time is called concept drift [10]. In this paper, propose a novel Transformer-based model using zero-
we investigate whether concept drift affects the performance shot learning [20] to determine the credibility levels of the
of our detection model, and if so, how we can mitigate them. users. The advantage of our approach is that it can deter-
This paper addresses the challenges mentioned above mine the credibility of both long-term and new users, and
(early fake news detection and scarcity of labelled data) to it can detect the malicious users who often change their
identify fake news. We propose a novel framework based on tactics to come back to the system or vulnerable users
a deep neural network architecture for fake news detection. who spread misinformation.
The existing works, in this regard, rely on the content of 3. We propose a novel weak supervision model to label
news [7, 11, 12], social contexts [1, 4, 5, 8, 13, 14], or both the news articles. The proposed model is an effective
[4, 8, 15]. We include a broader set of news-related features labelling technique that lessens the burden of exten-
and social context features compared to the previous works. sive labelling tasks. With this approach, the labels can
We try to detect fake news early (i.e. after a few minutes be extracted instantaneously from known sources and
of news propagation). We address the label shortage prob- updated in real-time.
lem that happens in real-world scenarios. Furthermore, our
model can combat concept drift. We evaluate our system by conducting experiments on
Inspired by the bidirectional and autoregressive Trans- real-world datasets: (i) NELA-GT-19 [21] that consists of
former (BART) [16] model from Facebook that is success- news articles from multiple sources and (ii) Fakeddit [22]
fully used in language modelling tasks, we propose to apply a that is a multi-modal dataset containing text and images in
deep bidirectional encoder and a left-to-right decoder under posts on the social media website Reddit. While the social
the hood of one unified model for the task of fake news contexts used in this model are from Reddit, consisting of

123
International Journal of Data Science and Analytics (2022) 13:335–362 337

upvotes, downvotes, and comments on posts, the same model source, headline, image/video, to build fake news detec-
can be generalized to fit other social media datasets. The same tion classifiers. Most content-based methods use stylometry
method is also generalizable for any other news dataset. The features (e.g. sentence segmentation, tokenization, and POS
results show that our proposed model can detect fake news tagging) and linguistic features (e.g. lexical features, bag-of-
earlier and more accurately than baselines. words, frequency of words, case schemes) of the news articles
The rest of the paper is organized as follows. Section 2 is to capture deceptive cues or writing styles. For example,
the related work. Section 3 discusses the proposed frame- Horne and Adalı [29] extract stylometry and psychologi-
work. Section 4 explains the details of our fake news cal features from the news titles to differentiate fake news
detection model, Sect. 5 describes the experimental set-up, from real. Przybyla et al. [26] develop a style-based text
and Sect. 6 shows the results and analyses. Finally, Sect. 7 classifier, in which they use bidirectional Long short-term
is about the limitations, and Sect. 8 gives the conclusion and memory (LSTM) to capture style-based features from the
lists the future directions. news articles. Zellers et al. [12] develop a neural network
model to determine the veracity of news from the news text.
Some other works [27, 30] consider lexicons, bag-of-words,
2 Literature review syntax, part-of-speech, context-free grammar, TFIDF, latent
topics to extract the content-based features from news arti-
Fake news is information that is false or misleading and is cles.
presented as real news[23]. The term ‘fake news’ became A general challenge of content-based methods is that fake
mainstream during the 2016 presidential elections in United news’s style, platform, and topics keep changing. Models
States. Following this, Google, Twitter, Facebook took steps that are trained on one dataset may perform poorly on a new
to combat fake news. However, due to the exponential growth dataset with different content, style, or language. Further-
of information in online news portals and social media sites, more, the target variables in fake news change over time,
distinguishing between real and fake news has become diffi- and some labels become obsolete, while others need to be
cult. re-labelled. Most content-based methods are not adaptable to
In the state-of-the-art, the fake news detection methods these changes, which necessitates re-extracting news features
are categorized into two types: (1) manual fact-checking; (2) and re-labelling data based on new features. These methods
automatic detection methods. Fact-checking websites, such also require a large amount of training data to detect fake
as Reporterslab,4 Politifact5 and others [2], rely on human news. By the time these methods collect enough data, fake
judgement to decide the truthfulness of some news. Crowd- news has spread too far. Because the linguistic features used
sourcing, e.g. Amazon’s Mechanical Turk,6 is also used for in content-based methods are mostly language-specific, their
detecting fake news in online social networks. These fact- generality is also limited.
checking methods provide the ground truth (true/false labels) To address the shortcomings of content-based methods,
to determine the veracity of news. The manual fact-checking a significant body of research has begun to focus on social
methods have some limitations: 1) it is time-consuming to contexts to detect fake news. The social context-based detec-
detect and report every fake news produced on the internet; tion methods examine users’ social interactions and extract
2) it is challenging to scale well with the bulks of newly cre- relevant features representing the users’ posts (review/post,
ated news, especially on social media; 3) it is quite possible comments, replies) and network aspects (followers–followee
that the fact-checkers’ biases (such as gender, race, preju- relationships) from social media. For example, Liu and Wu
dices) may affect the ground truth label. [5] propose a neural network classifier that uses social media
The automatic detection methods are alternative to the tweets, retweet sequences, and Twitter user profiles to deter-
manual fact-checking ones, which are widely used to detect mine the veracity of the news.
the veracity of the news. In the previous research, the The existing social contexts-based approaches are cate-
characteristics of fake news are usually extracted from the gorized into two types: (1) stance-based methods and (2)
news-related features (e.g. news content) [21] or from the propagation-based methods. The stance-based approaches
social contexts (social engagements of the users) [4, 22, 24] exploit the users’ viewpoints from social media posts to
using automatic detection methods. determine the truth. The users express the stances either
The content-based methods [25–28] use various types of explicitly or implicitly. The explicit stances are the direct
information from the news, such as article content, news expressions of users’ opinions usually available from their
reactions on social media. Previous works [4, 5, 22] mostly
use upvotes/downvotes, thumbs up/down to extract explicit
4 https://reporterslab.org/fact-checking/. stances. The implicit stance-based methods [5, 31], on the
5 https://www.politifact.com/. other hand, are usually based on extracting linguistic fea-
6 https://www.mturk.com/. tures from social media posts. To learn the latent stances from

123
338 International Journal of Data Science and Analytics (2022) 13:335–362

topics, some studies [11] use topic modelling. Other studies relationships among the publishers, news stories and social
[13, 32] look at fake users’ accounts and behaviours to see media users for fake news detection.
if they can detect implicit stances. A recent study also anal- Cui et al. [12] propose an explainable fake news detection
yses users’ views on COVID-19 by focusing on people who system DEFEND based on LSTM networks. The DEFEND
interact and share information on Twitter [33]. This study considers users’ comments to explain if some news is real
provides an opportunity to assess early information flows on or fake. Nguyen et al. [15] propose a fake news detection
social media. Other related studies [34, 35] examine users’ method FANG that uses the graph learning framework to
feelings about fake news on social media and discover a link learn the representations of social contexts. These methods
between sentiment analysis and fake news detection. discussed above are regarded as benchmark standards in the
The propagation-based methods [36–39] utilize informa- field of fake news research.
tion related to fake news, e.g. how users spread it. In general, In recent years, there has been a greater focus in NLP
the input to a propagation-based method can be either a news research on pre-trained models. BERT [17] and GPT-2 [43]
cascade (direct representation of news propagation) or self- are two state-of-the-art pre-trained language models. In the
defined graph (indirect representation capturing information first stage, the language model (e.g. BERT or GPT-2) is pre-
on news propagation) [2]. Hence, these methods use graphs trained on the unlabelled text to absorb maximum amount of
and multi-dimensional points for fake news detection [36, knowledge from data (unsupervised learning). In the second
39]. The research in propagation-based methods is still in its stage, the model is fine-tuned on specific tasks using a small-
early stages. labelled dataset. It is a semi-supervised sequence learning
To conclude, social media contexts, malicious user pro- task. These models are also used in fake news research.
files, and user activities can be used to identify fake news. BERT is used in some fake news detection models [1,
However, these approaches pose additional challenges. Gath- 7, 44] to classify news as real or fake. BERT uses bidirec-
ering social contexts, for example, is a broad subject. The data tional representations to learn information and is generally
is not only big, but also incomplete, noisy, and unstructured, more suitable for NLP tasks, such as text classification and
which may render existing detection algorithms ineffective. translation. The GPT-2, on the other hand, uses the unidirec-
Other than NLP methods, visual information is also used tional representation to predict the future using left-to-right
as a supplement to determine the veracity of the news. A context and is better suited for autoregressive tasks, where
few studies investigate the relationship between images and timeliness is a crucial factor. In related work, Zellers et al.
tweet credibility [40]. However, the visual information in [12] propose a Grover framework for the task of fake news
this work [40] is hand-crafted, limiting its ability to extract detection, which uses a language model close to the architec-
complex visual information from the data. In capturing auto- ture of GPT-2 trained on millions of news articles. Despite
matic visual information from data, Jin et al. [41] propose a these models’ robust designs, there are a few research gaps.
deep neural network approach to combine high-level visual First, these models do not consider a broader set of features
features with textual and social contexts automatically. from news and social contexts. Second, these methods ignore
Recently, transfer learning has been applied to detect fake the issue of label scarcity in real-world scenarios. Finally, the
news [1, 7]. Although transfer learning has shown promising emphasis is not on early fake news detection.
results in image processing and NLP tasks, its application in The state-of-the-art focuses primarily on fake news detec-
fake news detection is still under-explored. This is because tion methods rather than early fake news detection. A few
fake news detection is a delicate task in which transfer learn- works [4, 5] propose early detection of fake news. However,
ing must deal with semantics, hidden meanings, and contexts to detect fake news, these methods [4, 5] usually rely on a
from fake news data. In this paper, we propose a transfer large amount of fake news data observed over a long period
learning-based scheme, and we pay careful attention to the of time (depending upon the availability of the social con-
syntax, semantics and meanings in fake news data. texts). The work in [4] detects fake news after at least 12 h
of news propagation, as demonstrated in their experiments,
which may be too late. According to research [6], the fake
2.1 State-of-the-art fake news detection models news spreads within minutes once planted. For example, the
fake news that Elon Musk’s Tesla team is inviting people to
In one of earlier works, Karimi et al. [42] use convolu- give them any amount (ranging from 0.1 to 20) of bitcoins
tional neural network (CNN) and LSTM methods to combine in exchange for double the amount resulted in a loss of mil-
various text-based features, such as those from statements lions of dollars within the first few minutes.7 Therefore, it is
(claims) related to news data. Liu et al. [39] also use RNN and critical to detect fake news early on before it spreads.
CNN-based methods to build propagation paths for detecting
fake news at the early stage of its propagation. Shu et al. [4]
propose a matrix factorization method TriFN to model the 7 https://www.bbc.com/news/technology-56402378.

123
International Journal of Data Science and Analytics (2022) 13:335–362 339

Our work is intended to address these issues (early fake We discuss the Transformer block that consists of the
news detection, labels scarcity) in fake news research. BERT encoder, decoder and attention mechanism in detail in Sect. 4
and GPT-2 (or similar) have not been used to their full poten- and Fig. 3.
tial for representation learning and autoregression tasks in
a single unifying model that we intend to work on going 3.3 The news ecosystem
forward in our research. We propose a combination of Trans-
former architectures that can be applied to a wide range of The news ecosystem consists of three basic entities: publish-
scenarios, languages, and platforms. ers (news media or editorial companies that publish the news
article), information (news content) and users [4, 45, 47]. As
shown in Fig. 1, initially, the news comes from publishers.
3 Overview of the proposed framework Then, it goes to different websites or online news platforms.
The users get news from these sources, sharing news on dif-
3.1 Problem definition ferent platforms (blogs, social media). The multiple friends’
connections, followers–followees links, hashtags, and bots
Given a multi-source news dataset and social contexts of make up a social media network.
news consumers (social media users), the task of fake news
detection is to determine if a news item is fake or real. For- 3.3.1 News content
mally, we define the problem of fake news detection as:
The news content component takes news content from the
• Input: News items, social contexts and associated side news ecosystem. The news body (content) and the corre-
information sponding side information represent a news article. The news
• Output: One of two labels: ‘fake’ or ‘real’. body is the main text that elaborates the news story; gener-
ally, the way a news story is written reflects an author’s main
3.2 Proposed architecture argument and viewpoint. We include the following side infor-
mation related to news:
Figure 1 shows an overview of our proposed framework. Ini-
tially, the news comes from the news ecosystem [45], which • Source The source of news (e.g. CNN, BBC).
we refer to as the dataset (input) in this work. The news • Headline The title text that describes the main topic of the
content and social contexts go into the respective compo- article. Usually, the headlines are designed to catch the
nents where the data is being preprocessed. The input to the attention of readers.
embedding layer is the features from news content and social • Author The author of the news article.
contexts. The output from the embedding layer is the vector • Publication time The time when news is published; it is an
representations of news content and social contexts. These indicator of recency or lateness of news.
vector representations are combined to produce a single rep- • Partisan information This information is the adherence of
resentation that is passed as input to the Transformer block. a news source to a particular party. For example, a news
The output from the Transformer is transferred to the classi- source with many articles favouring the right-wing reflects
fication layer, followed by the cross-entropy layer. We get a the source’ and authors’ partisan bias.
label (fake or real) for each news as the final output.
We utilize three types of embeddings in the embedding 3.3.2 Social contexts
layer: (1) token embeddings: to transform words into vector
representations; (2) segment embeddings: to distinguish dif- The social contexts component takes the social contexts on
ferent segments or sentences from the content; (3) positional the news, such as posts, likes, shares, replies, followers–fol-
embeddings: to show tokens’ positions within sequences. lowees and their activities. When the features related to the
We create sequences from the news content and social content of news articles are not enough or available, social
contexts (user behaviours). In our work, we maintain a tem- contexts can provide helpful information on fake news. Each
poral order in the sequences through positional encodings. social context is represented by a post (comment, review,
The intuition is that each word in the sequence is temporally reply) and the corresponding side information (metadata).
arranged and is assigned to a timestep, which means that the The post is a social media object posted by a user; it contains
first few words correspond to the timestep 0, timestep 1, and useful information to understand a user’s view on a news
so on, till the last word corresponding to the last timestep. article. We include the following side information related to
We use the sinusoidal property of position encodings in social contexts:
sequences [46], where the distance between the neighbouring
timesteps is symmetrical and decays over time. • User A person or bot that registers on social media.

123
340 International Journal of Data Science and Analytics (2022) 13:335–362

Fig. 1 Overview of the proposed framework

• Title The headline or short explanation of the post. The a simple clustering approach that assigns a credibility level
title of the post matches the news headline. to each user. We adopt a different approach to build the user
• Score A numeric score given to a post by another user; credibility module, as shown in Fig. 2.
this feature determines whether another user approves or We use zero-shot learning (ZSL) [20] to determine the
disapproves of the post. credibility of users. ZSL is a mechanism by which a com-
• Source The source of news. puter program learns to recognize objects in an image or
• Number of comments The count of comments on a post; extract information from text without labelled training data.
this feature gives the popularity level of a post. For example, a common approach to classifying news cate-
• Upvote–Downvote ratio An estimate of other users’ gories is training a model from scratch on task-specific data.
approval/disapproval of a post. However, ZSL enables this task to be completed without any
• Crowd (aggregate) response We calculate the aggregate previous task-specific training. ZSL can also detect and pre-
responses of all users on each news article. To calculate dict unknown classes that a model has never seen during
the aggregate response, we take all the scores on a post to training based on prior knowledge from the source domain
determine a user’s overall view of a news story. We assume [43, 51] or auxiliary information.
that a news story or theme with a score less than 1 is not To determine the credibility level, we first group each
reliable and vice versa. user’s engagements (comments, posts, replies) and then feed
• User credibility We determine the credibility level of social this information into our ZSL classifier. We build our ZSL
media users as an additional social context feature. This classifier based on the Transformer architecture. We attach
feature is helpful to determine if a user tends to spread the pre-trained checkpoint8 (weights of the model during the
some fake news or not. For example, similar posts by a training phase) of a huge dataset: multi-genre natural lan-
non-credible user on a news item is an indicator of a news guage inference (MNLI) [50], with our classifier.
being real or fake. We determine user credibility through a A multi-sourced collection is frequently used to collect
user credibility component, shown in Fig. 2 and discussed information or opinions from large groups of people who
next. submit their data through social media or the internet. We
are using MNLI because it is a large-scale crowd-sourced
dataset that covers a range of genres of spoken and written
3.4 User credibility module
text. User credibility and crowdsourcing have been linked in
previous research [55, 56]. Therefore, we anticipate that a
The topic of determining the credibility of social media users
large amount of crowdsourced data in MNLI could reveal a
is not new in the literature. Some previous works apply com-
link between users’ credibility and how they express their
munity detection [48], sentiment analysis [33] and profile
opinions. It would be expensive if we need to gather such
ranking techniques [49]. However, there is not much work in
the fake news detection that considers the user credibility of
social media users. The seminal work [4], in this regard, is 8 https://dl.fbaipublicfiles.com/fairseq/models/bart.large.mnli.tar.gz.

123
International Journal of Data Science and Analytics (2022) 13:335–362 341

Fig. 2 The user credibility


module

crowd-sourced opinions as well as direct user feedbacks our- interact with a post multiple times, and each interaction is
selves. We gain the benefits of a pre-trained model in terms recorded with its timestamp.
of size and training time and the benefit of accuracy by using The task of fake news detection is to find a model M that
MNLI. predicts a label ŷ(n i ) ∈ {0, 1} for each news item based
Through ZSL, the checkpoint that is pre-trained can be on its news content and the corresponding social contexts.
fine-tuned for a specialized task, e.g. the user credibility task Therefore, the task of fake news detection, in this paper, is
in our work. We could classify the users into different unseen defined as shown in Eq. (1):
classes (user credibility levels). In total, we define five cred-
ibility levels: ‘New user’, ‘Very uncredible’, ‘Uncredible’, ŷ(n i )  M(C(n i ), SC(n i )) (1)
‘Credible’, ‘Very credible’. We use the prior knowledge of a
fine-tuned ZSL model and its checkpoint, and we also use the where C(n i ) refers to the content of news and SC(n i ) refers
semantics of the auxiliary information to determine known to the social contexts on the news. The notations used in this
user classes. Our model can also determine new unknown paper can be found in “Appendix A”.
user classes. Later, we incorporate this information as the
weak labels into our fake news detection model. 4.2 Proposed model: FND-NS
Another module in this framework is the weak supervision
module that is related to our datasets and labelling scheme, Here, we introduce our proposed classification model called
so we will discuss it in the dataset section (Sect. 6.4). FND-NS (Fake News Detection through News content and
Social context), which adapts the bidirectional and auto-
regressive Transformers (BART) for a new task—fake news
4 Proposed method detection, as shown in Fig. 3. The original BART [16] is a
denoising autoencoder that is trained in two steps. It first cor-
4.1 Preliminaries rupts the text with an arbitrary noising function, and then it
  learns a model to reconstruct the original text. We use the
Let N  n 1 , n 2 , . . . , n |N | be a set of news items, each of BART as sequence-to-sequence Transformer with a bidirec-
which is labelled as yi ∈ {0, 1}, yi  1 is the fake news and tional encoder (like BERT) and a left-to-right autoregressive
yi  0 is the real news. The news item n i is represented by decoder (like GPT-2).
its news body (content) and the side information (headline, Models such as BERT [17], which captures the text rep-
body, source, etc.). When a news item n i is posted on social resentations in both directions, and GPT-2 [18], which has
media, it is usually
 responded to by a number of social media autoregressive properties, are examples of self-supervised
users U  u 1 , u 2 , . . . , u |U | . The social contexts include methods. Both are Transformer models with their strengths:
users’ social engagements (interactions), such as comments BERT excels in discriminative tasks (identifying existing
on news, posts, replies, and upvotes/downvotes. data to classify) but is limited in generative tasks. At the
We define
 social contexts on a news  item n i as: SC  same time, GPT-2 is capable of generative tasks (learn-
(n i )  (u 1 , sc1 , t1 ), (u 2 , sc2 , t2 ), . . . , u |sc| , sc|sc| , t|sc| , ing the regularities in data to generate new samples) but
where each tuple (u, sc, t) refers to a user u’s social con- not discriminative tasks due to its autoregressive proper-
texts sc on a news item n i during time t. Here, a user may ties. In comparison with these models, BART integrates text

123
342 International Journal of Data Science and Analytics (2022) 13:335–362

Fig. 3 The encoder and decoder blocks in FND-NS model

generation and comprehension using both bidirectional and token level [52]. We randomly mask some tokens in the input
autoregressive Transformers. Due to this reason, we choose sequence. We follow the default masking probability, i.e.
to work on the BART architecture. masking 15% of tokens with (MASK), as in the original paper
Though we get inspiration from BART, our network archi- [16]. However, we predict the ids of those masked items based
tecture is different from the original BART in the following on the positions of missing inputs from the sequence. This
manner. The first difference between our model and the orig- way, we determine the next item in the sequence based on its
inal BART is the method of input. Original BART takes one temporal position. In our work, we use the decoder to make
piece of text as input in the encoder part. In contrast, we incor- predictions based on the previous sequences of text (news
porate a rich set of features (from news content and social articles) and the previous sequences of user behaviours (how
contexts) into the encoder part. We use multi-head attentions users respond to those articles). Modelling user behaviours
to weigh the importance of different pieces of information. in such a temporal manner helps us detect fake news in the
For example, if the headline is more convincing in persuading early stage.
readers to believe something, or if a post receives an excep- Finally, different from the original BART, we add a linear
tionally large number of interactions, we pay closer attention transformation and SoftMax layer to output the final target
to such information. We have modified the data loader of label.
the original BART to feed more information into our neural Next, we discuss our model (Fig. 3) and explain how we
network architecture. use it in fake news detection. Let N represents a set of news
The second difference is the way the next token is pre- items. Each news item has a set of text and social context fea-
dicted. By token, we mean the token-level tasks such as tures. These features are merged to form a combined feature
named entity recognition and question answering, where set, as shown in the flowchart in Fig. 4.
models are required to produce fine-grained output at the

123
International Journal of Data Science and Analytics (2022) 13:335–362 343

Fig. 4 Flowchart of proposed


FND-NS model

These combined features are then encoded into a vector to encode the input from both directions to get the contex-
representation. Let X represent a sequence of k combined tualized information. The input to the encoder is the input
features for a news item, as shown in Eq. (2):  . The encoder maps the input sequence X  to
sequence X 1:k
a contextualized encoding sequence X , as shown in Eq. (4):
X  {x1 , x2 , . . . , xk }. (2)   
f θenc : X 1:k → X 1:k . (4)
These features are given as input to the embedding layers.
The first encoder block transforms each context-
The embedding layer gives us a word embedding vector for
independent input vector to a context-dependent vector
each word (feature in our case) in the combined feature set.
representation. The next encoder blocks further refine the
We also add a positional encoding vector with each word
contextualized representation until the last encoder block
embedding vector. The word embedding vector gives us the
outputs final contextualized encoding X 1:k . Each encoder
(semantic) information for each word. The positional encod-
block consists of a bidirectional self-attention layer, followed
ing describes the position of a token (word) in a sequence.
by two feed-forward layers. We skip the details of feed-
Together they give us the semantic as well as temporal infor-
forward layers, which are the same as in [17]. We focus more
mation for each word.
 We definethis sequence of embedding
on the bidirectional self-attention layer that we apply to the
vectors as X   x1 , x2 , . . . , xk .
given inputs.
In the sequence-to-sequence problem, we find a mapping
 to a sequence of l The bidirectional self-attention layer takes the vector rep-
f from an input sequence of k vectors X 1:k
resentation xi ∈ X 1:k
 as the input. Each input vector x  in the
i
target vectors Y1:l . The number of target vectors is unknown
encoder, block is projected to a key vector κi ∈ K1:k , value
a priori and depends on the input sequence. The f is shown
vector vi ∈ V1:k , and a query vector qi ∈ Q 1:k , through three
in Eq. (3):
trainable weight matrices Wq , Wv , Wk , as shown in Eq. (5)

f : X 1:k → Y1:l . (3) qi  Wq xi ; vi  Wv xi ; κi  Wκ xi (5)

4.2.1 Encoder where ∀i ∈ {1, 2, . . . , k}. The same weight matrices are
applied to each input vector xi . After projecting each input
The encoder is a stack of encoder blocks, as shown in green vector xi to a query, key and value vector, each query vec-
in Fig. 3. The encoder maps the input sequence to a contextu- tor is compared to all the key vectors. The intuition is that
alized encoding sequence. We use the bidirectional encoder the higher the similarity between a key vector and a query

123
344 International Journal of Data Science and Analytics (2022) 13:335–362

vector, the more important is the corresponding value for the where ∀i ∈ {1, 2, . . . , l}. The LM head maps the encoded
output vector. The output from the self-attention layer is the sequence of target vectors Y 0:i−1 to a sequence of logit vec-
output vector representation xi , which is a refined contextu- tors L1:k  1 , . . . , k , where the dimensionality of each
alized representation of xi . An output vector x  is defined as logit vector i corresponds to the size of the input vocabulary
the weighted sum of all value vectors V plus the input vec- 1 : k. A probability distribution over the whole vocabulary
tor x  . The weights are proportional to the cosine similarity is obtained by applying a SoftMax operation on i , as shown
between the query vectors and respective key vectors, shown in Eq. (9):
in Eq. (6):
  
  pθdec (yi |X 1:k , Y0:i−1 )  Softmax f θdec X 1:k , Y0:i−1

X 1:k  V1:k SoftMax Q 1:k T 
K1:k + X 1:k (6)  
 Softmax Wemb T
y i−1  Softmax (li ) (9)
here X  is the sequence of output vectors generated from the
input X  . X  is given to the last encoder block, and the output here WembT is transpose of the word embedding matrix. We

from the last encoder is a sequence of encoder hidden states autoregressively generate output from the input sequences
X . The final output from the encoder is the contextualized through probability distribution in pθdec (yi |X 1:k , Y0:i−1 ).
encoded sequence X 1:k , which is passed to the decoder. The unidirectional attention takes the input vector y (rep-
resentation of y), and the output is the vector representation
4.2.2 Decoder y . Each query vector in the unidirectional self-attention layer
is compared only to its respective key vector and previous
The decoder only models on the leftward context, so it ones to yield the respective attention weights. The attention
does not learn bidirectional interactions. Generally, the news weights are then multiplied by their respective value vectors
(either real or fake) is shown or read in the order of and summed together, as in Eq. (10):
publication timestamps. So, news reading is a left-to-right  
 
(backward-to-forward) process. Naturally, the timestamps of Y1:l  V1:l Softmax K 1:l
T
Q 1:l + Y1:l . (10)
users’ engagements also follow the order of the news. In our
work, we model the left-to-right interdependencies in the The cross-attention layer takes as input two vector
sequences through the decoder part. The recurrent structure sequences: (1) outputs of the unidirectional self-attention
inside the decoder helps us use the predictions from a pre- 
layer, i.e. Y0:l−1 ; (2) contextualized encoding vectors X 1:k
vious state to generate the next state. With autoregressive from the encoder. The cross-attention layer puts each of its
modelling, we can detect fake news in a timely manner, con- input vectors to condition the probability distribution of the
tributing to early detection. next target vectors on the encoder’s input. We summarize
The Transformer-based decoder is a stack of decoder cross-attention in Eq. (11):
blocks, as shown in orange in Fig. 3, and the dense layer
 
language modelling (LM) head is on the top. The LM head is  
Y1:l  V1:l SoftMax K 1:l
T
Q 1:l + Y1:l . (11)
a linear layer with weights tied to the input embeddings. Each
decoder block has a unidirectional self-attention layer, fol-
The index range of the key and value vectors is 1 : l,
lowed by a cross-attention layer and two feed-forward layers.
which corresponds to the number of contextualized encoding
The details about the feed-forward layers can be found in the
vectors. Y  is given to the last decoder block and the output
paper [18]. Here, we focus more on the details of attention
from the decoder is a sequence of hidden states Y .
layers.
The input to the decoder is the contextualized encoding
sequence X 1:k from the encoder part. The decoder models 4.2.3 Model training
the conditional probability distribution of the target vector
sequence Y1:l , given the input X 1:k , shown in Eq. (7): In this work, we implement the transfer learning solution
[19] for fake news detection. We leverage the previous learn-
pθdec : (Y1:l |X 1:k ) (7) ings from a BART pre-trained checkpoint9 and fine-tune
the model on the downstream task of fake news detection.
here l is the number of the target vectors and depends on We perform the classification task for fake news detection.
the input sequence k. By Bayes’ rule, this distribution can be For the classification task, we input the same sequences into
factorized into conditional distributions of a target sequence the encoder and decoder. The final hidden state of the final
yi ∈ Y1:l , as shown in Eq. (8): decoder token is fed into an output layer for classification.

pθdec : (Y1:l |X 1:k  li1 pθdec (yi |Y0:i−1 |X 1:k )) (8) 9 https://dl.fbaipublicfiles.com/fairseq/models/bart.large.mnli.tar.gz.

123
International Journal of Data Science and Analytics (2022) 13:335–362 345

Fig. 5 The classification


process; input fed into encoder
goes into decoder; the output
label

This approach is like the [CLS] representation (CLS for clas- 5 Experimental set-up
sification) in BERT that serves as the token for the output
classification layer. The BERT has the CLS token returned by 5.1 Datasets
the encoder, but in BART, we need to add this additional token
in the final decoder part. Therefore, we add the token < S > in It was not a trivial task to find a suitable dataset to evaluate
the decoder to attend to other decoder states from the com- our proposed model because most of the standard datasets
plete input. We show the classification process in Fig. 5 [16]. available for fake news detection are either too small, sparse,
We represent the last hidden state [S] of the decoder as or void of temporal information.
h [S ] . The number of the classes is two (fake is 1, real is 0). A few state-of-the-art datasets, such as FakeNewsNet [23],
A probability distribution p ∈ [0, 1]2 is computed over the are not available as the full version but can be found as the
two classes using a fully connected layer with two output sample data. This is mainly because most of these datasets use
neurons on top of h [S ] , which is followed by the SoftMax Twitter data for social contexts and thus cannot be publicly
activation function, as shown in Eq. (12): accessible due to license policies. Other available datasets
  that consider the fake news content are outdated. Since fake
p  SoftMax W · h [S ] + b (12) news producers typically change their strategies over time,
such datasets are not suitable to solve the issue of fake news
where W is the learnable projection matrix and b is the bias. data for the recent news data. After extensive research and
We train our model for the sequence-pair classification task careful consideration, we found that the NELA-GT-19 and
[17] to classify fake news. Unlike the typical sequence-pair Fakeddit are most suitable for our proposed problem regard-
classification task, we use the binary Cross-Entropy with log- ing the number of news articles, temporal information, social
its loss function instead of the vanilla cross-entropy loss used contexts, and associated side information.
for the multi-class classification. However, the same model To evaluate the effectiveness of our proposed FND-NS
can be adapted for the multi-class classification if there is model, we conducted comprehensive experiments on the
a need. Through binary cross-entropy loss, our model can data from the real-world datasets: NELA-GT-2019 [21] and
assign independent probabilities to the labels. The cross- Fakeddit [22]. Both datasets are in English, and we take the
entropy function H determines the distance between the true same timeline for the two datasets.
probability distribution and predicted probability distribu-
tion, as shown in Eq. (13): 5.1.1 NELA-GT-2019
  
H y j , ŷ j  − y j log ŷ j (13) For our news component, we use the NELA-GT-2019 dataset
j1
[21], a large-scale, multi-source, multi-labelled benchmark
dataset for news veracity research. This dataset can be
where y j is the ground truth for observation and ŷ j is the
accessed from here.10 The dataset consists of 260 news
model prediction.
sources with 1.12 million news articles. These news articles
Based on the Transformer architecture, our model natu-
were published between January 1, 2019, and December 31,
rally takes the sequences of words as the input, which keeps
2019. The actual news articles are not labelled. We get the
flowing up the stacks from encoder to decoder, while the
ground truth labels (0—reliable, 1—mixed, 2—unreliable)
new sequences are coming in. We organize the news data
at the source level and use the weak supervision (discussed
according to the timestamps of users’ engagements so that
in Sect. 5.2) to assign a label to each news article. We use the
the temporal order is retained during the creation of the
sequences. We use paddings to fill up the shorter readers’
sequences, while the longer sequences are truncated. 10 https://doi.org/10.7910/DVN/O7FWPO

123
346 International Journal of Data Science and Analytics (2022) 13:335–362

Fig. 6 Weak supervision module

article ID, publication timestamp, source of news, title, con- 5.2 Weak supervision
tent (body), and article’s author for the news features. We
only use the ‘reliable’ and ‘unreliable’ source-level labels. The weak supervision module is a part of our proposed frame-
For the ‘mixed’ labels, we change them to ‘unreliable’ if they work as shown in Fig. 6. We utilize weak (distant) labelling to
are reported as ‘mixed’ by the majority of the assessment sites label news articles. Weak supervision (distant supervision) is
and omit the left-over ‘mixed’ sources. The statistics of the an alternative approach to label creation, in which labels are
actual data can be found in the original paper [21]. created at the source level and can be used as proxies for the
articles. One advantage of this method is that it reduces the
labelling workload. Furthermore, the labels for articles from
known sources are known instantly, allowing for real-time
5.1.2 Fakeddit
labelling, as well as parameter updates and news analysis.
This method is also effective in the detection of misinforma-
For the social contexts, we use the Fakeddit dataset [22],
tion [9, 53–55].
which is a large scale, multi-modal (text, image), multi-
The intuition behind weak supervision is that the weak
labelled dataset sourced from Reddit (social news and dis-
labels on the training data may be imprecise but can be used
cussion website). This dataset can be accessed from here.11
to make predictions using a strong model [54]. We overcome
Fakeddit consists of over 1 million submissions from 22 dif-
the scarcity issue of hand-labelled data by compiling a dataset
ferent subreddit (users’ community boards) and over 300,000
like this, which can be done almost automatically and can
unique individual users. The data are collected from March
yield good results as shown in Sect. 6.3.
19, 2008, till October 24, 2019. We consider the data from
In our work, we use the weak supervision to assign
January 01, 2019, till October 24, 2019, to match it with the
article-level labels for the NELA-GT-2019 dataset, where the
timeline of the NELA-GT-19. According to previous work
source-level labels are provided by the dataset. This method
[3], this amount of data is considered sufficient for testing
is also suggested by the providers of NELA-GT-2019 dataset
the concept drift. We use the features of the social context
[21]. For the Fakeddit dataset, the ground truth labels are pro-
from this dataset: submission (the post on a news article),
vided by the dataset itself, we only create two new labels for
submission title (title of the post matching with the headline
this dataset—‘crowd response’ and ‘user credibility’. We use
of the news story), users’ comments on the submission, user
these labels provided by the datasets to create a new weighted
IDs, subreddit (a forum dedicated to a specific topic on Red-
aggregate label to be assigned to each news article.
dit) source, news source, number of comments, up-vote to
From the NELA-GT-19, we get the ground truth labels
down-vote ratio, and timestamp of interaction. The statistics
associated with each source (e.g. NYTimes, CNN, BBC,
of the actual data can be found in the original paper [22].
theonion and many more). These labels are provided by seven
different assessment sites: (1) Media Bias/Fact Check, (2)
Pew Research Center, (3) Wikipedia, (4) OpenSources, (5)
AllSides, (6) BuzzFeed News, and (7) Politifact, to each news
11 https://github.com/entitize/fakeddit.

123
International Journal of Data Science and Analytics (2022) 13:335–362 347

source. Based on these seven assessments, Gruppi et al. [21] Table 1 Fake versus real samples used from the datasets
created an aggregated 3-class label: unreliable, mixed and Actual fake Actual real
reliable, to assign to each source. We use the source-level
labels as the proxies for the article-level labels. The assump- Predicted fake 2108 158
tion is that each news story belongs to a news source and Predicted real 91 1643
the reliability of the news source has an impact on the news
story. This approach is also suggested in the NELA-GT-18 Table 2 Confusion matrix
[56] and NELA-GT-20 [57] papers and has shown promising
Actual fake Actual real
results in the recent fake news detection work [3].
Once we get the label for each news article, we per- Predicted fake TP FP
form another step of processing over article-level labels. Predicted real FN TN
As mentioned earlier, the NELA-GT-19 provides the 3-
class source-level labels: {‘Unreliable’, ‘Mixed, ‘Reliable’}.
According to Gruppi et al. [21], the ‘mixed’ label means label according to user credibility is ‘Very uncredible’ or
mixed factual reporting. We have not used the ‘mixed’ label ‘Uncredible’; (4) its label according to crowd response is
in our work. We change the ‘mixed’ label to ‘unreliable’ if it ‘Fake’. We assign a label ‘Real’ to the news if all of the
is reported as ‘mixed’ by the majority of the assessment sites. following conditions are satisfied: (1) its label in Fakeddit
For the remaining left-over mixed labels, we remove those is ‘True’; (2) its label in NELA-GT-19 is ‘Reliable’; (3) its
sources to avoid ambiguity. This gives a final news dataset label according to user credibility is ‘New user’, ‘Credible’,
with 2-class labels: {‘Reliable’, ‘Unreliable’}. or ‘Very credible’; (4) its label according to crowd response
The other dataset used in this work is Fakeddit. Nakamura is ‘Real’. We do not penalize a new user because we do not
et al. [22] also use the weak supervision to create labels in have sufficient information for a new user.
the Fakeddit dataset. They use the Reddit themes to assign a Our FND-NS model implicitly assumes that these weak
label to each news story. More details about Reddit themes labels are precise and heuristically matches the labels against
and the distant labelling process are available in their paper the corpus to generate the training data. The model predicts
[22]. the final label: ‘Real’ or ‘Fake’ for the news.
The dataset itself provides labels as 2-way, 3-way and To handle the data imbalance problem in both datasets, we
6-way labels. We use the 6-way label scheme, where the use the under-sampling technique [58], in which the majority
labels assigned to each sample are: ‘True’, ‘Satire’, ‘Mis- class is made closer to the minority class by removing records
leading Content’, ‘Imposter Content’, ‘False Connection’, from the majority class. The number of fake and real news
and ‘Manipulated Content’. We assign two more weak labels items from datasets used in this research is given in Table 1.
in addition to 6-way labels, which are user credibility and We temporally split the data for the model training. We
crowd response labels that we compute using the social con- use the last 15% of the chronologically sorted data as the test
texts. The user credibility level has five classes: ‘New user’, set, the second to last 10% of the data as the validation set
‘Very uncredible’, ‘Uncredible’, ‘Credible’, ‘Very credible’. and the initial 75% of the data as the train set. We also split
The crowd response has two classes: ‘Fake’ and ‘Real’. the history of each user based on the interaction timestamp.
We get the user credibility levels through our ZSL clas- We consider the last 15% of the interactions as the test set.
sifier (Fig. 2). For the crowd response, we simply take the
scores of all the comments (posts) of users on a news story to 5.3 Evaluation metrics
determine the overall view of users on this news story. The
goal is to make the label learning more accurate by adding In this paper, the fake news detection task is a binary deci-
more weak labels to the available labels. In a preliminary sion problem, where the detection result is either fake or
test, we find that using weak supervision with multiple weak real news. To assess the performance of our proposed model,
labels (our case) achieves a better result than using Fakeddit we use the accuracy ACC, precision Prec, recall Rec, F1-
theme-based weak labels alone [22] (they learned using their score F1, area under the curve AUC and average precision AP
weak supervision model). as the evaluation metrics. The confusion matrix determines
Based on this, we design a formula to assign the final the information about actual and predicted classifications, as
label (‘Real’, ‘Fake’) to each sample in the aggregate func- shown in Table 2.
tionality part. We assign a final label ‘Fake’ to a new article The variables TP, FP, TN and FN in the confusion matrix
if one of the following conditions is satisfied: (1) its 6-way refer to the following:
label specified in Fakeddit is ‘Satire’, ‘Misleading content’,
‘Imposter’, ‘False connection’, or ‘Manipulated content’; (2) • True Positive (TP): number of fake news that are identified
its label specified in NELA-GT-19 is ‘Unreliable’; (3) its as fake news.

123
348 International Journal of Data Science and Analytics (2022) 13:335–362

• False Positive (FP): number of real news that are identified Table 3 Hyperparameters used for the proposed model
as fake news. Hyperparameter Value
• True negative (TN): number of real news that are identified
as real news. Model Bart Large MNLI
• False negative (FN): number of fake news that are identi- Vocabulary size 50,265 (vocabulary size defines
the number of different
fied as real news.
tokens)
Dimensionality size 1024 (dimensionality of the
For the Prec, Rec, F1 and ACC, we perform the specific layers and the pooling layer)
calculation as: No. of encoder layers 12

TP No. of decoder layers 12


Prec  (14) Attention heads 16 (number of attention heads
TP + FP for each attention layer in
TP encoder), 16 (number of
Rec  (15) attention heads for each
TP + FN
attention layer in decoder)
TP
F1  (16) Feed-forward layer 4096 (dimensionality of the
TP + 21 (FN + FP) dimensionality feed-forward layer in
encoder), 4096
TP + TN
ACC  . (17) (dimensionality of the
TP + TN + FP + FN feed-forward layer in decoder)
Activation function Gelu (nonlinear activation
To calculate the AUC, we calculate the true positive rate function in the encoder and
(TPR) and the false positive rate (FPR). TPR is a synonym pooler)
for the recall, whereas FPR is calculated as: Position embeddings 1024
Number of labels 2
FP
FPR  . (18) Batch size 8 (tested 8, 16, 32)
FP + TN
Epochs 10
The receiver operating characteristic (ROC) plots the Sequence length 700 (other values used are 512,
trade-offs between the TPR and FPR at different thresholds 1024, 2048 but 700 suits to
current settings)
in a binary classifier. The AUC is an aggregate measure to
Learning rate 1e−4 (tested 1e−2, 1e−3,
evaluate the performance of the model across all those possi- 1e−4)
ble thresholds. Compared to the accuracy measure ACC, the Dropout 0.1 (dropout probability for all
AUC is better at ranking predictions. For example, if there fully connected layers, tested
are more fake news samples in the classification, the accu- in {0.0, 0.1, 0.2, …, 0.9})
racy measure may favour the majority class. On the other Warm up steps 500 (tested 0, 100, 300, 500,
hand, the AUC measure gives us the score order (ranking) 1000)
along with the accuracy score. We also include the average Optimizer Adam
precision AP that gives the average precision at all such pos- Loss function Cross entropy
sible thresholds, similar to the area under the precision–recall Output layer SoftMax
curve.

5.4 Hyperparameters 1024. The model has 16-heads with around 1 million param-
eters. We add a 2-layer classification head fine-tuned on the
We implement our model with Pytorch on the GPUs provided MNLI. The model hyperparameters are shown in Table 3.
by Google Colab Pro.12 We use the pre-trained checkpoint Our model is trained using Adam optimizer [59]. In our
of bart-large-mnli.13 The MNLI is a crowd-sourced dataset experiments, the larger batch sizes did not work. So, we
that can be used for the tasks such as sentiment analysis, decrease the batch size from 32 (often used) to 8 until the
hate speech detection, detecting sarcastic tone, and textual memory issues get resolved. We keep the same batch size of
entailment (conclude a particular use of a word, phrase, or 8 during the training and validation process. The number of
sentence). The model is pre-trained on 12 encoder and 12 train epochs is 10. The default sequence length supported by
decoder layers, in total 24 layers with a dimensionality size of the BART is 1024. Through an initial analysis of our datasets,
we find that the mean length of a news story is around 550
12 https://colab.research.google.com/. words, whereas a Reddit post is on average 50 words. The
13 https://dl.fbaipublicfiles.com/fairseq/models/bart.large.mnli.tar.gz. maximum sequence length of BERT and GPT-2 is 512, which

123
International Journal of Data Science and Analytics (2022) 13:335–362 349

is less than the mean length of a news story. So, we set the contexts into the model. We also represent another variant of
sequence length to 700 to include the average news length and this model where we remove the news body and news source,
the side information from the news and social contexts. The keeping only social contexts (as in the default model) and
sequences are created based on the timestamps of the user’s represent it as 2-Stage Tranf. (nc-).
engagement. The longer sequences are truncated, while the exBAKE [7]: it is another fake news detection method
shorter ones are padded to the maximum length. based on deep neural networks. This model is also based on
the BERT model and is designed for the content of news.
5.5 Baseline approaches Besides showing the original model’s results, we also incor-
porate social contexts into the model by introducing another
We compare our model with the state-of-the-art fake news variant of this model. The model variants are exBAKE (with
detection methods, including deep neural and traditional both news content and social contexts) and exBAKE (sc-)
machine learning methods. We also consider other baselines, (default model, without social contexts).
including a few recent Transformer models, a neural method Declare [8]: it is a deep neural network model that assesses
(Text CNN) and a traditional baseline (XGboost). the credibility of news claims. This model uses both the news
A few state-of-the-art methods, such as a recent one by content and social contexts by default, so we feed this infor-
Liu and Wu [5], is not publicly accessible, so we have not mation to the model. An implementation of the model can be
included those in this experiment. Some of these baselines are found here.15
by default using content features only (e.g. exBake, Grover, TriFN [4]: it is a matrix factorization based model that uses
Transformer-based baselines, TextCNN), and some are using both news content and social contexts to detect fake news.
social contexts only (e.g. 2-Stage Tranf., SVM-L, SVM-G We give both the news and social contexts to the model. The
and LG group). A few baselines use the social contexts with model implementation can be accessed here.16
content-based features (e.g. TriFN, FANG, Declare). Our Grover [12]: it is a deep neural network-based fake news
model uses both the news content and the social contexts detection model based on GPT-2 [18] architecture. The model
with the side information. For a fair comparison, we test the takes news related information and can incorporate additional
baselines using their default settings. In addition, we also test social contexts too. We give both the news content and social
them by including both news content and social contexts. In contexts to Grover. In addition, we remove the social con-
this case, we create variants of baselines (default setting and texts and keep the content information only (as in default
setting with both news and social context). Grover model), which we represent as Grover (sc-). We use
To determine the optimal hyperparameter settings for the the Grover-base implementation of the model and initialize
baselines, we primarily consult the original papers. How- the model using the GPT-2 checkpoint.17 The model imple-
ever, there is little information provided on how the baselines mentation is available here.18
are tuned in these papers. So, we optimize the hyperparam- SVM-L; SVM-G; LG [14]: it is a machine learning model
eters of each baseline using our dataset. We also train all based on similarity among the friends’ networks to discover
the models from scratch. We optimize the standard hyper- fake accounts in social networks and detect fake news. We
parameters (epochs, batch size, dimensions, etc.) for each use all the proposed variants: linear support vector machine
baseline. Some of the hyperparameters specific to individ- (SVM), medium Gaussian SVM and logistic regression, and
ual models are reported below (along with the description of optimize them to their optimal settings.
each method). BERT [17]: BERT (bidirectional encoder representations
FANG [15]: it is a deep neural model to detect fake from Transformers) is a Google-developed Transformer-
news using graph learning. We optimize the following losses based model. We use both the cased (BERT-c) and uncased
simultaneously during the model training: 1) unsupervised (BERT-u) version of the BERT, with 24-layer, 1024-hidden,
proximity loss, 2) self-supervised stance loss, and 3) super- 16-heads, and 336 M parameters. The model implementation
vised fake news detection loss (details can be found in the can be found here.19
paper [15]), whereas the implementation details are available VGCN-BERT [60]: it is a deep neural network-based
here.14 We feed both the news-related information and social model that combines the capability of BERT with a vocab-
contexts into the model.
2-Stage Tranf. [1]: it is a deep neural fake news detection
model that focuses on fake news with short statements. The
15 https://github.com/atulkumarin/DeClare.
original model is based on BERT-base [17], and checkpoint is
16 https://github.com/KaiDMML/FakeNewsNet.
recommended to use, so we build this model using the same
17 https://openai.com/blog/tags/gpt-2/.
method. We feed the news-related information and social
18 https://github.com/rowanz/grover.
14 https://github.com/nguyenvanhoang7398/FANG. 19 https://github.com/google-research/bert.

123
350 International Journal of Data Science and Analytics (2022) 13:335–362

ulary graph convolutional network (VGCN). The model training loss validaon loss
0.7
implementation is available here.20
XLNET [61]: it is an extension of the Transformer-XL 0.6
model, which was pre-trained with an autoregressive method
to learn bidirectional contexts. We use the hyperparameters: 0.5

Loss
24-layer, 1024-hidden, 16-heads, 340 M parameters. The
0.4
model implementation is available here.21
GPT-2 [18]: it is a causal (unidirectional) Transformer pre- 0.3
trained using language modelling. We use the hyperparam-
eters: 24-layer, 1024-hidden, 16-heads, 345 M parameters. 0.2
The model implementation is available here.22 0 5 10 15 20 25 30
DistilBERT [62]: it is a BERT-based small, fast, cheap, Epochs
and light Transformer model, which uses 40% fewer param-
Fig. 7 Training versus validation loss
eters than BERT-base, runs 60% faster, and keeps over 95%
of BERT’s results, as measured in the paper [62]. We only
Table 4 Confusion matrix of the sample data
use the cased version for this model (based on the bet-
ter performance of the BERT cased version, also shown Actual fake Actual real
in the later experiments). We use the hyperparameters: 6- Predicted fake 35,104 12,997
layer, 768-hidden, 12-heads, 65 M parameters, and the model Predicted real 9793 32,893
implementation is available here.23
Longformer [63]: it is a Transformer-based model that
scales linearly with sequence length, making it easy to
XGBoost [65]: it is an optimized distributed machine
process documents of thousands of tokens or longer. We
learning algorithm under the gradient boosting framework,
use the hyperparameters with 24-layer, 1024-hidden, 16-
with implementation from here.28
heads, ~ 435 M parameters and the model is initiated from
We report the results for each baseline based on best per-
the RoBERTa-large24 checkpoint, trained on documents of
forming hyperparameters for each evaluation metric.
max length 4,096. The model implementation is available
here.25
We use both news content and social contexts to train the
Transformer-based models (BERT, VGCN-BERT, XLNET,
6 Results and analyses
GPT-2, DistillBERT), which are built for taking textual infor-
In this section, we present the results and analyse them.
mation. But these models can also handle the social contexts,
as evidenced in some preliminary tests where we first fed the
news content, then news content with social contexts, and 6.1 Model performance
found a marginal difference in performance.
Text CNN [64]: it is a convolution neural network (CNN) We show the learning curve for training loss and validation
with one layer of convolution on top of word vectors, where loss during model training in Fig. 7.
the vectors are pre-trained on a large number (~ 100 billion) In our model, the validation loss is quite close to the train-
of words from Google News.26 The model implementation ing loss. The validation loss is slightly higher than training
is available here.27 loss, but overall, both values are converging (when plotting
loss over time). Overall, it shows a good fit for model learn-
ing.
We also test the model’s performance on the test data to
show the confusion matrix in Table 4.
Based on the confusion matrix in Table 4, the model accu-
20 https://github.com/Louis-udm/VGCN-BERT. racy is 74.89%, which means more than 74% of the results
21 https://github.com/zihangdai/xlnet. are correct. We get the precision of 72.40%, which means
22 https://github.com/openai/gpt-2. that we have a few false positives (news is real but predicted
23 https://huggingface.co/transformers/model_doc/distilbert.html. as fake), and we can correctly predict a large portion of true
24 https://github.com/pytorch/fairseq/tree/master/examples/roberta. positives (i.e. the news is fake and predicted as fake). We get
25 https://github.com/allenai/longformer. a recall value of 77.68%, which shows we have many more
26 https://code.google.com/archive/p/word2vec/.
27 https://github.com/dennybritz/cnn-text-classification-tf. 28 https://xgboost.readthedocs.io/en/latest/.

123
International Journal of Data Science and Analytics (2022) 13:335–362 351

Table 5 Overall performance comparison • We exploit the knowledge transition from large-scale pre-
Method ACC Prec Rec F1 AUC AP trained models (e.g. MNLI) to the fake news detection
problem with transfer learning. The right choice of the pre-
FND-NS 0.748 0.724 0.776 0.749 0.704 0.710 trained checkpoints on the specific corpus helps us make
Fake news detection baselines better predictions. During empirical testing, we check the
FANG [15] 0.687 0.673 0.618 0.644 0.681 0.666 performance of our model with and without including the
2-Stage Tranf. [1] 0.621 0.647 0.626 0.636 0.620 0.614 pre-trained checkpoints. We find better results with the
2-Stage Tranf. (nc-) 0.612 0.634 0.610 0.622 0.612 0.607 inclusion of the MNLI checkpoint.
exBAKE [7] 0.665 0.625 0.614 0.619 0.640 0.601 • We model the timeliness in our model through an autore-
exBAKE (sc-) 0.685 0.630 0.610 0.620 0.651 0.610 gressive model, which helps us detect fake news in a timely
Declare [8] 0.621 0.610 0.607 0.608 0.578 0.590 and early manner.
TriFN [4] 0.615 0.601 0.614 0.607 0.601 0.596 • We address the label shortage problem through the pro-
Grover [12] 0.533 0.567 0.586 0.576 0.565 0.575 posed weak supervision module, which helps us make
Grover (sc-) 0.578 0.601 0.617 0.609 0.582 0.612 better predictions on unforeseen news.
SVM-L [14] 0.415 0.429 0.451 0.440 0.425 0.421
SVM-G [14] 0.434 0.450 0.455 0.452 0.428 0.430 We have the following more findings from the results:
LG [14] 0.425 0.468 0.457 0.462 0.431 0.420 Among the fake news detection baselines, the overall per-
Other baselines
formance of FANG is the best. The performance of FANG is
the second best after our FND-NS model. FANG uses graph
BERT-c [17] 0.640 0.610 0.620 0.615 0.622 0.639
learning to detect fake news and focus on learning context
BERT-u [17] 0.589 0.560 0.578 0.569 0.566 0.597
representations from the data. The overall performance of
VGCN-BERT [60] 0.627 0.598 0.610 0.604 0.610 0.645
exBAKE and 2-Stage Tranf. as indicated in most metrics is
XLNET [61] 0.520 0.600 0.530 0.563 0.525 0.557
the next best. These models (exBAKE and 2-Stage Tranf.) are
GPT-2 [18] 0.635 0.620 0.610 0.615 0.614 0.623
based on the BERT model and are suitable for representation
DistilBERT [62] 0.522 0.510 0.490 0.500 0.467 0.526
learning. Our model outperforms these models, most likely
Longformer [63] 0.543 0.573 0.550 0.561 0.536 0.545
because we focus on both autoregression and representation
Text CNN [64] 0.520 0.480 0.501 0.490 0.530 0.522 learning.
XGBoost [65] 0.510 0.454 0.487 0.470 0.525 0.514 The 2-Stage-Tranf. uses the claim data from the social
media. We also test this model with its default input set-
ting as in 2-Stage Tranf. (nc-), omitting the news content
true positives than false negatives. Generally, a false negative (news body, headline, source) and allowing only social con-
(news is fake but predicted as real) is worse than a false pos- text features (such as post, title, score). With this change,
itive in fake news detection. In our experiment, we get less we do not find much difference in the performance. We find
false negatives than false positives. Our F1-score is 74.95%, the better performance (though marginal) of 2-Stage-Tranf.
which is also quite high. when we only keep the news-related features (not includ-
ing social contexts). This is most likely due to the support
of the 2-Stage-Tranf. model for auxiliary information. Our
6.2 Overall performance comparison
model performs better than 2-Stage-Tranf. with its support
for side information. This is likely because our model can
We show the best results of all baselines and our FND-NS
handle longer sequence lengths than the baselines, resulting
model using all the evaluation metrics in Table 5. The results
in some loss of information and thus accuracy in those mod-
are based on data from both datasets, i.e. social contexts from
els.
Fakeddit on the NELA-GT-19 news. The input and hyperpa-
Then comes the performance of Declare, TriFN and
rameter optimization settings for each baseline model are
Grover models, all of which are considered the benchmark
given above (Sect. 5.5). The best scores are shown in bold.
models in fake news research. Grover is a content-based neu-
Overall, we see that our proposed FND-NS model has the
ral fake news detection model. Declare is a neural network
highest accuracy (74.8%), precision (72.4%), recall (77.6%),
framework that detects fake news based on the claim data.
AUC (70.4%) and average precision (71%) among all the
TriFN is a non-negative matrix factorization algorithm that
models. The superiority of our model is attributed to its
includes news content and social contexts to detect fake news.
advantages:
We also test Grover (content-based model) without social
contexts in Grover (sc-). We find some better performance
• Our model utilizes rich features from news content and of Grover (sc-) than Grover’s (with both inputs). This result
social contexts to extract the specifics of fake news. shows that a model built on rich content features (news body,

123
352 International Journal of Data Science and Analytics (2022) 13:335–362

headline, publication date) with autoregressive properties autoencoding models (DistilBERT, Longformer, BERT-u).
(GPT-2 like architecture) can perform better even without The exception is seen in BERT-c for some scores. The autore-
social contexts. gressive Transformers usually model the data from left to the
The SVM and LG are also used for fake news detection. right and are suitable for time-series modelling. They predict
Due to their limited capabilities and the use of hand-crafted the next token after reading all the previous ones. On the other
dataset features, the accuracies of SVM and LG are lower hand, the autoencoding models usually build a bidirectional
in these experiments. The results for SVM and LG do not representation from the whole sentences and are suitable for
generalize the performance of these models to all situations natural language understanding tasks, such as GLUE (general
in this field. language understanding evaluation), classification, and text
In general, the performance of the Transformer-based categorization [17]. Our fake news detection problem implic-
methods is better than the traditional neural-based methods itly involves data that vary over time. The autoregressive
(Text CNN) and the linear models (SVM, LG, XGBoost). models show relatively better results. Our FND-NS model
This is probably because the Transformer-based methods use performs the best because it has both the autoencoding model
the multi-head attention and positional embeddings, which and the autoregressive model.
are not by-default integrated with the CNNs (of text CNN) We find VGCN-BERT as a competitive model. The VGCN
and the linear methods. With the default attention mecha- is an extension of the CNN model combined with the graph-
nisms and more encoding schemes (e.g. token, segment and based method and the BERT model. The results in Table 5
position), the Transformers compute input and output rep- show the good performance of CNN in the TextCNN method
resentations better than the traditional neural methods. Our and that of the BERT model. The neural graph networks
FND-NS model, however, performs better than these Trans- have recently demonstrated noticeable performance in the
former models. This is because our framework includes many representative learning tasks by modelling the dependencies
add-ons, such as weak supervision, representation learning, among the graph states [66]. That is why the performance of
autoregression, which (all of them together) are not present VGCN-BERT (using BERT-u) is better than TextCNN and
in the typical Transformer models. BERT-u alone. This result also indicates that hybrid models
The general performance of simple neural methods (e.g. are better than standalone models. FANG also uses a graph
Text CNN, Declare) that are not Transformer-based is bet- neural network with the supervised learning loss function
ter than the linear methods (SVM, LG, XGBoost). This is and has shown promising results.
probably because the linear methods use manual feature
engineering, which is not optimal. On the other hand, the
6.3 Effectiveness of weak supervision
neural-based methods can capture both the global and the
local contexts in the news content and social contexts to detect
In this experiment, we test the effectiveness of the weak
the patterns of fake news.
supervision module on the validation data for the accuracy
Among the Transformers, the cased model (e.g. BERT-
measure.
c), in general, performs better than its respective uncased
We show different settings for weak supervision. These
version (e.g. BERT-u). Generally, fake or false news uses
settings are:
capital letters and emotion-bearing words to present some-
thing provoking. Horne and Adalı [29] also present several
examples where fake titles use capitalized words excessively. M1: Weak supervision on both datasets, NELA-GT-19 and
This shows why the cased models can detect fake news better Fakeddit with original labels + user credibility label +
compared to the uncased versions. crowd response label;
The overall performance of the distilled (condensed) ver- M2: Weak supervision on both datasets, NELA-GT-19 and
sions (Distill BERT) is slightly lower than their respective Fakeddit with original labels + user credibility label;
actual models (BERT). Based on the better performance of M3: Weak supervision on both datasets, NELA-GT-19 and
the BERT cased version over its uncased version, we use the Fakeddit with original labels + crowd response label;
cased version of Distill BERT. The Distill BERT does not use M4: Weak supervision on both datasets, NELA-GT-19 and
token-type embeddings and retains only half of the layers and Fakeddit with original labels;
parameters of the actual BERT, which probably results in the M5: Weak supervision on NELA-GT-19 only;
overall lower prediction accuracy. The distilled versions bal- M6: Weak supervision on Fakeddit only with original
ance the computational complexity and accuracy. This result labels;
suggests that using the distilled version can achieve compa- M7: Weak supervision on Fakeddit with original + user
rable results (to the original model) with better speed. credibility labels;
We also see that the general performance of the autore- M8: Weak supervision on Fakeddit with original + crowd
gressive models (XLNet and GPT-2) is better than the most response labels;

123
International Journal of Data Science and Analytics (2022) 13:335–362 353

80% The results on the original NELA-GT-19 may also be dif-


70% ferent in our work. This is because we do not consider much
of the mixed labels from the original dataset. Also, since
60% we use under-sampling for data balancing for both of our
Accuracy score

50% datasets, the results may vary for the experiments in this paper
versus the other papers using these datasets.
40%
30% 6.4 Ablation study
20%
In the ablation study, we remove a key component from our
10% model one a time and investigate its impact on the perfor-
0% mance. The list of reduced variants of our model are listed
M1 M2 M3 M4 M5 M6 M7 M8 M9 below:
Model variant
FND-NS: The original model with news and social con-
Fig. 8 Accuracy percentage on different settings of weak supervision
for FND-NS model
texts component;
FND-N: FND-NS with news component—removing
social contexts component;
M9: Weak supervision on Fakeddit with original + user FND-N(h-): FND-N with headlines removed from the
credibility labels + crowd response labels. news component;
FND-N(b-): FND-N with news body removed from the
news component;
The results of FND-NS on these settings are shown in FND-N(so-): FND-N with news source removed from the
Fig. 8. The results show that our model performs better when news component;
we include newly learned weak labels and worse when we FND-N(h-)S: FND-NS with headlines removed from the
omit any one of the weak labels. This is seen with the best per- news component;
formance of FND-NS in the M1 setting. The crowd response FND-N(b-)S: FND-NS with news body removed from the
label proves to be more productive than the user credibility news component;
label. This is seen with > 2% loss in accuracy in M2 and M7 FND-N(so-)S: FND-NS with news source removed from
(without crowd response) compared to the M3 and M8 (with the news component;
crowd response). This conclusion is also validated by another FND-S: FND-NS with social context component—remov-
experiment in the ablation study (Sect. 6.4) discussed later. ing news component;
We also see that the model performance improves when FND-S (uc-): FND-S with user credibility removed from
we include both datasets. This can be seen with the overall the social contexts;
better performance of the model with both datasets. In gen- FND-S (cr-): FND-S with crowd responses removed from
eral, all these results (in Fig. 8) indicate that the weak labels the social contexts;
may be imprecise but can be used to provide accurate pre- FND-NS (uc-): FND-NS with user credibility removed
dictions. Some of these FND-NS results are also present in from the social contexts;
the Ablation study (Sect. 6.4) and are explored in more detail FND-NS (cr-): FND-NS with crowd responses removed
there. from the social contexts;
In the Fakeddit paper [22], we see the performance of the FND (en-)-NS: FND-NS with the encoder block
original BERT model to be around 86%, which is under- removed—sequences from both the news and social con-
standable because the whole Fakeddit dataset (ranging from texts components are fed directly into the decoder;
the year 2008 till 2019) is used in that work. In our paper, FND (de-)-NS: FND-NS with the decoder block removed;
we use the Fakeddit data only for the year 2019. Usually, FND (12ly-)-NS: FND-NS with 12 layers removed (6 from
the models perform better with more data. In particular, deep encoder and 6 from decoder).
neural networks (e.g. BERT) perform better with more train-
ing examples. Omitting many training examples could affect The results of the ablation study are shown in Table 6.
the performance of the model. This is the possible reason we The findings from the results are summarized below:
see lower accuracy of our model in this experiment using the When we remove the news component, the model accu-
Fakeddit data. For the same reason, we see the performance racy drops. This is demonstrated by the lower scores of
of the original BERT a bit lower with the Fakeddit data in FND-S, compared to the original model FND-NS in Table
Table 5. 6. However, when we remove the social context component,

123
354 International Journal of Data Science and Analytics (2022) 13:335–362

Table 6 Variants and the results Accuracy F1-score AUC AP


0.8
Variant ACC F1 AUC AP
0.7
FND-NS 0.748 0.749 0.704 0.710
FND-N 0.618 0.609 0.572 0.589 0.6
FND-N(h-) 0.610 0.598 0.568 0.569
FND-N(b-) 0.569 0.574 0.545 0.547 0.5
FND-N(so-) 0.587 0.582 0.573 0.560
0.4
FND-N(h-)S 0.695 0.639 0.689 0.655
FND-N(b-)S 0.670 0.615 0.645 0.628 0.3
FND-N(so-)S 0.684 0.639 0.685 0.649
FND-S 0.659 0.621 0.619 0.607 0.2
FND-S (uc-) 0.641 0.614 0.587 0.591
FND-S (cr-) 0.622 0.597 0.572 0.582 0.1
FND-NS (uc-) 0.680 0.685 0.651 0.672
0.0
FND-NS (cr-) 0.667 0.660 0.635 0.622
50 100 250 500 700
FND (en-)-NS 0.575 0.567 0.551 0.580
FND (de-)-NS 0.527 0.526 0.520 0.534 Fig. 9 The FND-NS with different sequence lengths
FND (12ly-)-NS 0.515 0.510 0.468 0.520

Some users leave the system permanently, some change their


viewpoints, and new users keep coming into the system.
the model accuracy drops more. This is seen with the lower Therefore, the user credibility may not be as informative
accuracy of FND-N (without social contexts) compared to as crowd responses and thus has less effect on the overall
the FND-S. This result indicates that both the news content detection result.
and social contexts play an essential role in fake news detec- The model performance is impacted when we remove
tion, as indicated in the best performance of the FND-NS the encoder from the FND-NS. The model performance is
model. affected even more when we remove the decoder. This is
The results also show that the performance of the FND- seen with the lower scores of FND(de-)-NS, which is lower
NS model is impacted more when we remove the news body than FND(en-)-NS. In our work, the decoder is the autore-
than removing the headline or the source of the news. This is gressive model, and the encoder is the autoencoding model.
seen with relatively lower accuracy of FND-N(b-) compared This result also validates our previous finding from the base-
to both the FND-N(h-) and FND-N(so-). The same results are lines (Table 5), where we find the better performance of the
seen in the lower accuracy of FND-N(b-)S compared to both autoregressive model (e.g. GPT-2) compared to most autoen-
the FND-N(h-)S and FND-N(so-)S. The result shows that the coding models (Longformer, DistillBERT, BERT-c).
headline and source are important, but the news body alone Lastly, we find that removing layers from the model lowers
carries more information about fake news. The source seems the accuracy of the FND-NS model. We get better speed upon
to carry more information than the headline; this is perhaps removing almost half the layers and the parameters, but this
related to the partisan information. comes with the information loss and the lower accuracy. This
From the social contexts, we find that when we remove also validates our baseline results in Table 5, where we see
the user credibility or the crowd responses, the model perfor- that the distilled models are faster in speed, but they do not
mance in terms of accuracy is decreased. Between the user perform as good as the original models.
credibility and crowd responses, the model performance is We also test the sequence lengths in {50, 100, 250, 500,
impacted more when we remove crowd responses. This is 700} in our model. It is important to mention that the large
seen with the lower performance of FND-S(cr-) and FND- sequence length often causes memory issues and interrupts
NS(cr-) compared to FND-S(uc-) and FND-NS(uc-). The the model’s working. However, we adjust the sequence length
same finding is also observed in Fig. 8 for the results of weak according to the batch size. This facility to include sequence
supervision. The probable reason for the crowd responses length > 512 is provided by BART. Most models (e.g. BERT,
being more helpful for fake news detection could be that GPT-2) do not support sequence length > 512. Our model per-
they provide users’ overall score on a news article directly, formance with different sequence lengths is shown in Fig. 9.
whereas the user credibility only plays an indirect role in
the prediction process. According to the concept drift the- The results in Fig. 9 show that our model performs the best
ory, the credibility levels of the users may change over time. when we use a sequence length of 700. Our datasets consist

123
International Journal of Data Science and Analytics (2022) 13:335–362 355

of many features from the news content and social contexts 0.8
over the span of close to one year. The news stories are on
average 500 words or more, which carries important informa-
0.7
tion about the news veracity. The associated side information
is also essential.

AUC Score
The results clearly show that truncating the text could 0.6
result in information loss. It is why we see the informa-
tion loss with the smaller sequence lengths. With a larger
sequence length, we could accurately include more news fea- 0.5
tures and users’ engagement data to accurately reflect the
patterns in users’ behaviours.
We also observe that the sequence length depends on the 0.4

15-01-2019

30-01-2019

14-02-2019

01-03-2019

16-03-2019

31-03-2019

15-04-2019

30-04-2019

15-05-2019
average sequence length of the dataset. Since our datasets
are large and by default contain longer sequences, we get
better performance with a larger sequence length. Due to
the resource limitations, we could not test on further larger
lengths, which we leave for future work. Weeks

Fig. 10 AUC of FND-NS during different weeks


6.5 The impact of concept drift

The concept drift occurs when the interpretation of the data trained on those events. After this point, the model perfor-
changes over time, even when the data may not have changed mance becomes steady.
[10]. Concept drift is an issue that may lead to the predictions Overall, the results show that our model effectively deals
of trained classifiers becoming less accurate as time passes. with concept drift, as the model performance is not much
For example, the news that is classified as real may become impacted at a large scale during all the timesteps. In gen-
fake after some time. The news profiles and users’ profiles eral, these results suggest that simply retraining the model
classified as fake may also change over time (some profiles every so often is enough to keep changes with fake news. We
become obsolete and some are removed). Most importantly, also observe that fake news’s concept drift does not occur as
the tactics of fake news change over time, as new ways are often as in real news scenarios. The same kind of analysis is
developed to produce fake news. These types of changes will also seen in related work [3], where the authors performed
result in the concept drift. A good model can combat the extensive experiments on concept drift and concluded that
concept drift. the content of the fake news does not drift as abruptly as that
In this experiment, we train our model twice a month and of the real news. Though the fake news does not evolve so
then test on each week moving forward. At first, train on the often as the real news, once planted, the fake news travels
first two weeks’ data and test on the data from the third week. further, farther and broader than the real news. Therefore, it
Next time, the model is trained on the data from the next two is important to detect fake news as early as possible.
weeks plus the previous two weeks (e.g. week 1, 2, 3, 4)
and tested on the next week (e.g. week 5) and this process 6.6 The effectiveness of early fake news detection
(training on four weeks’ data and testing on the following
week) continues. We evaluate the performance of the model In this experiment, we compare the performance of our model
using AUC and report the results in Fig. 10. The reason we and baselines on early fake news detection. We follow the
choose AUC here is that it is good at ranking predictions methodology of Liu and Wu [39] to define the propagation
(compared to other metrics). path for a news story, shown in Eq. (19):
Overall, the concept drift impacts our FND-NS model’s  
performance, but these changes happen slowly over time. As P(n i , T ) ≺ x j , t < T  (19)
shown in Fig. 10, the model performance initially improves,
then the performance is impacted by the concept drift in mid where x j is the observation sample and T is the detection
of March. This probably shows the arrival of unforeseen deadline. The idea is that any observation data after the detec-
events during this time period. Once the model is trained tion deadline T cannot be used for training. For example, a
on these events, we see a rise in performance. This is shown piece of news with timestep t means that the news is propa-
by a better and steady performance of the model in April. gated t timesteps ago. Following [39] for choosing the unit
We then see a sudden rise in performance in mid-April. This for the detection deadlines, we also take the units in minutes.
is probably because, up to this point, the model has been According to the research in fake news detection, fake news

123
356 International Journal of Data Science and Analytics (2022) 13:335–362

usually takes less than an hour to spread. It is easy to detect 7 Limitations


fake news after 24 h, but earlier detection is a challenge, as
discussed earlier. Our data and approach have some limitations that we mention
In this experiment, we evaluate the performance of our below:
model and the baselines on different detection deadlines or
timesteps. To report the results, we take the observations 7.1 Domain-level error analysis
under the detection deadlines: 15, 30, 60, 100 and 120 min, as
shown in Fig. 11. For simplicity, we keep the best performing The NELA-GT-19 comprises 260 news sources, which can
models among the available variants, e.g. among the Trans- only represent a limited amount of fake news detection analy-
formers, we keep only BERT-c, GPT-2 and VGCN-BERT sis over a given period of time. As a result, the current results
based on better scores in the previous experiment (Table 5). are based on the provided information. There may be other
Similarly, we keep the LG from its group [14]. Among the datasets that are more recent, covering different languages
other fake news detection baselines, we include all (FANG, or target audiences, aligned with other fake news outlets
exBAKE, 2-Stage Tranf, Grover and Declare). We also keep (sources). They may have been missed in these results. In
the other baselines (TextCNN and XGBoost). We evaluate future, we would like to use other datasets such as NELA-GT-
the performance of the models using the AUC measure. 20 [57], or scrape more news sources from various websites
The results show that our FND-NS model outperforms all and social media platforms.
the models for early fake news detection. FND-NS also shows Due to concept drift, the model trained on our datasets
a steady performance during all these detection deadlines. may have biases [67], causing some legitimate news sites to
The autoregressive modelling (decoder) in FND-NS helps in be incorrectly labelled. This may necessitate a re-labelling
modelling future values based on past observations. We have and re-evaluation process using more recent data.
more observations listed below: According to recent research [21], the producers of disin-
formation change their tactics over time. We also want to see
• The autoregressive models (GPT-2, Grover) perform better how these tactics evolve and incorporate these changes into
for early detection, probably because these models implic- our detection models.
itly assume future values based on previous observations. At the moment, we evaluate our models on a binary classi-
• The autoencoding models (BERT, exBAKE) show rel- fication problem. Our next step will be to consider multi-label
atively lower performance than autoregressive models classification, which will broaden the model’s applicability
(GPT-2, Grover) in the early detection tasks. This is to various levels of fake news detection.
because these models are for representation learning tasks.
These models perform well when more training data is fed 7.2 Ground truth source-level labels for news
into the models (as seen with the better performance in articles
BERT-c in Table 5), but the deadline constraints have per-
haps limited their capacity to do early detection. We have used the Media Bias Fact Check’s source-level
• The FANG, exBAKE, 2-Stage Transf., TriFN, Declare, ground truth labels as proxies for the news articles. Accord-
VGCN-BERT perform better during later time steps. This ing to previous research, the choice of ground truth labels
is understandable, as the model learns over time. impacts downstream observations [68]. Our future research
• The LG, TextCNN and XGBoost do not perform as good should evaluate models using different ground truth from
as the other baselines. fake and mainstream news sites. Furthermore, some sources
consider more fine-grained fake news domains and more spe-
Overall, the results suggest that since linguistic features of cific subcategories. Understanding whether existing models
the fake news and the social contexts on the fake news are less perform better in some subcategories than others can provide
available during the early stage of the news, we see the lower helpful information about model bias and weaknesses.
performance of all the models during the early timesteps.
Our model shows better accuracy than other models because 7.3 Weak supervision
we consider both the news and the social contexts. The news
data and the social media posts contain sufficient linguistic Motivated by the success of weak supervision in similar
features and are supplementary to each other, which helps us previous works [9, 57, 69], we are currently using weak
determine the fake news earlier than the other methods. supervision to train deep neural network models effectively.
In our specific scenario, applying this weak supervision
scheme to the fake news classification problem also reduced
the model development time from weeks to days. Moreover,
despite noisy labels in weakly labelled training data, our

123
AUC score AUC score

0.0
0.2
0.4
0.6
0.8

0.0
0.2
0.4
0.6
0.8
FND-NS FND-NS

FANG FANG

exBAKE exBAKE
2-Stage Tranf 2-Stage Tranf
TriFN TriFN
Grover Grover

c
a
Declare Declare
15 min

30 min
LG LG
AUC score
BERT-c BERT-c

0.0
0.2
0.4
0.6
0.8
VGCN-BERT VGCN-BERT
FND-NS
GPT-2 GPT-2
FANG
Text CNN Text CNN
exBAKE
International Journal of Data Science and Analytics (2022) 13:335–362

XGBoost XGBoost
2-Stage Tranf
TriFN
Grover

e
AUC score AUC score
Declare
0.0
0.2
0.4
0.6
0.8

120 min
0.0
0.2
0.4
0.6
0.8

LG
BERT-c FND-NS
FND-NS
VGCN-BERT FANG FANG
GPT-2 exBAKE exBAKE
Text CNN 2-Stage Tranf 2-Stage Tranf
XGBoost TriFN TriFN

Grover Grover

d
b

Declare Declare
100 min
60 min

LG LG

deadline. d Fake news detection based on 100-min. deadline. e Fake news detection based on 120-min. deadline
BERT-c BERT-c

VGCN-BERT VGCN-BERT
GPT-2 GPT-2
Text CNN Text CNN
XGBoost XGBoost

Fig. 11 a Fake news detection based on 15-min. deadline. b Fake news detection based on 30-min. deadline. c Fake news detection based on 30-min.

123
357
358 International Journal of Data Science and Analytics (2022) 13:335–362

results show that our proposed model performs well using However, we must be cautious to avoid negative transfer,
weakly labelled data. However, we acknowledge that if we which is an open research problem.
rely too much on weakly labelled data, the model may not We conducted preliminary research to understand the
generalize in all cases. This limitation can be overcome by transferability between the source and target domains to
considering manual article-level labelling, which has its own avoid negative transfer learning. After that, we choose MNLI
set of consequences (e.g. laborious and time-consuming pro- to extract knowledge based on appropriate transferability
cess). measures for learning fake news detection and user credi-
In future, we intend to use semi-supervised learning bility. We understand that an entire domain (for example,
[70] techniques to leverage unlabelled data using structural from MNLI) cannot be used for transfer learning; however,
assumptions automatically. We could also use the transfer for the time being, we rely on a portion of the source domain
learning technique [71] to pre-train the model only on fake for useful learning in our target domain. The next step in this
news data. Furthermore, we plan to try knowledge-based research will be to identify a more specific transfer learning
weak supervision [54], which employs structured data to domain.
label a training corpus heuristically. The knowledge-based
weak supervision also allows the automated learning of an 7.6 User credibility
indefinite number of relation extractors.
As previously stated, we transfer relevant knowledge from
7.4 User profiles MNLI to user credibility, and we admit that the relatedness
between the two tasks can be partial. In future, we plan
Another limitation in this study is that we only use a small to get user credibility scores through other measures such
portion of users’ profiles from the currently available dataset as FaceTrust [76], Alexa Rank [77], community detection
(i.e. Fakeddit). Though Fakeddit covers users’ interactions algorithms [48], sentiment analysis [33] and profile ranking
over a long range of timestamps, we could only use a portion techniques [49].
because we need to match users’ interactions (social con-
texts) from the Fakeddit dataset with the timeline of news 7.7 Baselines
data from the NELA-GT-19 dataset. This limitation, how-
ever, only applies to our test scenarios. The preceding issue We include a variety of baseline methods in our experiment.
will not arise if a researcher or designer uses complete data While we choose algorithms with different behaviours and
to implement our model on their social media platform. benchmarking schemes in mind, we must acknowledge that
One future direction for our research is to expand the mod- our baseline selection is small compared to what is avail-
elling of users’ social contexts. First, we can include user able in the entire field. Our ultimate goal is to understand
connections in a social network in our model. User connec- broad trends. We recognize that our research does not eval-
tions information can reveal the social group to which a user uate enough algorithms to make a broad statement about the
belongs and how the social network nurtures the spread of whole fake news detection field.
fake news. Second, we may incorporate user historical data to
better estimate the user status, as a user’s tendency to spread 7.8 Sequence length
fake news may change over time.
Another approach is to crawl more real-world data from We find that the difference in sequence length is the most
news sites and social media platforms (such as Twitter) to critical factor contributing to FND-NS outperforming the
include more social contexts, which could help identify more benchmark models in our experiments. We acknowledge
fake news patterns. Crawling multi-modal data such as visual that most of the models used in this study do not support
content and video information can also be useful for detecting the sequence length larger than 512. We did not shorten the
fake news. sequence lengths during the ablation study, but ablation of
Our proposed fake news detection method can be applied heavier features such as the news body or headline tends to
to other domains, such as question-answering systems, news reduce total sequences, which is why our model performed
recommender systems [47, 72], to provide authentic news to differently (worse than expected) during the ablation study.
readers. Nevertheless, we would like to draw the readers’ attention to
a trade-off between the model’s predictive performance and
7.5 Transfer learning computational cost. In our experiments, models that consider
shorter sequences sacrifice some predictive performance for
We have used transfer learning to match the tasks of fake relatively shorter processing time. The predictive power of
news detection and user credibility classification. We have the classifiers usually improves by increasing the sequence
evidence that the MNLI can be useful for such tasks [73–75]. length [19, 63] that we choose to work with.

123
International Journal of Data Science and Analytics (2022) 13:335–362 359

7.9 Experimental set-up Notation Description

yi ∈ {0, 1} Label yi  1 is fake news; yi  0


Another limitation of this study is the availability of lim- is real news
ited resources (like GPUs, memory, data storage, etc.), due  
U  u 1 , u 2 , . . . , u |U | Set of users; |U | is the number of
to which we could not perform many experiments on other users
large-scale data sources. In future, we plan to expand our (u 0 , sc0 , t0 ) A tuple, user u and social context
experiments using better infrastructure. sc during timestamp t
So far, our model is trained offline. To satisfy the real- C(n i ) Content of news
time requirement, we just need to train and update the model SC(n i )  ((u 0 , sc0 , t0 ), . . . ,) Sequence of a user’ social contexts
periodically. on a news item, |sc| is the size of
SC
ŷ(n i ) ∈ {0, 1} Predicted label for news item n i
ŷ(n i )  M(n i , SC(n i )) Model M predicts a label for news
item based on its news features
8 Conclusion and social contexts
X  {x1 , x2 , . . . , xk } Sequence of k input vector
In this paper, we propose a novel deep neural framework representations, k is the length
 
for fake news detection. We identify and address two unique X   x1 , x2 , . . . , xk Sequence of embedding vectors
challenges related to fake news: (1) early fake news detec- from X
tion and (2) label shortage. The framework has three essential  →Y
f : X 1:k 1:l Mapping f from input sequence of
parts: (1) news module, (2) social contexts module and (3) k vectors to a sequence of l target
detection module. We design a unique Transformer model vectors
for the detection part, which is inspired by the BART archi- x  , X  Output vector representation from
input x  , and sequence of output
tecture. The encoder blocks in our model perform the task vectors of x 
of representation learning. The decoder blocks predict the κi ; vi ; qi ;K Key vector; value vector; query
future behaviour based on past observations, which also helps vector; set of key vectors
us address the challenges of early fake news detection. The Wv , Wk ,Wq , SoftMax Trainable weight vectors of κ; v; q,
decoders depend on the working of the encoders. So, both activation function
modules are essential for fake news detection. To address X 1:k ; Y1:l Contextualized input sequence to
the label shortage issue, we propose an effective weak super- decoder; the target vector
vision labelling scheme in our framework. To sum up, the sequence
inclusion of rich information from both the news and social f θenc ; pθdec , L1:k  1 , . . . , k Encoder function, decoder
function; logit vector
contexts and weak labels proves helpful in building a strong
y  ; y  Vector representation of y  , and y 
fake news detection classifier.
< S > ; [S ];h [S ] Token in decoder; last state of the
token; hien ste
p ∈ [0, 1]2 Probability distribution over
Declarations classes [0,1]
W ; b; h Project matrix; bias term;
Conflict of interest On behalf of all authors, the corresponding author
cross-entropy function
states that there is no conflict of interest.
X ; X  ; X  ;X Input sequence to encoder;
sequence generated from X ;
sequence generated from X  ;
output encoding sequence from
Appendix A: Notations used in paper X 
Y ; Y  ; Y  ; Y  ;Y Target sequence in decoder;
sequence generated from Y ; Y  ;
Y  ; and Y  , respectively
Notation Description y j ; ŷ j Ground truth label; model
  prediction
N  n 1 , n 2 , . . . , n |N | Set of news items, |N | is the size of
the news dataset

123
360 International Journal of Data Science and Analytics (2022) 13:335–362

References 18. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever,
I.: Language models are unsupervised multitask learners. OpenAI
1. Liu, C., Wu, X., Yu, M., Li, G., Jiang, J., Huang, W., Lu, X.: A Blog. 1, 9 (2019)
two-stage model based on BERT for short fake news detection. In: 19. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L.,
Lecture Notes in Computer Science (Including Subseries Lecture Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you
Notes in Artificial Intelligence and Lecture Notes in Bioinformat- need. In: Advances in Neural Information Processing Systems,
ics). 11776 LNAI, pp. 172–183 (2019). https://doi.org/10.1007/ pp. 5998–6008 (2017)
978-3-030-29563-9_17 20. Xian, Y., Akata, Z., Sharma, G., Nguyen, Q., Hein, M., Schiele, B.:
2. Zhou, X., Zafarani, R.: A survey of fake news: fundamental the- Latent embeddings for zero-shot classification. In: Proceedings of
ories, detection methods, and opportunities. ACM Comput. Surv. the IEEE Conference on Computer Vision and Pattern Recognition.
(2020). https://doi.org/10.1145/3395046 pp. 69–77 (2016)
3. Horne, B.D., NØrregaard, J., Adali, S.: Robust fake news detection 21. Gruppi, M., Horne, B.D., Adalı, S.: NELA-GT-2019: A large multi-
over time and attack. ACM Trans. Intell. Syst. Technol. (2019). labelled news dataset for the study of misinformation in news
https://doi.org/10.1145/3363818 articles. arXiv Preprint. http://arxiv.org/abs/2003.08444v2 (2020)
4. Shu, K., Wang, S., Liu, H.: Beyond news contents: The role of 22. Nakamura, K., Levy, S., Wang, W.Y.: r/fakeddit: a new multi-
social context for fake news detection. In: WSDM 2019—Pro- modal benchmark dataset for fine-grained fake news detection.
ceedings of 12th ACM International Conference on Web Search arXiv Preprint. http://arxiv.org/abs/1911.03854 (2019)
Data Mining, vol. 9, pp. 312–320 (2019). https://doi.org/10.1145/ 23. Shu, K., Mahudeswaran, D., Wang, S., Lee, D., Liu, H.: Fak-
3289600.3290994 eNewsNet: a data repository with news content, social context,
5. Liu, Y., Wu, Y.F.B.: FNED: a deep network for fake news early and spatiotemporal information for studying fake news on social
detection on social media. ACM Trans. Inf. Syst. (2020). https:// media. Big Data 8, 171–188 (2020). https://doi.org/10.1089/big.
doi.org/10.1145/3386253 2020.0062
6. Vosoughi, S., Roy, D., Aral, S.: The spread of true and false news 24. Pizarro, J.: Profiling bots and fake news spreaders at PAN’19
online. Science 359, 1146–1151 (2018) and PAN’20: bots and gender profiling 2019, profiling fake
7. Jwa, H., Oh, D., Park, K., Kang, J.M., Lim, H.: exBAKE: automatic news spreaders on Twitter 2020. In: 2020 IEEE 7th International
fake news detection model based on Bidirectional Encoder Repre- Conference on Data Science and Advanced Analytics (DSAA),
sentations from Transformers (BERT). Appl. Sci. 9, 4062 (2019). pp. 626–630 (2020)
https://doi.org/10.3390/app9194062 25. Horne, B.D., Dron, W., Khedr, S., Adali, S.: Assessing the news
8. Popat, K., Mukherjee, S., Yates, A., Weikum, G.: Declare: debunk- landscape: a multi-module toolkit for evaluating the credibility of
ing fake news and false claims using evidence-aware deep learning. news. In: The Web Conference 2018—Companion of the World
arXiv Preprint. http://arxiv.org/abs/1809.06416. (2018) Wide Web Conference, WWW 2018, pp. 235–238 (2018)
9. Wang, Y., Yang, W., Ma, F., Xu, J., Zhong, B., Deng, Q., Gao, 26. Przybyla, P.: Capturing the style of fake news. Proc. AAAI Conf.
J.: Weak supervision for fake news detection via reinforcement Artif. Intell. 34, 490–497 (2020). https://doi.org/10.1609/aaai.
learning. In: AAAI 2020—34th AAAI Conference on Artificial v34i01.5386
Intelligence, pp. 516–523 (2020) 27. Silva, R.M., Santos, R.L.S., Almeida, T.A., Pardo, T.A.S.: Towards
10. Hoens, T.R., Polikar, R., Chawla, N.: V: Learning from streaming automatically filtering fake news in Portuguese. Expert Syst. Appl.
data with concept drift and imbalance: an overview. Prog. Artif. 146, 113199 (2020). https://doi.org/10.1016/j.eswa.2020.113199
Intell. 1, 89–101 (2012) 28. Potthast, M., Kiesel, J., Reinartz, K., Bevendorff, J., Stein, B.:
11. Kaliyar, R.K., Goswami, A., Narang, P., Sinha, S.: FNDNet—a A stylometric inquiry into hyperpartisan and fake news. arXiv
deep convolutional neural network for fake news detection. Cogn. Preprint. http://arxiv.org/abs/1702.05638. (2017)
Syst. Res. 61, 32–44 (2020). https://doi.org/10.1016/j.cogsys.2019. 29. Horne, B., Adali, S.: This just in: Fake news packs a lot in title, uses
12.005 simpler, repetitive content in text body, more similar to satire than
12. Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi, A., Roes- real news. In: Proceedings of the International AAAI Conference
ner, F., Choi, Y.: Defending against neural fake news. Neurips on Web and Social Media (2017)
(2020) 30. Zhou, X., Wu, J., Zafarani, R.: SAFE: similarity-aware multi-modal
13. Yang, S., Shu, K., Wang, S., Gu, R., Wu, F., Liu, H.: Unsupervised fake news detection. Adv. Knowl. Discov. Data Min. 12085, 354
fake news detection on social media: a generative approach. In: (2020)
Proceedings of the AAAI Conference on Artificial Intelligence, 31. De Maio, C., Fenza, G., Gallo, M., Loia, V., Volpe, A.: Cross-
pp. 5644–5651 (2019) relating heterogeneous Text Streams for Credibility Assessment.
14. Mohammadrezaei, M., Shiri, M.E., Rahmani, A.M.: Identifying In: IEEE Conference on Evolving and Adaptive Intelligent
fake accounts on social networks based on graph analysis and clas- Systems, 2020-May, (2020). https://doi.org/10.1109/EAIS48028.
sification algorithms. Secur. Commun. Netw. (2018). https://doi. 2020.9122701
org/10.1155/2018/5923156 32. Wanda, P., Jie, H.J.: DeepProfile: finding fake profile in online
15. Nguyen, V.H., Sugiyama, K., Nakov, P., Kan, M.Y.: FANG: social network using dynamic CNN. J. Inf. Secur. Appl. (2020).
leveraging social context for fake news detection using graph rep- https://doi.org/10.1016/j.jisa.2020.102465
resentation. Int. Conf. Inf. Knowl. Manag. Proc. (2020). https:// 33. Naseem, U., Razzak, I., Khushi, M., Eklund, P.W., Kim, J.: Covid-
doi.org/10.1145/3340531.3412046 senti: a large-scale benchmark Twitter data set for COVID-19
16. Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, sentiment analysis. IEEE Trans. Comput. Soc. Syst. (2021)
A., Levy, O., Stoyanov, V., Zettlemoyer, L.: BART: denoising 34. Naseem, U., Razzak, I., Eklund, P.W.: A survey of pre-processing
sequence-to-sequence pre-training for natural language generation, techniques to improve short-text quality: a case study on hate
translation, and comprehension. arXiv (2019). https://doi.org/10. speech detection on twitter. Multimed. Tools Appl. 80, 1–28 (2020)
18653/v1/2020.acl-main.703 35. Naseem, U., Razzak, I., Hameed, I.A.: Deep context-aware embed-
17. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre- ding for abusive and hate speech detection on Twitter. Aust. J. Intell.
training of deep bidirectional transformers for language under- Inf. Process. Syst. 15, 69–76 (2019)
standing. arXiv Preprint. http://arxiv.org/abs/1810.04805. (2018) 36. Huang, Q., Zhou, C., Wu, J., Liu, L., Wang, B.: Deep spa-
tial–temporal structure learning for rumor detection on Twitter.

123
International Journal of Data Science and Analytics (2022) 13:335–362 361

Neural Comput. Appl. (2020). https://doi.org/10.1007/s00521- Proceedings of the 2018 Conference on Empirical Methods in Nat-
020-05236-4 ural Language Processing, EMNLP 2018, pp. 3528–3539 (2020)
37. Jiang, S., Chen, X., Zhang, L., Chen, S., Liu, H.: User-characteristic 56. Nørregaard, J., Horne, B.D., Adalı, S.: NELA-GT-2018: A large
enhanced model for fake news detection in social media. In: CCF multi-labelled news dataset for the study of misinformation in news
International Conference on Natural Language Processing and Chi- articles. In: Proceedings of 13th International Conference on Web
nese Computing, pp. 634–646 (2019) and Social Media, ICWSM 2019, pp. 630–638 (2019). https://doi.
38. Qian, F., Gong, C., Sharma, K., Liu, Y.: Neural user response org/10.7910/DVN/ULHLCB
generator: fake news detection with collective user intelligence. 57. Horne, Benjamin; Gruppi, M.: NELA-GT-2020: A Large Multi-
In: IJCAI International Joint Conference on Artificial Intelligence, Labelled News Dataset for The Study of Misinformation in News
pp. 3834–3840 (2018) Articles. arXiv Preprint. http://arxiv.org/abs/2102.04567. (2021).
39. Liu, Y., Wu, Y.F.B.: Early detection of fake news on social media https://doi.org/10.7910/DVN/CHMUYZ
through propagation path classification with recurrent and con- 58. Drummond, C., Holte, R.C., et al.: C4.5, class imbalance, and cost
volutional networks. In: 32nd AAAI Conference on Artificial sensitivity: why under-sampling beats over-sampling. In: Work-
Intelligence, AAAI 2018, pp. 354–361 (2018) shop on Learning from Imbalanced Datasets II, pp. 1–8 (2003)
40. Cao, J., Qi, P., Sheng, Q., Yang, T., Guo, J., Li, J.: Exploring the 59. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization.
role of visual content in fake news detection. Disinformation, Mis- arXiv Preprint. http://arxiv.org/abs/1711.05101 (2017)
information, Fake News Social Media, pp. 141–161 (2020) 60. Lu, Z., Du, P., Nie, J.Y.: VGCN-BERT: augmenting BERT with
41. Jin, Z., Cao, J., Guo, H., Zhang, Y., Luo, J.: Multimodal fusion with graph embedding for text classification. In: Lecture Notes in Com-
recurrent neural networks for rumor detection on microblogs. In: puter Science (Including Subseries Lecture Notes in Artificial
Proceedings of the 25th ACM International Conference on Multi- Intelligence and Lecture Notes in Bioinformatics), pp. 369–382
media, pp. 795–816 (2017) (2020)
42. Karimi, H., Roy, P., Saba-Sadiya, S., Tang, J.: Multi-source multi- 61. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le,
class fake news detection. In: Proceedings of the 27th International Q.V.: Xlnet: generalized autoregressive pretraining for language
Conference on Computational Linguistics, pp. 1546–1557 (2018) understanding. In: Advances in Neural Information Processing
43. Wu, X., Lode, M.: Language models are unsupervised multitask Systems, pp. 5753–5763 (2019)
learners (summarization). OpenAI Blog. 1, 1–7 (2020) 62. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a dis-
44. Vijjali, R., Potluri, P., Kumar, S., Teki, S.: Two stage transformer tilled version of BERT: smaller, faster, cheaper and lighter. arXiv
model for covid-19 fake news detection and fact checking. arXiv Preprint. http://arxiv.org/abs/1910.01108 (2019)
Preprint. http://arxiv.org/abs/2011.13253. (2020) 63. Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-
45. Anderson, C.W.: News ecosystems. SAGE Handb. Digit. J. document transformer. arXiv Preprint. http://arxiv.org/abs/2004.
410–423 (2016) 05150 (2020)
46. Wang, B., Shang, L., Lioma, C., Jiang, X., Yang, H., Liu, Q., 64. Kim, Y.: Convolutional neural networks for sentence classification.
Simonsen, J.G.: On position embeddings in BERT. In: International arXiv Preprint. http://arxiv.org/abs/1408.5882 (2014)
Conference on Learning Representations (2021) 65. Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H.,
47. Raza, S., Ding, C.: News recommender system: a review of recent et al.: Xgboost: extreme gradient boosting. R Package version 0.4-
progress, challenges, and opportunities. Artif. Intell. Rev. (2021). 2.1, (2015)
https://doi.org/10.1007/s10462-021-10043-x 66. Yao, L., Mao, C., Luo, Y.: Graph convolutional networks for text
48. Papadopoulos, S., Kompatsiaris, Y., Vakali, A., Spyridonos, P.: classification. In: Proceedings of the AAAI Conference on Artifi-
Community detection in social media. Data Min. Knowl. Discov. cial Intelligence, pp. 7370–7377 (2019)
24, 515–554 (2012) 67. Raza, S., Ding, C.: News recommender system considering tem-
49. Abu-Salih, B., Wongthongtham, P., Chan, K.Y., Zhu, D.: CredSaT: poral dynamics and news taxonomy. In: Proceedings—2019 IEEE
credibility ranking of users in big social data incorporating seman- International Conference on Big Data, Big Data 2019, pp. 920–929.
tic analysis and temporal factor. J. Inf. Sci. 45, 259–280 (2019). Institute of Electrical and Electronics Engineers Inc. (2019)
https://doi.org/10.1177/0165551518790424 68. Bozarth, L., Saraf, A., Budak, C.: Higher ground? How groundtruth
50. Williams, A., Nangia, N., Bowman, S.R.: A broad-coverage chal- labeling impacts our understanding of fake news about the 2016 US
lenge corpus for sentence understanding through inference. arXiv presidential nominees. In: Proceedings of the International AAAI
Preprint. http://arxiv.org/abs/1704.05426 (2017) Conference on Web and Social Media, pp. 48–59 (2020)
51. Pushp, P.K., Srivastava, M.M.: Train once, test anywhere: zero-shot 69. Wang, Y., Sohn, S., Liu, S., Shen, F., Wang, L., Atkinson, E.J.,
learning for text classification. arXiv Preprint. http://arxiv.org/abs/ Amin, S., Liu, H.: A clinical text classification paradigm using
1712.05972 (2017) weak supervision and deep representation. BMC Med. Inform.
52. Dong, L., Yang, N., Wang, W., Wei, F., Liu, X., Wang, Y., Gao, Decis. Mak. 19, 1–13 (2019). https://doi.org/10.1186/s12911-018-
J., Zhou, M., Hon, H.-W.: Unified language model pre-training 0723-6
for natural language understanding and generation. arXiv Preprint. 70. Zhu, X., Goldberg, A.B.: Introduction to semi-supervised learning.
http://arxiv.org/abs/1905.03197 (2019) Synth. Lect. Artif. Intell. Mach. Learn. 3, 1–130 (2009)
53. Helmstetter, S., Paulheim, H.: Weakly supervised learning for fake 71. Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., Liu, C.: A survey
news detection on Twitter. In: 2018 IEEE/ACM International Con- on deep transfer learning. In: International Conference on Artificial
ference on Advances in Social Networks Analysis and Mining Neural Networks, pp. 270–279 (2018)
(ASONAM), pp. 274–277 (2018) 72. Raza, S., Ding, C.: A Regularized Model to Trade-off between
54. Hoffmann, R., Zhang, C., Ling, X., Zettlemoyer, L., Weld, D.S.: Accuracy and Diversity in a News Recommender System. In:
Knowledge-based weak supervision for information extraction of 2020 IEEE International Conference on Big Data (Big Data),
overlapping relations. In: Proceedings of the 49th Annual Meeting pp. 551–560 (2020)
of the Association for Computational Linguistics: Human Lan- 73. Bhuiyan, M., Zhang, A., Sehat, C., Mitra, T.: Investigating “who”
guage Technologies, pp. 541–550 (2011) in the crowdsourcing of news credibility. In: Computational Jour-
55. Baly, R., Karadzhov, G., Alexandrov, D., Glass, J., Nakov, P.: Pre- nalism Symposium (2020)
dicting factuality of reporting and bias of news media sources. In: 74. Popat, K., Mukherjee, S., Strötgen, J., Weikum, G.: Credibility
assessment of textual claims on the web. In: International Con-

123
362 International Journal of Data Science and Analytics (2022) 13:335–362

ference on Information and Knowledge Managenent Proceedings, 77. Thakur, A., Sangal, A.L., Bindra, H.: Quantitative measurement
24–28-October-2016, pp. 2173–2178 (2016). https://doi.org/10. and comparison of effects of various search engine optimization
1145/2983323.2983661 parameters on Alexa Traffic Rank. Int. J. Comput. Appl. 26, 15–23
75. Yang, K.-C., Niven, T., Kao, H.-Y.: Fake News Detection as Natu- (2011)
ral Language Inference. arXiv Preprint. http://arxiv.org/abs/1907.
07347 (2019)
76. Sirivianos, M., Kim, K., Yang, X.: FaceTrust: Assessing the cred-
Publisher’s Note Springer Nature remains neutral with regard to juris-
ibility of online personas via social networks. In: Proceedings of
dictional claims in published maps and institutional affiliations.
4th USENIX Conferences on Hot Topics in Security (2009)

123

You might also like