0% found this document useful (0 votes)

132 views

Defence University College of Engineering: M-Tech Thesis Progress Report

The document is a progress report for an M-Tech thesis that aims to detect hate speech in Facebook comments using recurrent neural networks. It provides background on the growing problem of hate speech online and in Ethiopia. It then outlines the proposed method, which will use text processing, feature extraction, and an LSTM classifier to label Facebook comments as hate speech or not hate speech. The report reviews related work on hate speech detection, including previous studies that used machine learning for Amharic hate speech detection on Facebook as well as approaches using ELMo, BERT, CNN and other classifiers.

Uploaded by

melkamzer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

132 views

Defence University College of Engineering: M-Tech Thesis Progress Report

Uploaded by

melkamzer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 15

Defence University

College Of Engineering

M-Tech Thesis Progress Report

Thesis Title
DETECTION OF HATE SPEECH FACEBOOK COMMENT
USING RECURRENT NEURAL NETWORK
By
Tewodros Ambasajer
Supervisor: Dr. Taye Girma

Department: Department of Computer Science and Engineering

Specialization: Computer Engineering

June, 2021
Bishoftu
I. Introduction

Hate and dangerous speech is a serious and growing problem in Ethiopia, both online and offline. Hate
speech on social media has unfortunately become a common occurrence in the Ethiopian online
community; largely due to the substantial growth of users on social media in recent years. It has
contributed to the growing ethnic tensions and conflicts across the country there were 21.14 million
internet users in Ethiopia in January 2020. The number of internet users in Ethiopia increased by 534
thousands (+2.6%) between 2019 and 2020. Internet penetration in Ethiopia stood at 19% in January
2020. There were 6,137,000 Facebook users in Ethiopia in January 2020, which accounted for 5.3% of
its entire population. The majority of themes were men 68.9%. People aged 25to 34 where the largest
user group (3,800,000) [1].

ARTICLE 19 has now concerned about the text and misuse of the Ethiopia hate speech and
disinformation law against those who are critical of the Government’s policies. The Proclamation to
Prevent the Spread of Hate Speech and False Information, that took effect on 23 March 2020 is
extremely problematic from a human rights and free speech perspective and should be immediately
revised. In any case, while the Proclamation remains in effect, it must not be misused and the
Government must not abuse its power under the pretext of addressing the public health crisis.

A large number of people have recently joined Online Social Networking (OSN) websites, mainly due
to worldwide availability of low-cost Internet. According to a survey conducted by Statistic, by
January 2020, more than 4.54 billion people were active Internet users, accounting for 59% of the
global population. Among these users, 3.8 billion people were active users of OSN websites, which
represents 83.70% of the total Internet users.

1
Hate speech detection is a relatively new research area. With social media being used on a daily basis,
the usage of hate speech has increased as well. Social media companies rely on users to report hateful
content as well as manual filtering.
With widespread use of the Internet, large numbers of users take to various social media and online
forums to express their opinions and thoughts on numerous subject matters. Social media provides
benefits such as anonymity which allows people misuse the freedom of speech to convey hatred
towards others. As this is a serious issue, various social media corporations such as YouTube,
Instagram, Facebook and Twitter are continuously looking for ways to detect hate speech. Previously
they relied on users to report such content. As artificial intelligence is on the rise, these companies took
to machine learning techniques to optimize hate speech detection

Raw Comments

Text Processing

Feature Extraction Training Data

Classification

Test Data

Hate Speech Not Hate Speech

Figure 1.Block Diagram of Hate Speech Detection

2
Natural language processing is a branch of artificial intelligence that enables a computer system to
understand the natural language used by humans. But it is very difficult for machines to understand
emotions. Sentimental analysis is the process, which helps in identifying the emotions of a language
i.e. positive, negative or neutral. Sentimental analysis is done with the help of natural language
processing to understand the polarity of the language.
Social media currently provide localization, which allows the user to use different world languages on
their sites. One of these languages is Amharic; Amharic languages are one of wildly spoken language
and working language of the federal government of Ethiopia. The language is written left-to-right and
has its unique script, which lacks capitalization and in totals 275 characters, mainly consonant-vowel
pairs. It is the second-largest Semitic language after Arabic. The current estimated population is 107.53
million. The Amharic language is still under-resourced that have few computing tools, and hate or
offensives speech detection tools or research study that propose a solution.
Nowadays, in Ethiopia, it is an open secret that the recent widespread hate speech and call for violence
and attacks on particular targets of individual or group based on their political view, ethnic origin, and
religious affiliation. Therefore, it is important to monitor or automatically detect hate speech on this
platform to prevent their spread, and possible reduce acts of violence and hate crimes that destroy the
lives of individuals, families, communities, and also the country.
This research proposes to find out how Natural Language Processing techniques can contribute to the
detection of hate speech. This research paper also focuses on exploring and applying a current effective
method for this classification task on a Facebook dataset.
Therefore, as technology develops, we create a system that can detect a Facebook comments based on
the comments whether it is classified as hate speech or not using the LSTM method as a classifier. The
result of this system is to provide a label in the form of “hate speech” or “non-hate speech” on every
comment that becomes an input on this system.

3
Literature Review

This section presents a comprehensive review of basic related works to the area of automatic detection
of hate speech Facebook comment on social media to clearly understand the general technique,
method, and result of existing studies.
In this Amharic Hate speech,[1] Mossie and Wang perform preliminary study on hate speech detection
for Amharic language, Creating a dataset of 1821 posts and comments from Facebook and 4299
instances keyword and phrase extracted for posts and comments, then binary classify the speech as
“hate” and “not hate” using word2vec and TF-IDF for feature extraction, machine learning classifier
algorithms NB model with achieved 73.02% and 79.83% and Random forest achieved 63.55 and
65.34% accuracy respectively for both features. The authors conclude that the result is promising to
compute a large volume of data for a social network. The study considers hate speech as a binary
problem.

In this paper [3], Yanling Zhou and,Yanyan Yang presented the principle of three types of text
classification methods, ELMo, BERT and CNN, and applied them to hate speech detection, then
improved the performance by fusion from two perspectives: the fusion of the classification results of
ELMo, BERT and CNN, and the fusion of the classification results of three CNN classifiers with
different parameters. The results showed that fusion processing is a viable way to improve the
performance of hate speech detection. It can be deemed reasonable to achieve the practical significance
of performance at a little extra cost.

In this paper [2] ,Garima Koushik and Dr. Prof. K. Rajeswari, studied a machine learning model that
can classify a tweet into one of the two categories: hate speech or non-hateful speech using natural
language processing in python. We used publicly available twitter dataset and then extract its bag of
words and TFIDF (term frequency-inverse document frequency) features, these features are then
provided to train logistic regression classifier, our results show that the classifier is able to achieve
94.11% accuracy on detecting whether a tweet is hateful or not. The proposed model is built based on
publicly available twitter dataset after extracting the features using a bag of words and TFIDF
approach. Before extracting the features from the tweets, we have tried to visualize the tweets for better
understanding. We used the word cloud plot for hate tweets, non-hate tweets separately, where
the large size words are the most frequent words and the smaller size words occur less frequent. It
helps in understanding which word is mostly used as an offensive/ hate speech.
4
In this paper[3], Hao-Ren Yao, Eugene Yang proposed a solution for the automatic detection of the
hate tweets using machine learning with the help of a bag of words and the TFIDF approach. We used
logistic regression classifier for the classification of the tweets into one of the two categories. We kept
30% of dataset data and remaining as training data. Using a bag of words features we obtained 94.11%
accuracy whereas TFIDF feature gives an accuracy of 94.62%. Both the features give almost the same
accuracy. The performance of the model is good enough however; it can give a more accurate
performance by using different classifiers and by incorporating the linguistic features.

In this paper [4], the goal of this work is to explore a new way to automatically unearth hate speech on
Facebook. To achieve this goal, the process involves the following four stages: the discovery stage, the
sensitive social data collection stage, the sentiment and emotion analysis stage, and the clustering
stage. This paper introduced a new approach to identify the pages that promote hate speech by posting
comments related to controversial topics on Facebook. The comments in those postings will be
clustered to filter the sensitive topics discussed by all the users and then will be automatically analysed
to detect hatred. For this reason, our work shares many similarities with the work done by Ben-David
and Matamoros-Fernandez. However, the authors in focused on detecting hate speech promoted by
specific political parties in Spain. To achieve this goal, they analysed the posts from eight Facebook
pages. On the contrary, our proposed framework starts from some seed pages that are not restricted to
political parties. In this paper, we proposed a new approach to identify hate speech on Facebook, which
is a challenging task since many Facebook users have been trying hard to cover their real intentions. To
tackle this problem, we used graph analysis to identify pages that potentially promote hate speech. By
applying sentiment and emotion analysis, the most negative posts and comments were obtained. K-
means clustering was then applied to determine the most discussed topics. By analysing the topics
generated by the best combination of parameters, it can be observed that this new approach yields
promising results.

In this paper [5], In this work, the aim of this research is to identify the maximum number of Hate
Speech (HS) related tweets from Twitter as soon as it is posted by users. This issue falls within the
context of a binary classification problem where tweets are classified into two classes, i.e. either Hate
Speech (HS) or Non-Hate Speech (NHS). There is no proper or clear definition of HS is described. The
statement that hurt someone may be called as Hate Speech (HS). Sometimes, the HS may also be
called abusive statement; however, few works used the term hate speech in their research.

5
In order to detect the HS on social media platform, researchers mainly used two methodologies:
(i) traditional machine learning approach and
(ii) deep learning based approach.
The traditional machine learning based SVM classifier achieved the best performance but limited to
0.80, 0.40, 0.54 values of precision, recall, and F1-score to detect HS tweets, which is not satisfactory.
The classifier misclassifies the majority of HS tweets.
The deep learning based basic model of Convolutional Neural Network (CNN) having a single layer of
convolution and pooling layer yielded better performance. Another deep learning based model called
Long Short-Term Memory (LSTM) is used to improve model performance Results Comparison of the
Proposed Model with Existing Models. The C-LSTM model provided the recall value of 0.43, which is
lower than that of all tested models so far. The LSTM model is generally performing well with the
sequential data, where the model needs to preserve the semantics of the words for a long time;
however, the HS detection issue is of the subjective type where the model has to process the complete
tweet's words and extract their context. This may be one of the reasons behind the good performance of
the CNN model that worked with complete words of the tweets to find their context.

In this paper [3], In this work, they built a new dataset to find malice in Bangla Language that contains
hate speech on different categories such as religion, community, gender, and race. We have used web
scrapper to scrap the data. We have labeled our data in two categories either it is a hate speech or not.
The contribution of our work is to creating a new dataset for hate speech detection in Bangla Language
and applying the algorithm to detect it.
In their work, they made a new dataset in the Bangla language. We divided the dataset into two groups
and labeled them. There were anomalies in the dataset, and we processed them to remove the
anomalies. After that, we have extracted features from our dataset to use in our model. Then we
applied the Support vector machine and Naïve Bayes machine learning algorithms. Both the algorithm
performed well with our dataset. We showed Precision, Recall and F1-score for both of the algorithms.
Naïve Bayes gave us an accuracy of 72%

6
II. Description of the study

Hate speech detection in Facebook comments are important issues in social network environment.
Since the Facebook comments and all of its data are accessible to anyone over the social networks,
there have been many hate speech challenges faced to the social network. Many of the studies try to
solve hate speech issues, but still, some of these issues remain unsolved. One of the efficient
techniques to ensure hate speech is neural network. This paper attempts to propose detection of hate
speech using recurrent neural network and collect and label data set for Amharic Facebook comment.
Statements of the problem

Hate speech is particularly nasty type of speech, but it is still speech, not conduct, and covered by the
freedom of speech principle. In Ethiopia, the hate speech and incitements to violence come from all
sides, including the government and influential media figure and they are often posted in one of
hundreds of languages used throughout the country.

Regulating problematic online content has become a pressing issue around the world. The ease and
speed with which harmful and dangerous content is disseminated and accessed via social media
particularly reinforces this challenge. Many countries have or are considering some form of legislation
to regulate problematic online content, be it hate speech, fake news, or extremist and terrorist content.
Like many other nations, Ethiopia is grappling with the serious and growing problem of hate speech
and misinformation.

The general objective of this research work is to develop a machine learning based model to detect
hate speech on Facebook Comments.

The specific objective of this thesis work is listed as following:

 To collect and label data set for Amharic Facebook comment.

 To design and develop a prototype to demonstrate a model using recurrent neural network or
other deep learning models.
 To develop LSTM based model and compare with the other RNN and other deep learning
models using our own dataset.
 To evaluate the performance of the hate speech detection system using
evaluation metrics.
7
Scope of the study
The scope of this study is to provide new system in hate speech Amharic Facebook Comment
Detection. In this research, the proposed approach finds the suitable algorithms will be applied
technique of classification using Recurrent Neural Network algorithm. The algorithm is useful for
forming patterns that will present the classification. This study aims to create a system that detects hate
speech comments and identify the maximum number of Hate Speech (HS) its commented by building a
new Amharic dataset, utilizing multiple machine.

III. Progresses made to date

In this thesis work, we have been working hard starting from the time of proposal approval and we
were looking for clear way that can help for the thesis work through the progress. We have been
referring papers which are related works and that can help for the accomplishment of the thesis work.
We study Python easy to learn, powerful programming language to develop a machine learning
application. Anaconda Navigator allows us to launch development applications and easily manage
conda packages, environments, and channels without the need to use command-line commands an also
Jupiter notebooks An open-source web application that allows us to create and share documents that
contain live code, equations, visualizations, and narrative text. Uses include data cleaning and
transformation, numerical statistical modelling, data visualization, and machine learning.
IV. Problem encountered
This study comes across to limitations on a different phase of the research process. Since there is a lack
of other studies for comparison hate speech detection for the Amharic language, also lack share public
dataset and model for hate speech detection. The other constraint of this study is listed as follows.

 Due to the limitation of resources for the dataset annotation process of the dataset was
challenging and a lack of hate speech related law experts to consult.
 Due to the lack of standard Amharic language stemmer and stop words lists. The study
did not apply these two preprocessing methods.
 Due to the tight schedule, the study only implemented the proposed machine learning classifier
and limited it to develop and evaluate algorithms

8
V. Future work plan

Table 2.2: Work plan and time schedule (Phase 2)

No. Activities Phase 2 ()

June July August September Octeber
Week Week Week Week We
ek
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
1 Start Phase II
Collecting, configuring
and installing software
tools
2 Gathering the Amharic
post and comment
textual data from public
Facebook pages

3 Annotating the dataset..

4 collecting, cleaning,
filtering, and
consolidating data into
one file or data table
5 Deploying the proposed
system on Jupiter
notebooks
6 Install and configure
flask for making web
services in python

8 Analyzing the trained

models that are tasted
accuracy for each
features
9 Finalize and thesis write
up
10
Final presentation

9
6. Reference

[1] S. Kemp, “digital ethiopia.pdf.” [Online]. Available: https://datareportal.com/reports/digital-2021-ethiopia.

[2] G. Koushik, K. Rajeswari, and S. K. Muthusamy, “Automated hate speech detection on Twitter,” Proc. - 2019 5th

Int. Conf. Comput. Commun. Control Autom. ICCUBEA 2019, 2019, doi: 10.1109/ICCUBEA47591.2019.9128428.

[3] S. Ahammed, M. Rahman, M. H. Niloy, and S. M. M. H. Chowdhury, “Implementation of Machine Learning to

Detect Hate Speech in Bangla Language,” Proc. 2019 8th Int. Conf. Syst. Model. Adv. Res. Trends, SMART 2019,

pp. 317–320, 2020, doi: 10.1109/SMART46866.2019.9117214.

[4] A. Rodriguez, C. Argueta, and Y. L. Chen, “Automatic Detection of Hate Speech on Facebook Using Sentiment

and Emotion Analysis,” 1st Int. Conf. Artif. Intell. Inf. Commun. ICAIIC 2019, pp. 169–174, 2019, doi:

10.1109/ICAIIC.2019.8669073.

[5] P. K. Roy, A. K. Tripathy, T. K. Das, and X.-Z. Gao, “A Framework for Hate Speech Detection Using Deep

Convolutional Neural Network,” IEEE Access, vol. 8, pp. 204951–204962, 2020, doi:

10.1109/access.2020.3037073.

10
VI. Supervision of my study

My supervisor is providing me supervision at least once a week on a regular basis starting from the
proposal to this date of progress report writing; by reading and commenting on drafts. Most of the time
he is accessible, within reason (e.g., by phone, email contact) outside planned supervision meetings
when advise may be required. He provides me guidance on the nature and requirements of the research
degree being pursued and standards expected. This include providing me with a clear understanding on
the main aspects of undertaking postgraduate research, the nature of a research degree awarded at DEC
and the form and structure of a thesis. He also provides me guidance and advice to ensure the research
can be completed, including the preparation of the thesis, normally by the end of minimum submission
period. As being my advisor he is assisting me in producing a detailed work-plan and timetable for my
research and monitors my progress in relation to this plan and completes my Progression Reports as
required in a timely fashion.

11
Name of the student: Tewodros A. N o . RPG/0026/12

Signature: ___ Date of submission:

Major Supervisor's Comments and Approval:

Suggestions / Comments:
__________________________________________________________________________________
______________________________________________________________________________
________________________________________________________________________________
__________________________________________________________________________________
______________________________________________________________________________
__________________________________________________________________________________
______________________________________________________________________________
The above report is a true representation of the work performed by the student to
the above date. I consider that the student's progress is (please tick the one relevant
to this report):
Satisfactory Unsatisfactory

Name: ____Approval Date signature

12
Evaluation for Thesis Progress Report

Title of the thesis: __________________________________________

o
Name of student: _________________________________ ID. N______________
A). How well do you think the student demonstrates the following learning outcomes
in his/her thesis progress?

1. Understanding of major theories within area of specialization and the thesis retain a
focus on the stated research problem and the proposed argument
Satisfactory Unsatisfactory
2. Knowledge of major recent research within area of specialization, critical use of a
wide range of literature and theories
Satisfactory Unsatisfactory
3. Understanding of methods appropriate for research and ability to apply appropriate
theories and methods to his / her research within area of specialization.
Satisfactory Unsatisfactory
4. Adequacy of research design and execution.
Satisfactory Unsatisfactory
5. Thesis organization in an appropriate academic style, progression and structure of
its arguments.
Satisfactory Unsatisfactory
6. Ability to effectively summarize and present ideas in writing as well as in oral
presentation?
Satisfactory Unsatisfactory
7. Progresses so far made irrespective e of the entire thesis work.
Satisfactory Unsatisfactory

B). Any suggestions/comments?

Name of Examiner: ________Signature: Date: _

13
HOD Comments and Approval:

The student is then will be informed about the thesis progress results and will also be given strict
instructions to undertake corrections suggested by the thesis progress report evaluation committee with
the help of this part.

The thesis progress report as it is considered by thesis progress report evaluation

committee the student's progress is (please tick the one relevant to this report):

Satisfactory Unsatisfactory

Agreed corrections, suggestions and Comments:

__________________________________________________________________________________
______________________________________________________________________________
________________________________________________________________________________
__________________________________________________________________________________
______________________________________________________________________________
__________________________________________________________________________________
______________________________________________________________________________
________________________________________________________________________________
__________________________________________________________________________________
______________________________________________________________________________
__________________________________________________________________________________
______________________________________________________________________________