Defence University College of Engineering: M-Tech Thesis Progress Report
Defence University College of Engineering: M-Tech Thesis Progress Report
College Of Engineering
June, 2021
Bishoftu
I. Introduction
Hate and dangerous speech is a serious and growing problem in Ethiopia, both online and offline. Hate
speech on social media has unfortunately become a common occurrence in the Ethiopian online
community; largely due to the substantial growth of users on social media in recent years. It has
contributed to the growing ethnic tensions and conflicts across the country there were 21.14 million
internet users in Ethiopia in January 2020. The number of internet users in Ethiopia increased by 534
thousands (+2.6%) between 2019 and 2020. Internet penetration in Ethiopia stood at 19% in January
2020. There were 6,137,000 Facebook users in Ethiopia in January 2020, which accounted for 5.3% of
its entire population. The majority of themes were men 68.9%. People aged 25to 34 where the largest
user group (3,800,000) [1].
ARTICLE 19 has now concerned about the text and misuse of the Ethiopia hate speech and
disinformation law against those who are critical of the Government’s policies. The Proclamation to
Prevent the Spread of Hate Speech and False Information, that took effect on 23 March 2020 is
extremely problematic from a human rights and free speech perspective and should be immediately
revised. In any case, while the Proclamation remains in effect, it must not be misused and the
Government must not abuse its power under the pretext of addressing the public health crisis.
A large number of people have recently joined Online Social Networking (OSN) websites, mainly due
to worldwide availability of low-cost Internet. According to a survey conducted by Statistic, by
January 2020, more than 4.54 billion people were active Internet users, accounting for 59% of the
global population. Among these users, 3.8 billion people were active users of OSN websites, which
represents 83.70% of the total Internet users.
1
Hate speech detection is a relatively new research area. With social media being used on a daily basis,
the usage of hate speech has increased as well. Social media companies rely on users to report hateful
content as well as manual filtering.
With widespread use of the Internet, large numbers of users take to various social media and online
forums to express their opinions and thoughts on numerous subject matters. Social media provides
benefits such as anonymity which allows people misuse the freedom of speech to convey hatred
towards others. As this is a serious issue, various social media corporations such as YouTube,
Instagram, Facebook and Twitter are continuously looking for ways to detect hate speech. Previously
they relied on users to report such content. As artificial intelligence is on the rise, these companies took
to machine learning techniques to optimize hate speech detection
Raw Comments
Text Processing
Classification
Test Data
2
Natural language processing is a branch of artificial intelligence that enables a computer system to
understand the natural language used by humans. But it is very difficult for machines to understand
emotions. Sentimental analysis is the process, which helps in identifying the emotions of a language
i.e. positive, negative or neutral. Sentimental analysis is done with the help of natural language
processing to understand the polarity of the language.
Social media currently provide localization, which allows the user to use different world languages on
their sites. One of these languages is Amharic; Amharic languages are one of wildly spoken language
and working language of the federal government of Ethiopia. The language is written left-to-right and
has its unique script, which lacks capitalization and in totals 275 characters, mainly consonant-vowel
pairs. It is the second-largest Semitic language after Arabic. The current estimated population is 107.53
million. The Amharic language is still under-resourced that have few computing tools, and hate or
offensives speech detection tools or research study that propose a solution.
Nowadays, in Ethiopia, it is an open secret that the recent widespread hate speech and call for violence
and attacks on particular targets of individual or group based on their political view, ethnic origin, and
religious affiliation. Therefore, it is important to monitor or automatically detect hate speech on this
platform to prevent their spread, and possible reduce acts of violence and hate crimes that destroy the
lives of individuals, families, communities, and also the country.
This research proposes to find out how Natural Language Processing techniques can contribute to the
detection of hate speech. This research paper also focuses on exploring and applying a current effective
method for this classification task on a Facebook dataset.
Therefore, as technology develops, we create a system that can detect a Facebook comments based on
the comments whether it is classified as hate speech or not using the LSTM method as a classifier. The
result of this system is to provide a label in the form of “hate speech” or “non-hate speech” on every
comment that becomes an input on this system.
3
Literature Review
This section presents a comprehensive review of basic related works to the area of automatic detection
of hate speech Facebook comment on social media to clearly understand the general technique,
method, and result of existing studies.
In this Amharic Hate speech,[1] Mossie and Wang perform preliminary study on hate speech detection
for Amharic language, Creating a dataset of 1821 posts and comments from Facebook and 4299
instances keyword and phrase extracted for posts and comments, then binary classify the speech as
“hate” and “not hate” using word2vec and TF-IDF for feature extraction, machine learning classifier
algorithms NB model with achieved 73.02% and 79.83% and Random forest achieved 63.55 and
65.34% accuracy respectively for both features. The authors conclude that the result is promising to
compute a large volume of data for a social network. The study considers hate speech as a binary
problem.
In this paper [3], Yanling Zhou and,Yanyan Yang presented the principle of three types of text
classification methods, ELMo, BERT and CNN, and applied them to hate speech detection, then
improved the performance by fusion from two perspectives: the fusion of the classification results of
ELMo, BERT and CNN, and the fusion of the classification results of three CNN classifiers with
different parameters. The results showed that fusion processing is a viable way to improve the
performance of hate speech detection. It can be deemed reasonable to achieve the practical significance
of performance at a little extra cost.
In this paper [2] ,Garima Koushik and Dr. Prof. K. Rajeswari, studied a machine learning model that
can classify a tweet into one of the two categories: hate speech or non-hateful speech using natural
language processing in python. We used publicly available twitter dataset and then extract its bag of
words and TFIDF (term frequency-inverse document frequency) features, these features are then
provided to train logistic regression classifier, our results show that the classifier is able to achieve
94.11% accuracy on detecting whether a tweet is hateful or not. The proposed model is built based on
publicly available twitter dataset after extracting the features using a bag of words and TFIDF
approach. Before extracting the features from the tweets, we have tried to visualize the tweets for better
understanding. We used the word cloud plot for hate tweets, non-hate tweets separately, where
the large size words are the most frequent words and the smaller size words occur less frequent. It
helps in understanding which word is mostly used as an offensive/ hate speech.
4
In this paper[3], Hao-Ren Yao, Eugene Yang proposed a solution for the automatic detection of the
hate tweets using machine learning with the help of a bag of words and the TFIDF approach. We used
logistic regression classifier for the classification of the tweets into one of the two categories. We kept
30% of dataset data and remaining as training data. Using a bag of words features we obtained 94.11%
accuracy whereas TFIDF feature gives an accuracy of 94.62%. Both the features give almost the same
accuracy. The performance of the model is good enough however; it can give a more accurate
performance by using different classifiers and by incorporating the linguistic features.
In this paper [4], the goal of this work is to explore a new way to automatically unearth hate speech on
Facebook. To achieve this goal, the process involves the following four stages: the discovery stage, the
sensitive social data collection stage, the sentiment and emotion analysis stage, and the clustering
stage. This paper introduced a new approach to identify the pages that promote hate speech by posting
comments related to controversial topics on Facebook. The comments in those postings will be
clustered to filter the sensitive topics discussed by all the users and then will be automatically analysed
to detect hatred. For this reason, our work shares many similarities with the work done by Ben-David
and Matamoros-Fernandez. However, the authors in focused on detecting hate speech promoted by
specific political parties in Spain. To achieve this goal, they analysed the posts from eight Facebook
pages. On the contrary, our proposed framework starts from some seed pages that are not restricted to
political parties. In this paper, we proposed a new approach to identify hate speech on Facebook, which
is a challenging task since many Facebook users have been trying hard to cover their real intentions. To
tackle this problem, we used graph analysis to identify pages that potentially promote hate speech. By
applying sentiment and emotion analysis, the most negative posts and comments were obtained. K-
means clustering was then applied to determine the most discussed topics. By analysing the topics
generated by the best combination of parameters, it can be observed that this new approach yields
promising results.
In this paper [5], In this work, the aim of this research is to identify the maximum number of Hate
Speech (HS) related tweets from Twitter as soon as it is posted by users. This issue falls within the
context of a binary classification problem where tweets are classified into two classes, i.e. either Hate
Speech (HS) or Non-Hate Speech (NHS). There is no proper or clear definition of HS is described. The
statement that hurt someone may be called as Hate Speech (HS). Sometimes, the HS may also be
called abusive statement; however, few works used the term hate speech in their research.
5
In order to detect the HS on social media platform, researchers mainly used two methodologies:
(i) traditional machine learning approach and
(ii) deep learning based approach.
The traditional machine learning based SVM classifier achieved the best performance but limited to
0.80, 0.40, 0.54 values of precision, recall, and F1-score to detect HS tweets, which is not satisfactory.
The classifier misclassifies the majority of HS tweets.
The deep learning based basic model of Convolutional Neural Network (CNN) having a single layer of
convolution and pooling layer yielded better performance. Another deep learning based model called
Long Short-Term Memory (LSTM) is used to improve model performance Results Comparison of the
Proposed Model with Existing Models. The C-LSTM model provided the recall value of 0.43, which is
lower than that of all tested models so far. The LSTM model is generally performing well with the
sequential data, where the model needs to preserve the semantics of the words for a long time;
however, the HS detection issue is of the subjective type where the model has to process the complete
tweet's words and extract their context. This may be one of the reasons behind the good performance of
the CNN model that worked with complete words of the tweets to find their context.
In this paper [3], In this work, they built a new dataset to find malice in Bangla Language that contains
hate speech on different categories such as religion, community, gender, and race. We have used web
scrapper to scrap the data. We have labeled our data in two categories either it is a hate speech or not.
The contribution of our work is to creating a new dataset for hate speech detection in Bangla Language
and applying the algorithm to detect it.
In their work, they made a new dataset in the Bangla language. We divided the dataset into two groups
and labeled them. There were anomalies in the dataset, and we processed them to remove the
anomalies. After that, we have extracted features from our dataset to use in our model. Then we
applied the Support vector machine and Naïve Bayes machine learning algorithms. Both the algorithm
performed well with our dataset. We showed Precision, Recall and F1-score for both of the algorithms.
Naïve Bayes gave us an accuracy of 72%
6
II. Description of the study
Hate speech detection in Facebook comments are important issues in social network environment.
Since the Facebook comments and all of its data are accessible to anyone over the social networks,
there have been many hate speech challenges faced to the social network. Many of the studies try to
solve hate speech issues, but still, some of these issues remain unsolved. One of the efficient
techniques to ensure hate speech is neural network. This paper attempts to propose detection of hate
speech using recurrent neural network and collect and label data set for Amharic Facebook comment.
Statements of the problem
Hate speech is particularly nasty type of speech, but it is still speech, not conduct, and covered by the
freedom of speech principle. In Ethiopia, the hate speech and incitements to violence come from all
sides, including the government and influential media figure and they are often posted in one of
hundreds of languages used throughout the country.
Regulating problematic online content has become a pressing issue around the world. The ease and
speed with which harmful and dangerous content is disseminated and accessed via social media
particularly reinforces this challenge. Many countries have or are considering some form of legislation
to regulate problematic online content, be it hate speech, fake news, or extremist and terrorist content.
Like many other nations, Ethiopia is grappling with the serious and growing problem of hate speech
and misinformation.
The general objective of this research work is to develop a machine learning based model to detect
hate speech on Facebook Comments.
To design and develop a prototype to demonstrate a model using recurrent neural network or
other deep learning models.
To develop LSTM based model and compare with the other RNN and other deep learning
models using our own dataset.
To evaluate the performance of the hate speech detection system using
evaluation metrics.
7
Scope of the study
The scope of this study is to provide new system in hate speech Amharic Facebook Comment
Detection. In this research, the proposed approach finds the suitable algorithms will be applied
technique of classification using Recurrent Neural Network algorithm. The algorithm is useful for
forming patterns that will present the classification. This study aims to create a system that detects hate
speech comments and identify the maximum number of Hate Speech (HS) its commented by building a
new Amharic dataset, utilizing multiple machine.
Due to the limitation of resources for the dataset annotation process of the dataset was
challenging and a lack of hate speech related law experts to consult.
Due to the lack of standard Amharic language stemmer and stop words lists. The study
did not apply these two preprocessing methods.
Due to the tight schedule, the study only implemented the proposed machine learning classifier
and limited it to develop and evaluate algorithms
8
V. Future work plan
4 collecting, cleaning,
filtering, and
consolidating data into
one file or data table
5 Deploying the proposed
system on Jupiter
notebooks
6 Install and configure
flask for making web
services in python
9
6. Reference
[2] G. Koushik, K. Rajeswari, and S. K. Muthusamy, “Automated hate speech detection on Twitter,” Proc. - 2019 5th
Int. Conf. Comput. Commun. Control Autom. ICCUBEA 2019, 2019, doi: 10.1109/ICCUBEA47591.2019.9128428.
Detect Hate Speech in Bangla Language,” Proc. 2019 8th Int. Conf. Syst. Model. Adv. Res. Trends, SMART 2019,
[4] A. Rodriguez, C. Argueta, and Y. L. Chen, “Automatic Detection of Hate Speech on Facebook Using Sentiment
and Emotion Analysis,” 1st Int. Conf. Artif. Intell. Inf. Commun. ICAIIC 2019, pp. 169–174, 2019, doi:
10.1109/ICAIIC.2019.8669073.
[5] P. K. Roy, A. K. Tripathy, T. K. Das, and X.-Z. Gao, “A Framework for Hate Speech Detection Using Deep
Convolutional Neural Network,” IEEE Access, vol. 8, pp. 204951–204962, 2020, doi:
10.1109/access.2020.3037073.
10
VI. Supervision of my study
My supervisor is providing me supervision at least once a week on a regular basis starting from the
proposal to this date of progress report writing; by reading and commenting on drafts. Most of the time
he is accessible, within reason (e.g., by phone, email contact) outside planned supervision meetings
when advise may be required. He provides me guidance on the nature and requirements of the research
degree being pursued and standards expected. This include providing me with a clear understanding on
the main aspects of undertaking postgraduate research, the nature of a research degree awarded at DEC
and the form and structure of a thesis. He also provides me guidance and advice to ensure the research
can be completed, including the preparation of the thesis, normally by the end of minimum submission
period. As being my advisor he is assisting me in producing a detailed work-plan and timetable for my
research and monitors my progress in relation to this plan and completes my Progression Reports as
required in a timely fashion.
11
Name of the student: Tewodros A. N o . RPG/0026/12
12
Evaluation for Thesis Progress Report
1. Understanding of major theories within area of specialization and the thesis retain a
focus on the stated research problem and the proposed argument
Satisfactory Unsatisfactory
2. Knowledge of major recent research within area of specialization, critical use of a
wide range of literature and theories
Satisfactory Unsatisfactory
3. Understanding of methods appropriate for research and ability to apply appropriate
theories and methods to his / her research within area of specialization.
Satisfactory Unsatisfactory
4. Adequacy of research design and execution.
Satisfactory Unsatisfactory
5. Thesis organization in an appropriate academic style, progression and structure of
its arguments.
Satisfactory Unsatisfactory
6. Ability to effectively summarize and present ideas in writing as well as in oral
presentation?
Satisfactory Unsatisfactory
7. Progresses so far made irrespective e of the entire thesis work.
Satisfactory Unsatisfactory
13
HOD Comments and Approval:
The student is then will be informed about the thesis progress results and will also be given strict
instructions to undertake corrections suggested by the thesis progress report evaluation committee with
the help of this part.
Satisfactory Unsatisfactory
14