Arabic Language Sentiment Analysis On Health Services
Arabic Language Sentiment Analysis On Health Services
Abstract — The social media network phenomenon leads to a Twitter is a popular social media platform where a huge
massive amount of valuable data that is available online and easy
number of tweets are shared and many tweets contain valuable
data. As [4] reported: in March 2014, active Arabic users wrote
to access. Many users share images, videos, comments, reviews,
over 17 million tweets per day. There are huge numbers of
news and opinions on different social networks sites, with Twitter tweets generated every minute and many of them in the Arabic
being one of the most popular ones. Data collected from Twitter is language. Topics about health services appear frequently on
highly unstructured, and extracting useful information from Twitter trends. The aims of this paper is to introduce a new
tweets is a challenging task. Twitter has a huge number of Arabic Arabic data set on health services for opinion mining purposes.
users who mostly post and write their tweets using the Arabic Also, to explain the process of collecting data from Twitter,
language. While there has been a lot of research on sentiment preprocessing Arabic text and Annotating the data set. After
analysis in English, the amount of researches and datasets in collecting and annotating the dataset, some data processing tasks
Arabic language is limited. are applied, such as feature selections, machine learning
This paper introduces an Arabic language dataset which is about algorithms and deep neural networks. The efficiency of these
opinions on health services and has been collected from Twitter.
methods are assessed and compared.
The paper will first detail the process of collecting the data from This paper continues in Section II with a brief survey of work
Twitter and also the process of filtering, pre-processing and on sentiment analysis in English and the Arabic languages.
annotating the Arabic text in order to build a big sentiment Section III details the process of collection, pre-processing and
analysis dataset in Arabic. Several Machine Learning algorithms filtering that went into creating our data set; and then the
(Naïve Bayes, Support Vector Machine and Logistic Regression) annotating procedure. The use of deep neural networks and other
alongside Deep and Convolutional Neural Networks were utilized machine learning methods, including text feature selection, on
the dataset is described in Section IV. Finally, conclusions and
in our experiments of sentiment analysis on our health dataset.
ideas for future work are discussed in Section V.
Keywords — Sentiment Analysis, Machine Learning, Deep Neural
Networks, Arabic Language. II. RELATED WORK
I. INTRODUCTION There are many studies on sentiment analysis and a variety
of approaches have been developed. English has the greatest
In the past ten years, many social network sites (Facebook,
number of sentiment analysis studies, while research is more
Twitter, Instagram etc.) have increased the presence on the web.
limited for other languages including Arabic. This section
These sites have an enormous number of users who produce a
discusses several papers in the field of sentiment analysis using
massive amount of data which include texts, images, videos, etc.
either English or Arabic.
According to [1], the estimated amount of data on the web will
be about 40 thousand Exabytes, or 40 trillion gigabytes, in 2020. Speriosu et al. [5] compared three different approaches, by
Analysis of such date could be valuable. There are many using lexicon-based, maximum entropy classification and label
different techniques for data analytics on data collected from the propagation respectively. Several English datasets were used as
web, with sentiment analysis a prominent one. Sentiment training, evaluating and testing sets, and only positive and
analysis (see for example [2]) is the study of people’s attitudes, negative tweets were included in the datasets, whereas neutral
emotions and options, and involves a combination of text tweets were eliminated. The experiment illustrated that the
mining and natural language processing. Sentiment analysis maximum entropy algorithm had a better result than the lexicon-
focuses on analyzing text messages that hold people opinions. based predictor and the accuracy improved for the test set of the
Examples of topics for analysis include opinions on products, polarity dataset from 58.1% to 62.9%. The label propagation
services, food, educations, etc. [3]. obtained a better accuracy of 71.2%, by combining tweets and
lexical features.
Kumar and Sebastian [6] presented a novel way to do the III. AN ARABC DATA SET ON HEALTH
sentiment analysis on a Twitter data set in English, by extracting The work described in this paper involved several steps,
opinion words from the corpus. The types of words that were which are retrieving the data, filtering, pre-processing and
focused on, were adjectives, verbs and adverbs. Two methods annotating the data set, and finally applying some machine
were used: the corpus-based method for finding semantic of learning on the collected dataset. Collecting the data from
adjectives and the dictionary-based method for finding semantic Twitter using the Twitter API and defining keyword based
of verbs and adverbs. After extracting opinion words, each word queries related to health is the first step. The second step is a
gets a score, whether it is positive, negative or neutral. The challenging one because the retrieved data has much noise and
comprehensive tweet’s score is measured by using individual needs to be cleaned. Annotating the tweets in the data set to
score of each word using a linear equation. either positive of negative classes will occur after filtering it.
Saif et al. [7] applied semantic features to the analysis of the After annotating the data set, several machine learning
Twitter users’ opinions. There are three different English data algorithms can be applied to the data set using different text
sets used: Stanford Twitter Sentiment (STS), Health Care features extractions. Figure I shows the workflow of this project.
Reform (HCR) and Obama-McCain Debate (OMD). Three
approaches were used to add semantic features for sentiment FIGURE I. VISUALIZING THE WORK FLOW
classification purposes, which were replacement, augmentation
and interpolation. Baselines features like unigrams, parts of
speech (POS), sentiment-topic and semantic sentiment analysis
were used in the experiments. The best result in the experiments
came from using the interpolation approach, unigram features
and naïve Bayes classifier. The paper showed that large datasets
are best analyzed by semantic methods, while sentiment-topic is
the best method for small datasets or limited topics.
Shoukry and Rafea [8] addressed the Arabic sentence-level
to perform sentiment analysis on 1000 tweets. Support Vector
Machines (SVM) and Naïve Bayes (NB) were used in the
experiment together with unigram and bigram text feature
extraction. There were no differences between the results using
different text feature extractions, but there were variations on the
accuracy results using different classifiers: SVM ≈ 72% and NB
≈ 65%.
Ben Salamah and Elkhlifi [9] collected about 340,000 Arabic
tweets about debates in the Kuwait National Assembly. The data
was classified into positive and negative classes using decision
trees (J48, alternating decision tree and random tree) and SVMs.
The average precision results of the methods was 76% and the
average recall was 61%.
Abdulla et al [10] created an Arabic dataset for sentiment
analysis which contains 2000 tweets divided into positive, the
first half, and negative, the second half. Two methods were
applied to the dataset, which were corpus-based “Supervised
Learning” and lexicon-based “Unsupervised Learning”. Four
supervised machine learning algorithms were applied, i.e.,
SVM, NB, D-Tree and K-Nearest Neighbor. The SVM and NB
obtained better results, around 80%. On the other hand, the A. Data Collection
lexicon-based approach indicates that with a large lexicon the
accuracy results were improving. There were three different The data was collected from 01/02/2016 to 31/07/2016 via
phases, phase I has 1000 words, phase II has 2500 words and Twitter. The first approach was to retrieve tweets using some
phase III has 3500 words. The accuracy started from 16.5% in general words related to health, such as “مستشفى, Hospital” ,
phase one then it reached at 48.8% in phase two and it achieved “ مستوصف, Clinic” , “ صحه, Health”, etc. However, the majority
58.6% in phase three. of tweets found this was were not useful because they do not
express any opinions, which is the aim of the study.
Kim [11] utilized convolutional neural networks to classify Alternatively, observing trending hashtags, which are the most
sentences using seven different English data sets, with the Movie popular topics in Twitter at a specific time with many users
Reviews dataset (introduced in [12]) being one of them. The involved in, was more useful. Three topics regarding to health
experiment model built on top of “word2vec” [13] and various were raised as trending hashtags and many users shared their
model variations were used. The experiments achieved good opinions about them. They were as follows:
results of classifying sentences from different data sets.
#الصحة_تغلق_مستشفى
In addition to these tasks, normalization of some words or
This topic is about closing a private hospital (Closing Hospital). letters was performed as summarized below:
1) Removing Arabic short vowels (diacritics) “ ٍ, ٍ , ٍ, ٍ
#من_يعالج_الصحة
, ٍ , ٍ , ٍ , ٍ” [16].
The meaning of this topic is asking a question about who will 2) Removing the Tatweel character “ ”ـــwhich does not
resolve the health problems (Solving Health). affect the meaning of the word [16].
#نتنتظر_تحسين_الصحة 3) Replacing the letter “ ” ةto the letter “ [ ” ه16].
4) Replacing the letters “ آ، إ، ” أto the letter “ [ ” ا16].
This topic means that people are waiting for an improvement in 5) Normalizing some words, especially words which
the health services (Improving Health). contain the letter “ Hamzah ” “ ئ، ” ؤto one form
In addition to these topics, one topic was launched and asked because some users write it with “ Hamzah ” and other
users to post their opinions and experience about health services write it without it. For example the word “ ، الطوارئ
which was: ” الطواريcan be written in two ways by users, but the
words were normalized to one form which is “”الطوارئ
#رأيك_بالخدمات_الصحية
6) Normalizing any word with repeated letters, such as “
This topic was launched especially for this study, which is
” كبيييييييييرto be “ [ ” كبير16].
about users’ opinions regarding health services (Opinions about
Health). 7) Normalizing some special letters “ ے, ” ڪ, which are
not Arabic letters, but they have the same shape of
The number of retrieved tweets was massive (over 126 some Arabic letter.
thousand) but it decreased to 2026 tweets after filtering and pre- 8) Normalizing compound words using MS Excel by
processing. Table I shows the number of tweets of each topic joining them by the character “ _ ”; such as some city
before and after the pre-processing of the data. names, for example “ ”المنوره المدينهwhich will be
normalized to “ ” المدينه_المنوره.
TABLE I. THE CHANGES IN NUMBER OF TWEETS FOR EACH TOPIC BEFORE
AND AFTER FILTERING THE DATASET
9) Correcting words manually which were either missing
some letters, replacing a letter by a wrong one, or
Number of the Number of the writing the word in a wrong form.
Topics tweets before the tweets after the 10) Some Twitter users compress two words or more by
filtering filtering
(Closing Hospital) 105275 tweets 1009 tweets
ignoring the space between the words because of the
(Resolving Health) 11624 tweets 492 tweets characters limit on Twitter. These are normalized by
(Opinions about Health) 3033 tweets 285 tweets manually returning the spaces between words.
(Improving Health) 7027 tweets 240 tweets 11) Some users post their opinions in more than one tweet
Total 126959 tweets 2026 tweets and the solution here was to combine them in one long
tweet. After that, the length of combined tweet was
B. Data Pre-processing and Normalization reduced by removing unwanted words.
The number of collected tweets was sufficient to do the C. Data Anotating
sentiment analysis experiment as it contains a variety of words
and sentence structures. In contrast, the total number of tweets The data set has been annotated manually by three annotators
(126959 tweets) contained much noisy data and as the study and each tweet can be either positive or negative only. The
focused on only positive or negative tweets, all noisy data was reason of having three judges is to get three different opinions
removed. The following points are examples of noisy data that about each tweet, then calculating the majority vote of them
were removed. (“The Mode”). There are eight rows in Table II for eight
different situations of the annotators’ classification. Also, the
1) Spam tweets which are tweets that contain reasons of having only two classes are the difficulty of rating the
advertisements or harmful links [14]. opinions, the need of many annotators and the lack of scaled
2) Neutral tweets which do not have any opinions, such as words in the corpus such as “very” in English language and “”جدا
news tweets. in the Arabic language.
3) Retweeted tweets, which start by “RT” [15].
Table II details the number of each different situation of the
4) Duplicated tweets, which were retrieved more than once. annotators’ classification occurred in the dataset. For example,
In addition, as indicated in [15], some pre-processing steps when all of them agree as positive or negative tweets, when two
were undertaken to the remaining tweets by removing: of them agree as positive and another one disagree and when two
of them agree as negative and another one disagree. In addition
1) Opinions unrelated to health. to that, it shows the total number of positive and negative tweets.
2) Twitter users name which are like @user_name [15].
3) URLs which started by http:// until the next space,
which indicates the end of the URL [15].
4) Some words like “available”, “via” [15].
5) Hashtags topics.
6) Punctuations [15].
TABLE II. SUMMARY OF ANOTATION PROCESS (POSITIVE = P, NEGATIVE = N) Vector Machines (SVMs). [18] The NB algorithm used involved
Total
Multinomial Naive Bayes and Bernoulli Naive Bayes, and the
Annotator Annotator Annotator Final SVMs used involved Support Vector Classification, Linear
numbers of Total
1 2 3 Sentiment
occurrences Support Vector Classification, Stochastic Gradient Descent and
P P P 502 times P
P P N 49 times P
628 Nu-Support Vector Classification. The experiment had three
Positive phases by using different sizes for the training set and the testing
P N P 74 times P
Tweets
N P P 3 times P set. In Phase I the training set was 60% of the data set and the
P N N 135 times N testing set was 40% of the data set. In Phase II the testing set
1398
N P N 18 times N
N N P 15 times N
Negative reduced to 30% and the training set was 70%. In Phase III the
Tweets
N N N 1230 times N training set was increased 10% and the testing set was reduced
2026 10%. Table III shows the results of all different classifiers in
Total 2026 tweets
tweets different phases using the previously explained text feature
selection.
From Table II the accuracies of each annotator can be
measured. The accuracy of Annotator 1 is 93%, the accuracy of TABLE III. THE RESULTS OF 3 CLASSIFIERS THAT WERE USED WITH TF-IDF
FEATURE SELECTION, “UNIGRAM” AND “BIGRAM”
Annotator 2 is 95% and the accuracy of Annotator 3 is 97%.
Accuracy
Figure II shows the distributions of positive and negative No. Name of the Algorithm
Phase I Phase II Phase III
tweets numbers per each annotator. It is clear that Annotators 2 1 Multinomial Naive Bayes 87.42 88.98% 90.14%
and 3 have almost similar number of positive and negative 2 Bernoulli Naive Bayes 87.29% 87.50% 89.16%
tweets, but the Annotator 1 is slightly different. Overall, the data 3 Logistic Regression 86.92% 88.32% 86.94%
set is unbalanced, with the negative tweets more prevalent than 4 Support Vector 89.27% 90.13% 90.88%
the positive tweets. 5 Linear Support Vector 89.39% 90.46% 91.37%
6 Stochastic Gradient Descent 88.28% 88.98% 91.87%
7 Nu-Support Vector 86.31% 87.82% 86.20%
FIGURE II. VISUALIZING THE NUMBER OF POSITIVE AND NEGATIVE TWEETS IN
THE DATASET BASED ON THE THREE DIFFERENT ANNOTATORS
B. Deep Neural Networks Algorithms
Positive Negative Deep learning is a popular approach in computational
2000
NUMBER OF SENTANCES
%100
A. Machine Learning Algorithms
ACCURACY
100% [6]
ACCURACY