0% found this document useful (0 votes)
44 views

Arabic Language Sentiment Analysis On Health Services

This document discusses sentiment analysis on an Arabic language dataset related to health services that was collected from Twitter. It introduces the dataset and summarizes the process used to collect tweets from Twitter, preprocess and filter the Arabic text, and annotate the tweets as positive or negative. Several machine learning algorithms and deep neural networks were then applied to the annotated dataset to perform sentiment analysis and their effectiveness was evaluated and compared.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Arabic Language Sentiment Analysis On Health Services

This document discusses sentiment analysis on an Arabic language dataset related to health services that was collected from Twitter. It introduces the dataset and summarizes the process used to collect tweets from Twitter, preprocess and filter the Arabic text, and annotate the tweets as positive or negative. Several machine learning algorithms and deep neural networks were then applied to the annotated dataset to perform sentiment analysis and their effectiveness was evaluated and compared.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Arabic Language Sentiment Analysis on Health Services

Abdulaziz M. Alayba1, Vasile Palade2, Matthew England3 and Rahat Iqbal4


Faculty of Engineering, Environment and Computing
Coventry University
Coventry, UK
1
[email protected]
2
[email protected]
3
[email protected]
4
[email protected]

Abstract — The social media network phenomenon leads to a Twitter is a popular social media platform where a huge
massive amount of valuable data that is available online and easy
number of tweets are shared and many tweets contain valuable
data. As [4] reported: in March 2014, active Arabic users wrote
to access. Many users share images, videos, comments, reviews,
over 17 million tweets per day. There are huge numbers of
news and opinions on different social networks sites, with Twitter tweets generated every minute and many of them in the Arabic
being one of the most popular ones. Data collected from Twitter is language. Topics about health services appear frequently on
highly unstructured, and extracting useful information from Twitter trends. The aims of this paper is to introduce a new
tweets is a challenging task. Twitter has a huge number of Arabic Arabic data set on health services for opinion mining purposes.
users who mostly post and write their tweets using the Arabic Also, to explain the process of collecting data from Twitter,
language. While there has been a lot of research on sentiment preprocessing Arabic text and Annotating the data set. After
analysis in English, the amount of researches and datasets in collecting and annotating the dataset, some data processing tasks
Arabic language is limited. are applied, such as feature selections, machine learning
This paper introduces an Arabic language dataset which is about algorithms and deep neural networks. The efficiency of these
opinions on health services and has been collected from Twitter.
methods are assessed and compared.
The paper will first detail the process of collecting the data from This paper continues in Section II with a brief survey of work
Twitter and also the process of filtering, pre-processing and on sentiment analysis in English and the Arabic languages.
annotating the Arabic text in order to build a big sentiment Section III details the process of collection, pre-processing and
analysis dataset in Arabic. Several Machine Learning algorithms filtering that went into creating our data set; and then the
(Naïve Bayes, Support Vector Machine and Logistic Regression) annotating procedure. The use of deep neural networks and other
alongside Deep and Convolutional Neural Networks were utilized machine learning methods, including text feature selection, on
the dataset is described in Section IV. Finally, conclusions and
in our experiments of sentiment analysis on our health dataset.
ideas for future work are discussed in Section V.
Keywords — Sentiment Analysis, Machine Learning, Deep Neural
Networks, Arabic Language. II. RELATED WORK
I. INTRODUCTION There are many studies on sentiment analysis and a variety
of approaches have been developed. English has the greatest
In the past ten years, many social network sites (Facebook,
number of sentiment analysis studies, while research is more
Twitter, Instagram etc.) have increased the presence on the web.
limited for other languages including Arabic. This section
These sites have an enormous number of users who produce a
discusses several papers in the field of sentiment analysis using
massive amount of data which include texts, images, videos, etc.
either English or Arabic.
According to [1], the estimated amount of data on the web will
be about 40 thousand Exabytes, or 40 trillion gigabytes, in 2020. Speriosu et al. [5] compared three different approaches, by
Analysis of such date could be valuable. There are many using lexicon-based, maximum entropy classification and label
different techniques for data analytics on data collected from the propagation respectively. Several English datasets were used as
web, with sentiment analysis a prominent one. Sentiment training, evaluating and testing sets, and only positive and
analysis (see for example [2]) is the study of people’s attitudes, negative tweets were included in the datasets, whereas neutral
emotions and options, and involves a combination of text tweets were eliminated. The experiment illustrated that the
mining and natural language processing. Sentiment analysis maximum entropy algorithm had a better result than the lexicon-
focuses on analyzing text messages that hold people opinions. based predictor and the accuracy improved for the test set of the
Examples of topics for analysis include opinions on products, polarity dataset from 58.1% to 62.9%. The label propagation
services, food, educations, etc. [3]. obtained a better accuracy of 71.2%, by combining tweets and
lexical features.
Kumar and Sebastian [6] presented a novel way to do the III. AN ARABC DATA SET ON HEALTH
sentiment analysis on a Twitter data set in English, by extracting The work described in this paper involved several steps,
opinion words from the corpus. The types of words that were which are retrieving the data, filtering, pre-processing and
focused on, were adjectives, verbs and adverbs. Two methods annotating the data set, and finally applying some machine
were used: the corpus-based method for finding semantic of learning on the collected dataset. Collecting the data from
adjectives and the dictionary-based method for finding semantic Twitter using the Twitter API and defining keyword based
of verbs and adverbs. After extracting opinion words, each word queries related to health is the first step. The second step is a
gets a score, whether it is positive, negative or neutral. The challenging one because the retrieved data has much noise and
comprehensive tweet’s score is measured by using individual needs to be cleaned. Annotating the tweets in the data set to
score of each word using a linear equation. either positive of negative classes will occur after filtering it.
Saif et al. [7] applied semantic features to the analysis of the After annotating the data set, several machine learning
Twitter users’ opinions. There are three different English data algorithms can be applied to the data set using different text
sets used: Stanford Twitter Sentiment (STS), Health Care features extractions. Figure I shows the workflow of this project.
Reform (HCR) and Obama-McCain Debate (OMD). Three
approaches were used to add semantic features for sentiment FIGURE I. VISUALIZING THE WORK FLOW
classification purposes, which were replacement, augmentation
and interpolation. Baselines features like unigrams, parts of
speech (POS), sentiment-topic and semantic sentiment analysis
were used in the experiments. The best result in the experiments
came from using the interpolation approach, unigram features
and naïve Bayes classifier. The paper showed that large datasets
are best analyzed by semantic methods, while sentiment-topic is
the best method for small datasets or limited topics.
Shoukry and Rafea [8] addressed the Arabic sentence-level
to perform sentiment analysis on 1000 tweets. Support Vector
Machines (SVM) and Naïve Bayes (NB) were used in the
experiment together with unigram and bigram text feature
extraction. There were no differences between the results using
different text feature extractions, but there were variations on the
accuracy results using different classifiers: SVM ≈ 72% and NB
≈ 65%.
Ben Salamah and Elkhlifi [9] collected about 340,000 Arabic
tweets about debates in the Kuwait National Assembly. The data
was classified into positive and negative classes using decision
trees (J48, alternating decision tree and random tree) and SVMs.
The average precision results of the methods was 76% and the
average recall was 61%.
Abdulla et al [10] created an Arabic dataset for sentiment
analysis which contains 2000 tweets divided into positive, the
first half, and negative, the second half. Two methods were
applied to the dataset, which were corpus-based “Supervised
Learning” and lexicon-based “Unsupervised Learning”. Four
supervised machine learning algorithms were applied, i.e.,
SVM, NB, D-Tree and K-Nearest Neighbor. The SVM and NB
obtained better results, around 80%. On the other hand, the A. Data Collection
lexicon-based approach indicates that with a large lexicon the
accuracy results were improving. There were three different The data was collected from 01/02/2016 to 31/07/2016 via
phases, phase I has 1000 words, phase II has 2500 words and Twitter. The first approach was to retrieve tweets using some
phase III has 3500 words. The accuracy started from 16.5% in general words related to health, such as “‫مستشفى‬, Hospital” ,
phase one then it reached at 48.8% in phase two and it achieved “‫ مستوصف‬, Clinic” , “‫ صحه‬, Health”, etc. However, the majority
58.6% in phase three. of tweets found this was were not useful because they do not
express any opinions, which is the aim of the study.
Kim [11] utilized convolutional neural networks to classify Alternatively, observing trending hashtags, which are the most
sentences using seven different English data sets, with the Movie popular topics in Twitter at a specific time with many users
Reviews dataset (introduced in [12]) being one of them. The involved in, was more useful. Three topics regarding to health
experiment model built on top of “word2vec” [13] and various were raised as trending hashtags and many users shared their
model variations were used. The experiments achieved good opinions about them. They were as follows:
results of classifying sentences from different data sets.
 #‫الصحة_تغلق_مستشفى‬
In addition to these tasks, normalization of some words or
This topic is about closing a private hospital (Closing Hospital). letters was performed as summarized below:
1) Removing Arabic short vowels (diacritics) “ ٍ, ٍ , ٍ, ٍ
 #‫من_يعالج_الصحة‬
, ٍ , ٍ , ٍ , ٍ” [16].
The meaning of this topic is asking a question about who will 2) Removing the Tatweel character “‫ ”ـــ‬which does not
resolve the health problems (Solving Health). affect the meaning of the word [16].
 #‫نتنتظر_تحسين_الصحة‬ 3) Replacing the letter “ ‫ ” ة‬to the letter “ ‫[ ” ه‬16].
4) Replacing the letters “ ‫ آ‬، ‫ إ‬، ‫ ” أ‬to the letter “ ‫[ ” ا‬16].
This topic means that people are waiting for an improvement in 5) Normalizing some words, especially words which
the health services (Improving Health). contain the letter “ Hamzah ” “ ‫ ئ‬، ‫ ” ؤ‬to one form
In addition to these topics, one topic was launched and asked because some users write it with “ Hamzah ” and other
users to post their opinions and experience about health services write it without it. For example the word “ ، ‫الطوارئ‬
which was: ‫ ” الطواري‬can be written in two ways by users, but the
words were normalized to one form which is “‫”الطوارئ‬
 #‫رأيك_بالخدمات_الصحية‬
6) Normalizing any word with repeated letters, such as “
This topic was launched especially for this study, which is
‫ ” كبييييييييير‬to be “ ‫[ ” كبير‬16].
about users’ opinions regarding health services (Opinions about
Health). 7) Normalizing some special letters “ ‫ ے‬, ‫” ڪ‬, which are
not Arabic letters, but they have the same shape of
The number of retrieved tweets was massive (over 126 some Arabic letter.
thousand) but it decreased to 2026 tweets after filtering and pre- 8) Normalizing compound words using MS Excel by
processing. Table I shows the number of tweets of each topic joining them by the character “ _ ”; such as some city
before and after the pre-processing of the data. names, for example “‫ ”المنوره المدينه‬which will be
normalized to “ ‫” المدينه_المنوره‬.
TABLE I. THE CHANGES IN NUMBER OF TWEETS FOR EACH TOPIC BEFORE
AND AFTER FILTERING THE DATASET
9) Correcting words manually which were either missing
some letters, replacing a letter by a wrong one, or
Number of the Number of the writing the word in a wrong form.
Topics tweets before the tweets after the 10) Some Twitter users compress two words or more by
filtering filtering
(Closing Hospital) 105275 tweets 1009 tweets
ignoring the space between the words because of the
(Resolving Health) 11624 tweets 492 tweets characters limit on Twitter. These are normalized by
(Opinions about Health) 3033 tweets 285 tweets manually returning the spaces between words.
(Improving Health) 7027 tweets 240 tweets 11) Some users post their opinions in more than one tweet
Total 126959 tweets 2026 tweets and the solution here was to combine them in one long
tweet. After that, the length of combined tweet was
B. Data Pre-processing and Normalization reduced by removing unwanted words.
The number of collected tweets was sufficient to do the C. Data Anotating
sentiment analysis experiment as it contains a variety of words
and sentence structures. In contrast, the total number of tweets The data set has been annotated manually by three annotators
(126959 tweets) contained much noisy data and as the study and each tweet can be either positive or negative only. The
focused on only positive or negative tweets, all noisy data was reason of having three judges is to get three different opinions
removed. The following points are examples of noisy data that about each tweet, then calculating the majority vote of them
were removed. (“The Mode”). There are eight rows in Table II for eight
different situations of the annotators’ classification. Also, the
1) Spam tweets which are tweets that contain reasons of having only two classes are the difficulty of rating the
advertisements or harmful links [14]. opinions, the need of many annotators and the lack of scaled
2) Neutral tweets which do not have any opinions, such as words in the corpus such as “very” in English language and “‫”جدا‬
news tweets. in the Arabic language.
3) Retweeted tweets, which start by “RT” [15].
Table II details the number of each different situation of the
4) Duplicated tweets, which were retrieved more than once. annotators’ classification occurred in the dataset. For example,
In addition, as indicated in [15], some pre-processing steps when all of them agree as positive or negative tweets, when two
were undertaken to the remaining tweets by removing: of them agree as positive and another one disagree and when two
of them agree as negative and another one disagree. In addition
1) Opinions unrelated to health. to that, it shows the total number of positive and negative tweets.
2) Twitter users name which are like @user_name [15].
3) URLs which started by http:// until the next space,
which indicates the end of the URL [15].
4) Some words like “available”, “via” [15].
5) Hashtags topics.
6) Punctuations [15].
TABLE II. SUMMARY OF ANOTATION PROCESS (POSITIVE = P, NEGATIVE = N) Vector Machines (SVMs). [18] The NB algorithm used involved
Total
Multinomial Naive Bayes and Bernoulli Naive Bayes, and the
Annotator Annotator Annotator Final SVMs used involved Support Vector Classification, Linear
numbers of Total
1 2 3 Sentiment
occurrences Support Vector Classification, Stochastic Gradient Descent and
P P P 502 times P
P P N 49 times P
628 Nu-Support Vector Classification. The experiment had three
Positive phases by using different sizes for the training set and the testing
P N P 74 times P
Tweets
N P P 3 times P set. In Phase I the training set was 60% of the data set and the
P N N 135 times N testing set was 40% of the data set. In Phase II the testing set
1398
N P N 18 times N
N N P 15 times N
Negative reduced to 30% and the training set was 70%. In Phase III the
Tweets
N N N 1230 times N training set was increased 10% and the testing set was reduced
2026 10%. Table III shows the results of all different classifiers in
Total 2026 tweets
tweets different phases using the previously explained text feature
selection.
From Table II the accuracies of each annotator can be
measured. The accuracy of Annotator 1 is 93%, the accuracy of TABLE III. THE RESULTS OF 3 CLASSIFIERS THAT WERE USED WITH TF-IDF
FEATURE SELECTION, “UNIGRAM” AND “BIGRAM”
Annotator 2 is 95% and the accuracy of Annotator 3 is 97%.
Accuracy
Figure II shows the distributions of positive and negative No. Name of the Algorithm
Phase I Phase II Phase III
tweets numbers per each annotator. It is clear that Annotators 2 1 Multinomial Naive Bayes 87.42 88.98% 90.14%
and 3 have almost similar number of positive and negative 2 Bernoulli Naive Bayes 87.29% 87.50% 89.16%
tweets, but the Annotator 1 is slightly different. Overall, the data 3 Logistic Regression 86.92% 88.32% 86.94%
set is unbalanced, with the negative tweets more prevalent than 4 Support Vector 89.27% 90.13% 90.88%
the positive tweets. 5 Linear Support Vector 89.39% 90.46% 91.37%
6 Stochastic Gradient Descent 88.28% 88.98% 91.87%
7 Nu-Support Vector 86.31% 87.82% 86.20%
FIGURE II. VISUALIZING THE NUMBER OF POSITIVE AND NEGATIVE TWEETS IN
THE DATASET BASED ON THE THREE DIFFERENT ANNOTATORS
B. Deep Neural Networks Algorithms
Positive Negative Deep learning is a popular approach in computational
2000
NUMBER OF SENTANCES

modeling today [19]. The model consists of a big number of


1454 1432
1500 1266 hidden layers and neurons to represent the data with different
abstractions. It works efficiently and effectively with large
1000 760 datasets. Neural Networks with many hidden layers,
572 594
500 Convolutional Neural Networks and some Recurrent Neural
Networks are examples of this [20].
0
Annotator 1 Annotator 2 Annotator 3 In this section of the experiment, Deep and Convolutional
ANNOTATORS Neural Networks were used. The Deep Neural Network model
had three hidden layers and each layer has 1500 neurons. The
input features used were 741 words, which were based on their
IV. EXPERMENTS AND RESULTS frequency between 5 and 100 times in the corpus where the
The objective of this experiment is to investigate the output of the model is either positive or negative. The data set
efficiency of utilizing Deep Neural Networks and other was divided into 80% for training and 20% for testing. Figure III
Machine Learning algorithms on a newly developed Arabic shows the obtained accuracy results of the experiment using
Health Services Data Set. [17] The efficiency of them can be Deep Neural Network, which reached about 85% in 500 epochs.
measured by calculating the Accuracy of the classification task, Table IV shows the confusion matrix of the Deep Neural
which is defined as: Network experiment on the test set. Also, it details the numbers
of actual and predicted classes.
(𝑇𝑃+𝑇𝑁)
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = ,
(𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁) FIGURE III. ACCURACY RESULTS PER 500 EPOCHS FOR THE DEEP NEURAL
where (TP) is the number of true positives, (TN) is the number NETWORK
of true negatives, (FP) is the number of false positives and (FN)
is the number of false negatives. The Accuracy for 500 Epochs (Arabic Health Dataset) DNN

%100
A. Machine Learning Algorithms
ACCURACY

In this section of the experiment, a combination of %80


“Unigram” and “Bigram” techniques were used for text feature
selection in this experiment. TF-IDF (Term Frequency and %60
Inverse Document Frequency) [17] weighting words was used %40
to weight each feature in the corpus and the maximum 1000 -100 100 300 500
weighted features were fed to the machine learning algorithm. EPOCHS
There are three machine learning algorithms that were used:
Naïve Bayes (NB), Logistic Regression (LR) and Support
TABLE IV. DEEP NEURAL NETWORK CONFUSION MATRIX There will be further studies and experiments on using
Predicted Classes different text features extraction and other Deep Neural Network
Total and Recurrent Neural Network architectures in order to increase
Negative Positive
Negative 265 14 279 the accuracy of the results. In addition to this, the negation words
Actual Classes
Positive 43 83 126 in Arabic will be studied to increase the prediction performance
Total 308 97 405 and, as the data set is unbalanced, dealing with unbalanced data
set techniques will be another topic for our future studies.
Convolutional Neural Networks (CNNs) were also used in
the experiment. All the vocabulary in the corpus was trained
using “word2vec” [13] to create the input vectors of the models. VI. REFERENCES
The sequence length of each vector was 52 because the longest
sentence in the data set has 52 words. A (3, 4) sliding window [1] J. Gantz and D. Reinsel, “Big Data , Bigger Digital Shadows , and
Biggest Growth in the Far East Executive Summary: A Universe of
were used to filter the size of the matrix and the number of Opportunities and Challenges,” Idc, vol. 2007, no. December 2012, pp.
epochs was 100. The data set was divided into 80% for training 1–16, 2012.
and 20% for testing. The accuracy obtained in this experiment [2] B. Liu, “Sentiment Analysis and Opinion Mining,” Synth. Lect. Hum.
was about 90%. Figure IV shows the accuracy results on 500 Lang. Technol., vol. 5, no. 1, pp. 1–167, May 2012.
epochs by using a Convolutional Neural Network. Moreover, the [3] B. Agarwal, N. Mittal, P. Bansal, and S. Garg, “Sentiment Analysis
Using Common-Sense and Context Information,” Comput. Intell.
confusion matrix was measured on the test set and it can be Neurosci., vol. 2015, pp. 1–9, 2015.
found in Table IV. [4] Mohammad Bin Rashid Sschool of Government, “The Arab Social
Media Report.” [Online]. Available:
FIGURE IV. ACCURACY RESULTS PER 100 EPOCHS FOR THE CONVOLUTIONAL http://www.arabsocialmediareport.com/home/index.aspx.
NEURAL NETWORK [Accessed: 02-Apr-2016].
[5] M. Speriosu, N. Sudan, S. Upadhyay, and J. Baldridge, “Twitter Polarity
Classification with Label Propagation over Lexical Links and the
The Accuracy for 100 Epochs Follower Graph,” Proc. Conf. Empir. Methods Nat. Lang. Process., pp.
(Arabic Health Dataset) - CNN 53–56, 2011.
A. Kumar and T. M. Sebastian, “Sentiment Analysis on Twitter,” IJCSI
PERCENTAGES

100% [6]
ACCURACY

Int. J. Comput. Sci. Issues, vol. 9, no. 4, pp. 372–378, 2012.


80% [7] H. Saif, Y. He, and H. Alani, “Semantic Sentiment Analysis of Twitter,”
in The Semantic Web – ISWC 2012, 2012, pp. 508–524.
60% [8] A. Shoukry and A. Rafea, “Sentence-level Arabic sentiment analysis,”
Collab. Technol. Syst. (CTS), 2012 Int. Conf., pp. 546–550, 2012.
[9] J. Ben Salamah and A. Elkhlifi, “Microblogging Opinion Mining
40%
Approach for Kuwaiti Dialect,” Int. Conf. Comput. Technol. Inf. Manag.,
0 100 200 300 400 500
pp. 388–396, 2014.
[10] N. A. Abdulla, N. A. Ahmed, M. A. Shehab, and M. Al-Ayyoub, “Arabic
NUMBER OF EPOCH sentiment analysis: Lexicon-based and corpus-based,” in 2013 IEEE
Jordan Conference on Applied Electrical Engineering and Computing
Technologies (AEECT), 2013, pp. 1–6.
TABLE IV. CONVOLUTIONAL NEURAL NETWORK CONFUSION MATRIX
[11] Y. Kim, “Convolutional Neural Networks for Sentence Classification,”
Predicted Classes Proc. 2014 Conf. Empir. Methods Nat. Lang. Process. (EMNLP 2014),
Total pp. 1746–1751, 2014.
Negative Positive
Negative 274 16 290 [12] B. Pang and L. Lee, “Seeing stars: Exploiting class relationships for
Actual Classes
Positive 23 93 116 sentiment categorization with respect to rating scales,” in Proceedings of
Total 297 109 406 the ACL, 2005.
[13] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed
Representations of Words and Phrases and their Compositionality,”
Proc. NIPS, pp. 1–9, 2013.
V. CONCLUSION AND FUTURE WORK [14] Twitter, “Reporting spam on Twitter,” 2016. [Online]. Available:
https://support.twitter.com/articles/64986. [Accessed: 17-May-2016].
This paper introduces a new Arabic data set for sentiment [15] A. Pak and P. Paroubek, “Twitter as a Corpus for Sentiment Analysis
analysis about health services. The paper also detailed the and Opinion Mining.,” Lrec, pp. 1320–1326, 2010.
process of collecting Twitter tweets, the way of filtering, pre- [16] S. Ahmed, M. Pasquier, and G. Qadah, “Key issues in conducting
processing Arabic text by removing unwanted data, removing sentiment analysis on Arabic social media text,” in 2013 9th
International Conference on Innovations in Information Technology
some unrelated words and text and normalizing the text.
(IIT), 2013, pp. 72–77.
Moreover, it explained the procedure of annotating the data set [17] H. Manning, C., Raghavan, P. and Schütze, Introduction to information
manually by three annotators. The initial experiments were retrieval, 1st Ed. New York: Cambridge University Press, 2008.
conducted by utilizing Deep Neural Networks and several other [18] scikit-learn developers, “scikit-learn,” 2010. .
Machine Learning algorithms. NB, LR, SVM, DNNs and CNNs [19] Y. LeCun, Y. Bengio, G. Hinton, L. Y., B. Y., and H. G., “Deep
were used and the accuracy in each experiment was recorded. learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
[20] J. Schmidhuber, “Deep Learning in neural networks: An overview,”
The accuracy results were roughly between 85% and 91% and Neural Networks, vol. 61, pp. 85–117, 2015.
the best classifiers were SVM using Linear Support Vector
Classification and Stochastic Gradient Descent. The SVM
classifier accuracy is similar to the first annotator’s accuracy.

You might also like