HAHA 2019 Dataset: A Corpus For Humor Analysis in Spanish: Luis Chiruzzo, Santiago Castro, Aiala Rosá
HAHA 2019 Dataset: A Corpus For Humor Analysis in Spanish: Luis Chiruzzo, Santiago Castro, Aiala Rosá
Abstract
This paper presents the development of a corpus of 30,000 Spanish tweets that were crowd-annotated with humor value and funniness
score. The corpus contains approximately 38.6% intended-humor tweets with a 2.04/5 average funniness score. It has been used in an
automatic humor recognition and analysis competition, obtaining encouraging results from the participants.
Keywords: Humor, Computational Humor, Humor Detection, Humor Analysis, Natural Language Processing
5106
dataset which has had two iterations so far and has been used
for the 2018 and 2019 editions of the HAHA (Humor Anal-
ysis based on Human Annotations) evaluation campaign1 .
5107
(two humorous and a non-humorous one). The purpose of Si tuviera un peso por cada Si tuviera un peso por cada
these test tweets was to rule out users that did not understand persona que me dice "feo", vez que alguien me dice
the premise of the annotation process, we considered the pues sería pobre porque "feo", sería pobre porque
sessions where any of these tweets were mislabeled as in- soy perfecto. soy perfecto.
valid sessions and did not use their votes in the final version —¿Tienes Wi-Fi? - ¿Tienes wi-fi?
of the corpus. —Claro. - Sí
First, we aimed at getting at least five votes for each tweet —¿Cuál es la clave? - ¿Y cuál es la clave?
and determine the humorous tweets by simple majority, i.e. —Tener dinero y pagarlo. - Tener dinero y pagarlo.
the tweets that got at least three humor votes out of five Me encanta encontrar Me encanta encontrar el
should be considered humorous. We tried to shuffle the dinero en mi ropa. Es como dinero en mi ropa, es como
tweets presented to the annotators to keep the number of un regalo para mí de mí. un regalo para mí de mí.
votes for each tweet as close as possible on average, not Cuando te digan ESTUDIA Cuando te digan ESTUDIA
letting some of the tweets to lag (a notable exception to this no hagas nada, significa no hagas nada, significa
are the three test tweets, which received as many votes as ES-TU-DIA, aprovéchalo. ES-TU-DIA, aprovechalo!
sessions). As the voting period proceeded, we realized that #fb
the tweets that were already getting three negative votes did ¿Cursi yo? ¿Cursi YOO? Cursi yo?? CURSI
not have any possibility of being considered humorous in Cursi el viento..!! ..que YOOOOOOO?????...
the final corpus, even if the remaining two votes were of the acaricia tu cabello, impreg- cursi el viento que acaricia
humorous categories. Because of this, occasionally during nándome de tu aroma, y el tu cabello impregnandome
the voting period, we manually deprioritized the tweets that dulce terciopelo... con tu aroma y el dulce
got three or more negative votes, to keep in the pool only the terciopelo...
tweets that still had a chance of being considered positive.
As a result, the corpus contains some tweets that do not have Table 1: Examples of different types of near-duplicate
five votes, mainly the non-humorous ones. tweets.
Once the voting period ended, we had received 117,800
votes from 1,546 users. We collected all the annotations, ous year plus thirteen new accounts), and 3,000 new random
discarding the invalid sessions, determined the humorous Spanish tweets. We used the same web tool for annotating
value by simple majority, and the average score for the the new tweets with a small modification.
humorous tweets. In total, around 26.9% of the tweets From our experience during the 2018 annotation, we found
were considered humorous. We then randomly discarded out that some annotators were still confused between consid-
non-humorous tweets until getting 20,000 tweets in total, ering a tweet as “non-humorous” or considering it “humor-
achieving a final proportion of 36.8% humorous tweets in ous but not funny”. This was more evident for tweets that
the corpus. This 2018 version of the corpus contains 20,000 contained insults or offensive content, on occasions tweets
tweets where 7,357 are humorous and 12,643 are not, the that would be considered a bad taste joke (i.e. humorous but
average funniness score for the humorous tweets is 2.10. not funny) could be labeled as not humor if they contained
The corpus was divided into an 80/20 train-test split and it insults. To alleviate this situation, we decided to slightly
was used in the HAHA at IberEval 2018 competition (Castro modify the graphical interface by adding a new option to
et al., 2018). mark a tweet as offensive. This option, as shown in Figure 1,
is a checkbox, and its information is saved only after the
3.4. Corpus 2019 user chose whether the tweet is humorous or not. The pur-
The second iteration was done between December 2018 and pose of this new option is twofold: On one hand, it could
March 2019. First, we started by analyzing some tweets in help us collect information about tweets that are offensive or
the 2018 version of the corpus that we noticed were near- not offensive to analyze if there is any correlation between
duplicates, i.e. the content was almost the same with a few offensive content and humor. On the other hand, it would
different words that did not change their semantics. We used help to make clearer to an annotator that there are tweets
a semi-automatic process to find duplicate candidates by that could be offensive or in bad taste but should be marked
collecting all pairs of tweets that had a Jaccard coefficient as humorous nonetheless. We hoped that making this option
greater than 0.5. We manually inspected all pairs, clustered explicit would help disentangle these possibilities and show
them into equivalence classes, and took one example from that offensiveness and humor are different dimensions.
each class discarding the others from the corpus. As a re- Between February and March 2019, we received 74,312
sult, we pruned 1,278 tweets from the corpus, most of them votes from 780 users. This time we used two test tweets
were humorous. Table 1 shows some examples of near- presented to all users, different from the ones used the previ-
duplicate tweets found in the previous version of the corpus. ous year but with the same intent of trying to detect invalid
The most common differences between tweets considered sessions. After determining the humorous tweets and their
near-duplicates include slight changes in spelling or capital- respective scores, we discarded non-humorous tweets until
ization, differences in punctuation, repetition of characters we got the 30,000 tweets we wanted for this version of the
and use of hashtags. corpus, which ended up being slightly more balanced than
The aim for this second version of the corpus was to get the 2018 version having 38.6% of humorous tweets. In the
30,000 tweets in total, so we extracted 10,000 new tweets 2019 version of the corpus, there are 30,000 tweets where
from humorous accounts (the same accounts as in the previ- 11,595 are humorous and 18,405 are not, the average fun-
5108
niness score for the humorous tweets is 2.04. The corpus Not humor
was divided in an 80/20 train-test split with the following
criteria: all tweets that had been part of the train and test
partitions in the 2018 version of the corpus are part of the
training partition in the 2019 corpus, the tweets that were
annotated in 2019 would be split between train and test to
keep the best possible balance given the number of humor- 53.9%
ous tweets. In this way, the 2019 test partition contains
only tweets that the participants of the previous year had not 1.2%
seen. This corpus was used in the HAHA at IberLEF 2019 4.1% 5
competition (Chiruzzo et al., 2019). Refer to the Appendix 4
for more details. 17.6% 9.5%
4. Analysis 13.6% 3
1
In this section, we present the composition of the final
dataset and an analysis of some aspects of the corpus. 2
4.1. Dataset information Figure 2: Distribution of votes in the final version of the
corpus. The numbers 1 to 5 are the different scores the
Train Test Total
annotators could assign to the humorous tweets.
Tweets 24,000 6,000 30,000
Non-humorous 14,757 3,658 18,405
Humorous 9,253 2,342 11,595
Average Score 2.04 2.03 2.04 the corpus. First of all, the agreement for the humorous/non-
Votes No 59,440 13,605 73,045 humorous classes is above 0.5 in all cases, which indicates a
Votes 1 19,063 4,818 23,881 moderate to substantial agreement (Fleiss, 1971). Compare
Votes 2 14,713 3,777 18,490 this to the agreement values obtained in (Castro et al., 2016),
Votes 3 10,206 2,649 12,855 which reports an agreement of 0.365 for a similar task. The
Votes 4 4,493 1,122 5,615 agreement for the funniness score value is considerably
Votes 5 1,305 275 1,580 lower, which is expected due to the high subjectivity of this
measure.
Table 2: Composition of the final corpus for the total count It is also interesting that the agreement increases appreciably
and each class. in all cases when considering only the valid sessions. This
could indicate that the process of presenting test tweets
Table 2 shows the composition of the corpus, which is pro-
to all users helps ruling out some low-quality annotations.
vided as two CSV files containing the training data and test
The agreement values for the 2019 annotations have also
data. Each row in the files includes the tweet unique iden-
increased significantly respect to the 2018 corpus.
tifier, the text of the tweet, the number of votes for each
category (not humor, 1, 2, 3, 4 or 5 stars), and two values
that can be calculated from the number of votes: a boolean 4.3. Offensiveness
value indicating if the tweet should be considered humorous
or not and a real value indicating the average funniness score In total, we received 1,438 votes that were marked as of-
(if the tweet is not humorous, this value is NULL). Figure 2 fensive. Although this number is not enough for creating
shows the general distribution of votes for tweets in the cor- a corpus of offensive tweets (indeed very few tweets were
pus. The corpus contains around 38.6% of humorous tweets voted as offensive more than once) we found an interesting
(i.e. tweets that receive less negative votes than positive property of the votes that had the offensive mark. Figure 3
votes), although in total the number of all positive votes is shows the distribution by category of all votes marked as
around 46.1%. offensive and all votes not marked as offensive. Notice that
in the cases were a user marked the tweet as offensive, the
4.2. Agreement
most common voted category is “1” (humorous, but with the
lowest score). On the other hand, if the vote does not have
Humor Funniness
the offensive mark, the most common category is “x” (not
2018 2019 2018 2019
humor). This could indicate that the users that understood
All sessions 0.551 0.605 0.144 0.208
the possibility of marking a tweet as offensive, also under-
Valid sessions 0.571 0.639 0.163 0.224
stood more clearly that it is possible to have a tweet that is
Table 3: Annotator agreement measured as Krippendorff’s both offensive and humorous, while other users opted for
alpha for the categorical humor value and the ranged funni- marking more tweets as not humorous. Another possibility
ness value. is that offensive tweets (such as tweets containing insults)
have a higher chance of being jokes in bad taste. Further
Table 3 shows the agreement of the annotators calculated analysis is needed to understand what the case is in our
using Krippendorff’s alpha for the 2018 and 2019 versions of corpus.
5109
Task1 Task2
Year System Precision Recall F1 Acc RMSE
INGEOTEC 77.9 81.6 79.2 84.5 0.978
UO_UPV 81.6 75.7 78.5 84.6 1.592
2018 ELiRF_UPV 80.5 74.3 77.2 83.7 -
random baseline 36.5 48.9 41.8 49.2 1.142
dash baseline 93.9 9.3 16.9 65.9 -
adilism 79.1 85.2 82.1 85.5 0.736
Kevin & Hiromi 80.2 83.1 81.6 85.4 0.769
2019 bfarzin 78.2 83.9 81.0 84.6 0.746
random baseline 39.4 49.7 44.0 50.5 1.651
dash baseline 94.5 16.3 27.8 66.9 -
Table 4: Performance of the top three teams that took part in the competitions in 2018 and 2019. Task 1 refers to humor
identification (classification task) while Task 2 refers to funniness score prediction (regression task).
5. Conclusion
40 We presented a corpus of Spanish tweets annotated with
% votes
5110
Ismailov, A. (2019). Humor Analysis Based on Human Natural Language Processing, pages 2367–2376, Lisbon,
Annotation Challenge at IberLEF 2019: First-place Solu- Portugal, September. Association for Computational Lin-
tion. In Proceedings of the Iberian Languages Evaluation guistics.
Forum (IberLEF 2019), CEUR Workshop Proceedings,
Bilbao, Spain, 9. CEUR-WS. Appendix
Krippendorff, K. (2012). Content analysis: An introduction We present the data statement following (Bender and Fried-
to its methodology. Sage. man, 2018). HAHA 2019 is a dataset of 30,000 tweets with
Ortiz-Bejar, J., Salgado, V., Graff, M., Moctezuma, D., annotations for intended humor (binary) and funniness (five-
Miranda-Jiménez, S., and Tellez, E. (2018). INGEOTEC point scale). It can be accessed via https://www.fing.
at IberEval 2018 Task HaHa: mu-TC and EvoMSA to edu.uy/inco/grupos/pln/haha/.
Detect and Score Humor in Texts. In Proceedings of
A. C URATION R ATIONALE Jokes are hard to find on-
the Third Workshop on Evaluation of Human Language
line automatically using heuristics. At the same time, find-
Technologies for Iberian Languages (IberEval 2018) co-
ing jokes within in-the-wild long texts can be problematic
located with 34th Conference of the Spanish Society for
since you have to account for its boundaries concerning
Natural Language Processing (SEPLN 2018).
non-humorous content. Thus, we collect jokes from Twitter,
7. Language Resource References supposing a tweet is either completely humorous or not at
all. We rely on cherry-picked humorous accounts to source
Castro, S., Cubero, M., Garat, D., and Moncecchi, G. (2016).
humorous tweets, and randomly sampled tweets for non-
Is this a joke? detecting humor in spanish tweets. In Ibero-
humorous content (which we hypothesize are harder to tell
American Conference on Artificial Intelligence, pages
apart from jokes compared to specific types of tweets such
139–150. Springer.
as headlines or proverbs). We collected the data between
Castro, S., Chiruzzo, L., Rosá, A., Garat, D., and Moncecchi,
December 2018 and February 2019.
G. (2018). A Crowd-Annotated Spanish Corpus for Hu-
Because the data from each source type is not clean, we car-
mor Analysis. In Proceedings of the Sixth International
ried out an online crowd-annotation between February and
Workshop on Natural Language Processing for Social
March 2019, in which any person could enter the web page
Media, pages 7–11.
and annotate tweets voluntarily. We shared this web page
Khandelwal, A., Swami, S., Akhtar, S. S., and Shrivas-
with our acquaintances and also on social networks (Face-
tava, M. (2018). Humor detection in English-Hindi code-
book and Twitter). We used three tweets that we knew the
mixed social media content : Corpus and baseline system.
intended humor answer for spam detection and we used an
In Proceedings of the Eleventh International Conference
HTTP cookie with a long expiration time to avoid showing
on Language Resources and Evaluation (LREC 2018),
repeated tweets (note a user could eventually see the same
Miyazaki, Japan, May. European Language Resources
tweet twice if entering from different devices). We always
Association (ELRA).
showed a random tweet among those that had the minimum
Mihalcea, R. and Strapparava, C. (2005). Making comput-
annotation count. Finally, because we detected duplicate
ers laugh: Investigations in automatic humor recognition.
tweet texts (same content with the same or different format),
In Proceedings of the Conference on Human Language
we merged them along with their annotations.
Technology and Empirical Methods in Natural Language
Processing, HLT ’05, pages 531–538, Stroudsburg, PA, B. L ANGUAGE VARIETY Because we do not target a spe-
USA. Association for Computational Linguistics. cific Spanish dialect, for the former we used a varied number
Potash, P., Romanov, A., and Rumshisky, A. (2017). of humorous accounts that declared to be from each of the
Semeval-2017 task 6:# hashtagwars: Learning a sense countries in which Spanish is the main language, while for
of humor. In Proceedings of the 11th International Work- the latter we obtained randomly streamed tweets in Spanish
shop on Semantic Evaluation (SemEval-2017), pages 49– (using Twitter’s language detection). It includes dialects of
57. Spanish (es) from the following origins: Argentina (es-AR),
Reyes, A., Rosso, P., and Veale, T. (2013). A multidimen- Bolivia (es-BO), Chile (es-CL), Colombia (es-CO), Costa
sional approach for detecting irony in twitter. Language Rica (es-CR), Dominican Republic (es-DO), Ecuador (es-
resources and evaluation, 47(1):239–268. EC), El Salvador (es-SV), Guatemala (es-GT), Honduras
Sjöbergh, J. and Araki, K. (2007). Recognizing humor (es-HN), Mexico (es-MX), Nicaragua (es-NI), Panama (es-
without recognizing meaning. In Francesco Masulli, et al., PA), Paraguay (es-PY), Peru (es-PE), Puerto Rico (es-PR),
editors, WILF, volume 4578 of Lecture Notes in Computer Spain (es-ES), and Uruguay (es-UY).
Science, pages 469–476. Springer. C. S PEAKER D EMOGRAPHIC The only information we
van den Beukel, S. and Aroyo, L. (2018). Homonym detec- know is that they likely speak Spanish.
tion for humor recognition in short text. In Proceedings D. A NNOTATOR D EMOGRAPHIC For privacy and practi-
of the 9th Workshop on Computational Approaches to cal reasons, we do not ask annotators for information. How-
Subjectivity, Sentiment and Social Media Analysis, pages ever, we have the following information for the annotation
286–291, Brussels, Belgium, October. Association for period from Google Analytics (February 1st to March 31st,
Computational Linguistics. 2019):
Yang, D., Lavie, A., Dyer, C., and Hovy, E. (2015). Humor
recognition and humor anchor extraction. In Proceed- • 92% bounce rate. The following data only counts the
ings of the 2015 Conference on Empirical Methods in non-bounced visits.
5111
• 1,222 page views (note that in one page view there can
be many annotations).
• 1,083 sessions.
5112