HAHA 2019 Dataset: A Corpus For Humor Analysis in Spanish: Luis Chiruzzo, Santiago Castro, Aiala Rosá

This paper presents a new Spanish-language dataset of 30,000 tweets annotated for humor and funniness. The dataset addresses issues with previous Spanish humor datasets, including removing duplicate tweets and improving inter-annotator agreement. Tweets were crowd-annotated on two dimensions: whether the author intended humor (binary), and a 5-point funniness score. Approximately 38.6% of tweets were intended as humor, with an average funniness score of 2.04. The dataset was used in humor detection and funniness prediction competitions.

Uploaded by

saber

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views

HAHA 2019 Dataset: A Corpus For Humor Analysis in Spanish: Luis Chiruzzo, Santiago Castro, Aiala Rosá

Uploaded by

saber

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 5106–5112

Marseille, 11–16 May 2020

c European Language Resources Association (ELRA), licensed under CC-BY-NC

HAHA 2019 Dataset: A Corpus for Humor Analysis in Spanish

Luis ChiruzzoR , Santiago CastroM R , Aiala RosáR

R
Universidad de la República, Uruguay
M
University of Michigan, USA
[email protected], [email protected], [email protected]

Abstract
This paper presents the development of a corpus of 30,000 Spanish tweets that were crowd-annotated with humor value and funniness
score. The corpus contains approximately 38.6% intended-humor tweets with a 2.04/5 average funniness score. It has been used in an
automatic humor recognition and analysis competition, obtaining encouraging results from the participants.
Keywords: Humor, Computational Humor, Humor Detection, Humor Analysis, Natural Language Processing

1. Introduction 2. Related Work

Different authors have constructed datasets for humor recog-
nition in English texts, most of them focusing on recog-
Within the subfield of Computational Humor, there have nizing humorous short texts (called one-liners). (Mihalcea
been several works that have built resources to study dif- and Strapparava, 2005) created a corpus of 16,000 one-liner
ferent forms of Humor in texts (Mihalcea and Strapparava, jokes. (Sjöbergh and Araki, 2007) built their corpus by
2005; Sjöbergh and Araki, 2007; Yang et al., 2015; Potash downloading 6,100 one-liner jokes collected from the Inter-
et al., 2017). For the case of the Spanish language, (Castro net. (Yang et al., 2015) also constructed a humor dataset,
et al., 2016) created a first humor dataset, in which several collecting 2,423 short texts from the site Pun of the Day
issues were subsequently addressed by (Castro et al., 2018). (http://www.punoftheday.com). (van den Beukel
Both of these works extracted tweets from both live random and Aroyo, 2018) collected 12,000 humorous one-liners
samples and selected accounts and had them annotated ac- with a web-scraper from five selected jokes web-sites.
cording to intended humor (binary), and a funniness score The microblogging platform Twitter has been found partic-
from one to five for the latter case. The dataset created ularly useful for building humor corpora due to its public
in (Castro et al., 2018) was used by Castro et al. (2018) availability and the fact that its short messages are suitable
in the context of the HAHA 2018 competition for Humor for jokes or humorous comments. (Reyes et al., 2013) built a
Detection and Funniness Average Prediction. corpus for detecting irony in tweets by searching for several
hashtags (i.e., #irony, #humor, #education and #politics).
However, the dataset presented in (Castro et al., 2018) still
More recently, (Potash et al., 2017) built a tweet corpus that
presents some issues. First, there are many duplicate tweets
aims to distinguish the degree of funniness, assigning the
(around 6%), which only differ in their format or spacing,
values 0, 1 or 2 to each tweet. They used the tweet set issued
but the text is essentially the same. This is a potential prob-
in response to a TV game show, labeling which tweets were
lem with the quality of the data as they may have different
considered humorous by the show. The dataset includes
annotations. Second, we believe there is some room for
12,734 tweets and was used for the SemEval-2017 Task 6
improvement for the inter-annotator agreement, which is
(#HashtagWars: Learning a Sense of Humor).
0.57 in Krippendorff’s alpha value (Krippendorff, 2012).
For languages other than English, the available resources are
Finally, there is a class unbalance that should be tackled as
scarce. (Khandelwal et al., 2018) created a corpus contain-
it does not represent a sample from the reality (as tweets are
ing English-Hindi code-mixed tweets, with 1,755 humorous
picked from different sources) and complicates training and
tweets and 1,698 non-humorous tweets. For Spanish, dif-
evaluation.
ferent versions of the corpus presented in this paper have
In this work, we build on top of (Castro et al., 2018), tackling been available (Castro et al., 2016; Castro et al., 2018), in
the mentioned issues and creating a new dataset. We gather this work we focus on an extension and improvement of this
more tweets following a similar crowdsourcing annotation resource.
procedure but tackling some of the issues to increase the The main differences between our work and the ones previ-
agreement score and to have a more balanced dataset, and ously discussed are the construction of a resource to study
we put it together along with the tweets from their dataset Humor for the Spanish language, the five-point funniness
by removing the duplicate tweets and merging the annota- scale used for the annotation, and the crowdsourcing pro-
tions. Additionally, during the dataset annotation, we asked cess through which the dataset was annotated so that the
the annotators to tell if they considered the tweet text to humorous nature of each tweet was decided by multiple and
be offensive or not. We find interesting to study how hu- varied people.
mor plays along with hate speech. This dataset was used
(Chiruzzo et al., 2019) in the context of the HAHA 2019 3. Dataset construction
competition, hosted at IberLEF, for Humor Detection and In this section, we describe the approach we take to defin-
Funniness Score Prediction. ing intended humor and funniness, and the way we built the

5106
dataset which has had two iterations so far and has been used
for the 2018 and 2019 editions of the HAHA (Humor Anal-
ysis based on Human Annotations) evaluation campaign1 .

3.1. Approach to Humor and Funniness

We define two separate dimensions to conceptualize what
we consider humorous. First, the type of humor we are
trying to deal with in this work is intentional humor, i.e. the
author of a text intended to be humorous or to amuse others
with it. It clearly exists a relationship between humor and
funniness, so in a first approach, following this criteria, we
could be tempted to consider a piece of text as humorous if
Figure 1: Graphical interface used for the annotation process
any number of people find it funny. However, funniness is
in 2019. The sample tweet says: “- Boss, you underpaid me
a highly subjective property that varies significantly from
this month. - But I overpaid you last month. - Yes, one error
person to person and also could vary across time for the
is understandable, but two...” The only difference between
same person. It would not be correct to say that the text is
the 2018 and the 2019 versions of the tool is that the latter
not humorous because the recipients of the message did not
contains the “Ofensivo”/“Offensive” checkbox.
find it funny, it could as well happen that the message was a
joke in bad taste which failed to amuse the recipients, but
the intention of being humorous existed nonetheless. That “skip” (which is de-emphasized). There is also an “offensive”
is why we consider the two separate but related dimensions: checkbox which will be explained in section 3.4.. If the user
• Humor is an attribute of a piece of text that refers to chooses the option “no”, it will be recorded as a negative
the intention of the writer of being humorous. vote for that tweet (not humor) and no further questions will
be asked. On the other hand, if the user chooses “yes”, she
• Funniness is an attribute that refers to the subjective is immediately prompted with a list of options for scoring
experience of the reader if he or she finds the text the tweet from 1 to 5. The options are displayed as emojis
amusing. depicting different states of amusement from “Nada gra-
cioso”/“Not funny” to “¡Buenísimo!”/“Great!”. The vote
Of these two dimensions, the former could be considered as
is not recorded until the user selects one of the scoring op-
more objective and the latter as more subjective. As we show
tions. This way, we make sure that any vote for a humorous
in Section 4.2., this seems to be the case, as the annotation
tweet has a corresponding score.
process yielded greater agreement measures for the former
The tweets are presented randomly, but we keep track of an
than for the latter.
identifier for the session so as not to present the same tweet
The two dimensions are translated into two different sub-
twice to the same user. During the annotations periods, the
tasks in the HAHA evaluation campaign. The first task
page was shared on popular social networks (Facebook and
refers to automatically determining if a tweet is humorous
Twitter) to draw as much attention as possible and thus get
or not (a classification problem) and the second one refers
votes from many different users from different backgrounds.
to automatically assessing how funny a tweet is (a regres-
As we will describe in section 3.3., we used some test tweets
sion problem). As we will see in Section 4.4., the humor
to try to measure the quality of the annotations in each
intention of a tweet seems to be much easier to predict than
session.
its expected funniness, which could in part happen due to
the objectivity or subjectivity intrinsic to these dimensions.
3.3. Corpus 2018
3.2. Annotation interface The first iteration was between February and March
The graphical interface2 presented to the annotators was 2018 (Castro et al., 2018). In this first version of the corpus,
designed to have these concepts in mind: we want to distin- the aim was to collect 20,000 tweets with labels for humor-
guish between tweets that were intended to be humorous or ous or non-humorous and a corresponding score, trying to
not humorous, and for the humorous ones we want to know make it as balanced as possible between the humorous and
how funny the annotator finds them. We also tried to make non-humorous classes. We sampled 16,500 tweets from
the interface as intuitive and engaging as possible, so we humorous Twitter accounts in Spanish that were found by
could use any number of annotators without prior training manual inspection and 12,000 random tweets in Spanish.
and keep them long enough in the platform so we could We tried to find humorous accounts from different Span-
collect votes for several tweets. ish speaking countries (including Spain, Mexico, Uruguay,
The interface is shown in Figure 1. It displays an exam- Colombia, Argentina, and others) so as not to bias the corpus
ple tweet, and the only guiding text in the screen asks the to a single Spanish variant. These tweets were crowd an-
user if the tweet intends to be humorous, which corresponds notated by volunteers using a web tool during March 2018.
exactly to the first dimension we want to capture. The avail- The annotators had to decide, for each tweet, if it was hu-
able options are “yes” or “no” (which are emphasized), or morous or not, and in case it was humorous, how funny the
annotator considered it on a five-point scale.
1
https://www.fing.edu.uy/inco/grupos/pln/haha/ All the users were presented with the same three test tweets
2
http://clasificahumor.com for which we already knew if they were humorous or not

5107
(two humorous and a non-humorous one). The purpose of Si tuviera un peso por cada Si tuviera un peso por cada
these test tweets was to rule out users that did not understand persona que me dice "feo", vez que alguien me dice
the premise of the annotation process, we considered the pues sería pobre porque "feo", sería pobre porque
sessions where any of these tweets were mislabeled as in- soy perfecto. soy perfecto.
valid sessions and did not use their votes in the final version —¿Tienes Wi-Fi? - ¿Tienes wi-fi?
of the corpus. —Claro. - Sí
First, we aimed at getting at least five votes for each tweet —¿Cuál es la clave? - ¿Y cuál es la clave?
and determine the humorous tweets by simple majority, i.e. —Tener dinero y pagarlo. - Tener dinero y pagarlo.
the tweets that got at least three humor votes out of five Me encanta encontrar Me encanta encontrar el
should be considered humorous. We tried to shuffle the dinero en mi ropa. Es como dinero en mi ropa, es como
tweets presented to the annotators to keep the number of un regalo para mí de mí. un regalo para mí de mí.
votes for each tweet as close as possible on average, not Cuando te digan ESTUDIA Cuando te digan ESTUDIA
letting some of the tweets to lag (a notable exception to this no hagas nada, significa no hagas nada, significa
are the three test tweets, which received as many votes as ES-TU-DIA, aprovéchalo. ES-TU-DIA, aprovechalo!
sessions). As the voting period proceeded, we realized that #fb
the tweets that were already getting three negative votes did ¿Cursi yo? ¿Cursi YOO? Cursi yo?? CURSI
not have any possibility of being considered humorous in Cursi el viento..!! ..que YOOOOOOO?????...
the final corpus, even if the remaining two votes were of the acaricia tu cabello, impreg- cursi el viento que acaricia
humorous categories. Because of this, occasionally during nándome de tu aroma, y el tu cabello impregnandome
the voting period, we manually deprioritized the tweets that dulce terciopelo... con tu aroma y el dulce
got three or more negative votes, to keep in the pool only the terciopelo...
tweets that still had a chance of being considered positive.
As a result, the corpus contains some tweets that do not have Table 1: Examples of different types of near-duplicate
five votes, mainly the non-humorous ones. tweets.
Once the voting period ended, we had received 117,800
votes from 1,546 users. We collected all the annotations, ous year plus thirteen new accounts), and 3,000 new random
discarding the invalid sessions, determined the humorous Spanish tweets. We used the same web tool for annotating
value by simple majority, and the average score for the the new tweets with a small modification.
humorous tweets. In total, around 26.9% of the tweets From our experience during the 2018 annotation, we found
were considered humorous. We then randomly discarded out that some annotators were still confused between consid-
non-humorous tweets until getting 20,000 tweets in total, ering a tweet as “non-humorous” or considering it “humor-
achieving a final proportion of 36.8% humorous tweets in ous but not funny”. This was more evident for tweets that
the corpus. This 2018 version of the corpus contains 20,000 contained insults or offensive content, on occasions tweets
tweets where 7,357 are humorous and 12,643 are not, the that would be considered a bad taste joke (i.e. humorous but
average funniness score for the humorous tweets is 2.10. not funny) could be labeled as not humor if they contained
The corpus was divided into an 80/20 train-test split and it insults. To alleviate this situation, we decided to slightly
was used in the HAHA at IberEval 2018 competition (Castro modify the graphical interface by adding a new option to
et al., 2018). mark a tweet as offensive. This option, as shown in Figure 1,
is a checkbox, and its information is saved only after the
3.4. Corpus 2019 user chose whether the tweet is humorous or not. The pur-
The second iteration was done between December 2018 and pose of this new option is twofold: On one hand, it could
March 2019. First, we started by analyzing some tweets in help us collect information about tweets that are offensive or
the 2018 version of the corpus that we noticed were near- not offensive to analyze if there is any correlation between
duplicates, i.e. the content was almost the same with a few offensive content and humor. On the other hand, it would
different words that did not change their semantics. We used help to make clearer to an annotator that there are tweets
a semi-automatic process to find duplicate candidates by that could be offensive or in bad taste but should be marked
collecting all pairs of tweets that had a Jaccard coefficient as humorous nonetheless. We hoped that making this option
greater than 0.5. We manually inspected all pairs, clustered explicit would help disentangle these possibilities and show
them into equivalence classes, and took one example from that offensiveness and humor are different dimensions.
each class discarding the others from the corpus. As a re- Between February and March 2019, we received 74,312
sult, we pruned 1,278 tweets from the corpus, most of them votes from 780 users. This time we used two test tweets
were humorous. Table 1 shows some examples of near- presented to all users, different from the ones used the previ-
duplicate tweets found in the previous version of the corpus. ous year but with the same intent of trying to detect invalid
The most common differences between tweets considered sessions. After determining the humorous tweets and their
near-duplicates include slight changes in spelling or capital- respective scores, we discarded non-humorous tweets until
ization, differences in punctuation, repetition of characters we got the 30,000 tweets we wanted for this version of the
and use of hashtags. corpus, which ended up being slightly more balanced than
The aim for this second version of the corpus was to get the 2018 version having 38.6% of humorous tweets. In the
30,000 tweets in total, so we extracted 10,000 new tweets 2019 version of the corpus, there are 30,000 tweets where
from humorous accounts (the same accounts as in the previ- 11,595 are humorous and 18,405 are not, the average fun-

5108
niness score for the humorous tweets is 2.04. The corpus Not humor
was divided in an 80/20 train-test split with the following
criteria: all tweets that had been part of the train and test
partitions in the 2018 version of the corpus are part of the
training partition in the 2019 corpus, the tweets that were
annotated in 2019 would be split between train and test to
keep the best possible balance given the number of humor- 53.9%
ous tweets. In this way, the 2019 test partition contains
only tweets that the participants of the previous year had not 1.2%
seen. This corpus was used in the HAHA at IberLEF 2019 4.1% 5
competition (Chiruzzo et al., 2019). Refer to the Appendix 4
for more details. 17.6% 9.5%

4. Analysis 13.6% 3
1
In this section, we present the composition of the final
dataset and an analysis of some aspects of the corpus. 2
4.1. Dataset information Figure 2: Distribution of votes in the final version of the
corpus. The numbers 1 to 5 are the different scores the
Train Test Total
annotators could assign to the humorous tweets.
Tweets 24,000 6,000 30,000
Non-humorous 14,757 3,658 18,405
Humorous 9,253 2,342 11,595
Average Score 2.04 2.03 2.04 the corpus. First of all, the agreement for the humorous/non-
Votes No 59,440 13,605 73,045 humorous classes is above 0.5 in all cases, which indicates a
Votes 1 19,063 4,818 23,881 moderate to substantial agreement (Fleiss, 1971). Compare
Votes 2 14,713 3,777 18,490 this to the agreement values obtained in (Castro et al., 2016),
Votes 3 10,206 2,649 12,855 which reports an agreement of 0.365 for a similar task. The
Votes 4 4,493 1,122 5,615 agreement for the funniness score value is considerably
Votes 5 1,305 275 1,580 lower, which is expected due to the high subjectivity of this
measure.
Table 2: Composition of the final corpus for the total count It is also interesting that the agreement increases appreciably
and each class. in all cases when considering only the valid sessions. This
could indicate that the process of presenting test tweets
Table 2 shows the composition of the corpus, which is pro-
to all users helps ruling out some low-quality annotations.
vided as two CSV files containing the training data and test
The agreement values for the 2019 annotations have also
data. Each row in the files includes the tweet unique iden-
increased significantly respect to the 2018 corpus.
tifier, the text of the tweet, the number of votes for each
category (not humor, 1, 2, 3, 4 or 5 stars), and two values
that can be calculated from the number of votes: a boolean 4.3. Offensiveness
value indicating if the tweet should be considered humorous
or not and a real value indicating the average funniness score In total, we received 1,438 votes that were marked as of-
(if the tweet is not humorous, this value is NULL). Figure 2 fensive. Although this number is not enough for creating
shows the general distribution of votes for tweets in the cor- a corpus of offensive tweets (indeed very few tweets were
pus. The corpus contains around 38.6% of humorous tweets voted as offensive more than once) we found an interesting
(i.e. tweets that receive less negative votes than positive property of the votes that had the offensive mark. Figure 3
votes), although in total the number of all positive votes is shows the distribution by category of all votes marked as
around 46.1%. offensive and all votes not marked as offensive. Notice that
in the cases were a user marked the tweet as offensive, the
4.2. Agreement
most common voted category is “1” (humorous, but with the
lowest score). On the other hand, if the vote does not have
Humor Funniness
the offensive mark, the most common category is “x” (not
2018 2019 2018 2019
humor). This could indicate that the users that understood
All sessions 0.551 0.605 0.144 0.208
the possibility of marking a tweet as offensive, also under-
Valid sessions 0.571 0.639 0.163 0.224
stood more clearly that it is possible to have a tweet that is
Table 3: Annotator agreement measured as Krippendorff’s both offensive and humorous, while other users opted for
alpha for the categorical humor value and the ranged funni- marking more tweets as not humorous. Another possibility
ness value. is that offensive tweets (such as tweets containing insults)
have a higher chance of being jokes in bad taste. Further
Table 3 shows the agreement of the annotators calculated analysis is needed to understand what the case is in our
using Krippendorff’s alpha for the 2018 and 2019 versions of corpus.

5109
Task1 Task2
Year System Precision Recall F1 Acc RMSE
INGEOTEC 77.9 81.6 79.2 84.5 0.978
UO_UPV 81.6 75.7 78.5 84.6 1.592
2018 ELiRF_UPV 80.5 74.3 77.2 83.7 -
random baseline 36.5 48.9 41.8 49.2 1.142
dash baseline 93.9 9.3 16.9 65.9 -
adilism 79.1 85.2 82.1 85.5 0.736
Kevin & Hiromi 80.2 83.1 81.6 85.4 0.769
2019 bfarzin 78.2 83.9 81.0 84.6 0.746
random baseline 39.4 49.7 44.0 50.5 1.651
dash baseline 94.5 16.3 27.8 66.9 -

Table 4: Performance of the top three teams that took part in the competitions in 2018 and 2019. Task 1 refers to humor
identification (classification task) while Task 2 refers to funniness score prediction (regression task).

humorous tweets are not written as dialogues, and that is

60 59.04 why its F1 score is not that high.
55.7

5. Conclusion
40 We presented a corpus of Spanish tweets annotated with
% votes

information about humor: if the tweets are humorous or

25.1 not, and how funny the humorous tweets are. This informa-
tion was crowd annotated by users that rated each tweet as
20 15.94
12.45 non-humorous or humorous, the humorous ones were also
8.55 8.87 annotated with a score in a range from 1 to 5. The corpus
2.23 2 3.99
3.82 3.82 contains 30,000 tweets with about 38.6% instances of the
1.39
1.05
0 humorous class, with an 80/20 train-test split.
This corpus is slightly more balanced than the 2018 version
n x 1 2 3 4 5 of the corpus (Castro et al., 2018) and is also less noisy
because we manually analyzed and resolved all cases of
Offensive Not offensive near-duplicated tweets. The annotators also had the option
of marking a tweet as offensive. Although the number of
Figure 3: Percentage of votes for each category considering votes for offensive tweets is not enough to create a corpus
tweets marked as offensive or not marked as offensive. Votes of offensive tweets by itself, the marks in the corpus could
marked as “n” mean the user skipped the tweet, votes marked help analyze the relationship between humor, funniness, and
as “x” mean a tweet that is not humorous, and votes marked offensiveness.
as a number mean a tweet that is humorous with that score. As future work, it would be very interesting to generate a
similar resource as this one but for other languages, particu-
larly for English.
4.4. How baselines have performed on it
Table 4 shows the performance of the top three systems that 6. Bibliographical References
participated in the competition in 2018 and 2019 together
with two baselines. In 2018 the top system used an evolu- Bender, E. M. and Friedman, B. (2018). Data statements
tionary algorithm for training the system (Ortiz-Bejar et al., for natural language processing: Toward mitigating sys-
2018), while in 2019 the top system performed fine-tuning tem bias and enabling better science. Transactions of the
over a multilingual BERT language model (Ismailov, 2019). Association for Computational Linguistics, 6:587–604.
For task 1, the random baseline selects the positive class Castro, S., Chiruzzo, L., and Rosá, A. (2018). Overview of
randomly with the probability of the training corpus, while the HAHA Task: Humor Analysis based on Human Anno-
for task 2 it selects always the average score in the training tation at IberEval 2018. In CEUR Workshop Proceedings,
corpus. volume 2150, pages 187–194.
The dash baseline, which is only defined for task 1, selects Chiruzzo, L., Castro, S., Etcheverry, M., Garat, D., Prada,
all tweets that start with a dash as humorous. The intuition J. J., and Rosá, A. (2019). Overview of HAHA at Iber-
behind this baseline is that very often the humorous tweets LEF 2019: Humor Analysis based on Human Annota-
are written in a dialogue format, starting each line with a tion. In Proceedings of the Iberian Languages Evaluation
dash. This baseline has quite a high precision, more than Forum (IberLEF 2019), CEUR Workshop Proceedings,
90% in both versions of the corpus. None of the systems Bilbao, Spain, September. CEUR-WS.
could beat this baseline in terms of precision. On the other Fleiss, J. L. (1971). Measuring nominal scale agreement
hand, the recall of this baseline is very low, because many among many raters. Psychological bulletin, 76(5):378.

5110
Ismailov, A. (2019). Humor Analysis Based on Human Natural Language Processing, pages 2367–2376, Lisbon,
Annotation Challenge at IberLEF 2019: First-place Solu- Portugal, September. Association for Computational Lin-
tion. In Proceedings of the Iberian Languages Evaluation guistics.
Forum (IberLEF 2019), CEUR Workshop Proceedings,
Bilbao, Spain, 9. CEUR-WS. Appendix
Krippendorff, K. (2012). Content analysis: An introduction We present the data statement following (Bender and Fried-
to its methodology. Sage. man, 2018). HAHA 2019 is a dataset of 30,000 tweets with
Ortiz-Bejar, J., Salgado, V., Graff, M., Moctezuma, D., annotations for intended humor (binary) and funniness (five-
Miranda-Jiménez, S., and Tellez, E. (2018). INGEOTEC point scale). It can be accessed via https://www.fing.
at IberEval 2018 Task HaHa: mu-TC and EvoMSA to edu.uy/inco/grupos/pln/haha/.
Detect and Score Humor in Texts. In Proceedings of
A. C URATION R ATIONALE Jokes are hard to find on-
the Third Workshop on Evaluation of Human Language
line automatically using heuristics. At the same time, find-
Technologies for Iberian Languages (IberEval 2018) co-
ing jokes within in-the-wild long texts can be problematic
located with 34th Conference of the Spanish Society for
since you have to account for its boundaries concerning
Natural Language Processing (SEPLN 2018).
non-humorous content. Thus, we collect jokes from Twitter,
7. Language Resource References supposing a tweet is either completely humorous or not at
all. We rely on cherry-picked humorous accounts to source
Castro, S., Cubero, M., Garat, D., and Moncecchi, G. (2016).
humorous tweets, and randomly sampled tweets for non-
Is this a joke? detecting humor in spanish tweets. In Ibero-
humorous content (which we hypothesize are harder to tell
American Conference on Artificial Intelligence, pages
apart from jokes compared to specific types of tweets such
139–150. Springer.
as headlines or proverbs). We collected the data between
Castro, S., Chiruzzo, L., Rosá, A., Garat, D., and Moncecchi,
December 2018 and February 2019.
G. (2018). A Crowd-Annotated Spanish Corpus for Hu-
Because the data from each source type is not clean, we car-
mor Analysis. In Proceedings of the Sixth International
ried out an online crowd-annotation between February and
Workshop on Natural Language Processing for Social
March 2019, in which any person could enter the web page
Media, pages 7–11.
and annotate tweets voluntarily. We shared this web page
Khandelwal, A., Swami, S., Akhtar, S. S., and Shrivas-
with our acquaintances and also on social networks (Face-
tava, M. (2018). Humor detection in English-Hindi code-
book and Twitter). We used three tweets that we knew the
mixed social media content : Corpus and baseline system.
intended humor answer for spam detection and we used an
In Proceedings of the Eleventh International Conference
HTTP cookie with a long expiration time to avoid showing
on Language Resources and Evaluation (LREC 2018),
repeated tweets (note a user could eventually see the same
Miyazaki, Japan, May. European Language Resources
tweet twice if entering from different devices). We always
Association (ELRA).
showed a random tweet among those that had the minimum
Mihalcea, R. and Strapparava, C. (2005). Making comput-
annotation count. Finally, because we detected duplicate
ers laugh: Investigations in automatic humor recognition.
tweet texts (same content with the same or different format),
In Proceedings of the Conference on Human Language
we merged them along with their annotations.
Technology and Empirical Methods in Natural Language
Processing, HLT ’05, pages 531–538, Stroudsburg, PA, B. L ANGUAGE VARIETY Because we do not target a spe-
USA. Association for Computational Linguistics. cific Spanish dialect, for the former we used a varied number
Potash, P., Romanov, A., and Rumshisky, A. (2017). of humorous accounts that declared to be from each of the
Semeval-2017 task 6:# hashtagwars: Learning a sense countries in which Spanish is the main language, while for
of humor. In Proceedings of the 11th International Work- the latter we obtained randomly streamed tweets in Spanish
shop on Semantic Evaluation (SemEval-2017), pages 49– (using Twitter’s language detection). It includes dialects of
57. Spanish (es) from the following origins: Argentina (es-AR),
Reyes, A., Rosso, P., and Veale, T. (2013). A multidimen- Bolivia (es-BO), Chile (es-CL), Colombia (es-CO), Costa
sional approach for detecting irony in twitter. Language Rica (es-CR), Dominican Republic (es-DO), Ecuador (es-
resources and evaluation, 47(1):239–268. EC), El Salvador (es-SV), Guatemala (es-GT), Honduras
Sjöbergh, J. and Araki, K. (2007). Recognizing humor (es-HN), Mexico (es-MX), Nicaragua (es-NI), Panama (es-
without recognizing meaning. In Francesco Masulli, et al., PA), Paraguay (es-PY), Peru (es-PE), Puerto Rico (es-PR),
editors, WILF, volume 4578 of Lecture Notes in Computer Spain (es-ES), and Uruguay (es-UY).
Science, pages 469–476. Springer. C. S PEAKER D EMOGRAPHIC The only information we
van den Beukel, S. and Aroyo, L. (2018). Homonym detec- know is that they likely speak Spanish.
tion for humor recognition in short text. In Proceedings D. A NNOTATOR D EMOGRAPHIC For privacy and practi-
of the 9th Workshop on Computational Approaches to cal reasons, we do not ask annotators for information. How-
Subjectivity, Sentiment and Social Media Analysis, pages ever, we have the following information for the annotation
286–291, Brussels, Belgium, October. Association for period from Google Analytics (February 1st to March 31st,
Computational Linguistics. 2019):
Yang, D., Lavie, A., Dyer, C., and Hovy, E. (2015). Humor
recognition and humor anchor extraction. In Proceed- • 92% bounce rate. The following data only counts the
ings of the 2015 Conference on Empirical Methods in non-bounced visits.

5111
• 1,222 page views (note that in one page view there can
be many annotations).
• 1,083 sessions.

• 793 users (8 users had at least 10 sessions).

• User age: 7.3% 18–24, 46.6% 25–34, 20.65% 35–44,
11.08% 45-54, 9.82% 55-64, and 4.53% 65+.

• 54.9% male and 45.1% female.

• >72% of the users’ web browser language was in Span-
ish, >25% was in English.
• Top 10 countries: 635 users from Uruguay, 38 from
Argentina, 31 from Mexico, 17 from the United States,
12 from Spain, 11 from Chile, 11 from the United
Kingdom, 7 from China, 6 from Ecuador and 5 from
Guatemala.
• Device: 72.51% mobile users, 26.86% desktop, and
0.63% tablet.

• 64% sessions from social networks (60% Facebook and

40% Twitter), 24.5% direct access, 9.2% from organic
searches, and 2.3% from other types of referrals.

E. S PEECH S ITUATION The tweets (written text) were

extracted between December 2018 and February 2019 (how-
ever, note some tweets are older). They are publicly-
accessible tweets.
F. T EXT C HARACTERISTICS The texts are tweets, which
contain up to 280 characters and may include emoji.
G. R ECORDING Q UALITY N/A.
H. OTHER N/A.
I. P ROVENANCE A PPENDIX N/A.

5112