2025.chum-1.8
2025.chum-1.8
72
To measure the quantity of laughter elicited by
each joke, the recording of each set was labeled to
mark the segments in which laughter occurred. The
original audio was then converted to a graph of
decibels over time using Formula 1.
𝑑𝐵 = 20 ∗ 𝑙𝑜𝑔10(|𝑠| + 1𝑒 −6 ) (1)
In the formula, s is the original sound wave, and
-6
1e is the lowest sound level perceivable by
humans. The area under the curve, representing the
"quantity of laughter," was then computed using
Simpson’s numerical integration method
implemented in Python (Matthews, 2004). We refer
to the measure as Total Laughter; its units are
decibel-seconds (see Figure 1). We believe this
method best captures the quantity of laughter
compared to other potential methods such as the
average, median, or max, as those other methods Figure 1: A demonstration of how the "Total
would be poor at capturing situations in which Laughter" of a single joke is measured. The sound
different individuals in the audience "get the joke" wave of the laughter segment following the joke is
at different times, resulting in the same amount of converted to dB over time. The area under the
laughter spread over a longer period of time. curve (here 241) is the Total Laughter in decibel-
seconds.
For the present analysis, we used the audio of
two high-quality sets performed at the same North characteristics affecting the overall loudness of its
Hollywood venue by the same comedian, Mike laughter. That normalization was achieved by:
Perkins, with audience sizes of 35 and 15. The sets 1. Prior to conducting a paired t-test, we compared
were performed a month apart at the same time of two joke versions across sets. The Loudness
day (at the end of the comedian's 10-minute set measure of all the jokes within a set was
opening the 8 p.m. show). Two other sets were normalized by the median Loudness across all
excluded from the analysis either because of poor jokes in the set.
venue quality or small audience size (N<10). 2. For the GLM the Set was included as a regressor
The audio tracks were annotated to select the of no interest.
segments of laughter associated with each joke. In
a typical set, sounds unrelated to the laughter, such 4.4 The Hypothesis
as heckling, would mix with the laughter. However, Historically, the standard for demonstrating that AI
these interferences were not an issue in the sets we had reached a certain milestone against human
analyzed. Additionally, comedians might speak performance involved only a few data points. For
over the laughter to make a comment or start the example, Kasparov played only six games with
next joke. But in the sets we analyzed, the Deep Blue (AI) in 1997 (scoring 2.5-3.5). In 2011,
comedian made an effort to let the audience laugh Watson (AI) competed only once against two
uninterrupted, though he often did start the next human champions on Jeopardy!, and won. While
joke when he felt the laughter was dying down. We such events would not meet the nominal standards
always ended the laugh segment before the of statistical significance required to determine that
comedian resumed talking, so the audio segment AI was "consistently" better than the human
contained laughter only. Importantly, the comedian champions, they are nevertheless considered
was not aware which jokes had been written by AI, meaningful milestones, since before those events it
so any such interference affected all jokes equally. was considered inconceivable that AI would
We compared the performance of Human vs. AI perform at the level of those human champions
jokes within sets and between sets. The between- even once.
sets comparison required some form of If generating jokes for a comedy/talk show-style
normalization of the laughs to remove any bias monologue, where the quality is judged by
resulting from the size of the audience or other
73
How reliable is the measure itself? The measure
captures the total laughter of an audience of N=35
and 15 in Sets 1 and 2, respectively. In a classical
experiment, jokes are rated by a handful of raters.
While audience members' responses are not
entirely independent (e.g., laughter is contagious)
whatever effect audience members had on each
other was present for all jokes and presumably had
the effect of signal amplification rather than of
cancellation of individual differences.
Additionally, unlike with raters, it is not possible to
tease apart the contributions of individual raters
(here, audience members). Despite these
drawbacks, the number of raters/audience
members is much larger than in a typical study in
the field, suggesting higher reliability than the
standard. The validity of the measure is arguably
Figure 2: The jokes written by the human expert higher since the measure is of a natural response to
(H) and Witscript (AI) in order of the Total jokes in a natural environment. However, there
Laughter they elicited in Set 1. Joke ID may be other forms of humor for which a
corresponds to the actual order in which the jokes traditional approach using numerical ratings would
were told. The jokes are listed in the Appendix. be better suited than our measurement method.
audience laughter, was an AI-complete problem, How did the Human and AI jokes compare? The
we would expect that: funniest joke (area under the curve = 241) was
written by AI. On average, AI did slightly better (M
H0: None of the AI-generated jokes would perform = 106, SD = 96) than the Human (M = 104, SD =
better than any of the professional human writer’s. 86) in Set 1, with the reverse true in Set 2 (AI: M =
We could reject this hypothesis if: 66, SD = 21; Human M = 99, SD = 93). However,
these differences were not significant (both sets:
H1: Some of the AI-generated jokes performed Mann-Whitney U(4,4) = 8.0, ns). The lack of
better than some of the Human’s. statistical difference between the groups is not
meaningful with the present sample size. Instead,
5 Results and Discussion as explained above (see the hypotheses), we rely on
a standard similar to Deep Blue's and Watson’s,
5.1 Analysis Within a Set that of a limited live demonstration of equivalence
Figure 2 displays the eight jokes performed in Set to human performance, which we have met.
11 ranked by the Total Laughter they elicited. Three
5.2 Comparison Between the Sets
of the four jokes written by AI elicited more
laughter than at least one joke written by the human As described above, the two sets had the same eight
expert. Additionally, the joke that elicited the most topics, for which half of the punchlines were
laughter was AI-written. written by AI and half by the Human. The jokes
This result is in line with H1, in that some of the were counterbalanced so that if a particular topic
AI-written jokes did better than some of the had a punchline written by the Human in Set 1 it
Human’s. The same pattern held true for Set 2; see would have a punchline written by AI in Set 2, and
the Appendix for the data. If we deem this result to vice versa.
be reliable, we can conclude that writing the type The audience size for Set 1 was bigger than for
of humor analyzed here is not AI-complete. How Set 2 (35 vs. 15), resulting in longer laugh times (M
can we determine this reliability? = 2.16 sec. vs. 1.71 sec.) and greater values on our
Total Laughter metric (M = 105 vs. 83). But
7 Conclusion
AI-written jokes, performed in front of a live
audience, elicited laughter within the same range as
jokes written by a professional human comedy
writer.
Some AI-written jokes ranked higher than some
of the human-written jokes, and the funniest joke,
as measured by quantity of laughter, was written by
AI.
The study provides naturalistic, real-world
Figure 3: The Median Laughter Loudness (over the evidence that when it comes to generating
duration of the laugh) elicited by the Human (H)-
comedy/talk show monologue-style humor, an AI
vs. AI (A)-written jokes for each topic across the
two sets. The lack of pattern suggests equivalent system can perform at the level of a professional
performance by the Human and AI sources. human comedy writer.
75
Acknowledgments Marcio L. Inácio and Hugo G. Oliveira. 2024.
Generation of Punning Riddles in Portuguese with
We would like to thank the following comedians Prompt Chaining. 15th International Conference on
for their insights and help: Mike Perkins, Kevin Computational Creativity (ICCC'24).
Hickerson, Ajitesh Srivastava, and the comedy/talk
Carolyn Lamb, Daniel G. Brown, and Charles L.A.
show writer who wrote the Human jokes for the Clarke. 2015. Human Competence in Creativity
experiment. Evaluation. Sixth International Conference on
Computational Creativity.
References
Tyler Loakman, Aaron Maladry, and Chenghua Lin.
Miriam Amin and Manuel Burghardt. 2020. A Survey 2023. The Iron(ic) Melting Pot: Reviewing Human
on Approaches to Computational Humor Evaluation in Humour, Irony and Sarcasm
Generation. In Proceedings of the 4th Joint Generation. Conference on Empirical Methods in
SIGHUM Workshop on Computational Linguistics Natural Language Processing.
for Cultural Heritage, Social Sciences, Humanities
and Literature, pages 29–41, Online. International John H. Matthews. 2004. Simpson’s 3/8 Rule for
Committee on Computational Linguistics. Numerical Integration, Numerical Analysis-
Numerical Methods Project.
Ori Amir et al. 2022. The elephant in the room:
attention to salient scene features increases with Anirudh Mittal, Yufei Tian, and Nanyun Peng.
comedic expertise. Cognitive Processing, 23(2), 2022. AmbiPun: Generating Humorous Puns with
203-215. Ambiguous Context. In Proceedings of the 2022
Conference of the North American Chapter of the
Ori Amir and Irving Biederman. 2016. The Neural Association for Computational Linguistics: Human
Correlates of Humor Creativity. Frontiers in Human Language Technologies, pages 1053–1062, Seattle,
Neuroscience, 10(597). United States. Association for Computational
Jacob Brawer and Ori Amir. 2021. Mapping the ‘funny Linguistics.
bone’: neuroanatomical correlates of humor Stavros Petridis and Maja Pantic. Is this joke really
creativity in professional comedians. Social funny? Judging the mirth by audiovisual laughter
Cognitive and Affective Neuroscience, 16(9), 915- analysis. 2009. In 2009 IEEE International
925. Conference on Multimedia and Expo, New York,
Tom B. Brown et al. 2020. Language models are few- NY, USA, pp. 1444-1447, doi:
shot learners. arXiv preprint arXiv:2005.14165. 10.1109/ICME.2009.5202774.
Fabricio Goes, Zisen Zhou, Piotr Sawicki, Marek Saša Petrović and David Matthews.
Grzes, and Daniel G. Brown. 2022. Crowd score: A 2013. Unsupervised joke generation from big data.
method for the evaluation of jokes using large In Proceedings of the 51st Annual Meeting of the
language model AI voters as judges. arXiv preprint Association for Computational Linguistics (Volume
arXiv:2212.11214. 2: Short Papers), pages 228–232, Sofia, Bulgaria.
Association for Computational Linguistics.
Drew Gorenz and Norbert Schwarz. 2024. How funny
is ChatGPT? A comparison of human- and AI- Sophie Scott, Nadine Lavan, Sinead Chen, and Carolyn
produced jokes. PLoS ONE 19(7): e0305364. McGettigan. 2014. The social life of laughter.
https://doi.org/10.1371/journal.pone.0305364. Trends in Cognitive Sciences, 18(12), 618–620.
https://doi.org/10.1016/j.tics.2014.09.002.
He He, Nanyun Peng and Percy Liang. 2019. Pun
Generation with Surprise. North American Chapter Alexey Tikhonov and Pavel Shtykovskiy. 2024. Humor
of the Association for Computational Linguistics. Mechanics: Advancing Humor Generation with
Multistep Reasoning. arXiv preprint
Nabil Hossain, John Krumm, Michael Gamon, and arXiv:2405.07280.
Henry Kautz. 2020. SemEval-2020 Task 7:
Assessing Humor in Edited News Headlines. In Joe Toplyn. 2014. Comedy Writing for Late-Night TV:
Proceedings of the Fourteenth Workshop on How to Write Monologue Jokes, Desk Pieces,
Semantic Evaluation, pages 746–758, Barcelona Sketches, Parodies, Audience Pieces, Remotes, and
(online). International Committee for Other Short-Form Comedy. Twenty Lane Media,
Computational Linguistics. LLC, Rye, New York.
Matthew M. Hurley, Daniel C. Dennett, and Reginald Joe Toplyn. 2020a. Systems and Methods for
B. Adams. 2011. Inside Jokes: Using Humor to Generating Jokes. U.S. Patent No. 10,642,939.
Reverse-Engineer the Mind. MIT Press. Washington, DC: U.S. Patent and Trademark Office.
76
Joe Toplyn. 2020b. Systems and Methods for That's on them, we subcontracted to Boeing.
Generating Comedy. U.S. Patent No. 10,878,817. AI: (TL: 51, T: 1.04, ML: 50)
Washington, DC: U.S. Patent and Trademark Office. They're especially concerned since the leak is
Joe Toplyn. 2021a. Systems and Methods for coming from one of their astronauts' space diapers.
Generating and Recognizing Jokes. U.S. Patent No.
11,080,485. Washington, DC: U.S. Patent and Joke 2
Trademark Office. Topic:
Joe Toplyn. 2021b. Witscript: A System for Generating Why do TV stations air false political ads?
Improvised Jokes in a Conversation. In Proceedings Human: (TL: 51, T: 1.04, ML: 50)
of the 12th International Conference on That's so after the election, we welcome the sound
Computational Creativity, 22–31. Online: of "Attention, Hemorrhoid Sufferers!"
Association for Computational Creativity.
AI: (TL: 101, T: 1.96, ML: 53)
Joe Toplyn. 2023. Witscript 3: A Hybrid AI System for Because they want to make sure the viewers are
Improvising Jokes in a Conversation. arXiv, just as confused as the candidates!
abs/2301.02695.
Alessandro Valitutti, Hannu Toivonen, Antoine Joke 3
Doucet, and Jukka M. Toivanen. 2013. “Let Topic:
Everything Turn Well in Your Wife”: Generation of A company just introduced a virtual dog leash that
Adult Humor Using Lexical Constraints. In uses wireless technology.
Proceedings of the 51st Annual Meeting of the
Human: (TL: 73, T: 1.58, ML: 47)
Association for Computational Linguistics (Volume
2: Short Papers), pages 243–248, Sofia, Bulgaria. Wifi can control my dog's movements? So where's
Association for Computational Linguistics. his virtual pooper scooper?
AI: (TL: 53, T: 1.54, ML: 36)
Thomas Winters. 2021. Computers Learning Humor Is
No Joke. Harvard Data Science Review, 3(2).
But I'm pretty sure that's just a fancy way of saying
doi.org/10.1162/99608f92.f13a2337. 'I don't want to walk my dog.'
Hang Zhang, Dayiheng Liu, Jiancheng Lv, and Cheng Joke 4
Luo. 2020. Let's be Humorous: Knowledge
Enhanced Humor Generation. Annual Meeting of
Topic:
the Association for Computational Linguistics. Bob Yerkes, a stuntman who appeared in "Star
Wars," died at the age of 92.
Appendix. The Jokes Human: (TL: 231, T: 3.92, ML: 62)
In his long career, he broke so many bones, his
Below is a full list of the jokes and their Joke ID, grave says Rest in Pieces. But true Star Wars fan to
which indicates the order in which they were told the end, he asked to be buried in his parent's
in the sets. Each joke has a topic that serves as a basement.
prompt/setup for both the AI- and human-written AI: (TL: 48, T: 1.00, ML: 49)
punchlines. Each set randomly includes half of the He passed away surrounded by his loved ones and
punchlines written by AI. Next to each joke, we a strategically placed pile of mattresses.
also provide these metrics for the laughter it
elicited: Total Laughter, in decibel-seconds (TL); Joke 5
total laugh Time, in seconds (T); and Median Topic:
Laughter Loudness over the duration of the laugh, BuzzFeed put out a list of 31 things to buy when
in decibels (ML). you finally decide to update your kitchen.
Human: (TL: 36, T: 1.08, ML: 33)
Joke 1 If you ask me, appliances are too smart already. The
Topic: clock on my coffee maker flashes 12 12 12...
A new report says that NASA officials are worried What'll it do smarter--snicker? "Tsk tsk tsk. So
about a leak on the International Space Station. much for caffeine increasing brain function."
Human: (TL: 79, T: 2.00, ML: 40) AI: (TL: 25, T: 0.92, ML: 26)
Will they fix it? Naw, even in space, landlords don't Number 32 on the list: a new Buzzfeed article on
fix leaks. 31 ways to use all the unnecessary gadgets you
"But Houston, we have a potty problem." bought from the first list.
77
Joke 6
Topic:
Scientists have discovered a sixth ocean more than
400 miles below the surface of the Earth.
Human: (TL: 60, T: 1.29, ML: 48)
Great, I was just looking for a gnarly new place to
surf. (mime surfing around dangers) "Stalactite!
Stalagmite! Bats! Gollum!!"
AI: (TL: 94, T: 1.54, ML: 62)
Looks like Aquaman's commute just got a whole
lot longer.
Joke 7
Topic:
Scientists are studying whether astronauts in the
future could transform rocks into food.
Human: (TL: 236, T: 4.62, ML: 52)
Hey, don't give Fruity Pebbles any ideas. Rocky
Road with real rocks? You could chip a tooth on
Stone Ground Mustard!
AI: (TL: 241, T: 4.79, ML: 52)
Which is great news for anyone who's ever had a
craving for a pebble pie.
Joke 8
Topic:
A new study says that young children in the UK get
almost half their calories from ultra-processed
food.
Human: (TL: 46, T: 0.88, ML: 57)
If you think that's bad, the other half is British
cooking.
AI: (TL: 70, T: 1.79, ML: 40)
The most popular kids' meals in the UK are now
the Happy Meal, the Crispy Chicken Sandwich,
and Uncle Nigel's Deep-Fried Crumpets.
78