Debatts

Overview

Debatts is a novel zero-shot text-to-speech synthesis system designed for rebuttal, where a speaker synthesizes their own persuasive articulation given the context from the opposing side. This page provides an overview of the Debatts model, the Debatts-Data dataset, and demonstrates the performance of the system compared to baseline models.

Abstract

In debating, rebuttal is one of the most critical stages, where a speaker addresses the arguments presented by the opposing side. During this process, the speaker synthesizes their own persuasive articulation given the context from the opposing side. This work proposes a novel zero-shot text-to-speech synthesis system for rebuttal, namely Debatts. Debatts takes two speech prompts, one from the opposing side (i.e., opponent) and one from the speaker. The prompt from the opponent is supposed to provide debating style prosody, and the prompt from the speaker provides identity information. In particular, we pretrain the Debatts system from in-the-wild dataset, and integrate an additional reference encoder to take debating prompt for style. In addition, we also create a debating dataset to develop Debatts. In this setting, Debatts can generate a debating-style speech in rebuttal for any voices. Experimental results confirm the effectiveness of the proposed system in comparison with the classic zero-shot TTS systems.

The Debatts-Data Dataset

Overview

The Debatts-Data dataset is constructed from a vast collection of professional Mandarin speech data sourced from diverse video platforms and podcasts on the Internet. The in-the-wild collection approach ensures real and natural rebuttal speech. This is the first Mandarin rebuttal speech dataset for expressive text-to-speech synthesis. The dataset contains annotations of transcription, duration, and style embed.

Dataset Statistics

The following table provides a comparison of publicly available debate datasets:

Dataset	Lang	Num of Spks	Duration (hrs)	Text/Speech	SR (kHz)	Wild/Studio
V.Debate [1]	EN	-	24 (E.)	T	-	W
Record [2]	EN	10	6 (E.)	T+S	44.1	S
Rebuttal [3]	EN	14	27 (E.)	T+S	-	S
DBates [4]	EN	140	70 (E.)	T+S	16	W
Debatts-data	ZH	2350	111.9	T+S	16	W

References

Vivesdebate: A new annotated multilingual corpus of argumentation in a debate tournament. Applied Sciences, 11(15), 7160.
A recorded debating dataset. arXiv preprint arXiv:1709.06438.
A dataset of general-purpose rebuttal. arXiv preprint arXiv:1909.00393.
Dbates: Dataset for discerning benefits of audio, textual, and facial expression features in competitive debate speeches. IEEE Transactions on Affective Computing, 14(2), 1028-1043.

The Debatts-Data Preprocessing Pipeline

Debatts-Data Pipe is the first open-source preprocessing pipeline designed to transform in-the-wild professional debating speech data into annotated, high-quality rebuttal data for text-to-speech generation. It encompasses five key steps:

Moderator detection
Rebuttal session extraction
Speaker diarization
Overlap deletion
Merging and speech enhancement with metadata extraction

The pipeline introduces a precise methodology for detecting moderators within extensive competitive speech datasets, facilitating the extraction of relevant sections based on moderator cues. This approach is adaptable to other languages with minimal modifications and can be extended to any context involving moderator-led discussions.

The Debatts Model Architecture

Debatts is the first zero-shot debating text-to-speech (TTS) system that leverages the opponent's speech as a style prompt, alongside the target speaker's speech as a speaker identification prompt. It features a two-stage model architecture comprising:

Text-to-Semantic Stage: The model predicts target semantic tokens by integrating the semantic tokens from both the opponent and the target speaker, along with the text tokens.
Semantic-to-Acoustic Stage: The model generates speech with a debating style based on the concatenated target speaker's and predicted semantic tokens, combined with the target speaker's acoustic tokens.

This approach enables the generation of natural and expressive rebuttal speech for any voice.

Demos

In this section, we demonstrate the zero-shot TTS performance of the Debatts compared to the baseline models. The samples below highlight the effectiveness of Debatts in generating rebuttal-style speech in comparison to the baseline system. Some detailed explanations and examples can be found on link https://amphionspace.github.io/debatts/.

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
.history		.history
data		data
static		static
.gitignore		.gitignore
README.md		README.md
index.html		index.html
index_1001.html		index_1001.html
index_ori.html		index_ori.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Debatts

Overview

Abstract

The Debatts-Data Dataset

Overview

Dataset Statistics

References

The Debatts-Data Preprocessing Pipeline

The Debatts Model Architecture

Demos

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

amphionspace/debatts

Folders and files

Latest commit

History

Repository files navigation

Debatts

Overview

Abstract

The Debatts-Data Dataset

Overview

Dataset Statistics

References

The Debatts-Data Preprocessing Pipeline

The Debatts Model Architecture

Demos

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Packages