Some Piper voices. #430

StoryHack · 2024-03-08T19:46:27Z

StoryHack
Mar 8, 2024

So, I have been playing with training voices for a while now. I really wanted to have several good sounding voices available that have friendly licenses. So I'm posting 6 voices (well, 5, with a high and medium quality version of one) that I have trained and think sound pretty good. I include ckpt files for several, in case you want to fine tune with them.

Updated with 3 additional voices on 5/10/2024

https://brycebeattie.com/files/tts

If somebody wants to upload these to huggingface or similar, you have my blessing.

synesthesiam · 2024-03-12T14:32:28Z

synesthesiam
Mar 12, 2024
Maintainer

Awesome work, thank you! I will get these uploaded to the piper-voices repo 🙂

0 replies

StoryHack · 2024-03-12T21:11:03Z

StoryHack
Mar 12, 2024
Author

Forgot I had a medium quality settings version of the Cori voice. That is up as well now.

0 replies

bzp83 · 2024-03-13T13:14:49Z

bzp83
Mar 13, 2024

So, I have been playing with training voices for a while now. I really wanted to have several good sounding voices available that have friendly licenses. So I'm posting 6 voices (well, 5, with a high and medium quality version of one) that I have trained and think sound pretty good. I include ckpt files for several, in case you want to fine tune with them.

https://brycebeattie.com/files/tts

If somebody wants to upload these to huggingface or similar, you have my blessing.

that's awesome!

Would you mind to share some screenshots of how the gen and disc graphs from tensorboard look like from these trainings? I'm having difficulty understanding what a good graph looks like.

thanks!

1 reply

StoryHack Mar 13, 2024
Author

I never figured out really what looks good in those graphs, so I stopped checking them. I just do it by ear, running some samples every so often. I just go for as large of a dataset as I can reasonably assemple (20 hours+), then train enough epochs to let it run for about two days on my RTX 4080 when I do a voice from scratch. At least 8 or so hours of clips run for about a day for a fine tuned model. I'm not real scientific about it.

bzp83 · 2024-03-13T23:04:56Z

bzp83
Mar 13, 2024

I never figured out really what looks good in those graphs, so I stopped checking them. I just do it by ear, running some samples every so often. I just go for as large of a dataset as I can reasonably assemple (20 hours+), then train enough epochs to let it run for about two days on my RTX 4080 when I do a voice from scratch. At least 8 or so hours of clips run for about a day for a fine tuned model. I'm not real scientific about it.

fair enough.

Would you by any chance have used a 3090? I borrow my son's 4090 to do some training but I'm considering buying a second hand 3090 or a 4080. I don't want to spend too much on a 4090 unless it's really that faster than a 4080 or 3090.

Do you think I would benefit from training a large dataset (100+ hours of audio) until it's good enough, then training a smaller dataset ~10h using a checkpoint from the 100+ hours training? Would it get as good as the one trained with 100+ hours?

thanks!

2 replies

StoryHack Mar 14, 2024
Author

Sorry, no experience with a 3090. I had a 3060, but I had to keep the batch size pretty small, and training took forever.

I didn't post any of my fine tunes, but I have done several starting from the "Kristin" model I posted that worked out well. Several of the english voices in the official repo are fine tuned like you describe from the "lessac" voice and sound good.

SeymourNickelson Apr 1, 2024

Training a modified version of the LJSpeech dataset on a 3090 for 2k epochs takes ~10 days for me. Batch size is 32.

jnhck · 2024-03-18T15:57:29Z

jnhck
Mar 18, 2024

Thank you for your work.
Could your voices now be used to finetune other voices such as it was done with all the voices that seem to be finetunes of the EN Lessac voice? Do you have any idea how many hours of labeled-audio data would one need to finetune another language? And how much compute would be necessary and how long would it take?
Sorry for all the questions, but I am new to the world of TTS and a still a bit overwhelmed. Thanks in advance.

4 replies

StoryHack Mar 18, 2024
Author

Yes, you can finetune with most of those. You would use the .ckpt file to do so. The number of hours of dataset recordings needed varies wildly with how good the recordings are and how well the dataset was prepared. I've used 4, 6 8, 12+ hours with a variety of datasets and had all sorts of results, and I haven't really figured out all the secrets. I've just done a lot of training sessions, with as big of datasets as I can get.

The first time I ran the ljspeech dataset, with my RTX 3060, it took over a month. My newer RTX 4080 did it in a few days. and finetuning generally needs less epochs to make it sound good.

jnhck Mar 19, 2024

Thanks for the answer. Maybe this is a stupid question, but wouldn't it be possible to go on fiverr or something like that and pay people with decent microphones to read some stuff? Like even if it takes 50 hrs per language, it seems like it would be still quite affordable all things considered. I possibly could get access to some H100 for some time, I guess the finetuning would be quite fast with those (if I had any idea, what I was doing).

StoryHack Mar 19, 2024
Author

Good readers with good equipment can get expensive pretty fast, but you might find somebody on fiverr. I mostly use Librivox recordings, as they are in the public domain. But you have to hunt around Librivox to find the good narrators, and make sure you are using similar "sounding" recordings. (Many narrators upgrade their equipment/process over time, so earlier recordings sound much different.) There are thousands of recorded books released there. Whichever way you go, though, the trouble is getting all those recordings split up with matching text. You can do it by hand in audacity and a text editor, but it takes a long time.

I've also used the piper recording studio to record my own voice, which automatically does all the audio file creation and assembly of the metadata.csv. Once you get it set up, it is very easy to use. I just haven't recorded enough for the dataset to be useful yet. And setup is not super easy unless you are used to installing this kind of project. It's probably more than you can get a fiverr narrator to do.

jnhck Mar 19, 2024

Thanks for the detailed answer. I see, the entire process seems to be more complicated than I thought. I guess I have to read up about it some more and then see where I go from there.

king-dahmanus · 2024-03-20T13:07:06Z

king-dahmanus
Mar 20, 2024

Hi.
The Cristin voice is really great, but like Lessak, it seems to have a problems with pauses. It pauses when it's not supposed to.

Also, it has the same issue as @StoryHack described. The quality of the generated audio varries from sentence to sentence. i guess there aren't any tools you could use to equalise all those recordings without some serious technical audio fiddling?

1 reply

StoryHack Mar 20, 2024
Author

The pause issue stems from the fact that I'm using a relatively automatic process to generate the datasets. I split a long recording into a bunch of short ones using Audacity. Audacity has a "detect sound" feature that will mark sections of a file, looking for gaps of silence as the guide. This can align with whole sentences if you tweak the settings just right, but it doesn't always work out. Once I export all the short wavs from audacity, I use whisper ai to listen to the wavs and generate the text.

This is not a perfect process, but as I read up on how some of the big publicly available datasets, LibriTTS and Hi-Fi TTS, were generated, and looked at some of their metadata, and listened to their wavs, I've convinced myself this process generates the same correctness of dataset.

One thing that they have been using is automatic alignment tools that use the original text and recording to mark what parts of the audio correspond to what sentences of the text. I tried to figure out the tools in Nvidia's NEMO toolkit to to this and see if it is more accurate, but haven't gotten it to work at all.

The way that generates the best dataset is still manual marking and slicing up the recordings, using the original text. You get correct text, and you get natural pauses and phrasing (I actually don't know how much this factors in in training.) Doing this manually takes forever though. Using various tools, I've worked for several hours to get 30 minutes of dataset prepared. Compare that to that fact that I can build a 20 hour dataset with about 30 minutes of human interaction and a few hours of my computer just grinding away at it.

I have considered manually building a smaller dataset of an hour or two of samples, then using it to finetune a model generated by a larger, automatically-produced dataset of the same narrator. See if that helps at all.

If the recordings of a narrator vary wildly in volume, I've occasionally used a compressor plugin (and other filters) to help them all sound more alike. That kind of audio fiddling was the goal of the group that released LibriTTS-r. They built an unreleased AI model for restoring audio, rather than just use high/low pass filters/compressors/EQ in a DAW like audacity, reaper, protools, etc. And quite frankly the multispeaker voices from that dataset sound pretty "same" from sentence to sentence (my "clean100" and the official "libritts-r" models, which are both derived from that dataset). I've tried using just a single speaker from those datasets, but each speaker seems to have only 10-20 minutes of recordings. And according to one of the authors of LibriTTS-R when I contacted him, they have no plans to release the audio restoration model, so I can't play with "restoring" a larger number of recordings of a single speaker.

My current experiment is a multispeaker model with datasets of 8-14ish hours each of six different narrators (67K+ utterances). Got a ways to go on training, though.

king-dahmanus · 2024-03-22T01:03:53Z

king-dahmanus
Mar 22, 2024

Good luck @StoryHack

0 replies

StoryHack · 2024-05-10T17:13:06Z

StoryHack
May 10, 2024
Author

I just put 3 additional voices on the page, all public domain.

One is my voice (Bryce), which I was experimenting with to see the minimum samples needed. I recorded using a Vivitar USB mic and the piper recording studio. I used the Harvard balanced sentences as most of the corpus, along with some longer and shorter sentences that I made up. Some day I will record more and redo this voice, but it sounds reasonably close to real me now.

The other 2 are both US Male voices built from datasets made from Librivox recorings.

0 replies

synesthesiam · 2024-05-30T21:22:09Z

synesthesiam
May 30, 2024
Maintainer

Grabbing them now, thank you for training these!

0 replies

j0shm1lls · 2024-06-17T03:20:19Z

j0shm1lls
Jun 17, 2024

Do you happen to have screenshots of what your tensorboard graphs looked like after say 1000 epochs when training from scratch? I have about 14 hours of source audio that I'm training from scratch on a 4080, batch size of 24, and after about 300 epochs my loss_disc_all starts dropping and loss_gen_all and val_loss starts rising. Also, running inference sounds a bit robotic. I'm afraid I'm overfitting but maybe I'm just not letting it train long enough to resolve?

1 reply

StoryHack Jun 19, 2024
Author

I don't save tensorboard output at all. The graphs never really made sense to me, so I just kind of train by ear. When I've had robotic sounding results it usually means it's either not trained enough, or I screwed up my dataset and it has samples that cut off part of the last word in each of the recordings.

VAlmqvist · 2025-09-27T20:05:18Z

VAlmqvist
Sep 27, 2025

Okay, so this is long since a finished convo, but I figured I'd ask anyway since it's the best place I've found to ask.... I am just starting my foray into finetuning and training voices, but since I'm swedish my options are limited. I would like to start from a baseline using the medium Lisa model, but I can't seem to find a checkpoint for it. The "sv" is missing from the huggingface repo, when checking the checkpoints... Anyone ideas on what to do?

0 replies

Some Piper voices. #430

Uh oh!

Uh oh!

Replies: 11 comments · 9 replies

Uh oh!

synesthesiam Mar 12, 2024 Maintainer

Uh oh!

StoryHack Mar 12, 2024 Author

Uh oh!

Uh oh!

Uh oh!

StoryHack Mar 13, 2024 Author

Uh oh!

Uh oh!

StoryHack Mar 14, 2024 Author

Uh oh!

Uh oh!

Uh oh!

StoryHack Mar 18, 2024 Author

Uh oh!

Uh oh!

StoryHack Mar 19, 2024 Author

Uh oh!

Uh oh!

Uh oh!

StoryHack Mar 20, 2024 Author

Uh oh!

Uh oh!

StoryHack May 10, 2024 Author

Uh oh!

synesthesiam May 30, 2024 Maintainer

Uh oh!

Uh oh!

StoryHack Jun 19, 2024 Author

Uh oh!

Replies: 11 comments 9 replies

synesthesiam
Mar 12, 2024
Maintainer

StoryHack
Mar 12, 2024
Author

StoryHack Mar 13, 2024
Author

StoryHack Mar 14, 2024
Author

StoryHack Mar 18, 2024
Author

StoryHack Mar 19, 2024
Author

StoryHack Mar 20, 2024
Author

StoryHack
May 10, 2024
Author

synesthesiam
May 30, 2024
Maintainer

StoryHack Jun 19, 2024
Author